How to Deal with Categorical Variables in Machine Learning with Python?

How to Deal with Categorical Variables in Machine Learning with Python?

If you have categorical variables in your dataset and want to know how to deal with categorical variables in machine learning, then this tutorial is for you. In this article, you will understand the method in machine learning for Categorical variables along with Python code. So give your few minutes to this article and clear your doubts.

Now without any further ado, let’s get started-

How to Deal with Categorical Variables in Machine Learning with Python?

Before we dive into the techniques in machine learning for Categorical variables, first understand what are Categorical variables?

What are Categorical variables?

Categorical variables have different categories or labels associated with the observation. And they have non-numerical values, that’s why we need to convert this textual data into numerical form.

For example, in this dataset, the “Country” variable has 3 categories- France, Spain, and Germany. And for the machine learning model, it’s hard to compute some correlation between these categories. And that’s why we need to convert these strings into numbers.

How to Deal with Categorical Variables in Machine Learning with Python?

Some more examples for categorical variables are-

  1. A “weather” variable with the values: Sunny, Cloudy, and Rainy.
  2. A “color” variable with the values: Green, Yellow, and Red

So to convert these strings into numbers, there are various methods available. But the most popular method is one hot Encoding. Now let’s understand one hot Encoding in detail.

One hot Encoding

One hot encoding technique turns this country column into 3 different columns. Why only three columns?

Because there are total 3 different categories in the Country variable- France, Spain, and Germany. If there were 5 different countries, we would turn this column into five columns.

I hope you understood.

One more important thing is that One hot encoding consists of creating binary vectors for each of the countries. That means we have to represent categorical variable values in terms of 0 and 1. Let me explain this in detail-

As I told you that one hot encoding turns this country column into 3 different columns. So after creating three different columns and filling the values as 0 and 1, it looks something like that-

CountryFranceSpainGermany
France100
Spain010
Germany001
Spain010
Germany001
France100
Spain010
France100
Germany001
France100

How these values are filled?

Let me explain with the help of this image-

How to Deal with Categorical Variables in Machine Learning with Python?

I hope now you understood. Now let’s see how to implement one hot encoding in Python.

One hot Encoding in Python

For implementation, I am gonna use a small dataset just for your interpretation. So this the small dataset, where we have two categorical variables “Country and Purchased“. And we have to convert this textual data into numerical form.

How to Deal with Categorical Variables in Machine Learning with Python?

So the first step is-

1. Import the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

NumPy is an open-source Python library used to perform various mathematical and scientific tasks. NumPy is used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices.

Matplotlib is a plotting library, that is used for creating a figure, plotting area in a figure, plot some lines in a plotting area, decorates the plot with labels, etc.

Pandas is a tool used for data wrangling and analysis.

So in step 1, we imported all required libraries. Now the next step is-

2. Load the Dataset

dataset = pd.read_csv('Data.csv')

As you can see in the dataset, there are 3 independent variables and 1 dependent variable. That’s why we need to split the independent variables as X and a dependent variable as Y. So the next step is-

3. Split Dataset into X and Y

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

Now we have split the dataset into X and Y. But as you can see in the dataset, there are some missing values too. I have already written an article on How to Handle Missing Values in Machine Learning, you can check. So the next step is-

4. Handling Missing Values

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

After handling missing values, it’s time to apply one hot encoding to the dataset.

5. Encoding categorical data

First we encode the independent variable “Country”-

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

Why I used 0…?

Because Country variable has index value 0.

After running this code, the “Country” variable values turns into numerical form and look something like that-

Categorical Variables in Python

Now let’s encode the dependent variable “Purchased”-

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

And the dependent variable “Purchased” values are converted into 0 and 1. 0 means No and 1 means Yes.

Categorical Variables in Python

So this is all about encoding the categorical variables. I hope you understood the concept easily. Now it’s time to wrap up.

Conclusion

In this article, I have discussed how to deal with categorical variables in machine learning. If you have any questions, feel free to ask me in the comment section. But if you found this article helpful, kindly share it with others.

All the Best!

Happy Learning!

Thank YOU!

Learn Machine Learning A to Z Basic

Though of the Day…

Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young.

– Henry Ford

author image

Written By Aqsa Zafar

Founder of MLTUT, Machine Learning Ph.D. scholar at Dayananda Sagar University. Research on social media depression detection. Create tutorials on ML and data science for diverse applications. Passionate about sharing knowledge through website and social media.

Leave a Comment

Your email address will not be published. Required fields are marked *