One Hot Encoding in Python

07.03.2021

Intro

One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.

For example, let’s say we have the following list of expenses with categories.

Category Amount
Transport 100
Grocery 300
Bills 200

We would like to convert this data to numerical form. Here is an example.

Amount Transport Grocery Bills
100 1 0 0
300 0 1 0
200 0 0 1

We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.

Creating the Data Set

To start, let’s create our data set. The eample here is a list of expenses and their respective categories.

import pandas as pd

df = pd.DataFrame({
    "category": ["Transport", "Grocery", "Bills"],
    "amount": [100, 300, 200]
})

df.head()
category amount
0 Transport 100
1 Grocery 300
2 Bills 200

Encoding Variables

To encode categorical variables, we can use the get_dummies method from pandas. We pass our data frame to the method and it returns a new data frome encoded with one hot encoding.

pd.get_dummies(df)
amount category_Bills category_Grocery category_Transport
0 100 0 0 1
1 300 0 1 0
2 200 1 0 0