One Hot Encoding in Python

Intro

One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.

For example, let’s say we have the following list of expenses with categories.

Category	Amount
Transport	100
Grocery	300
Bills	200

We would like to convert this data to numerical form. Here is an example.

Amount	Transport	Grocery	Bills
100	1	0	0
300	0	1	0
200	0	0	1

We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.

Creating the Data Set

To start, let’s create our data set. The eample here is a list of expenses and their respective categories.

import pandas as pd

df = pd.DataFrame({
    "category": ["Transport", "Grocery", "Bills"],
    "amount": [100, 300, 200]
})

df.head()

	category	amount
0	Transport	100
1	Grocery	300
2	Bills	200

Encoding Variables

To encode categorical variables, we can use the get_dummies method from pandas. We pass our data frame to the method and it returns a new data frome encoded with one hot encoding.

pd.get_dummies(df)

	amount	category_Bills	category_Grocery	category_Transport
0	100	0	0	1
1	300	0	1	0
2	200	1	0	0

One Hot Encoding in Python

07.03.2021

Intro

Creating the Data Set

Encoding Variables

Label Encoding in R

Ordinal Encoding in R