One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.
For example, let’s say we have the following list of expenses with categories.
Category | Amount | |
---|---|---|
Transport | 100 | |
Grocery | 300 | |
Bills | 200 |
We would like to convert this data to numerical form. Here is an example.
Amount | Transport | Grocery | Bills |
---|---|---|---|
100 | 1 | 0 | 0 |
300 | 0 | 1 | 0 |
200 | 0 | 0 | 1 |
We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.
To start, let’s create our data set. The eample here is a list of expenses and their respective categories.
import pandas as pd
df = pd.DataFrame({
"category": ["Transport", "Grocery", "Bills"],
"amount": [100, 300, 200]
})
df.head()
category | amount | |
---|---|---|
0 | Transport | 100 |
1 | Grocery | 300 |
2 | Bills | 200 |
To encode categorical variables, we can use the get_dummies
method from pandas
. We pass our data frame to the method and it returns a new data frome encoded with one hot encoding.
pd.get_dummies(df)
amount | category_Bills | category_Grocery | category_Transport | |
---|---|---|---|---|
0 | 100 | 0 | 0 | 1 |
1 | 300 | 0 | 1 | 0 |
2 | 200 | 1 | 0 | 0 |