One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.
For example, let’s say we have the following list of expenses with categories.
| Category | Amount | |
|---|---|---|
| Transport | 100 | |
| Grocery | 300 | |
| Bills | 200 |
We would like to convert this data to numerical form. Here is an example.
| Amount | Transport | Grocery | Bills |
|---|---|---|---|
| 100 | 1 | 0 | 0 |
| 300 | 0 | 1 | 0 |
| 200 | 0 | 0 | 1 |
We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.
To start, let’s create our data set. The eample here is a list of expenses and their respective categories.
import pandas as pd
df = pd.DataFrame({
"category": ["Transport", "Grocery", "Bills"],
"amount": [100, 300, 200]
})
df.head()| category | amount | |
|---|---|---|
| 0 | Transport | 100 |
| 1 | Grocery | 300 |
| 2 | Bills | 200 |
To encode categorical variables, we can use the get_dummies method from pandas. We pass our data frame to the method and it returns a new data frome encoded with one hot encoding.
pd.get_dummies(df)| amount | category_Bills | category_Grocery | category_Transport | |
|---|---|---|---|---|
| 0 | 100 | 0 | 0 | 1 |
| 1 | 300 | 0 | 1 | 0 |
| 2 | 200 | 1 | 0 | 0 |
