One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.
For example, let’s say we have the following list of expenses with categories.
| Category | Amount | |
|---|---|---|
| Transport | 100 | |
| Grocery | 300 | |
| Bills | 200 |
We would like to convert this data to numerical form. Here is an example.
| Amount | Transport | Grocery | Bills |
|---|---|---|---|
| 100 | 1 | 0 | 0 |
| 300 | 0 | 1 | 0 |
| 200 | 0 | 0 | 1 |
We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.
To start, let’s create our data set.
data <- data.frame(
category = as.factor(c("Transport", "Grocery", "Bills")),
amount = c(100, 300, 200)
)
data## category amount
## 1 Transport 100
## 2 Grocery 300
## 3 Bills 200There are many libraries we can use for encoding categorical variables. Let’s look at a few.
We start by using the mltools package. We can use the one_hot
method. We do need to convert the data to a data.table first for
mltools to use.
library(mltools)## Warning: package 'mltools' was built under R version 4.0.5library(data.table)## Warning: package 'data.table' was built under R version 4.0.5df.table <- as.data.table(data)
df.encoded <- one_hot(df.table)
df.encoded## category_Bills category_Grocery category_Transport amount
## 1: 0 0 1 100
## 2: 0 1 0 300
## 3: 1 0 0 200library(caret)## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data)) 