One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.
For example, let’s say we have the following list of expenses with categories.
Category | Amount | |
---|---|---|
Transport | 100 | |
Grocery | 300 | |
Bills | 200 |
We would like to convert this data to numerical form. Here is an example.
Amount | Transport | Grocery | Bills |
---|---|---|---|
100 | 1 | 0 | 0 |
300 | 0 | 1 | 0 |
200 | 0 | 0 | 1 |
We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.
To start, let’s create our data set.
data <- data.frame(
category = as.factor(c("Transport", "Grocery", "Bills")),
amount = c(100, 300, 200)
)
data
## category amount
## 1 Transport 100
## 2 Grocery 300
## 3 Bills 200
There are many libraries we can use for encoding categorical variables. Let’s look at a few.
We start by using the mltools
package. We can use the one_hot
method. We do need to convert the data to a data.table
first for
mltools
to use.
library(mltools)
## Warning: package 'mltools' was built under R version 4.0.5
library(data.table)
## Warning: package 'data.table' was built under R version 4.0.5
df.table <- as.data.table(data)
df.encoded <- one_hot(df.table)
df.encoded
## category_Bills category_Grocery category_Transport amount
## 1: 0 0 1 100
## 2: 0 1 0 300
## 3: 1 0 0 200
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2
dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))