One Hot Encoding in R

06.20.2021

Intro

One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.

For example, let’s say we have the following list of expenses with categories.

Category Amount
Transport 100
Grocery 300
Bills 200

We would like to convert this data to numerical form. Here is an example.

Amount Transport Grocery Bills
100 1 0 0
300 0 1 0
200 0 0 1

We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.

Creating the Data Set

To start, let’s create our data set.

data <- data.frame(
  category = as.factor(c("Transport", "Grocery", "Bills")),
  amount = c(100, 300, 200)
)

data
##    category amount
## 1 Transport    100
## 2   Grocery    300
## 3     Bills    200

One-Hot Encoding

There are many libraries we can use for encoding categorical variables. Let’s look at a few.

We start by using the mltools package. We can use the one_hot method. We do need to convert the data to a data.table first for mltools to use.

library(mltools)
## Warning: package 'mltools' was built under R version 4.0.5
library(data.table)
## Warning: package 'data.table' was built under R version 4.0.5
df.table <- as.data.table(data)
df.encoded <- one_hot(df.table)
df.encoded
##    category_Bills category_Grocery category_Transport amount
## 1:              0                0                  1    100
## 2:              0                1                  0    300
## 3:              1                0                  0    200
library(caret)
## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: lattice

## Loading required package: ggplot2
dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))