One hot encoding is a method of converting categorical variables into numerical form. It is a preprocessing needed for some machine learning algorithms to improve performance. In this article, we will learn how to do one-hot encoding in R.
For example, let’s say we have the following list of expenses with categories.
We would like to convert this data to numerical form. Here is an example.
We can see that each row now has an entry for each category. There is a 0 if the row doesn’t have the category and 1 if the row does. As you may be able to tell, we will gain a lot of extra features from this type of encoding, but that is not always a problem.
To start, let’s create our data set.
data <- data.frame( category = as.factor(c("Transport", "Grocery", "Bills")), amount = c(100, 300, 200) ) data
## category amount ## 1 Transport 100 ## 2 Grocery 300 ## 3 Bills 200
There are many libraries we can use for encoding categorical variables. Let’s look at a few.
We start by using the
mltools package. We can use the
method. We do need to convert the data to a
data.table first for
mltools to use.
## Warning: package 'mltools' was built under R version 4.0.5
## Warning: package 'data.table' was built under R version 4.0.5
df.table <- as.data.table(data) df.encoded <- one_hot(df.table) df.encoded
## category_Bills category_Grocery category_Transport amount ## 1: 0 0 1 100 ## 2: 0 1 0 300 ## 3: 1 0 0 200
## Warning: package 'caret' was built under R version 4.0.5 ## Loading required package: lattice ## Loading required package: ggplot2
dummy <- dummyVars(" ~ .", data=data) newdata <- data.frame(predict(dummy, newdata = data))