Ordinal Encoding in R

07.04.2021

Intro

Ordinal Encoding is similar to Label Encoding where we take a list of categories and convert them into integers. However, unlike Label Encoding, we preserve and order. For example, if we are encoding rankings of 1st place, 2nd place, etc, there is an inherit order. In this article, we will learn how to use Ordinal Encoding in R

The Data

Let’s create a small data frame with rankings and their prize money. The data is here is fake, but the process will work on any data frame. We have a list of ranks, which is a categorical variable, that we want to encode.

df <- data.frame(
  money = c(3000, 1000, 2000, 2000),
  rank = as.factor(c('1st', '2nd', '3rd', '1st'))
)
df
##   money rank
## 1  3000  1st
## 2  1000  2nd
## 3  2000  3rd
## 4  2000  1st

Using a Ordinal Encoder in R

To encode our ranks, turn them into numbers, we will use the encode_ordinal method from the cleandata package. To use this method, we simply pass our data frame of categories to the encode_ordinal method. Note that the columns we pass must be a factor type. We also pass a list of the order for the variables.

library(cleandata)
## Warning: package 'cleandata' was built under R version 4.0.5
order.list = c('1st', '2nd', '3rd')

## Create a data frame of all categories. We only have one here
cat.df = df[, c("rank"), drop = FALSE]

encoded = encode_ordinal(cat.df, order = order.list)
##   rank  
##  1st:2  
##  2nd:1  
##  3rd:1  
## coded 1 cols 3 levels 
##  rank 
##  1:2  
##  2:1  
##  3:1
encoded
##   rank
## 1    1
## 2    2
## 3    3
## 4    1

Encoding many variables concurrently

Let’s update our data frame to have t-shirt sizes along with the ranks. We will also add ranks for a second tournament. Then, let’s see how we can encode multiple categories at the same time.

df <- data.frame(
  money = c(3000, 1000, 2000, 2000),
  rank = as.factor(c('1st', '2nd', '3rd', '1st')),
  rank2 = as.factor(c('2nd', '3rd', '3rd', '1st')),
  shirt = as.factor(c('sm', 'sm', 'med', 'lrg'))
)
df
##   money rank rank2 shirt
## 1  3000  1st   2nd    sm
## 2  1000  2nd   3rd    sm
## 3  2000  3rd   3rd   med
## 4  2000  1st   1st   lrg

Now, we can create a data frame consisting of multiple categories to our encode_ordinal method.

library(cleandata)

order.list = c('1st', '2nd', '3rd')

## Create a data frame of all categories. We only have one here
cat.df = df[, c("rank", "rank2"), drop = FALSE]

## Encode both ranks
encoded = encode_ordinal(cat.df, order = order.list)
##   rank   rank2  
##  1st:2   1st:1  
##  2nd:1   2nd:1  
##  3rd:1   3rd:2  
## coded 2 cols 3 levels 
##  rank  rank2
##  1:2   1:1  
##  2:1   2:1  
##  3:1   3:2
encoded
##   rank rank2
## 1    1     2
## 2    2     3
## 3    3     3
## 4    1     1
## Encode shirts
cat.df = df[, c("shirt"), drop = FALSE]
encoded = encode_ordinal(cat.df, order = order.list)
##  shirt  
##  lrg:1  
##  med:1  
##  sm :2  
## coded 1 cols 3 levels 
##  shirt  
##  lrg:1  
##  med:1  
##  sm :2
encoded
##   shirt
## 1    sm
## 2    sm
## 3   med
## 4   lrg