Label Encoding in Python

07.06.2021

Intro

Label Encoding is one of many encoding techniques to convert your categorical variables into numerical variables. This is a requirement for many machine learning algorithms. Label Encoding is used when you have a number of categories that don't have an order. If your data is orders, like small, medium, large, you should use the Ordinal Encoding. In this article, we will learn how to use label encoding in Python.

The Data

Let's create a small data frame with cities and their population. The data is here is fake, but the process will work on any data frame. We have a list of cities, which is a categorical variable, that we want to encode.

import pandas as pd

df = pd.DataFrame({
    "city": ['Dallas', 'Austin', 'Denver', 'Boulder'],
    "pop": [1000, 2000, 3000 , 4000]
})

df.head()
city pop
0 Dallas 1000
1 Austin 2000
2 Denver 3000
3 Boulder 4000

Using a Label Encoder in Python

To encode our cities, turn them into numbers, we will use the LabelEncoder class from the sklearn.preprocessing package. We first create an instance of the class, then we use the fit_transform method to encode our variables.

from sklearn import preprocessing
  
le = preprocessing.LabelEncoder()

le.fit_transform(df['city'])
array([2, 0, 3, 1])

We if we separate the fit and transform steps, we can teach our encoder to reverse the encoding. We can do the following:

  • Create a LabelEncoder
  • Fit the categories
  • Use transform to encode
  • Use inverse_transform to decode
le = preprocessing.LabelEncoder()

le = le.fit(df['city'])

encoded = le.transform(df['city'])

list(le.inverse_transform(encoded))
['Dallas', 'Austin', 'Denver', 'Boulder']