Label Encoding is one of many encoding techniques to convert your categorical variables into numerical variables. This is a requirement for many machine learning algorithms. Label Encoding is used when you have a number of categories that don't have an order. If your data is orders, like small, medium, large, you should use the Ordinal Encoding. In this article, we will learn how to use label encoding in Python.
Let's create a small data frame with cities and their population. The data is here is fake, but the process will work on any data frame. We have a list of cities, which is a categorical variable, that we want to encode.
import pandas as pd
df = pd.DataFrame({
"city": ['Dallas', 'Austin', 'Denver', 'Boulder'],
"pop": [1000, 2000, 3000 , 4000]
})
df.head()
city | pop | |
---|---|---|
0 | Dallas | 1000 |
1 | Austin | 2000 |
2 | Denver | 3000 |
3 | Boulder | 4000 |
To encode our cities, turn them into numbers, we will use the LabelEncoder
class from the sklearn.preprocessing
package. We first create an instance of the class, then we use the fit_transform
method to encode our variables.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit_transform(df['city'])
array([2, 0, 3, 1])
We if we separate the fit and transform steps, we can teach our encoder to reverse the encoding. We can do the following:
le = preprocessing.LabelEncoder()
le = le.fit(df['city'])
encoded = le.transform(df['city'])
list(le.inverse_transform(encoded))
['Dallas', 'Austin', 'Denver', 'Boulder']