How to Encode Categorical Variables in Sklearn

02.08.2021

Intro

It is common to have data sets with categorical data. For example, say we have a size column that values are small, medium, and large. Often, we want to transfrom these variables into number to apply mathematic algorithms to them. In this article, we will see how to encode categorical variables in sklearn.

Encoding Variables

To encode categorical variables, we can use the OneHotEncoder class and run fit_transform on the data. In the example below, we transform the iris.target data.

from sklearn import datasets
from sklearn import preprocessing

iris = datasets.load_iris()
 
X = iris.data
y = iris.target


cat_encoder = preprocessing.OneHotEncoder()
encoded = cat_encoder.fit_transform(y.reshape(-1,1)).toarray()

print(encoded)