A Box-Cox transformation is a preprocessing technique used to transform a distribution into a normally distributed one. Normal distribution is often a requirement, especially for linear regression. The Box-Cox transformation doesn't guarantee that your data will be normally distributed afterwards, so you will always need to check. In this article, we will learn how to conduct a box-cox transformation in python.
We begin by creating some mock data. We will generate samples from the exponential distribution. Notice from the histogram that our data is not normal.
import numpy as np
from scipy.stats import boxcox
import seaborn as sns
data = np.random.exponential(size=1000)
sns.displot(data)
<seaborn.axisgrid.FacetGrid at 0x27052effbe0>
The scipy.stats
package provides a function called boxvox
that will automatically transform the data for you. We pass our X vector in and the transformed data is returned.
transformed_data, best_lambda = boxcox(data)
sns.displot(transformed_data)
<seaborn.axisgrid.FacetGrid at 0x27052effb80>
Notice that our data is more normal, but not completely normal. This is why you need to confirm with a graph. After this visual check, it would be good to run other normality tests, like shahpiro-wilk, to give further evidence of normality.