A Box-Cox transformation is a preprocessing technique used to transform a distribution into a normally distributed one. Normal distribution is often a requirement, especially for linear regression. The Box-Cox transformation doesn’t guarantee that your data will be noramlly distributed afterwards, so you will always need to check. In this article, we will learn how to conduct a box-cox transformation in R.
We begin by creating some mock data. We will generate samples from the exponential distribution. Notice from the histogram that our data is definitely not normal.
x = rexp(10000, rate = 5)
hist(x)
There are quite a few methods for using box cox in R. However, many of
them require a model. The forecast
package provides a function called
BoxCox
that will automatically transform the data for you. We pass our
X vector in and the transformed data is returned.
library(forecast)
x.transformed = BoxCox(x, lambda = "auto")
hist(x.transformed)
Notice that our data is more normal, but not completely normal. This is why you need to confirm. After this visual check, it would be good to run other normality tests, like shahpiro-wilk, to give further evidence of normality.