How to Build a Random Forest Model with Important Features in Sklearn

02.28.2021

Intro

In a previous article, we learned how to find the most important features of a Random Forest model. In practice it is often useful to simplify a model so that it can be generalized and interpreted. Thus, we may want to fit a model with only the important features. In this article, we will learn how to fit a Random Forest Model using only the important features in Sklearn.

Building a Model with Important Features

To build a random forest model with only important features, we need to use the SelectFromModel class from the feature_selection package. We create an instance of SelectFromModel using the random forest class (in this example we use a classifer). We also specify a threshold for "how important" we want features to be. All features less than .2 will not be used.

Next, we apply the fit_transform to our features which will filter out unimportant features. Finally, we fit a random forest model like normal using the important features.

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel

iris = datasets.load_iris()
features = iris.data
target = iris.target

randomforest = RandomForestClassifier()

selector = SelectFromModel(randomforest, threshold=0.2)

importantFeatures = selector.fit_transform(features, target)

model = randomforest.fit(importantFeatures, target)