In a previous article, we learned how to find the most important features of a Random Forest model. In practice it is often useful to simplify a model so that it can be generalized and interpreted. Thus, we may want to fit a model with only the important features. In this article, we will learn how to fit a Random Forest Model using only the important features in Sklearn.
To build a random forest model with only important features, we need to use the SelectFromModel
class from the feature_selection
package. We create an instance of SelectFromModel
using the random forest class (in this example we use a classifer). We also specify a threshold for "how important" we want features to be. All features less than .2 will not be used.
Next, we apply the fit_transform
to our features which will filter out unimportant features. Finally, we fit a random forest model like normal using the important features.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel
iris = datasets.load_iris()
features = iris.data
target = iris.target
randomforest = RandomForestClassifier()
selector = SelectFromModel(randomforest, threshold=0.2)
importantFeatures = selector.fit_transform(features, target)
model = randomforest.fit(importantFeatures, target)