Overview
While researching the feature selection literature for my PhD, I came across a mostly overlooked but really clever all relevant feature selection method called Boruta. Since it didn’t have a Python implementation I wrapped it up in a scikit-learn like module and open sourced it. I also extended and modified it slightly.
It has become quite popular and currently is part of scikit-contrib, with more than 700 s on GitHub.
Example use
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas
from sklearn.ensemble import RandomForestClassifier
from boruta_py import boruta_py
# load X and y
X = pd.read_csv('my_X_table.csv', index_col=0).values
y = pd.read_csv('my_y_vector.csv', index_col=0).values
# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
forest = RandomForestClassifier(n_jobs=-1, class_weight='auto')
# define Boruta feature selection method
feat_selector = boruta_py.BorutaPy(forest, n_estimators='auto', verbose=2)
# find all relevant features
feat_selector.fit(X, y)
# check selected features
feat_selector.support_
# check ranking of features
feat_selector.ranking_
# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
Further details
Please make sure to check out my blog post about how Boruta actually works and why you should use an all-relevant feature selection method.