While researching the feature selection literature for my PhD, I came across a mostly overlooked but really clever all relevant feature selection method called Boruta. Since it didn’t have a Python implementation I wrapped it up in a scikit-learn like module and open sourced it. I also extended and modified it slightly.
It has become quite popular and currently is part of scikit-contrib, with more than 700 s on GitHub.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import pandas from sklearn.ensemble import RandomForestClassifier from boruta_py import boruta_py # load X and y X = pd.read_csv('my_X_table.csv', index_col=0).values y = pd.read_csv('my_y_vector.csv', index_col=0).values # define random forest classifier, with utilising all cores and # sampling in proportion to y labels forest = RandomForestClassifier(n_jobs=-1, class_weight='auto') # define Boruta feature selection method feat_selector = boruta_py.BorutaPy(forest, n_estimators='auto', verbose=2) # find all relevant features feat_selector.fit(X, y) # check selected features feat_selector.support_ # check ranking of features feat_selector.ranking_ # call transform() on X to filter it down to selected features X_filtered = feat_selector.transform(X)
Please make sure to check out my blog post about how Boruta actually works and why you should use an all-relevant feature selection method.