Overview
While researching the feature selection literature for my PhD, I came across a couple of filter methods that leverage information theoretic principles to select relevant features.
I wrapped up three of the most promising mutual information based feature selection methods in a scikit-learn like module and open sourced it.
Example use
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import mifs
# load X and y
X = pd.read_csv('my_X_table.csv', index_col=0).values
y = pd.read_csv('my_y_vector.csv', index_col=0).values
# define MI_FS feature selection method
feat_selector = mifs.MutualInformationFeatureSelector()
# find all relevant features
feat_selector.fit(X, y)
# check selected features
feat_selector.support_
# check ranking of features
feat_selector.ranking_
# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
Further details
Please make sure to check out my blog post about mathematical principles behind these algorithms and how I managed to make them really performant by parallizing their execution.