Overview

While researching the feature selection literature for my PhD, I came across a couple of filter methods that leverage information theoretic principles to select relevant features.

I wrapped up three of the most promising mutual information based feature selection methods in a scikit-learn like module and open sourced it.

Example use

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    import pandas as pd
    import mifs

    # load X and y
    X = pd.read_csv('my_X_table.csv', index_col=0).values
    y = pd.read_csv('my_y_vector.csv', index_col=0).values

    # define MI_FS feature selection method
    feat_selector = mifs.MutualInformationFeatureSelector()

    # find all relevant features
    feat_selector.fit(X, y)

    # check selected features
    feat_selector.support_

    # check ranking of features
    feat_selector.ranking_

    # call transform() on X to filter it down to selected features
    X_filtered = feat_selector.transform(X)

Further details

Please make sure to check out my blog post about mathematical principles behind these algorithms and how I managed to make them really performant by parallizing their execution.

Updated: