## Quick summary

I wrapped up three mutual information based feature selection methods in a scikit-learn like module. You can find it on my GitHub. It is very easy to use, you can run the example.py or import it into your project and apply it to your data like any other scikit-learn method.

## Background

### Mutual information based filter methods

In the past twenty years, a large variety of information theory based filter methods were developed. Filter methods represent a subclass of feature selection algorithms which are classifier independent and capture the discriminating power of each feature by calculating some sort of relevance statistics with respect to the outcome variable. These statistics are used in a heuristic scoring criterion which acts as a proxy measure for classification accuracy. Therefore filter methods can rank the features by their relevance computationally cheaply, without the need of training classifiers on the data.

Information theory based filter methods assess the mutual information between the $latex m$ features of $latex X^{m \times n}$ and $latex y$. To follow the conventional notation of introductory texts in information theory however, $latex X$ will have a different meaning in this post.

Information theory is concerned about quantifying the uncertainty present in distributions, by measuring their entropy. The entropy of the distribution of variable $latex X$, denoted as $latex H(X)$, is defined as:

$latex H(X) = -\sum_{x \in \mathcal{X}}p(x)\log p(x),$

where $latex x$ denotes a possible value of variable $latex X$, that it can take from a set of values $latex \mathcal{X}$, and $latex p(x)$ is the distribution of $latex X$. This equation quantifies the uncertainty in $latex X$, and for discrete variables it could be computed by estimating $latex p(X)$ as the fraction of observations taking on value $latex x$ from the total $latex N$: $latex \bar{p}(x)=\frac{\#x}{N}.$

If $latex p(x)$ peaks around a certain value, than the entropy of it will be low, while if it is uniform, meaning all events in $latex X$ are equally likely, it will be high. Furthermore, conditional entropy of two distributions could be defined as:

$latex H(X|Y)=-\sum_{y \in \mathcal{Y}}p(y) \sum_{x \in \mathcal{X}}p(x|y)\log p(x|y),$

which represents the amount of uncertainty remaining in$latex X$ after we have seen $latex Y$.

Mutual information (Shannon 1948) between $latex X$ and $latex Y$ is then defined as:

$latex I(X;Y) = H(X) - H(X|Y)$

$latex \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}}p(xy)\log\frac{p(xy)}{p(x)p(y)}$

In this difference the first term represents the uncertainty before $latex Y$ is known, while the second term captures the uncertainty after $latex Y$ is known. Thus mutual information could also be thought of, as the amount of uncertainty in $latex X$ that is removed by knowing $latex Y$. Mutual information is symmetric $latex I(X;Y)=I(Y;X)$ and is zero if and only if $latex X$ and $latex Y$ are statistically independent.

Building on the these concepts, and given we have three discrete random variables $latex X,Y,Z$, the conditional mutual information could be defined as:

$latex I(X;Y|Z)=H(X|Z)-H(X|Y,Z)=I(X;Y,Z)-I(X;Z),$

where $latex I(X;Y,Z)$ is the joint mutual information.

As described earlier, filter methods attempt to rank the features based on statistical relevance measures. Therefore the simplest information theoretic filter techniques simply calculate the mutual information between all features and $latex y$ independently, rank them, and select the $latex k$ best ones. This is called Mutual Information Maximisation and it has been used in many early algorithms (Lewis 1992). This method is known to be suboptimal however, when the features are interdependent, as it will end up selecting highly correlated thus redundant features.

Using the Joint Mutual Information (JMI) Moody Moody 1999) and Meyer et al. (Meyer 2008) proposed a method that focuses on increasing the complementary information between the selected features:

$latex J_{jmi}(X_i) = \sum_{X_j \in S}I(X_i X_j;Y),$

where $latex J_{jmi}(X_i)$ is the JMI score for feature $latex i$ under consideration, and $latex X_j \in S$ represents all features that were selected in previous iterations of the algorithm. This selection criterion ensures, that the candidate feature $latex X_i$ has to be complementary with all the previously selected features $latex X_j$ in order to be added to $latex \bar{S}$.

Brown et al. (Brown 2012) in their extensive review, systematically benchmarked 17 information theoretic filter methods including the widely used Mutual Information Feature Selection (Battiti 1994) and Max-Relevance Min-Redundancy (Peng 2005) algorithms. They performed a large empirical study to rank these methods by their accuracy, stability and flexibility, and the JMI criterion based feature selection methods were picked as overall winners.

### How to select features using mutual information?

OK, see we have these mathematical concepts of entropy and mutual information and joint mutual information. They are very useful as they don’t really assume anything about our data, yet they can somehow capture if two random variables have any sort of similarity or “connection”. But how could we use these concepts to actually perform feature selection?

Let’s introduce some notation. We have a matrix of data $latex X^{n \times p}$ and an outcome variable $latex y$, which could be discrete or continuous. We want to select a set of features $latex S$ from $latex X$, such that $latex |S| << p$. Let’s call the set of all features $latex F, |F|=p$. We do this btw to understand which features govern the system mostly and/or to reduce the chance of over-fitting the noise in our data.

There are two problems to address with mutual information based feature selection:

1. As the number of features in our dataset ($latex p$) grows, the possible combinations to consider for $latex S$ grows exponentially and becomes intractable even for a few dozen features.
2. Although it is simple to compute entropy and consequently mutual information for discrete random variables, most of the time we have continuous measurements in real life datasets. Rounding them to the nearest integer, might seem tempting but it introduces a bias into our MI estimates. Using binning or histogram based methods are better but still suffer from the same bias issue.

#### The first problem

To overcome the first problem we could calculate the MI between each feature and our outcome variable and select an arbitrary amount of the top-scoring ones. This would be a univariate approach, and require us to assume that all of our features are completely independent from each other. As we have discussed in our earlier post about Boruta, this is rarely the scenario we find ourselves in.

A better way would be to calculate the MI between each feature and $latex y$, $latex \forall f_i \in F \text{compute} I(f_i;y)$. Then select $latex f_i$ with the largest $latex I(f_i;y)$, remove it from $latex F$ and add it to $latex S$.

Then in each consecutive round we would perform a greedy search and find the feature $latex f_i \in F$, which has the maximum joint mutual information with the previously selected features $latex f_i \in S$ and $latex y$:

$latex \arg\max_{f_i \in F-S}(\sum_{s \in S} (I(f_i, f_s; y)))$

This is the selection criteria of the JMI method and it was show to perform best out of 17 information theory based filter methods. As with every greedy search algorithm it can happen that we end up with suboptimal set of $latex S$.

A recent alteration of this criterion is the Joint Mutual Information Maximisation (JMIM) which uses the maximum of the minimum of the JMI:

$latex \arg\max_{f_i \in F-S}(\min_{s \in S} (I(f_i, f_s; y))).$

Beyond these two methods, I also implemented a third MI based filter method the Minimum Redundancy Maximum Relevance algorithm, which is quite famous  and heavily used in life sciences.

#### The second problem

To overcome the second problem I used the well established kNN based MI estimation methods in the case when both $latex X$ and $latex y$ are continuous. I used the excellent sklearn based Python implementation of Gael Varoquaux on GitHub to get started.

Most of the times in life sciences however, we have continuous measurements in $latex X$ and a discrete $latex y$ denoting some class membership like “treatment” vs “control”. None of the previously described MI estimating procedures apply to this scenario and until recently there was no accepted way to deal with this (at least to the best of my knowledge).

Luckily an extension of the above described kNN method got published recently which deals with this problem. I coded up the method in Python and compared the MI estimates I got from it, with the Matlab implementation that came with the paper. The numbers matched up nicely and the Python version was 5-10 times faster, thanks to the fantastic kNN algorithm in scikit-learn. So the mifs module can be used to select features from a continuous $latex X$ matrix against a categorical or a discrete outcome variable, using mutual information based metrics.

### Going parallel

Since these filter methods spend most of the CPU time on calculating MI between columns of $latex X$ and $latex y$, which is embarrassingly parallel, I used joblib to make the mifs module run in parallel. It utilizes all the cores the user has and this can easily cut the time spent on feature selection in half or even make it shorter.

## Feedback

Hopefully this easy to use implementation will make these information theory based feature selection methods more appealing to researchers. Please let me know if I cocked something up or if you think I could improve this work. I’m planning to adopt a chapter from my late stage PhD report into another blog post and show some benchmarking results where I compare the MI based filter methods, the Boruta method and other better established FS methods, so stay tuned..

Tags:

Topics:

Updated: