SymPy is an amazing library for symbolic mathematics in Python. It’s like Mathematica, and its online shell version along with SymPy Gamma is pretty much like Wolfram Alpha (WA). OK, I know you can ask WA some pretty cool questions, but let’s face it, most of use just want to find the derivative of a function, or simplify an expression, and not …

## MIFS – parallelized Mutual Information based Feature Selection module

TL,DR: I wrapped up three mutual information based feature selection methods in a scikit-learn like module. You can find it on my GitHub. It is very easy to use, you can run the example.py or import it into your project and apply it to your data like any other scikit-learn method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import pandas as pd import mifs # load X and y X = pd.read_csv('my_X_table.csv', index_col=0).values y = pd.read_csv('my_y_vector.csv', index_col=0).values # define MI_FS feature selection method feat_selector = mifs.MutualInformationFeatureSelector() # find all relevant features feat_selector.fit(X, y) # check selected features feat_selector.support_ # check ranking of features feat_selector.ranking_ # call transform() on X to filter it down to selected features X_filtered = feat_selector.transform(X) |

Mutual information based filter methods The following bit is adopted …

## Linear algebra notes and LaTeX

TL;DR I wanted to take a linear algebra course. I also wanted to learn LaTeX. I did both, and wrote a 70-something page long document from my notes of the Linear Algebra Foundations and Frontiers MOOC by The University of Texas at Austin. It still isn’t completely finished and I’m sure there are tons of typos in it but here it is: LAFF notes. …

## BorutaPy – an all relevant feature selection method

TL,DR: There’s a pretty clever all-relevant feature selection method, which was conceived by Witold R. Rudnicki and developed by Miron B. Kursa at the ICM UW. Here is its website. While working on my PhD project I read their paper, really liked the method, but didn’t quite like how slow it was. It’s based on R’s Random Forest implementation which runs …