CorrMapper was the main project of my PhD. It is an online research tool for the integration and visualisation of complex biomedical and omics datasets.
Why CorrMapper is needed?
CorrMapper could be best understood by thinking of the problems researchers with complex biomedical datasets face these days.
Analytical platforms are getting a lot cheaper
Most of the analytical platforms we use today in life sciences and medical research are getting cheaper each year. Some of them are getting cheaper at a ridiculous rate.
The sequencing and assembly of the first human genomes cost hundreds of millions of dollars ($3 billion by the US government funded project 1 and $300,000,000 by the private Celera initiative) and took 11 and 3 years respectively.
Yet today, less than 15 years later, we are about to pass the $1000 price point in human genome sequencing, with the Illumina HiSeq X Ten, which will be capable of sequencing 18,000 human genomes per year, to the gold standard of 30× coverage.
Rise of multi-omics studies
Usually, when products get cheaper two things happen:
- more people start using them,
- people use more of them.
If this happens to several analytical platform simultaneously, then researchers with the same budget will suddenly be able to design studies which utilise multiple platforms at once. This is precisely what happened in the past 10 years in life sciences and multi-omics studies are becoming a lot more popular and affordable.
Multi-omics just means that we have more than one omics dataset, where omics is the terminology used in life sciences to collectively refer to the data coming from genomics, transcriptomics, metagenomics, metabolomics, etc.
Multi-omics studies have great potential as they allow us to examine the biology behind a disease from multiple viewpoints, each analytical platform opening a new window to the underlying biochemical processes.
For example the change of gene expression in colon cancer is just as important as the changes in epigenomic markers, or the gut microbiome which cannot be ignored neither, as there seems to be a complex, multi-level interplay between the bugs in our gut and our health.
Multi-omics studies have too many features
How do we relate these disparate datasets and combine them so that their complimentary information could be harnessed to expand our biomedical knowledge?
Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. Many of these features however, might be completely irrelevant to the studied biological problem, or redundant in the context of others (multicollinearity). We wouldn’t expect for example to find all 25000 human genes to be involved in breast cancer, or all urinary metabolites as candidate biomarkers for liver failure.
Multi-omics studies are hard to visualise
Learning from such feature rich datasets inevitably incurs an increased computational cost. It also increases the chance of over- fitting the noise in our data, while reducing the predictive power of our models. Finally, the correlation networks arising from these high-throughput datasets are often hard to interpret and explore due to their density and lack of interactive tools.
Clinical metadata presents additional complexity
Finally, biomedical studies will have increasing amounts of metadata attached to the actual omics measurements, making the stratification of patients easier than ever before. This is largely due to the explosion of digital/wearable health gadgets and the radical improvement in the digitization of healthcare records.
How does CorrMapper work?
CorrMapper attempts the near impossible and address several of these problems at once.
Interactive metadata explorer
CorrMapper provides a completely automatically generated metadata explorer which allows researchers to explore and stratify their patient cohort with an interactive dashboard which seamlessly integrates metadata with up to two omics datasets.
Advanced feature selection
Several cutting edge feature selection algorithms have been built into CorrMapper to allow researchers to focus their attention to only those features which have the most discriminatory power with respect to a metadata variable, for example cancer vs. control. This not only decreases the computational cost of subsequent analysis steps but also helps with the interpretation of the data.
Estimation of sparse covariance structures
The selected features are then used to estimate a sparse inverse covariance matrix using the graphical lasso algorithm. This is useful because under Gaussian assumptions this inverse covarience matrix will only be zero if the row and column variables of the given cell are conditionally independent given all other features in the matrix.
In plain English, we can work out which variables are conditionally independent (having removed all the confounding effects of the others). This is hugely important because without this, the complexity of biological systems would almost certainly guarantee that we will see a lot of spurious and confounded correlations in our analysis. Based on the inverse covariance matrix we can draw a network of correlated variables, see below.
Robust estimation of statistical significance
The edges of the network represent Spearman rank correlations for which p values are estimated using 10000 permutations. The p-values are made more precise using a Generalized Pareto Distribution based method. Finally the p values are corrected for multiple testing using one of the user selected methods.
Highly interactive visualisation of correlation networks
The resulting heatmap and networks of correlations are then simultaneously visualised and interlinked.
If the uploaded datasets have genomic features, this allows CorrMapper to use a more appropriate genomic network visualisation, where the features are laid out in clockwise fashion along the genome of the given species.