Successful PhD Defense – Machine Learning Approaches For High-Dimensional Genome-Wide Association Studies

On August 24th Muhammad Ammar Malik successfully defended his PhD thesis with the title: Machine learning approaches for high-dimensional genome-wide association studies

Genome-wide association studies (GWAS) aim to find statistical associations between genetic variants and traits of interests. The genetic variants that explain a lot of variation in genome-wide gene expression may lead to confounding in expression quantitative trait loci (eQTL) analyses. To account for these confounding factors, we proposed LVREML, a method conceptually analogous to estimating fixed and random effects in linear mixed models (LMM). We showed that the maximum-likelihood latent variables can always be chosen orthogonal to the known factors (such genetic variants). This indicates that the maximum likelihood variables explain the sample covariances that is not already explained by the genetic variants in the model.

For identifying which traits are affected by the identified genetic variants, we need to reverse the functional relation between genotypes and traits. In this regard, multi-trait approaches are more advantageous than studying the traits individually. The multi-trait approaches benefit from increased power from considering cross-trait covariances and reduced multiple testing burden because a single test is needed to test for associations to a set of traits. Therefore, we analyzed various machine learning methods (ridge regression, Naive Bayes/independent univariate correlation, random forests and support vector machines) for reverse regression in multi-traitGWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods.

We then extended the above approach to human dataset. We used the genotype and brain-imaging features extracted from the MRIs obtained from the ADNI database. Our results showed that the genotype prediction performance varied across genetic variants. This helped in identifying genomic regions that are associated with high number of traits in high-dimensional phenotypic data. We also observed that the feature coefficients of fitted machine learning models correlated with the strength of association between variants and traits. Our results also showed that non-linear machine learning methods like random forests identified genetic variants distinct from the linear methods. In particular, we observed that random forest was able to identify single-nueclotide-polymorphisms (SNPs) that were distinct from the ones identified by ridge and lasso regression. Further analysis showed that the identified SNPs belonged to genes previously associated with brain-related disorders.

Publications

Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders

(https://doi.org/10.1093/g3journal/jkab410)

Malik, M. A., & Michoel, T.

G3, 12(2), jkab410.

High-dimensional multi-trait GWAS by reverse prediction of genotypes

(https://doi.org/10.48550/arXiv.2111.00108)

Malik, M. A., Ludl, A. A., & Michoel, T.

In 2021 International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer, Cham.

rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes

(https://doi.org/10.48550/arXiv.2204.00067)

Malik, M. A., Lundervold, A. S., & Michoel, T.

Under review at Bioinformatics Advances.