News & Updates
Biomarker Discovery in Parkinson’s disease using Machine Learning on Public Multi-omic Datasets: A Pilot Study
Objective: In this study, we use machine learning (ML) as a framework for biomarker studies to assess if multiple modalities in the same model perform best.
Background: Parkinson’s disease (PD) is a complex, progressive disorder where rare and common genetic variants contribute to the risk, onset, and progression of disease. Given the long latency between the damage to dopaminergic cells and the onset of clinical symptoms, there is an increasing need to identify reliable biomarkers that can predict 1) onset, 2) disease progression or 3) response to therapeutic interventions. Currently, most biomarker studies only focus on a handful of features from a single assay. Preliminary results show multimodal data improves prediction between cases and controls.
Methods: By using public datasets such as the Parkinson’s Progression Markers Initiative (PPMI) via the Accelerating Medicines Partnership - Parkinson’s Disease (AMP-PD), we have developed an automated ML tool (GenoML) that applies different ML algorithms to genetic, clinical, and transcriptomic data separately and combined to assess the accuracy, sensitivity, and specificity of predictive models. This included 872 samples that had sequenced genomes , clinical data, and ~50K normalized transcripts from RNA sequencing.
Results: Determining best algorithms based on the area under the curve (AUC) for predicting peri-diagnostic PD, both genetic and transcriptomic data performed best using XGBoost, while clinical data performed best using logistic regression. While the types of data each individually performed well (clinical; AUC=85.5%, genetic; AUC=79.5%, and transcriptomic; AUC=79.6%), clinical data has lowest sensitivity (clinical; 0.71, genetic; 0.73, and transcriptomic; 0.80) while having the highest specificity (clinical; 0.88, genetic; 0.69, and transcriptomic; 0.83). However, using all three data types combined, the XGBoost performed best, with AUC=89.88%, sensitivity=0.78, and specificity=0.83 in witheld testing samples.
Conclusions: When assessing the performance in 30% of test samples after training on 70% of samples, multiple modalities implemented in the same predictive model performs best. By incorporating different modalities, we can develop more comprehensive predictive models to better understand the complex disease and identify better biomarkers.
Character Count: 2407/2500 characters
GenoML Website: https://genoml.github.io
GitHub: https://github.com/GenoML/genoml/tree/python_v1.5
Acknowledgements:
AMP DUA: Acknowledge the cohorts PPMI, BioFIND, PDBP, and HBS personnel and other cohorts who provided AMP PD Data and/or the funding of the Studies, and will include language in manuscripts similar to the following:
AMP PD Acknowledgement
"Data used in the preparation of this article were obtained from the AMP PD Knowledge Platform. For up-to-date information on the study, https://www.amp-pd.org. “AMP PD – a public-private partnership – is managed by the FNIH and funded by Celgene, GSK, the Michael J. Fox Foundation for Parkinson’s Research, the National Institute of Neurological Disorders and Stroke, Pfizer, Sanofi, and Verily. AMP PD Cohort Acknowledgements “PPMI – a public-private partnership – is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including GlaxoSmithKline, Golub Capital, Handl Therapeutics, Inistro, Janssen Neuroscience, Lilly, Lundbeck, Merk, Meso Scale Discovery, Neurocrine, Pfizer, Piramal, Prevail therapeutics, Roche, Sanofi Genzyme, Servier, Takeda, Teva, UCB, Verily, and Voyager Therapeutics. The PPMI Investigators have not participated in reviewing the data analysis or content of the manuscript. For up-to-date information on the study, visit www.ppmi-info.org.”