Genomic Data
Genomic data in AMP PD are contributed and curated by the Global Parkinson’s Genetics Program (GP2). To increase ancestral diversity in PD genetic datasets, GP2 is working with researchers around the world and aims to genotype over 150,000 individuals with Parkinson’s disease and controls; visit the GP2 dashboard to view the progress of GP2 so far and the cohorts participating in this effort to this date.
Genotyping was performed using the Illumina NeuroBooster array that includes a Global Diversity Array-8 (GDA) backbone. The array was developed to test 1.9 million markers and includes ~95,000 Neurodegenerative disease-specific variants. Samples coming from GP2 are being genotyped at various international centers using standard Illumina infrastructure. For more information on the content that is included in the NeuroBooster array, please see the Neurobooster Github.
GP2 Ancestry | Total | PD | Non-PD |
---|---|---|---|
African Admixed | 162 | 81 | 81 |
Ashkenazi Jewish | 520 | 396 | 124 |
Latino and Indigenous people of the Americas | 214 | 152 | 62 |
East Asian | 46 | 35 | 11 |
European | 3919 | 2732 | 1187 |
South Asian | 47 | 38 | 9 |
Total | 4908 | 3434 | 1474 |
Genotype Data Processing
[embed type:node embed_type:image id:257 align:right width:350 height:]
Genotype data was produced and clustered using general Illumina standard genotyping protocols and quality control of data was performed by GP2.
All data processing was performed against ancestry appropriate reference panels as well as TopMed or equivalent.
Genotype Quality Control Process
The genotype calling and quality control practices can be found on GitHub [https://github.com/GP2code/GenoTools] and are constantly being improved and updated. In general the QC involves filtering for basic sample level metrics like call rate, extreme heterozygosity and homozygosity outliers, as well as sex checks that compare clinical and genetically ascertained sex. Ancestry will be genomically adjudicated and data will be subsetted accordingly i.e. all passing QC samples are split by ancestry and imputed within ancestry. Related samples are retained for imputation and removed before further analysis to maximize case:control balance, keeping probands. Various levels of standard pre- and post- imputation variant filtering will be carried out as is standard including retaining variants that have an imputation quality score (RSQ) >= 0.3 and minor allele count >= 10.
Specific ancestry populations include: European, Ashkenazi Jewish, Finnish, African, African-American/Caribbean, Native American, South Asian, East Asian.
Genotype data products provided by GP2
- PLINK2 binary containing filtered imputed SNPs
- HDF5 containing metadata with counts of samples/variants removed at each QC step, Ancestry makeup of cohort, PC/UMAP plots
Whole Genome Data
GP2 includes whole genome sequencing data for some study participants. For GP2, was performed using an Illumina NovaSeq or newer/equivalent using the Illumina HiSeq XTen sequencer with samples coming from whole blood, saliva or brain tissue.
Quality control of sequenced data is to be performed by GP2.
WGS Sample Processing
For GP2 WGS Workflow was performed using Cromwell, the execution engine from the Broad institute using workflows written in the workflow definition language (WDL) and published by Broad.
WGS Integrated Quality Control Process
Quality control included:
Sample Quality
- Contamination (Freemix < 3%)
- Coverage (Mean coverage < 25)
- WGS metric outliers (TiTv < 2)
- Missingness (missingness genotype rates per sample > 5%)
Genetic Data Checks
- Duplication check
- Clinically reported sex
- Excessive heterogeneity / homogeneity
Data Availability
Following Processing and QC the following data are made available by GP2.
Data products for WGS data provided by GP2
- PLINK
- CRAM
- VCF
- Metric files
- CSV files