Genomic Data

Genomic data in AMP PD are contributed and curated by the Global Parkinson’s Genetics Program (GP2). To increase ancestral diversity in PD genetic datasets, GP2 is working with researchers around the world and aims to genotype over 150,000 individuals with Parkinson’s disease and controls; visit the GP2 dashboard to view the progress of GP2 so far and the cohorts participating in this effort to this date.

Genotyping was performed using the Illumina NeuroBooster array that includes a Global Diversity Array-8 (GDA) backbone. The array was developed to test 1.9 million markers and includes ~95,000 Neurodegenerative disease-specific variants. Samples coming from GP2 are being genotyped at various international centers using standard Illumina infrastructure. For more information on the content that is included in the NeuroBooster array, please see the Neurobooster Github.

GP2 Ancestry Total PD Non-PD
African Admixed 162 81 81
Ashkenazi Jewish 520 396 124
Latino and Indigenous people of the Americas 214 152 62
East Asian 46 35 11
European 3919 2732 1187
South Asian 47 38 9
Total 4908 3434 1474

Genotype Data Processing

Illumina Infinium Global Diversity Array-8 Kit
Illumina Infinium Global Diversity Array-8 Kit Source: https://gp2.org/the-components-of-gp2-first-data-release/

Genotype data was produced and clustered using general Illumina standard genotyping protocols and quality control of data was performed by GP2.

All data processing was performed against ancestry appropriate reference panels as well as TopMed or equivalent.

Genotype Quality Control Process

The genotype calling and quality control practices can be found on GitHub [https://github.com/GP2code/GenoTools] and are constantly being improved and updated. In general the QC involves filtering for basic sample level metrics like call rate, extreme heterozygosity and homozygosity outliers, as well as sex checks that compare clinical and genetically ascertained sex. Ancestry will be genomically adjudicated and data will be subsetted accordingly i.e. all passing QC samples are split by ancestry and imputed within ancestry. Related samples are retained for imputation and removed before further analysis to maximize case:control balance, keeping probands. Various levels of standard pre- and post- imputation variant filtering will be carried out as is standard including retaining variants that have an imputation quality score (RSQ) >= 0.3 and minor allele count >= 10.

Specific ancestry populations include: European, Ashkenazi Jewish, Finnish, African, African-American/Caribbean, Native American, South Asian, East Asian.

Genotype data products provided by GP2

  • PLINK2 binary containing filtered imputed SNPs
  •  HDF5 containing metadata with counts of samples/variants removed at each QC step, Ancestry makeup of cohort, PC/UMAP plots

Whole Genome Data

GP2 includes whole genome sequencing data for some study participants. For GP2, was performed using an Illumina NovaSeq or newer/equivalent using the Illumina HiSeq XTen sequencer with samples coming from whole blood, saliva or brain tissue.
Quality control of sequenced data is to be performed by GP2.

WGS Sample Processing

For GP2 WGS Workflow was performed using Cromwell, the execution engine from the Broad institute using workflows written in the workflow definition language (WDL) and published by Broad.

WGS Integrated Quality Control Process

Quality control included:
Sample Quality

  • Contamination (Freemix < 3%)
  • Coverage (Mean coverage < 25)
  • WGS metric outliers (TiTv < 2)
  • Missingness (missingness genotype rates per sample > 5%)

Genetic Data Checks

  • Duplication check
  • Clinically reported sex
  • Excessive heterogeneity / homogeneity

Data Availability
Following Processing and QC the following data are made available by GP2.
Data products for WGS data provided by GP2

  • PLINK
  • CRAM
  • VCF
  • Metric files
  • CSV files