Whole Genome Data

AMP PD includes whole genome sequencing data for most study participants. All sequencing was performed by Macrogen and the Uniformed Services University of Health Sciences (USUHS) using the Illumina HiSeq XTen sequencer with samples coming from whole blood.

Quality control of sequenced data was performed by Hampton Leonard and Hirotaka Iwaki from Datatecnica as part of a contract with the Laboratory of Neurogenetics (NIA).
 

Processed WGS Totals

   Control Case Other  
  Cohort No Mutations With Mutations No Mutations With Mutations No Mutations With Mutations Total
  BioFIND 69 1 90 9 3 0 172
  HBS 217 10 585 55 0 0 867
  PDBP 454 26 772 88 119 10 1,469
  PPMI 194 373 420 364 61 21 1,433
  Total 934 410 1,867 516 183 31 3,941
Control = Clinically Healthy; Case = Clinical Diagnosis of Parkinson's Disease; Other = Clinical Diagnosis of SWEDDS or other neurodegenerative disorders

WGS Sample Mutation:

GBA: N370S (rs76763715); T408M (rs75548401); E365K (rs2230288)

LRRK2: G2019S (rs34637584); R1441G_T (rs33939927); R1441G_G (rs33939927)

SNCA: A53T (rs104893877); G51D (rs431905511); E46K (rs104893875); A30P (rs104893878); H50Q (rs201106962)

Data Processing

Data processing was performed on the Google Cloud Platform. All data processing was performed against Build 38 of the Human Genome reference (GRCh38DH, 1000 Genomes Project version).


Single Sample Processing

FASTQs were processed using the Broad Institute's implementation of the Functional Equivalence Pipeline to produce alignments (output as CRAM files) and variant calls (output as gVCF files).


Joint Genotyping

After single sample processing was completed joint genotyping, using the Broad Institute's Joint Genotyping pipeline, was performed on the gVCF files.


Variant Annotations

Variant annotations add variant identifiers and gene identifiers as annotations. The annotation fields can be seen on the WGS Variant Effect Predictor Fields page. Annotations were generated on the joint genotyped variants using the Variant Effect Predictor (VEP).
 

Whole Genome Sequencing Methodology: Alignment approach; Variant Calling approach; Variant of annotation

WGS Data Dictionary

If you want to download a version of the full AMP PD Whole Genome Sequencing Data Dictionary, click one of the buttons below for a specific format.

Data Availability

The following are available to registered researchers at this time:

Per sample

  • CRAM
  • gVCF
  • metric files

Per release

Joint genotyped and annotated variants

  • Per chromosome VCF files
  • PLINK files
  • Variants table

AMP PD will be generating and processing additional WGS data, which will be available in subsequent releases. When new WGS data is available in a new release, new joint genotyping results using the Broad Institute GATK pipeline will be availalble. Joint genotyping results using the TopMED GotCloud pipeline will also be available to AMP PD users in the near future.

WGS Workflow Overview & Execution

Cromwell: execution engine from the Broad institute. Runs workflows written in the workflow definition  language (WDL)  

MySQL: database of submitted, running, and completed jobs  

Cromwell Workspace: Directory in Google Cloud Storage used by Cromwell to communicate with workflow tasks

Pipelines API: The following steps detail the process of turning FASTQs into CRAMs and gVCFs using two workflows from the Broad:

  1. Operator submits workflow request to Cromwell on a REST API - listening on port 8000
  2. Cromwell creates a subdirectory in gs://<bucket/cromwell_executions for each workflow
  3. Repeat until workflow completes
    • Cromwell creates task-specific directories in the workflow directory and populates it with the script to run
    • Cromwell calls the Pipelines API to launch a VM to run the step
    • Pipelines API downloads input files, executes the task, and writes outputs back to the task-specific directory in the workflow.
    • Cromwell gets "job status" information both from the Pipelines API and the task-specific directory of the workflow
  4. Operator copies outputs to "final" location
     

WGS Workflow Overview

WGS Quality Control Process

Hampton Leonard and Hirotaka Iwaki from Datatecnica have performed QC analysis on 4,047 AMP PD WGS samples. This analysis has included:

  • Sample Quality
    • Contamination (Freemix < 3%)
    • Coverage (Mean coverage < 25)
    • WGS metric outliers (TiTv < 2)
    • Missingness (missingness genotype rates per sample > 5%)
       
  • Genetic Data Checks
    • Duplication check
    • Concordance against NeuroX data
    • Clinically reported sex
    • Excessive heterogeneity
    • Clinically reported race/ethnicity
       

WGS QC Process

Sample Quality Checks

  1. Contamination - Some samples show clear signs of contamination as reported by VerifyBAMId. Contaminated samples were removed from AMP PD.
    Pass/Fail Criteria: VerifyBamID FREEMIX >= 0.03
  2. Read Coverage - In sequencing experiments, some samples may have low mean coverage. Outliers identified by this QC criteria were excluded from joint calling and flagged for wet laboratory follow-up.
    Pass/Fail Criteria: Mean_Coverage >= 25 reads per variant
  3. WGS metric outliers - Low transition transversion ratio (TiTv)
    Pass/Fail Criteria: Failing samples at values < 2 based on dbSNPs
  4. Missingness - Refers to missing genotype rates per sample
    Pass/Fail Criteria: Sample with > 0.05% missingness

Genetic Data Checks

  1. Duplication check - Some samples matched their NeuroX data, but also matched another WGS sample (which matched its NeuroX data). This indicates that the same individual has been included in AMP PD more than once. Some samples had no NeuroX data to match against, but matched another WGS sample. This indicates that either the same individual has been included in AMP PD, or one of the samples was mislabeled. The higher quality WGS samples were used in joint genotyping and the lower quality WGS samples were made available in AMP PD, but not in joint genotyping.
    Pass/Fail Criteria: Software King Relatedness = dup/MZTwi
  2. Concordance against NeuroX data - For some samples, there was NeuroX data available, but the WGS sample did not match this NeuroX data based on rates of genotype concordance. The WGS data is a superset of the NeuroX data, so samples with only WGS were not included in this phase of analysis. Discordant samples were removed from AMP as this suggests a problem with the DNA itself.
    Pass/Fail Criteria: Software King Relatedness !=dup/MZTwin
  3. Clinically reported sex - Sex estimated from WGS was checked against data from self-report. Discordant sex suggests a sample mix-up generally. These samples were removed from joint calling and the biological data used for the assay and further assays were flagged for caution going forward.
    Pass/Fail Criteria: M=F or F=M, Blanks ignore
  4. Excessive heterogeneity - Computes observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (<observed hom. count> - <expected count>) / (<total observations> - <expected count>)).
    Pass/Fail Criteria: F > +/- 0.15
  5. Clinically reported race/ethnicity - Samples for subjects who reported white and are admix or reported multiracial and are genetically European. Ancestry outliers are determined using PCA and comparing to hapmap samples. Any sample within a distribution of plus/minus 6 standard deviations from the mean in PC1 and PC2 are considered to be part of that population genetically.
     
    Excluded Flagged
    Genetically inferred African or Asian = clinically reported “white” Genetically inferred Admix = anything other than “mixed race”
    Genetically inferred European = clinically reported “african/asia”  Clinically reported “unknown”