Whole Genome Data

AMP PD includes whole genome sequencing data for most study participants. All sequencing was performed by Macrogen and the Uniformed Services University of Health Sciences (USUHS) using the Illumina HiSeq XTen sequencer with samples coming from whole blood.

Quality control of sequenced data was performed by Hampton Leonard and Hirotaka Iwaki from Datatecnica as part of a contract with the Laboratory of Neurogenetics (NIA).

What's on this page:

WGS Data Processing Overview

WGS Data Dictionary

WGS Data Availability

WGS Workflow Overview & Execution

WGS Quality Control Process

WGS Quality & Data Checks

Processed WGS Totals for PD Cases and Controls

	PD Case		Control (Asymptomatic Individuals)		Others		All Enrollment Types
	With Known Mutations	No Known Mutations	With Known Mutations	No Known Mutations	With Known Mutations	No Known Mutations	Unknown Mutation Status
BioFIND	24	75	22	48	2	1	0
HBS	204	435	146	385	0	3	7
LBD	0	0	541	1424	1367	1247	0
LCC	181	61	203	154	0	0	0
PDBP	278	573	145	370	38	111	6
PM	0	0	0	0	0	0	100
PPMI	605	352	583	173	83	53	2
SURE-PD	90	169	0	0	0	0	0
STEADY-PD3	101	227	0	0	0	1	0
Total	1483	1892	1640	2554	1490	1416	115

WGS Known Sample Mutations

GBA: N370S (rs76763715); T369M (rs75548401); E326K (rs2230288)

LRRK2: G2019S (rs34637584); R1441G_T (rs33939927); R1441G_G (rs33939927)

SNCA: A53T (rs104893877); G51D (rs431905511); E46K (rs104893875); A30P (rs104893878)

APOE: 388T_C (rs429358); 526C_T (rs7412)

			E2/E2	E2/E3	E3/E3	E2/E4	E3/E4	E4/E4	Unknown
						44	1122	6205	213	2563	328	115
PD	1521	Case	5	100	532	14	185	15	3
		Control	2	78	307	15	107	6	3
		Other	0	26	94	3	24	2	1
PP	1851	Case	7	121	611	19	185	14	1
		Control	7	98	487	14	143	7	1
		Other	0	13	87	5	29	2	0
PM	100	Case	0	0	0	0	0	0	75
		Control	0	0	0	0	0	0	25
		Other	0	0	0	0	0	0	0
BF	172	Case	2	8	73	1	13	2	0
		Control	8	9	40	2	18	1	0
		Other	0	0	1	0	1	1	0
HB	1180	Case	1	85	389	12	145	7	1
		Control	4	68	333	9	110	7	6
		Other	0	0	3	0	0	0	0
LB	4579	Case	0	0	0	0	0	0	0
		Control	4	179	1305	17	424	36	0
		Other	5	192	1200	85	920	212	0
LC	599	Case	1	31	152	4	52	2	0
		Control	0	48	224	3	81	1	0
		Other	0	0	0	0	0	0	0
SY	329	Case	4	44	208	7	59	6	0
		Control	0	0	0	0	0	0	0
		Other	0	0	1	0	0	0	0
SU	259	Case	2	22	158	3	67	7	0
		Control	0	0	0	0	0	0	0
		Other	0	0	0	0	0	0	0

Additional participant variant data is now available in AMP PD Tier 2 data identifying the APOE genotype for all AMP PD participants. Apolipoprotein (Apo) E is produced under the direction of the APOE gene and is one of five main types of blood lipoproteins (A-E). AMP PD has evaluated participant’s WGS data to determine what combination of APOE forms (genotype) is present. The APOE gene exists in three different forms (alleles) – e2, e3, and e4 – with e3 being the most common allele, found in 60% of the general population.

AMP PD’s public dataset includes 3,095 participants with at least one copy of the APOE E4 gene, and 327 participants with two copies.

Release 1 WGS Samples by Cohort and Study

Release 2 WGS Samples by Cohort

Release 3 WGS Samples by Cohort

Data Processing

Data processing was performed on the Google Cloud Platform. All data processing was performed against Build 38 of the Human Genome reference (GRCh38DH, 1000 Genomes Project version).

Single Sample Processing

FASTQs were processed using the Broad Institute's implementation of the Functional Equivalence Pipeline to produce alignments (output as CRAM files) and variant calls (output as gVCF files).

Joint Genotyping

After single sample processing was completed joint genotyping, using the Broad Institute's Joint Genotyping pipeline, was performed on the gVCF files.

Variant Annotations

Variant annotations add variant identifiers and gene identifiers as annotations. The annotation fields can be seen on the WGS Variant Effect Predictor Fields page. Annotations were generated on the joint genotyped variants using the Variant Effect Predictor (VEP).

Whole Genome Sequencing Methodology: Alignment approach; Variant Calling approach; Variant of annotation

WGS Data Dictionary

If you want to download a version of the full AMP PD Whole Genome Sequencing Data Dictionary, click one of the buttons below for a specific format.

WGS Data Dictionary (pdf)

WGS Data Dictionary (excel)

Data Availability

The following are available to registered researchers at this time:

Per sample

CRAM
gVCF
metric files

Per release

Joint genotyped and annotated variants

Per chromosome VCF files
PLINK files
Variants table

AMP PD will be generating and processing additional WGS data, which will be available in subsequent releases. When new WGS data is available in a new release, new joint genotyping results using the Broad Institute GATK pipeline will be availalble. Joint genotyping results using the TopMED GotCloud pipeline will also be available to AMP PD users in the near future.

WGS Workflow Overview & Execution

Cromwell: execution engine from the Broad institute. Runs workflows written in the workflow definition language (WDL)

MySQL: database of submitted, running, and completed jobs

Cromwell Workspace: Directory in Google Cloud Storage used by Cromwell to communicate with workflow tasks

Pipelines API: The following steps detail the process of turning FASTQs into CRAMs and gVCFs using two workflows from the Broad:

Operator submits workflow request to Cromwell on a REST API - listening on port 8000
Cromwell creates a subdirectory in gs://<bucket/cromwell_executions for each workflow
Repeat until workflow completes
- Cromwell creates task-specific directories in the workflow directory and populates it with the script to run
- Cromwell calls the Pipelines API to launch a VM to run the step
- Pipelines API downloads input files, executes the task, and writes outputs back to the task-specific directory in the workflow.
- Cromwell gets "job status" information both from the Pipelines API and the task-specific directory of the workflow
Operator copies outputs to "final" location

WGS Quality Control Process

Hampton Leonard and Hirotaka Iwaki from Datatecnica have performed QC analysis on 10,418 AMP PD WGS samples. This analysis has included:

Sample Quality
- Contamination (Freemix < 3%)
- Coverage (Mean coverage < 25)
- WGS metric outliers (TiTv < 2)
- Missingness (missingness genotype rates per sample > 5%)
Genetic Data Checks
- Duplication check
- Concordance against NeuroX data
- Clinically reported sex
- Excessive heterogeneity
- Clinically reported race/ethnicity

Sample Quality Checks

Contamination - Some samples show clear signs of contamination as reported by VerifyBAMId. Contaminated samples were removed from AMP PD.
Pass/Fail Criteria: VerifyBamID FREEMIX >= 0.03
Read Coverage - In sequencing experiments, some samples may have low mean coverage. Outliers identified by this QC criteria were excluded from joint calling and flagged for wet laboratory follow-up.
Pass/Fail Criteria: Mean_Coverage >= 25 reads per variant
WGS metric outliers - Low transition transversion ratio (TiTv)
Pass/Fail Criteria: Failing samples at values < 2 based on dbSNPs
Missingness - Refers to missing genotype rates per sample
Pass/Fail Criteria: Sample with > 0.05% missingness

Genetic Data Checks

Duplication check - Some samples matched their NeuroX data, but also matched another WGS sample (which matched its NeuroX data). This indicates that the same individual has been included in AMP PD more than once. Some samples had no NeuroX data to match against, but matched another WGS sample. This indicates that either the same individual has been included in AMP PD, or one of the samples was mislabeled. The higher quality WGS samples were used in joint genotyping and the lower quality WGS samples were made available in AMP PD, but not in joint genotyping.
Pass/Fail Criteria: Software King Relatedness = dup/MZTwi
Concordance against NeuroX data - For some samples, there was NeuroX data available, but the WGS sample did not match this NeuroX data based on rates of genotype concordance. The WGS data is a superset of the NeuroX data, so samples with only WGS were not included in this phase of analysis. Discordant samples were removed from AMP as this suggests a problem with the DNA itself.
Pass/Fail Criteria: Software King Relatedness !=dup/MZTwin
Clinically reported sex - Sex estimated from WGS was checked against data from self-report. Discordant sex suggests a sample mix-up generally. These samples were removed from joint calling and the biological data used for the assay and further assays were flagged for caution going forward.
Pass/Fail Criteria: M=F or F=M, Blanks ignore
Excessive heterogeneity - Computes observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (<observed hom. count> - <expected count>) / (<total observations> - <expected count>)).
Pass/Fail Criteria: F > +/- 0.15

Clinically reported race/ethnicity - Samples for subjects who reported white and are admix or reported multiracial and are genetically European. Ancestry outliers are determined using PCA and comparing to hapmap samples. Any sample within a distribution of plus/minus 6 standard deviations from the mean in PC1 and PC2 are considered to be part of that population genetically.

Excluded	Flagged
Genetically inferred African or Asian = clinically reported “white”	Genetically inferred Admix = anything other than “mixed race”
Genetically inferred European = clinically reported “african/asia”	Clinically reported “unknown”