News & Updates

GP2 Release Notes – May 2022

May 16, 2022

Storage updates and release schedules: We intend to release updates each quarter to stored data as scheduled releases. These are the top level directories within the GP2 Tier 1(gp2tier1) and Tier 2 (gp2tier2) that contain public summary level and private participant level data (respectively).

For example, release 2 on May 6th 2022 would be in the top level directory /release2_06052022 in both the tier1 and tier2 storage buckets.

Contact: For questions relating to data processing, please email admin@gp2.org.

Release specific info follows below.

Current Release

Release2_06052022 (beta)

For a current list of samples, studies, cohorts and geographic territories covered by GP2 please see the GP2 website here [https://gp2.org/cohort-dashboard/].

For more information regarding this release, please check out the GP2 blog post under the title ‘Components of GP2’s Second Data Release’ : [https://gp2.org/blog/]

Complex Disease
General Information:

3,736 samples are added in this release, the number of shared GP2 samples now equals 8,644 (5,249 PD cases, 3,395 non-PD).
New genotype samples were processed using GenoTools version 0.1 [https://github.com/dvitale199/GenoTools]. All samples were imputed to TOPMed reference detailed in the GenoTools pipeline.
All data provided is GRCh38 (hg38).

GDPR note:

Currently, all data included in this release has been determined to comply with GDPR guidelines, as it comes from countries not governed by GDPR or participants who are no longer living.

Bucket and Directory Structure:

gp2tier1 @release2_06052022
└── summary_statistics/

gp2tier2 @release2_06052022
├── raw_genotypes/
├── imputed_genotypes/
├── cnvs/
├── meta_data/
├── clinical_data/
└── summary_statistics/

Bucket and Directory Overview:

gp2tier1, this is the bucket for summary statistics and other non-participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release.
- summary_statistics - The file META5_no23_with_rsids2.txt contains open access summary statistics from the most recent Parkinson’s GWAS can be found here as well as in the tier 2 storage bucket. (This excludes 23andMe samples, from Nalls et al 2019, https://pubmed.ncbi.nlm.nih.gov/31701892/) can be found here as well as in the tier 2 storage bucket. Column headers conform to the standard METAL meta-analysis output [https://genome.sph.umich.edu/wiki/METAL_Documentation].
gp2tier2, this is the bucket for participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release. Its content is mirrored below.
- raw_genotypes - PLINK binary files for each ancestry group for all samples passing quality control prior to imputation. Each PLINK binary includes all attempted variants from the array for that ancestry group. As a note, for flexibility in community analyses, all known duplicate samples were removed but related samples remain.
- imputed_genotypes - All genotype data has been imputed using the TOPMed reference panel and is contained in PLINK2 files separated by chromosome. Prior to upload, these files have been filtered for minor allele count > 10 and imputation quality > 0.3 as is industry standard. Each file set is separated by genetically defined ancestry groups prior to imputation. Workflow for ancestry determination and all other QC processes are found under [https://github.com/dvitale199/GenoTools].
- cnvs - probabilistic estimates of copy number variation per gene and +/- 250kb flanking regions for deletions, duplications and insertions for all samples. Code for these estimates can be found here [https://github.com/GP2code/GenoTools/tree/main/CNV]. This is currently “hypothesis generating” data and will be improved for next release.
- meta_data - Meta data included in the HDF5 file GP2_round2.QC.metrics.h5 is currently comprised of QC metrics, ancestry counts, predictive ancestry labels, confusion matrix, new samples UMAP, projected principal components, pruned samples, reference principal components, reference UMAP, and total (samples and reference) UMAP. KeysViewHDF5 = ['QC', 'ancestry_counts', 'ancestry_labels', 'confusion_matrix', 'new_samples_umap', 'projected_pcs', 'pruned_samples', 'ref_pcs', 'ref_umap', 'total_umap']
  - GP2_[ancestry]_release1_samples - ID lists per ancestry group of all participants included in release 1
- clinical_data - The corresponding data dictionary for an explanation of the columns can be found in release2_26042022_data_dictionary.csv.
- summary_statistics - this includes basic summary statistics from gp2tier1

Ancestry group definitions

AAC - African American / Caribbean
AFR - African ancestry
AJ - Ashkenazi Jewish
AMR - Latino and indigenous Americas populations
EUR - general European ancestry
EAS - East Asian ancestry
SAS - South Asian ancestry
FIN - Finnish population isolate
CAS - Central Asian

This release does not contain FIN due to insufficient sample size for accurate estimates of imputation quality

Previous Release

Storage updates and release schedules: We will attempt to make at least quarterly updates to stored data as scheduled releases. These are the top level directories within gp2_tier1 and gp2_tier2 that contain public summary level and private participant level data (respectively). For example, release 1 on November 29th 2021 would be in the top level directory /release1_29112021 in both the tier1 and tier2 storage buckets.

Contact: For questions relating to data processing, please email dawg@gp2.org.

Release specific info follows below.

Current Release

Release1_29112021

For a current list of samples, studies, cohorts and geographic territories covered by GP2 please see the GP2 website here.

General Information:

4908 samples are added in this release, the number of available GP2 samples now equals 4908.
New genotype samples were processed using GenoTools version 0.1 [https://github.com/dvitale199/GenoTools]. All samples were imputed to TOPMed reference detailed in the GenoTools pipeline.
All data provided is GRCh38 (hg38).

* GDPR note: Currently, all data included in this release has been determined to comply with GDPR guidelines, as it comes from countries not governed by GDPR or participants who are no longer living.

Bucket and Directory Structure:

gp2_tier1
└── summary_statistics/

gp2_tier2
├── raw_genotypes/
├── imputed_genotypes/
├── meta_data/
├── clinical_data/
└── summary_statistics/

Bucket and Directory Overview:

gp2_tier1, this is the bucket for summary statistics and other non-participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release.
summary_statistics - The file META5_no23_with_rsids2.txt contains open access summary statistics from the most recent Parkinson’s GWAS (excluding 23andMe samples, from Nalls et al 2019, https://pubmed.ncbi.nlm.nih.gov/31701892/) can be found here as well as in the tier 2 storage bucket. Column headers conform to the standard METAL meta-analysis output [https://genome.sph.umich.edu/wiki/METAL_Documentation].
gp2_tier2, this is the bucket for participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release. Its content is mirrored below.
raw_genotypes - PLINK binary files for each ancestry group for all samples passing quality control prior to imputation. Each PLINK binary includes all attempted variants from the array for that ancestry group. As a note, for flexibility in community analyses, all known duplicate samples were removed but related samples remain.
imputed_genotypes - All genotype data has been imputed using the TOPMed reference panel and is contained in PLINK2 files separated by chromosome. Prior to upload, these files have been filtered for minor allele count > 10 and imputation quality > 0.3 as is industry standard. Each file set is separated by genetically defined ancestry groups prior to imputation.
meta_data - Meta data included in the HDF5 file GP2_round1.QC.metrics.h5 is currently comprised of QC, ancestry counts, ancestry labels, confusion matrix, new samples UMAP, projected principal components, pruned samples, reference principal component, reference UMAP, total (samples and reference) UMAP.
clinical_data - The corresponding data dictionary for an explanation of the columns can be found in release1_29112021_data_dictionary.csv.
summary_statistics - this includes basic summary statistics from gp2_tier1

Ancestry group definitions

AAC - African Admixed
AFR - African Ancestry
AJ - Ashkenazi Jewish
AMR - Latino and Indigenous Americas populations
EUR - general European ancestry
EAS - East Asian ancestry
SAS - South Asian ancestry
FIN - Finnish population isolate

This release does not contain AFR or FIN due to insufficient sample size for imputation quality

Previous Releases

Release1_29112021

For a current list of samples, studies, cohorts and geographic territories covered by GP2 please see the GP2 website here.

General Information:
4908 samples are added in this release, the number of available GP2 samples now equals 4908.
New genotype samples were processed using GenoTools version 0.1 [https://github.com/dvitale199/GenoTools]. All samples were imputed to TOPMed reference detailed in the GenoTools pipeline.
All data provided is GRCh38 (hg38).
* GDPR note: Currently, all data in this release is not governed by GDPR.

Bucket and Directory Structure:

gp2_tier1
└── summary_statistics/

gp2_tier2
├── raw_genotypes/
├── imputed_genotypes/
├── meta_data/
├── clinical_data/
└── summary_statistics/

Bucket and Directory Overview:

gp2_tier1, this is the bucket for summary statistics and other non-participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release.
summary_statistics - The file META5_no23_with_rsids2.txt contains open access summary statistics from the most recent Parkinson’s GWAS (excluding 23andMe samples, from Nalls et al 2019, https://pubmed.ncbi.nlm.nih.gov/31701892/) can be found here as well as in the tier 2 storage bucket. Column headers conform to the standard METAL meta-analysis output [https://genome.sph.umich.edu/wiki/METAL_Documentation].
gp2_tier2, this is the bucket for participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release. Its content is mirrored below.
raw_genotypes - PLINK binary files for each ancestry group for all samples passing quality control prior to imputation. Each PLINK binary includes all attempted variants from the array for that ancestry group. As a note, for flexibility in community analyses, all known duplicate samples were removed but related samples remain.
imputed_genotypes - All genotype data has been imputed using the TOPMed reference panel and is contained in the PLINK2 files chr*.* Prior to upload, these files have been filtered for minor allele count > 10 and imputation quality > 0.3 as is industry standard. Each file set is separated by genetically defined ancestry groups prior to imputation.
meta_data - Meta data included in the HDF5 file GP2_round1.QC.metrics.h5 is currently comprised of QC, ancestry counts, ancestry labels, confusion matrix, new samples UMAP, projected principal components, pruned samples, reference principal component, reference UMAP, total (samples and reference) UMAP.
clinical_data - The corresponding data dictionary for an explanation of the columns can be found in release1_29112021_data_dictionary.csv.
summary_statistics - this includes basic summary statistics from gp2_tier1

Ancestry group definitions

AAC - African Admixed
AFR - African Ancestry
AJ - Ashkenazi Jewish
AMR - Latino and Indigenous Americas populations
EUR - general European ancestry
EAS - East Asian ancestry
SAS - South Asian ancestry
FIN - Finnish population isolate

This release does not contain AFR or FIN due to insufficient sample size for imputation quality