News & Updates
GP2 10th Data Release Notes – June 2025
The Components of GP2’s 10th Data Release
Tags
Research Operations; Research Collaboration; Complex Disease Genetics; Release
Authors
Hampton Leonard
DataTecnica/National Institutes of Health | USA
Hampton has a background in data science and machine learning, which she applies to large multi-omic datasets in the neurodegenerative disease space. She is passionate about investigating differences on both clinical and omic levels and how these differences can affect clinical trial outcomes.
Mike Nalls
DataTecnica/National Institutes of Health | USA
Mike founded Data Tecnica in early 2017 after over a decade of experience in large dataset analytics and methods research in healthcare and other scientific fields. Mike has 400+ peer-reviewed publications in the field of applied statistics in large datasets, brain diseases, and genomics. He is a strong advocate of open science, collaboration, and transparency in science.
Dan Vitale
DataTecnica/National Institutes of Health | USA
Dan is a data science consultant for Data Tecnica, consulting primarily for the Laboratory of Neurogenetics and CARD at the National Institute on Aging of the National Institutes of Health. His work is focused on open science, automation, development of genetic analytic pipelines and software, and machine learning.
Mathew Koretsky
DataTecnica/National Institutes of Health | USA
Mat is a data science consultant for Data Tecnica, consulting primarily for CARD at the National Institute on Aging of the National Institutes of Health. He is passionate about pipeline development and meaningful applications of computer science in the biomedical research space.
Kristin Levine
DataTecnica/National Institutes of Health | USA
Kristin works with the Data Tecnica and National Institute on Aging (NIA) teams on data and code sharing plus real-world data analysis of biobanks and healthcare systems. She is also an accomplished writer, now applying her communication skills to scientific domains.
Mary B Makarious
DataTecnica/National Institutes of Health | USA
Mary is a biomedical data scientist committed to open science principles and enhancing diversity in genomic studies. With her background in machine learning, data science, and genetics, she analyzes large-scale multi-omics datasets to develop open, reproducible pipelines and user-friendly notebooks and tools. Her efforts aim to empower others to effectively explore and interpret their own data and to foster a more inclusive and collaborative scientific community.
Lietsel Jones
DataTecnica/National Institutes of Health | USA
Lietsel is an analyst with Data Tecnica with a keen interest in the intersection between epidemiology and genetics. She is also a clinical data manager with GP2 working to collect and harmonize large clinical datasets from worldwide contributors.
Zih-Hua Fang
German Center for Neurodegenerative Diseases | Germany
Zih-Hua leads the whole-genome sequencing data analysis efforts in GP2 and contributes to GP2’s work on monogenic and familial Parkinson’s disease.
J Solle
Michael J. Fox Foundation for Parkinson’s Research | USA
J is the implementation Program Lead for GP2, co-lead for the Operations & Compliance Working Group, and a member of the Operations Committee.
On behalf of the GP2 Operations & Compliance, Complex Disease Data Analysis, Monogenic Data Analysis, Clinical Integration, and Data and Code Dissemination Working Groups.
Overview
In July 2025, GP2 announced the 10th data release on the Terra and the Verily® Workbench platforms in collaboration with AMP® PD. This release includes 11,109 additional genotyped participants and 13,339 additional WGS participants.
- The genotype array (NBA) data, including locally-restricted samples, now consists of a total of 82,944 genotyped participants (36,939 PD cases, 19,821 Controls, and 26,184 ‘Other’ phenotypes).
- When removing the locally-restricted samples, these now consist of 65,303 samples (28,586 PD cases, 15,258 Controls, and 21,459 ‘Other’ phenotypes).
- The whole genome sequencing (WGS) data now consists of a total of 21,073 sequenced participants (8,134 PD cases, 3,531 Controls, and 9,408 ‘Other’ phenotypes).
- When removing the locally-restricted samples, these now consist of 16,608 participants (6,801 PD cases, 3,244 Controls, and 6,563 ‘Other’ phenotypes).
- Of note, cases recruited via the Monogenic network are coded as ‘Other’.
- The clinical exome data now consists of 10,454 samples with PD (Release 8).
- Of the 92,021 unique samples with genetic data (NBA, WGS, or clinical exome), 26,982
individuals also have additional extended clinical information.
What’s New In This Release?
Expanding Genomic Data
This release introduces a substantial expansion in the number of participants with available genetic data. We have added:
- 11,109 new participants with genotype array (NBA) data
- 13,339 new participants with whole genome sequencing (WGS) data
- 12,311 new participants with extended clinical data
- A family file (and corresponding data dictionary) which reports pairwise kinship estimates
between individuals within families. It includes both inferred relationships (with kinship coefficients) and reported relationships.
Inclusion of PAR Region in Imputation
We’ve reintroduced the pseudoautosomal (PAR) region in the imputation of genotype array data, improving coverage and interpretation of sex chromosome variation. This enhancement is part of ongoing efforts to enhance genomic coverage and analytic accuracy.
Joint-calling Now Include AMP® PD cohorts
- The jointly-called WGS variant sets now include samples from the following five AMP® PD cohorts: BioFind, PPMI, LCC, STEADY-PD3 and SURE-PD3.
- By processing these samples together with GP2 rather than independently, it minimizes missingness, artifacts, and improves genotype accuracy.
- We have added a column to master key denoting which GP2 samples are also present in the AMP-PD dataset.
Targeted Imputation of rs3115534 Across Select Ancestries
In response to strong community interest in the intronic variant rs3115534, given that it’s been associated with increased risk of Parkinson’s disease and REM sleep behavior disorder, and has been functionally validated, we have now implemented a targeted imputation strategy to ensure its inclusion in the released datasets
- Specifically, chromosome 1 was imputed for five ancestries (AFR, AAC, AMR, MDE, and CAH) using the 1000 Genomes Phase 3 30x high coverage reference panel.
- Following imputation, data for rs3115534 was merged back into the TOPMed-based imputed files provided with GP2 releases. Note that imputation metrics for this variant did not meet quality thresholds (R2 < 0.3) in other ancestry groups.
rs3115534 Release 10 Imputation Metrics using Phase 3 30x 1000 Genomes Panel | |||||
---|---|---|---|---|---|
Population | Status | AF | MAF | AVG_CS | R2 |
AFR | IMPUTED | 0.761049 | 0.238951 | 0.992879 | 0.968831 |
AAC | IMPUTED | 0.855586 | 0.144414 | 0.993458 | 0.959606 |
AMR | IMPUTED | 0.983414 | 0.0165855 | 0.991453 | 0.507857 |
MDE | IMPUTED | 0.980407 | 0.0195926 | 0.990912 | 0.584081 |
CAH | IMPUTED | 0.93793 | 0.0620703 | 0.993982 | 0.909959 |
New Summary Statistics Now Available
We’ve made available several new GWAS summary statistics datasets, expanding global representation:
- GP2’s European (EUR) meta-GWAS (pre-print; GitHub)
- South African GWAS (pre-print pending; GitHub)
- Indian GWAS (pre-print; GitHub)
- RBD (REM Sleep Behavior Disorder) GWAS (pre-print pending; GitHub pending)
- LARGE-PD GWAS, which includes Latino American participants (pre-print pending;
GitHub pending)
Clinical Data
This release contains clinical data for a total of 92,021 individuals who have genetic and core clinical data available. Of these, 26,982 have deep clinical phenotyping data available. This information consists of:
- Age at diagnosis and onset
- Primary, current, and latest diagnoses
- Cognitive exams such as the Mini-Mental State Examination (MMSE) and the Montreal Cognitive Assessment (MoCA)
- Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS)
- Detailed “other” phenotypes, such as Lewy body Dementia (LBD)
Individual-Level Data
We now capture the data from a total of 124 cohorts. Please refer to the GP2 Cohort Dashboard for more information on the cohorts that have been shared.
Genetically-determined ancestry of array genotyped GP2 participants are broken into 11 ancestry groups; the tables below provide details of the genetically-determined ancestry of participants in this release that have passed quality control for array data and whole genome sequencing data. These numbers reflect samples from previous releases, reclustered using the updated cluster file and subjected to quality control, as well as newly genotyped samples exclusive to this release. The final table provides information about the genetically-determined ancestry of selected other, non-PD phenotypes.
Array Genotyped Data - GP2 Release 10 | ||||
---|---|---|---|---|
Ancestry | Total (+VWB) | PD (+VWB) | Control (+VWB) | Other (+VWB) |
African | 3,754 (3,780) | 1,181 (1191) | 2,305 (2,307) | 268 (282) |
African Admixed | 1,192 (1,215) | 361 (370) | 760 (763) | 71 (82) |
Ashkenazi Jewish | 3,265 (3,472) | 1,482 (1,531) | 408 (435) | 1,375 (1,506) |
Latino and Indigenous people of the Americas | 3,564 (3,608) | 1,974 (1,995) | 1,433 (1,439) | 157 (174) |
East Asian | 6,619 (6,662) | 2,393 (2,411) | 2,697 (2,705) | 1,529 (1,546) |
European | 41,901 (58,823) | 18,703 (26,778) | 5,899 (10,372) | 17,299 (21,673) |
South Asian | 801 (945) | 270 (317) | 260 (269) | 271 (359) |
Central Asian | 1670 (1691) | 776 (782) | 624 (626) | 270 (283) |
Middle Eastern | 1349 (1,493) | 675 (752) | 535 (559) | 139 (182) |
Finnish | 116 (144) | 87 (106) | 8 (12) | 21 (26) |
Complex Admixture | 1,072 (1,111) | 684 (706) | 329 (334) | 59 (71) |
Total | 65,303 (82,944) | 28,586 (36,939) | 15,258 (19,821) | 21,459 (26,184) |
Whole Genome Sequenced Data - GP2 Release 10 | ||||
---|---|---|---|---|
Ancestry | Total (+VWB) | PD (+VWB) | Control (+VWB) | Other (+VWB) |
African | 1,671 (1,696) | 646 (656) | 848 (853) | 177 (187) |
African Admixed | 254 (267) | 126 (130) | 113 (114) | 15 (23) |
Ashkenazi Jewish | 1,389 (1,485) | 337 (355) | 100 (106) | 952 (1,024) |
Latino and Indigenous people of the Americas | 301 (333) | 154 (171) | 24 (24) | 123 (138) |
East Asian | 2,525 (2,542) | 576 (582) | 343 (343) | 1,606 (1,617) |
European | 8,354 (12,461) | 4,155 (5,389) | 1,131 (1,397) | 3,068 (5,675) |
South Asian | 309 (417) | 47 (73) | 10 (16) | 252 (328) |
Central Asian | 833 (840) | 259 (261) | 329 (330) | 245 (249) |
Middle Eastern | 788 (824) | 386 (394) | 308 (309) | 94 (121) |
Finnish | 22 (30) | 17 (20) | 4 (4) | 1(6) |
Complex Admixture | 162 (178) | 98 (103) | 34 (35) | 30 (40) |
Total | 16,608 (21,073) | 6,801 (8,134) | 3,244 (3,531) | 6,563 (9,408) |
Array Genotyped Data - GP2 Release 10 | |||||||
---|---|---|---|---|---|---|---|
Ancestry | Prodromal NBA/WGS | PSP NBA/ WGS | AD NBA/WGS | DLB NBA/ WGS | MSA NBA/ WGS | CBD/CBS NBA/WGS | FTD NBA/WGS |
African | 16/7 | 6/4 | 0/0 | 2/0 | 7/4 | 1/0 | 0/0 |
African Admixed | 23/7 | 4/2 | 1/0 | 0/0 | 2/0 | 1/0 | 0/0 |
Ashkenazi Jewish | 308/71 | 23/12 | 9/0 | 14/6 | 8/3 | 4/3 | 2/1 |
Latino and Indigenous people of the Americas | 30/11 | 5/0 | 5/0 | 2/0 | 2/0 | 1/0 | 0/0 |
East Asian | 27/4 | 14/63 | 4/4 | 18/0 | 6/178 | 2/32 | 0/0 |
European | 4206/848 | 1307/ 920 | 484/136 | 442/340 | 421/ 334 | 166/159 | 65/63 |
South Asian | 3/2 | 34/32 | 1/0 | 5/1 | 5/8 | 9/9 | 2/2 |
Central Asian | 4/4 | 4/1 | 70/72 | 4/1 | 1/0 | 4/1 | 0/0 |
Middle Eastern | 14/1 | 9/4 | 2/2 | 1/0 | 0/0 | 1/1 | 1/1 |
Finnish | 9/0 | 2/1 | 2/0 | 0/0 | 1/1 | 0/0 | 1/0 |
Complex Admixture | 9/2 | 7/5 | 5/4 | 3/1 | 1/0 | 0/0 | 1/1 |
Total | 4649/957 | 1415/ 1044 | 583/218 | 491/349 | 454/ 528 | 189/205 | 72/68 |
Snapshot of Clinical Data - GP2 Release 10 (on VWB) | ||
---|---|---|
Clinical Data | N, Unique IDs | N, IDs with Follow-up |
Age at Sample Collection | 71,747 | - |
Age at Onset | 38,718 | - |
Age at Diagnosis | 31,667 | - |
Basic Family History | 92,021 | - |
Demographics | 26,701 | - |
Hoehn & Yahr Stage | 11,486 | 5,515 |
UPDRS Part 1 Score | 2,359 | 1,057 |
UPDRS Part 2 Score | 2,338 | 1,049 |
UPDRS Part 3 Score | 3,606 | 1,084 |
UPDRS Part 4 Score | 1,739 | 1,090 |
MDS UPDRS Part 1 Score | 5,168 | 2,802 |
MDS UPDRS Part 2 Score | 5,242 | 2,854 |
MDS UPDRS Part 3 Score | 7,532 | 2,870 |
MDS UPDRS Part 4 Score | 2,479 | 1,016 |
MOCA | 9,500 | 2,753 |
MMSE | 1,954 | - |
RBD Score | 3,986 | 3,290 |
Head Trauma | 5,495 | 3,747 |
Vitals | 5,895 | 4,035 |
Smell | 5,200 | 1,466 |
Data Access
Locality-restricted GDPR samples via the Verily Viewpoint Workbench
We are continuing to pilot granting access to locally-restricted samples, otherwise known as samples governed by the General Data Protection Regulation (GDPR) policy, through our collaboration with the Verily Viewpoint Workbench.
At this time, as GP2 continues to roll out data sharing solutions for GDPR protected data, release 10 data with regional restrictions will be available to only GP2 consortium members and partners. As testing and implementation continues in 2025, this solution will be available to the broader research community. All release 10 samples can be found on Workbench, meanwhile all release 10 samples not governed by GDPR requirements can be found on the community workbench on Terra (like all previous releases). To gain access to the full release on VWB you must:
- Have approved GP2 Tier 2 access
- Fill out the GDPR-governed sample request form
- Be a GP2 consortium member (contributing cohort, GP2 partner, or project analyses team member)
Future data releases will continue to grow the diversity of participants available. You can check out our dashboard to see our progress. For users with tier 2 access already, you can explore the data further on our cohort browser, expanded on in a previous blog post.
As always, please refer to the README that accompanies each GP2 release for further details regarding recommendations for quality control, pipelines, data, and analyses!