Home
/
< Back To News & Updates
/
GP2 10th Data Release Notes – June 2025

News & Updates

GP2 10th Data Release Notes – June 2025

July 8, 2025

The Components of GP2’s 10th Data Release

Authors

Hampton Leonard
DataTecnica/National Institutes of Health | USA

Hampton has a background in data science and machine learning, which she applies to large multi-omic datasets in the neurodegenerative disease space. She is passionate about investigating differences on both clinical and omic levels and how these differences can affect clinical trial outcomes.

Mike Nalls
DataTecnica/National Institutes of Health | USA

Mike founded Data Tecnica in early 2017 after over a decade of experience in large dataset analytics and methods research in healthcare and other scientific fields. Mike has 400+ peer-reviewed publications in the field of applied statistics in large datasets, brain diseases, and genomics. He is a strong advocate of open science, collaboration, and transparency in science.

Dan Vitale
DataTecnica/National Institutes of Health | USA

Dan is a data science consultant for Data Tecnica, consulting primarily for the Laboratory of Neurogenetics and CARD at the National Institute on Aging of the National Institutes of Health. His work is focused on open science, automation, development of genetic analytic pipelines and software, and machine learning.

Mathew Koretsky
DataTecnica/National Institutes of Health | USA

Mat is a data science consultant for Data Tecnica, consulting primarily for CARD at the National Institute on Aging of the National Institutes of Health. He is passionate about pipeline development and meaningful applications of computer science in the biomedical research space.

Kristin Levine
DataTecnica/National Institutes of Health | USA

Kristin works with the Data Tecnica and National Institute on Aging (NIA) teams on data and code sharing plus real-world data analysis of biobanks and healthcare systems. She is also an accomplished writer, now applying her communication skills to scientific domains.

Mary B Makarious
DataTecnica/National Institutes of Health | USA

Mary is a biomedical data scientist committed to open science principles and enhancing diversity in genomic studies. With her background in machine learning, data science, and genetics, she analyzes large-scale multi-omics datasets to develop open, reproducible pipelines and user-friendly notebooks and tools. Her efforts aim to empower others to effectively explore and interpret their own data and to foster a more inclusive and collaborative scientific community.

Lietsel Jones
DataTecnica/National Institutes of Health | USA

Lietsel is an analyst with Data Tecnica with a keen interest in the intersection between epidemiology and genetics. She is also a clinical data manager with GP2 working to collect and harmonize large clinical datasets from worldwide contributors.

Zih-Hua Fang
German Center for Neurodegenerative Diseases | Germany

Zih-Hua leads the whole-genome sequencing data analysis efforts in GP2 and contributes to GP2’s work on monogenic and familial Parkinson’s disease.

J Solle
Michael J. Fox Foundation for Parkinson’s Research | USA

J is the implementation Program Lead for GP2, co-lead for the Operations & Compliance Working Group, and a member of the Operations Committee.

On behalf of the GP2 Operations & Compliance, Complex Disease Data Analysis, Monogenic Data Analysis, Clinical Integration, and Data and Code Dissemination Working Groups.

Overview

In July 2025, GP2 announced the 10th data release on the Terra and the Verily® Workbench platforms in collaboration with AMP® PD. This release includes 11,109 additional genotyped participants and 13,339 additional WGS participants.

The genotype array (NBA) data, including locally-restricted samples, now consists of a total of 82,944 genotyped participants (36,939 PD cases, 19,821 Controls, and 26,184 ‘Other’ phenotypes).
- When removing the locally-restricted samples, these now consist of 65,303 samples (28,586 PD cases, 15,258 Controls, and 21,459 ‘Other’ phenotypes).
The whole genome sequencing (WGS) data now consists of a total of 21,073 sequenced participants (8,134 PD cases, 3,531 Controls, and 9,408 ‘Other’ phenotypes).
- When removing the locally-restricted samples, these now consist of 16,608 participants (6,801 PD cases, 3,244 Controls, and 6,563 ‘Other’ phenotypes).
- Of note, cases recruited via the Monogenic network are coded as ‘Other’.
The clinical exome data now consists of 10,454 samples with PD (Release 8).
Of the 92,021 unique samples with genetic data (NBA, WGS, or clinical exome), 26,982
individuals also have additional extended clinical information.

What’s New In This Release?

Expanding Genomic Data
This release introduces a substantial expansion in the number of participants with available genetic data. We have added:

11,109 new participants with genotype array (NBA) data
13,339 new participants with whole genome sequencing (WGS) data
12,311 new participants with extended clinical data
A family file (and corresponding data dictionary) which reports pairwise kinship estimates
between individuals within families. It includes both inferred relationships (with kinship coefficients) and reported relationships.

Inclusion of PAR Region in Imputation
We’ve reintroduced the pseudoautosomal (PAR) region in the imputation of genotype array data, improving coverage and interpretation of sex chromosome variation. This enhancement is part of ongoing efforts to enhance genomic coverage and analytic accuracy.

Joint-calling Now Include AMP® PD cohorts

The jointly-called WGS variant sets now include samples from the following five AMP® PD cohorts: BioFind, PPMI, LCC, STEADY-PD3 and SURE-PD3.
- By processing these samples together with GP2 rather than independently, it minimizes missingness, artifacts, and improves genotype accuracy.
We have added a column to master key denoting which GP2 samples are also present in the AMP-PD dataset.

Targeted Imputation of rs3115534 Across Select Ancestries
In response to strong community interest in the intronic variant rs3115534, given that it’s been associated with increased risk of Parkinson’s disease and REM sleep behavior disorder, and has been functionally validated, we have now implemented a targeted imputation strategy to ensure its inclusion in the released datasets

Specifically, chromosome 1 was imputed for five ancestries (AFR, AAC, AMR, MDE, and CAH) using the 1000 Genomes Phase 3 30x high coverage reference panel.
Following imputation, data for rs3115534 was merged back into the TOPMed-based imputed files provided with GP2 releases. Note that imputation metrics for this variant did not meet quality thresholds (R2 < 0.3) in other ancestry groups.

rs3115534 Release 10 Imputation Metrics using Phase 3 30x 1000 Genomes Panel
Population	Status	AF	MAF	AVG_CS	R2
AFR	IMPUTED	0.761049	0.238951	0.992879	0.968831
AAC	IMPUTED	0.855586	0.144414	0.993458	0.959606
AMR	IMPUTED	0.983414	0.0165855	0.991453	0.507857
MDE	IMPUTED	0.980407	0.0195926	0.990912	0.584081
CAH	IMPUTED	0.93793	0.0620703	0.993982	0.909959

New Summary Statistics Now Available
We’ve made available several new GWAS summary statistics datasets, expanding global representation:

GP2’s European (EUR) meta-GWAS (pre-print; GitHub)
South African GWAS (pre-print pending; GitHub)
Indian GWAS (pre-print; GitHub)
RBD (REM Sleep Behavior Disorder) GWAS (pre-print pending; GitHub pending)
LARGE-PD GWAS, which includes Latino American participants (pre-print pending;
GitHub pending)

Clinical Data
This release contains clinical data for a total of 92,021 individuals who have genetic and core clinical data available. Of these, 26,982 have deep clinical phenotyping data available. This information consists of:

Age at diagnosis and onset
Primary, current, and latest diagnoses
Cognitive exams such as the Mini-Mental State Examination (MMSE) and the Montreal Cognitive Assessment (MoCA)
Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS)
Detailed “other” phenotypes, such as Lewy body Dementia (LBD)

Individual-Level Data
We now capture the data from a total of 124 cohorts. Please refer to the GP2 Cohort Dashboard for more information on the cohorts that have been shared.

Genetically-determined ancestry of array genotyped GP2 participants are broken into 11 ancestry groups; the tables below provide details of the genetically-determined ancestry of participants in this release that have passed quality control for array data and whole genome sequencing data. These numbers reflect samples from previous releases, reclustered using the updated cluster file and subjected to quality control, as well as newly genotyped samples exclusive to this release. The final table provides information about the genetically-determined ancestry of selected other, non-PD phenotypes.

Array Genotyped Data - GP2 Release 10
Ancestry	Total (+VWB)	PD (+VWB)	Control (+VWB)	Other (+VWB)
African	3,754 (3,780)	1,181 (1191)	2,305 (2,307)	268 (282)
African Admixed	1,192 (1,215)	361 (370)	760 (763)	71 (82)
Ashkenazi Jewish	3,265 (3,472)	1,482 (1,531)	408 (435)	1,375 (1,506)
Latino and Indigenous people of the Americas	3,564 (3,608)	1,974 (1,995)	1,433 (1,439)	157 (174)
East Asian	6,619 (6,662)	2,393 (2,411)	2,697 (2,705)	1,529 (1,546)
European	41,901 (58,823)	18,703 (26,778)	5,899 (10,372)	17,299 (21,673)
South Asian	801 (945)	270 (317)	260 (269)	271 (359)
Central Asian	1670 (1691)	776 (782)	624 (626)	270 (283)
Middle Eastern	1349 (1,493)	675 (752)	535 (559)	139 (182)
Finnish	116 (144)	87 (106)	8 (12)	21 (26)
Complex Admixture	1,072 (1,111)	684 (706)	329 (334)	59 (71)
Total	65,303 (82,944)	28,586 (36,939)	15,258 (19,821)	21,459 (26,184)

Whole Genome Sequenced Data - GP2 Release 10
Ancestry	Total (+VWB)	PD (+VWB)	Control (+VWB)	Other (+VWB)
African	1,671 (1,696)	646 (656)	848 (853)	177 (187)
African Admixed	254 (267)	126 (130)	113 (114)	15 (23)
Ashkenazi Jewish	1,389 (1,485)	337 (355)	100 (106)	952 (1,024)
Latino and Indigenous people of the Americas	301 (333)	154 (171)	24 (24)	123 (138)
East Asian	2,525 (2,542)	576 (582)	343 (343)	1,606 (1,617)
European	8,354 (12,461)	4,155 (5,389)	1,131 (1,397)	3,068 (5,675)
South Asian	309 (417)	47 (73)	10 (16)	252 (328)
Central Asian	833 (840)	259 (261)	329 (330)	245 (249)
Middle Eastern	788 (824)	386 (394)	308 (309)	94 (121)
Finnish	22 (30)	17 (20)	4 (4)	1(6)
Complex Admixture	162 (178)	98 (103)	34 (35)	30 (40)
Total	16,608 (21,073)	6,801 (8,134)	3,244 (3,531)	6,563 (9,408)

Array Genotyped Data - GP2 Release 10
Ancestry	Prodromal NBA/WGS	PSP NBA/ WGS	AD NBA/WGS	DLB NBA/ WGS	MSA NBA/ WGS	CBD/CBS NBA/WGS	FTD NBA/WGS
African	16/7	6/4	0/0	2/0	7/4	1/0	0/0
African Admixed	23/7	4/2	1/0	0/0	2/0	1/0	0/0
Ashkenazi Jewish	308/71	23/12	9/0	14/6	8/3	4/3	2/1
Latino and Indigenous people of the Americas	30/11	5/0	5/0	2/0	2/0	1/0	0/0
East Asian	27/4	14/63	4/4	18/0	6/178	2/32	0/0
European	4206/848	1307/ 920	484/136	442/340	421/ 334	166/159	65/63
South Asian	3/2	34/32	1/0	5/1	5/8	9/9	2/2
Central Asian	4/4	4/1	70/72	4/1	1/0	4/1	0/0
Middle Eastern	14/1	9/4	2/2	1/0	0/0	1/1	1/1
Finnish	9/0	2/1	2/0	0/0	1/1	0/0	1/0
Complex Admixture	9/2	7/5	5/4	3/1	1/0	0/0	1/1
Total	4649/957	1415/ 1044	583/218	491/349	454/ 528	189/205	72/68

Snapshot of Clinical Data - GP2 Release 10 (on VWB)
Clinical Data	N, Unique IDs	N, IDs with Follow-up
Age at Sample Collection	71,747	-
Age at Onset	38,718	-
Age at Diagnosis	31,667	-
Basic Family History	92,021	-
Demographics	26,701	-
Hoehn & Yahr Stage	11,486	5,515
UPDRS Part 1 Score	2,359	1,057
UPDRS Part 2 Score	2,338	1,049
UPDRS Part 3 Score	3,606	1,084
UPDRS Part 4 Score	1,739	1,090
MDS UPDRS Part 1 Score	5,168	2,802
MDS UPDRS Part 2 Score	5,242	2,854
MDS UPDRS Part 3 Score	7,532	2,870
MDS UPDRS Part 4 Score	2,479	1,016
MOCA	9,500	2,753
MMSE	1,954	-
RBD Score	3,986	3,290
Head Trauma	5,495	3,747
Vitals	5,895	4,035
Smell	5,200	1,466

Data Access

Locality-restricted GDPR samples via the Verily Viewpoint Workbench

We are continuing to pilot granting access to locally-restricted samples, otherwise known as samples governed by the General Data Protection Regulation (GDPR) policy, through our collaboration with the Verily Viewpoint Workbench.

At this time, as GP2 continues to roll out data sharing solutions for GDPR protected data, release 10 data with regional restrictions will be available to only GP2 consortium members and partners. As testing and implementation continues in 2025, this solution will be available to the broader research community. All release 10 samples can be found on Workbench, meanwhile all release 10 samples not governed by GDPR requirements can be found on the community workbench on Terra (like all previous releases). To gain access to the full release on VWB you must:

Have approved GP2 Tier 2 access
Fill out the GDPR-governed sample request form
Be a GP2 consortium member (contributing cohort, GP2 partner, or project analyses team member)

Future data releases will continue to grow the diversity of participants available. You can check out our dashboard to see our progress. For users with tier 2 access already, you can explore the data further on our cohort browser, expanded on in a previous blog post.

As always, please refer to the README that accompanies each GP2 release for further details regarding recommendations for quality control, pipelines, data, and analyses!