Transcriptomics Data

AMP PD has RNA Fastq and workflow products from Salmon, Star, and Feature Counts for BioFIND, PDBP, and PPMI cohorts. All RNA Sequencing was performed by Hudson Alpha at 150 base pairs, and is supplied along with corresponding clinical data.

Processed RNA-Seq Totals

Cohort Baseline Month 0.5 Month 06 Month 12 Month 18 Month 24 Month 36 Totals
BioFIND 0 208 0 0 0 0 0 208
PDBP 1466 0 574 614 506 471 0 3631
PPMI 1,522 0 853 873 0 833 541 4622
Totals 2,988 208 1,427 1,487 506 1,304 541 8461

Library Preparation and Protocol Details

All RNA samples were normalized to 30ng/ul. Depending on the available material, input amounts of RNA used in the rRNA and globin reduction step ranged from 684-752 ng of RNA. All products were used following the manufacturer's directions except where noted. All samples underwent rRNA and globin reduction via the Illumina Globin-Zero Gold kit (catalog number GZG1224).  Following RNA reduction, stranded libraries were prepared by first-strand synthesis and second strand synthesis using the New England Biolabs (NEB) Ultra II First Strand Module (catalog number E7771L) followed by the NEB Ultra II Directional Second Strand Module (catalog number E7550L).

Following second strand synthesis, the double-stranded cDNA was converted to a sequencing library by standard, ligation-based library preparation. The following NEB modules were used, in order:

  1. NEB End Repair Module (catalog number E6050L)
  2. NEB A-tailing Module (catalog number E6053L) and
  3. NEB Quick Ligation Module (catalog number E6056L)  

Each of the modules was scaled back 1:2 from the recommended enzyme concentrations using the appropriate buffers. Following ligation to standard Illumina paired-end adaptors, each library was amplified for 12 cycles of PCR using Roche Kapa HiFi polymerase (catalog number KK2612). Each forward and reverse PCR primer included an 8nt unique index sequence. After PCR, the insert sizes were evaluated via Perkin-Elmer Caliper GX.  

Libraries were quantitated using the Roche Kapa SYBR FAST Universal kit (catalog number KR0389) and diluted to 2nM final stocks, pooled in equal molar amounts and sequenced on the Illumina NovaSeq 6000 platform to generate 100M paired reads per sample at 150 nt read lengths. Samples were demultiplexed based on the unique i5 and i7 indexes to individual sample FASTQ files.

Workflows

  1. Salmon is a method for quantifying transcript abundance from RNA-seq reads that is RNA Strandaccurate and fast. Salmon uses new algorithms to provide accurate expression estimates quickly and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.
  2. STAR (Spliced Transcripts Alignment to a Reference) aligns high-throughput long and short RNA-seq data to a reference genome using uncompressed suffix arrays. STAR is a stand alone software capable of aligning reads in a continuous streaming mode. It is able to detect canonical junctions, non-canonical splices and chimeric transcripts and to map full-length RNA sequences.
  3. featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. The program is developed for counting reads to genomic features such as genes, exons, promoters and genomic bins.

Transcriptomics Processing & Sequencing Strategy

  1. Develop a comprehensive RNA resource from whole blood samples that can be easily accessed and utilized by researchers
  2. Choose methods to comprehensively profile the samples for researchers to interrogate genes, pathways, and mechanisms that play a role in disease
  3. Approach and sequencing strategy should enable scientific inquiries and data analysis into the future with broad applicability
  4. Organize the data in a way that will be accessible to a wide range of investigators - from investigators that have the ability to download and analyze the raw files to researchers that do not have significant bioinformatics capabilities

Transcriptomics Research Data Dictionary

If you want to download a version of the full AMP PD Transcriptomics Research Data Dictionary, click one of the buttons below for a specific format. 

Data Analysis & Processing

Transcript abundance were estimated using two pipelines. First, Transcripts Per Million transcripts (TPMs) were generated using Salmon pipeline directly from FASTQ files on Gencode29. Second, counts per gene were generated by counting aligned reads from STAR generated BAMs onto B38 of the human genome.
 

     Salmon v0.11.3

        Options: quant
        --libTypeA
        --threads 16 --numBootstraps 100
        --seqBias --gcBias
        --dumpEq --geneMap
        --gencode.v29.primary_assembly.annotation.gtf

     STAR v2.6.1d

         STAR --genomeDir STARREF --runMode alignReads 
        --twopassMode Basic\
        --outFileNamePrefix SAMPLEID --readFilesCommand zcat\
        --readFilesIn FASTQL1 FASTQL2 
        --outSAMtype BAM SortedByCoordinate\
        --outFilterType BySJout --outFilterMultimapNmax 20\
        --outFilterMismatchNmax 999 
        --outFilterMismatchNoverLmax 0.1\
        --alignIntronMax 1000000 --alignMatesGapMax 1000000\
        --alignSJoverhangMin 8 --alignSJDBoverhangMin 1\
        --chimOutType WithinBAM --chimSegmentMin 15\
        --chimJunctionOverhangMin 15 --runThreadN 16\
        --outSAMstrandField intronMotif 
        --outSAMunmapped Within\
        --outSAMattrRGline RGTAGLIST

     Feature Counts v1.6.2

        Options:
        --T 2 -p  -t exon  -g gene_id  
        --a gencode.v19.annotation.patched_contigs.gtf
        --s 2

 

Transcriptomics Quality Control Approach

AMP PD Transcriptomics data goes through a series of quality control steps prior to making the data available to researchers. This QC process is motivated by a philosophy that encompasses the following principles:

  • Eliminate samples that are fundamentally unusable
    • An example of an unusable sample is one that has contamination (sample partially matches with two unrelated participants).
       
  • Annotate samples that are difficult to use
    • An example of a sample that is difficult to use is one that has a low number of reads, is an outlier (PCA analysis), or moderate contamination. 
       
  • Publish and make available metrics about all samples
    • Metrics are available in GCS and BigQuery
       

Following these principles, AMP PD transcriptomics data goes through a series of quality control checks, some of which will result in samples and all derived data being withheld from the published dataset (with potential of being made available in a future release). Other checks will result in annotations being provided in a table for researchers.
 

Also under consideration is adoption of a method similar to the ENCODE project: Red = a critical issue was identified in the data, Orange = a moderate issue was identified in the data, Yellow = a mild issue was identified in the data.

 

RNASeq

 Proof of Concept

As part of the Transcriptomics quality control process, a pilot program was designed to test and validate sequencing methods.

RNA-Pilot Step 1
Pilot Design & Sample Collection

Samples obtained from Indiana University, Tel Aviv and BioRep. Whole blood was collected in PaxGene  tubes,  RNA isolated using PaxGene blood miRNA kit  (total RNA isolation), and DNase treated.

2


Pilot Objectives

Test potentially variability across sites, varying RNA Integrity Number (RIN) (average RIN in PPMI is 7.2), sample preparation methods and read depth.

3


Pilot Test Setup

Tested kits that would provide the greatest transcript diversity including transcripts without poly(A) tails (circRNA, IncRNA, splicing patterns, splicing junctions,  etc.). All kits tested had: globin depletion, rRNA depletion, and stranded: 1) NED/Kapa; 2) Swift; and 3) TruSeq.

RNA-Pilot Step 4


Pilot Conclusions

NED/Kapa provided high transcript diversity, high correlations, and worked well with HAIB automation. At 100 M read pairs new gene detection reached a plateau. Use of UMI showed a 28% duplication rate.

Transcriptomics Quality Control Process

Quality control checks were performed for 8,670 RNASeq samples for AMP PD. The subsections below describe at a high level what checks were executed as part of the RNASeq QC process.

Concordance Checks

Validation that RNA samples are correctly associated with participants

  • Sex Check: a sex check has been used during processing of samples to ensure that at a coarse level we have properly identified samples. In the end, the sex check is superceded by the SNP check against genomic data
    Key Checks:
    (1) All RNA samples sex determined for comparison against clinically reported sex
    (2) Does the RNA sample expression level for sex-linked genes match the clinically reported sex for the individual
     
  • All RNA samples genotype comparison to WGS samples
    Key Checks:

    (1)
     Does RNA sample match the WGS sample for that individual 
    (2) Does the RNA sample match a WGS sample for a different individual
     
  • All RNA samples genotype comparison against all RNA samples
    Key Checks: 
    (1)
     Does the RNA sample match other samples for the same individual
    (2) Does the RNA sample match samples for other individuals
     

Mismatched Samples Check

RNA SNP Checks Match Against Genomic SNPs. AMP PD has whole genome sequencing data (WGS) for most participants. We compared the genotypes for a set of SNPs against the transcriptomic expression.

  • Samples that fail to match WGS for the same participant_id were removed from AMP PD for later evaluation
     
  • If an RNA sample has no associated WGS sample and passes the sex check, but does not match other RNA samples for the individual, then the sample was removed from AMP PD for later evaluation

Duplicate Samples Check

 

  • Matched against own WGS data and matched against other WGS data: Sample is genetically identical to their own WGS sample as well as a WGS sample with a different participant_id
     
  • Matched against own RNA data and matched against other RNA data: Sample is genetically identical to expected RNA sample and to RNA sample(s) with a different participant_id

 

Post-Alignment Quality Check

  • Quant Based PCA - PCA of salmon output to identify outliers for potential re-analysis, resequencing, or even re-prepping from new samples
     
  • Count Based PCA

 

 

Extracellular Vesicle Pilot Study Data

Within AMP PD, there’s already whole blood transcriptomic data, so an interesting question that could also be addressed is “what information do the cells release?” which led to this Extracellular Vesicle (EV) exRNA pilot. Extracellular RNA is interesting because there is a lot of information released by cells when exposed to stressors or other environmental factors. In this pilot study, samples from the BioFIND cohort were used to conduct bulk RNA and small RNA processing experiments.

Bulk EV exRNA Experiment

Bulk exRNA experiments were conducted from matching plasma and CSF samples from 185 participants. Sequencing data was processed with two different annotations – VG29 and VGLN. GENCODE version 29 was used and VG29 is harmonized with the existing whole blood data. In addition, samples were also processed using GENCODE version 29 with additional LNCIPEDIA version 5.2 annotation (VGLN) that adds in additional long non-coding RNAs.

RNA Isolation


Kits that isolated all extracellular RNA were utilized for this experiment – total extracellular RNA preparation. Samples were then processed in the same way as the existing AMP PD whole blood transcriptomics samples.

CSF RNA Isolation
For each CSF subject sample, 1mL of CSF was thawed on ice. RNA was then isolated with Qiagen’s miRNeasy Serum/Plasma Kit (Qiagen, Cat. No. 217184) using a modified protocol to include an on-column DNase treatment (Qiagen, Cat. No. 79256).
 
For each CSF pool, 5mL of CSF provided in 200uL aliquots was thawed on ice, then pooled together, gently mixed, and re-aliquoted into five 1mL aliquots. RNA was then isolated from each 1mL CSF aliquot with Qiagen’s miRNeasy Serum/Plasma Kit (Qiagen, Cat. No. 217184) using a modified protocol to include an on-column DNase treatment (Qiagen, Cat. No. 79256). Isolated RNA across all five aliquots was then pooled together, gently mixed, and re-aliquoted evenly across 5 microcentrifuge tubes. 

Plasma RNA Isolation
For each of 18 plasma subject samples, 1mL of plasma was thawed at room temperature, then immediately moved to ice. RNA was then isolated with Qiagen’s miRNeasy Serum/Plasma Kit (Qiagen, Cat. No. 217184) using a modified protocol to include an on-column DNase treatment (Qiagen, Cat. No. 79256).
 
For each of 169 plasma subject samples, 1mL of plasma was thawed at room temperature, then immediately moved to ice. RNA was then isolated with Norgen Biotek’s Plasma/Serum Circulating and Exosomal RNA Purification Mini Kit - Slurry Format (Norgen Biotek, Cat. No. 51000) with DNase treatment (Norgen Biotek, Cat. No. 25720).
 
For each plasma pool, 5mL of plasma provided in 200uL aliquots was thawed at room temperature then immediately moved to ice. All aliquots were pooled together, gently mixed, and re-aliquoted into five 1mL aliquots. RNA was then isolated from each 1mL plasma aliquot with Norgen Biotek’s Plasma/Serum Circulating and Exosomal RNA Purification Mini Kit - Slurry Format (Norgen Biotek, Cat. No. 51000) with DNase treatment (Norgen Biotek, Cat. No. 25720). Isolated RNA across all five aliquots was then pooled together, gently mixed, and re-aliquoted evenly across 5 microcentrifuge tubes.  

Whole Transcriptome Library Preparation

CSF RNA Whole Transcriptome Library Preparation & Sequencing
For each CSF RNA sample, a uniquely dual-indexed, Illumina-compatible, double-stranded cDNA whole transcriptome library was synthesized from all total RNA in 1mL of CSF with Takara Bio’s SMART-Seq Stranded Kit Components (Takara Bio, Cat. No. 634447) and SMARTer RNA Unique Dual Index Kit (Takara Bio, Cat. No. 634452 & 634457). A replicate of each CSF pool was prepared in parallel with every batch of subject CSF. Briefly, this library preparation included RNA priming (72 °C for 3 min), a 5-cycle Indexing PCR, ribosomal cDNA depletion, and a 16-cycle enrichment PCR. 

Each library was measured for size with Agilent’s High Sensitivity D1000 ScreenTape and buffer (Agilent, Cat. No. 5067-5584 & 5067-5603) and concentration with KAPA SYBR FAST Universal qPCR Kit (Roche, Cat. No. KK4824). Two libraries did not produce quantifiable libraries and were not sequenced. Libraries were then combined into an equimolar pool which was also measured for size and concentration. The pool was then normalized to 2 nM with a 1% v/v PhiX Control v3 spike-in (Illumina, Cat. No. FC-110-3001), denatured and further diluted, loaded into a NovaSeq 6000 flow cell cartridge (Illumina, Cat. No. 20028313), and sequenced at 101 x 9 x 9 x 101 cycles with standard workflow and a final flow cell concentration of 400 pM. Libraries were sequenced to at least 50 M read pairs (or 100 M paired-end reads).

Plasma RNA Whole Transcriptome Library Preparation & Sequencing
For each plasma RNA sample, a uniquely dual-indexed, Illumina-compatible, double-stranded cDNA whole transcriptome library was synthesized from up to 2ng of total plasma RNA with Takara Bio’s SMART-Seq Stranded Kit Components (Takara Bio, Cat. No. 634447) and SMARTer RNA Unique Dual Index Kit (Takara Bio, Cat. No. 634452 & 634457). A replicate of each plasma pool was prepared in parallel with subject samples. Briefly, this library preparation included RNA fragmentation (85 °C for 2 min), a 5-cycle Indexing PCR, ribosomal cDNA depletion, and a 16-cycle enrichment PCR. Each library was measured for size with Agilent’s High Sensitivity D1000 ScreenTape and buffer (Agilent, Cat. No. 5067-5584 & 5067-5603). 1uL of each library was combined into a non-equimolar pool which was then measured for size via TapeStation and concentration via Roche’s KAPA SYBR FAST Universal qPCR Kit (Roche, Cat. No. KK4824), diluted to 70 pM, then loaded into an iSeq flowcell cartridge (Illumina, Cat. No. 20031371) with a 1% v/v PhiX Control v3 spike-in (Illumina, Cat. No. FC-110-3001), and sequenced at 101 x 8 x 8 x 101 cycles. Passing filter cluster counts per library were generated from this data and used to make a re-balanced pool which was subsequently measured for size and concentration, diluted to 2 nM with a 1% v/v PhiX Control v3 spike-in, denatured and further diluted, loaded into a NovaSeq 6000 flow cell cartridge (Illumina, Cat. No. 20028313), and sequenced at 101 x 9 x 9 x 101 cycles with standard workflow and a final flow cell concentration of 400 pM.Libraries were sequenced to at least 50M read pairs (or 100M paired-end reads). 

Small EV exRNA Experiment

Small exRNA experiments were conducted from the same plasma biosamples used in the bulk exRNA experiment, of which, 180 passed QC and have been released.

RNA Isolation

For each of the plasma subject samples, 1mL of plasma was thawed at room temperature, then immediately moved to ice. RNA was then isolated with Norgen Biotek’s Plasma/Serum Circulating and Exosomal RNA Purification Mini Kit - Slurry Format (Norgen Biotek, Cat. No. 51000) with DNase treatment (Norgen Biotek, Cat. No. 25720).

Small RNA Library Preparation and Sequencing

Plasma RNA samples were prepared for library generation with QIAGEN’s RNeasy MinElute Clean-up kit (74204), as follows: 5 ng of total RNA were treated with buffer RLT, and 100% ethanol. The sample was passed through a MinElute column, washed, dried, and RNA was eluted with ultra-pure water. Immediately after the clean-up, RNA was concentrated using a speed-vacuum centrifuge, and used for library preparation with Perkin Elmer’s NEXTFLEX Small RNA-Seq v3 and UDI barcodes (NOVA-5132-06). RNA was denatured at 70°C, and underwent 3’ adapter ligation for 2 hours at 25°C. NEXTFLEX Cleanup Beads were then used to remove excess free adapter, and the 5’ adapter was ligated for 1 hour at 20°C. cDNA was generated from the 3’ and 5’ ligated RNA, followed by a second bead clean-up, and finally PCR-amplified using UDI primers, for 18 cycles.
Libraries were size-selected and cleaned up using PAGE and the DNA Clean and Concentrate kit (Zymo, D4014). Briefly, samples were separated onto 6% PA gels, the band of interest was excised, and the gel piece was crushed and incubated in water overnight, with constant agitation. DNA binding buffer was added to precipitate the DNA, which was then applied to a column, washed, and eluted in ultra-pure water. Library size and concentration was determined via Agilent 2100 Bioanalyzer, using the High Sensitivity DNA kit.
Library pools were denatured with NaOH, and clustered onto flow cells at 14 pM, with 5% PhiX spike-in, using cBot instruments. Sequencing was carried out on Illumina’s HiSeq 2500 using TruSeq v3 reagents.