Transcriptomics Data

AMP PD has RNA Fastq and workflow products from Salmon, Star, and Feature Counts for BioFIND, PDBP, and PPMI cohorts. All RNA Sequencing was performed by Hudson Alpha at 150 base pairs, and is supplied along with corresponding clinical data.

Processed RNA-Seq Totals

Cohort Baseline Month 0.5 Month 06 Month 12 Month 18 Month 24 Month 36 Totals
BioFIND 0 208 0 0 0 0 0 208
PDBP 1,427 0 561 598 493 459 0 3,538
PPMI 1,512 0 853 873 0 831 541 4,610
Totals 2,939 208 1,414 1,471 493 1,290 541 8,356

Library Preparation and Protocol Details

All RNA samples were normalized to 30ng/ul. Depending on the available material, input amounts of RNA used in the rRNA and globin reduction step ranged from 684-752 ng of RNA. All products were used following the manufacturer's directions except where noted. All samples underwent rRNA and globin reduction via the Illumina Globin-Zero Gold kit (catalog number GZG1224).  Following RNA reduction, stranded libraries were prepared by first-strand synthesis and second strand synthesis using the New England Biolabs (NEB) Ultra II First Strand Module (catalog number E7771L) followed by the NEB Ultra II Directional Second Strand Module (catalog number E7550L).

Following second strand synthesis, the double-stranded cDNA was converted to a sequencing library by standard, ligation-based library preparation. The following NEB modules were used, in order:

  1. NEB End Repair Module (catalog number E6050L)
  2. NEB A-tailing Module (catalog number E6053L) and
  3. NEB Quick Ligation Module (catalog number E6056L)  

Each of the modules was scaled back 1:2 from the recommended enzyme concentrations using the appropriate buffers. Following ligation to standard Illumina paired-end adaptors, each library was amplified for 12 cycles of PCR using Roche Kapa HiFi polymerase (catalog number KK2612). Each forward and reverse PCR primer included an 8nt unique index sequence. After PCR, the insert sizes were evaluated via Perkin-Elmer Caliper GX.  

Libraries were quantitated using the Roche Kapa SYBR FAST Universal kit (catalog number KR0389) and diluted to 2nM final stocks, pooled in equal molar amounts and sequenced on the Illumina NovaSeq 6000 platform to generate 100M paired reads per sample at 150 nt read lengths. Samples were demultiplexed based on the unique i5 and i7 indexes to individual sample FASTQ files.
 

Workflows

  1. Salmon is a method for quantifying transcript abundance from RNA-seq reads that is RNA Strandaccurate and fast. Salmon uses new algorithms to provide accurate expression estimates quickly and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.
  2. STAR (Spliced Transcripts Alignment to a Reference) aligns high-throughput long and short RNA-seq data to a reference genome using uncompressed suffix arrays. STAR is a stand alone software capable of aligning reads in a continuous streaming mode. It is able to detect canonical junctions, non-canonical splices and chimeric transcripts and to map full-length RNA sequences.
  3. featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. The program is developed for counting reads to genomic features such as genes, exons, promoters and genomic bins.

Transcriptomics Processing & Sequencing Strategy

  1. Develop a comprehensive RNA resource from whole blood samples that can be easily accessed and utilized by researchers
  2. Choose methods to comprehensively profile the samples for researchers to interrogate genes, pathways, and mechanisms that play a role in disease
  3. Approach and sequencing strategy should enable scientific inquiries and data analysis into the future with broad applicability
  4. Organize the data in a way that will be accessible to a wide range of investigators - from investigators that have the ability to download and analyze the raw files to researchers that do not have significant bioinformatics capabilities
     

Data Analysis & Processing

Transcript abundance were estimated using two pipelines. First, Transcripts Per Million transcripts (TPMs) were generated using Salmon pipeline directly from FASTQ files on Gencode29. Second, counts per gene were generated by counting aligned reads from STAR generated BAMs onto B38 of the human genome.
 

     Salmon v0.11.3

        Options: quant
        --libTypeA
        --threads 16 --numBootstraps 100
        --seqBias --gcBias
        --dumpEq --geneMap
        --gencode.v29.primary_assembly.annotation.gtf

     STAR v2.6.1d

         STAR --genomeDir STARREF --runMode alignReads 
        --twopassMode Basic\
        --outFileNamePrefix SAMPLEID --readFilesCommand zcat\
        --readFilesIn FASTQL1 FASTQL2 
        --outSAMtype BAM SortedByCoordinate\
        --outFilterType BySJout --outFilterMultimapNmax 20\
        --outFilterMismatchNmax 999 
        --outFilterMismatchNoverLmax 0.1\
        --alignIntronMax 1000000 --alignMatesGapMax 1000000\
        --alignSJoverhangMin 8 --alignSJDBoverhangMin 1\
        --chimOutType WithinBAM --chimSegmentMin 15\
        --chimJunctionOverhangMin 15 --runThreadN 16\
        --outSAMstrandField intronMotif 
        --outSAMunmapped Within\
        --outSAMattrRGline RGTAGLIST

     Feature Counts v1.6.2

        Options:
        --T 2 -p  -t exon  -g gene_id  
        --a gencode.v19.annotation.patched_contigs.gtf
        --s 2

 

Transcriptomics Quality Control Approach

AMP PD Transcriptomics data goes through a series of quality control steps prior to making the data available to researchers. This QC process is motivated by a philosophy that encompasses the following principles:

  • Eliminate samples that are fundamentally unusable
    • An example of an unusable sample is one that has contamination (sample partially matches with two unrelated participants).
       
  • Annotate samples that are difficult to use
    • An example of a sample that is difficult to use is one that has a low number of reads, is an outlier (PCA analysis), or moderate contamination. 
       
  • Publish and make available metrics about all samples
    • Metrics are available in GCS and BigQuery
       

Following these principles, AMP PD transcriptomics data goes through a series of quality control checks, some of which will result in samples and all derived data being withheld from the published dataset (with potential of being made available in a future release). Other checks will result in annotations being provided in a table for researchers.
 

Also under consideration is adoption of a method similar to the ENCODE project: Red = a critical issue was identified in the data, Orange = a moderate issue was identified in the data, Yellow = a mild issue was identified in the data.
 

 

RNASeq

 Proof of Concept

As part of the Transcriptomics quality control process, a pilot program was designed to test and validate sequencing methods.

RNA-Pilot Step 1
Pilot Design & Sample Collection

Samples obtained from Indiana University, Tel Aviv and BioRep. Whole blood was collected in PaxGene  tubes,  RNA isolated using PaxGene blood miRNA kit  (total RNA isolation), and DNase treated.

2


Pilot Objectives

Test potentially variability across sites, varying RNA Integrity Number (RIN) (average RIN in PPMI is 7.2), sample preparation methods and read depth.

3


Pilot Test Setup

Tested kits that would provide the greatest transcript diversity including transcripts without poly(A) tails (circRNA, IncRNA, splicing patterns, splicing junctions,  etc.). All kits tested had: globin depletion, rRNA depletion, and stranded: 1) NED/Kapa; 2) Swift; and 3) TruSeq.

RNA-Pilot Step 4


Pilot Conclusions

NED/Kapa provided high transcript diversity, high correlations, and worked well with HAIB automation. At 100 M read pairs new gene detection reached a plateau. Use of UMI showed a 28% duplication rate.

Transcriptomics Quality Control Process

Quality control checks were performed for ~8,670 RNASeq samples for AMP PD. The subsections below describe at a high level what checks were executed as part of the RNASeq QC process.

Concordance Checks

Validation that RNA samples are correctly associated with participants

  • Sex Check: a sex check has been used during processing of samples to ensure that at a coarse level we have properly identified samples. In the end, the sex check is superceded by the SNP check against genomic data
    Key Checks:
    (1) All RNA samples sex determined for comparison against clinically reported sex
    (2) Does the RNA sample expression level for sex-linked genes match the clinically reported sex for the individual
     
  • All RNA samples genotype comparison to WGS samples
    Key Checks:

    (1)
     Does RNA sample match the WGS sample for that individual 
    (2) Does the RNA sample match a WGS sample for a different individual
     
  • All RNA samples genotype comparison against all RNA samples
    Key Checks: 
    (1)
     Does the RNA sample match other samples for the same individual
    (2) Does the RNA sample match samples for other individuals
     

Mismatched Samples Check

RNA SNP Checks Match Against Genomic SNPs. AMP PD has whole genome sequencing data (WGS) for most participants. We compared the genotypes for a set of SNPs against the transcriptomic expression.

  • Samples that fail to match WGS for the same participant_id were removed from AMP PD for later evaluation
     
  • If an RNA sample has no associated WGS sample and passes the sex check, but does not match other RNA samples for the individual, then the sample was removed from AMP PD for later evaluation

Duplicate Samples Check

  • Matched against own WGS data and matched against other WGS data: Sample is genetically identical to their own WGS sample as well as a WGS sample with a different participant_id
     
  • Matched against own RNA data and matched against other RNA data: Sample is genetically identical to expected RNA sample and to RNA sample(s) with a different participant_id

Post-Alignment Quality Check

  • Quant Based PCA - PCA of salmon output to identify outliers for potential re-analysis, resequencing, or even re-prepping from new samples
     
  • Count Based PCA