Transcriptomics Data
AMP PD has RNA Fastq and workflow products from Salmon, Star, and Feature Counts for BioFIND, PDBP, and PPMI cohorts. All RNA Sequencing was performed by Hudson Alpha at 150 base pairs, and is supplied along with corresponding clinical data.
Processed RNA-Seq Totals
Cohort | Baseline | Month 0.5 | Month 06 | Month 12 | Month 18 | Month 24 | Month 36 | Totals |
---|---|---|---|---|---|---|---|---|
BioFIND | 0 | 208 | 0 | 0 | 0 | 0 | 0 | 208 |
PDBP | 1466 | 0 | 574 | 614 | 506 | 471 | 0 | 3631 |
PPMI | 1,522 | 0 | 853 | 873 | 0 | 833 | 541 | 4622 |
Totals | 2,988 | 208 | 1,427 | 1,487 | 506 | 1,304 | 541 | 8461 |
Library Preparation and Protocol Details
All RNA samples were normalized to 30ng/ul. Depending on the available material, input amounts of RNA used in the rRNA and globin reduction step ranged from 684-752 ng of RNA. All products were used following the manufacturer's directions except where noted. All samples underwent rRNA and globin reduction via the Illumina Globin-Zero Gold kit (catalog number GZG1224). Following RNA reduction, stranded libraries were prepared by first-strand synthesis and second strand synthesis using the New England Biolabs (NEB) Ultra II First Strand Module (catalog number E7771L) followed by the NEB Ultra II Directional Second Strand Module (catalog number E7550L).
Following second strand synthesis, the double-stranded cDNA was converted to a sequencing library by standard, ligation-based library preparation. The following NEB modules were used, in order:
- NEB End Repair Module (catalog number E6050L)
- NEB A-tailing Module (catalog number E6053L) and
- NEB Quick Ligation Module (catalog number E6056L)
Each of the modules was scaled back 1:2 from the recommended enzyme concentrations using the appropriate buffers. Following ligation to standard Illumina paired-end adaptors, each library was amplified for 12 cycles of PCR using Roche Kapa HiFi polymerase (catalog number KK2612). Each forward and reverse PCR primer included an 8nt unique index sequence. After PCR, the insert sizes were evaluated via Perkin-Elmer Caliper GX.
Libraries were quantitated using the Roche Kapa SYBR FAST Universal kit (catalog number KR0389) and diluted to 2nM final stocks, pooled in equal molar amounts and sequenced on the Illumina NovaSeq 6000 platform to generate 100M paired reads per sample at 150 nt read lengths. Samples were demultiplexed based on the unique i5 and i7 indexes to individual sample FASTQ files.
Workflows
- Salmon is a method for quantifying transcript abundance from RNA-seq reads that is accurate and fast. Salmon uses new algorithms to provide accurate expression estimates quickly and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.
- STAR (Spliced Transcripts Alignment to a Reference) aligns high-throughput long and short RNA-seq data to a reference genome using uncompressed suffix arrays. STAR is a stand alone software capable of aligning reads in a continuous streaming mode. It is able to detect canonical junctions, non-canonical splices and chimeric transcripts and to map full-length RNA sequences.
- featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. The program is developed for counting reads to genomic features such as genes, exons, promoters and genomic bins.
Transcriptomics Processing & Sequencing Strategy
- Develop a comprehensive RNA resource from whole blood samples that can be easily accessed and utilized by researchers
- Choose methods to comprehensively profile the samples for researchers to interrogate genes, pathways, and mechanisms that play a role in disease
- Approach and sequencing strategy should enable scientific inquiries and data analysis into the future with broad applicability
- Organize the data in a way that will be accessible to a wide range of investigators - from investigators that have the ability to download and analyze the raw files to researchers that do not have significant bioinformatics capabilities
Transcriptomics Research Data Dictionary
If you want to download a version of the full AMP PD Transcriptomics Research Data Dictionary, click one of the buttons below for a specific format.
Data Analysis & Processing
Transcript abundance were estimated using two pipelines. First, Transcripts Per Million transcripts (TPMs) were generated using Salmon pipeline directly from FASTQ files on Gencode29. Second, counts per gene were generated by counting aligned reads from STAR generated BAMs onto B38 of the human genome.
Salmon v0.11.3 Options: quant |
STAR v2.6.1d STAR --genomeDir STARREF --runMode alignReads |
Feature Counts v1.6.2 Options: |
Transcriptomics Quality Control Approach
AMP PD Transcriptomics data goes through a series of quality control steps prior to making the data available to researchers. This QC process is motivated by a philosophy that encompasses the following principles:
- Eliminate samples that are fundamentally unusable
- An example of an unusable sample is one that has contamination (sample partially matches with two unrelated participants).
- An example of an unusable sample is one that has contamination (sample partially matches with two unrelated participants).
- Annotate samples that are difficult to use
- An example of a sample that is difficult to use is one that has a low number of reads, is an outlier (PCA analysis), or moderate contamination.
- An example of a sample that is difficult to use is one that has a low number of reads, is an outlier (PCA analysis), or moderate contamination.
- Publish and make available metrics about all samples
- Metrics are available in GCS and BigQuery
- Metrics are available in GCS and BigQuery
Following these principles, AMP PD transcriptomics data goes through a series of quality control checks, some of which will result in samples and all derived data being withheld from the published dataset (with potential of being made available in a future release). Other checks will result in annotations being provided in a table for researchers.
Also under consideration is adoption of a method similar to the ENCODE project: Red = a critical issue was identified in the data, Orange = a moderate issue was identified in the data, Yellow = a mild issue was identified in the data.
As part of the Transcriptomics quality control process, a pilot program was designed to test and validate sequencing methods.
|
Transcriptomics Quality Control Process
Quality control checks were performed for 8,670 RNASeq samples for AMP PD. The subsections below describe at a high level what checks were executed as part of the RNASeq QC process.
Post-Alignment Quality Check
- Quant Based PCA - PCA of salmon output to identify outliers for potential re-analysis, resequencing, or even re-prepping from new samples
- Count Based PCA
- Query and reshape feature counts BigQuery data into a count matrix (genes=rows, sampleID=columns, value=value ie counts). This count matrix coupled with metaData table was used to create a DESeqDataSet or dis object using DESeq2
- Source: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Extracellular Vesicle Pilot Study Data
Within AMP PD, there’s already whole blood transcriptomic data, so an interesting question that could also be addressed is “what information do the cells release?” which led to this Extracellular Vesicle (EV) exRNA pilot. Extracellular RNA is interesting because there is a lot of information released by cells when exposed to stressors or other environmental factors. In this pilot study, samples from the BioFIND cohort were used to conduct bulk RNA and small RNA processing experiments.