RNA-seq: Introduction, Applications and Protocols
Login to get unlimited free access
The identification and quantification of gene expression levels have unveiled the complexity of many biological mechanisms within the cell. The widespread knowledge about cell transcriptome has been accomplished through the RNA sequencing (RNA-Seq) method. This method detects the presence of RNA molecules by using the capabilities of high-throughput sequencing. Also, RNA-Seq achieves higher coverage and resolution than Sanger sequencing and microarray-based methodologies.
The identification and quantification of gene expression levels have unveiled the complexity of many biological mechanisms within the cell. The widespread knowledge about cell transcriptome has been accomplished through the RNA sequencing (RNA-Seq) method. This method detects the presence of RNA molecules by using the capabilities of high-throughput sequencing. Also, RNA-Seq achieves higher coverage and resolution than Sanger sequencing and microarray-based methodologies. Nevertheless, strong experimental design is crucial to obtain successful RNA-seq study outcomes.
RNA seq analysis has been initially applied to identify differential gene expression profiles by quantifying gene expression levels among treatment and control samples. This technique provides insights into genes associated with distinct phenotypes in response to environmental conditions. Alternatively, RNA-seq can be used to identify novel transcripts and alternative splice events. Interestingly, Disrupt splice sites may affect mRNA and protein products leading to disease susceptibility in humans.
Recently, RNA Seq analysis has also been used to quantify non-coding transcripts. This group includes two main classes, short non-coding RNAs, and long non-coding RNAs. Both play an important role in the signaling and regulation of gene transcription. Furthermore, RNA seq analysis allows the detection of allele-specific expression (ASE), which captures the expression of each allele in a given genetic variant. This is useful to study many biological phenomena such as genomic imprinting, X chromosome inactivation, RNA editing, random monoallelic expression, and nonsense-mediated decay.
Protocols for RNA seq analysis can be adapted to the different RNA populations investigated in each study, including total RNA, pre-mRNA, mRNA, and non-coding RNA (ncRNA). However, some data processing steps are common among the distinct applications as well as in other NGS approaches like quality control (QC), read mapping, and post-processing data. Regardless of the RNA-seq protocols used, each analysis, during the experimental design of the study, must consider the number of replicates, randomization, desired statistical power and capture of enough variability of the study.
Quality control (QC) analysis must be performed at each step of the protocol to monitor data quality. After generating raw data from sequencing, QC involves the detection of sequencing errors, PCR artifacts, and contaminations. It also evaluates GC content, sequence adaptors, and overrepresented k-mers. In this regard, FastQC, fastp, and NGSQC are standard tools in bioinformatics to quantify QC metrics. Additionally, once low-quality fragments and/or adaptor sequences are identified, software like FASTX-Toolkit or Trimmomatic can be used to eliminate them.
Read alignment and data processing
Just like whole-genome sequencing (WGS), RNA seq analysis can be performed either with or without knowledge of the organism’s reference genome. However, reference-based genome assembly is preferable than de novo approaches since it generally produces higher-quality transcript reconstruction. Particularly, during reference-based alignment, either genome or transcriptome might be used as a reference. However, depending on the application, one or the other might be more recommended. For example, if the objective is to identify novel transcripts or aberrant splice regions, then genome-based alignment would be preferable since the transcriptome would limit the analysis to sets of known transcripts. Unlike in DNA alignment, the tools that perform RNA read mapping should be able to recognize splice junction regions. The most popular software are STAR, TopHat, TopHat2, and HiSat. Quality control procedures after mapping should consider the discard of multi-mapping and duplicated reads. Both can be achieved using samtools and Picard respectively.
Once RNA-seq data is preprocessed, different types of analyses can be applied to the data, including transcript quantification, differential gene expression (DGE), alternative splicing, non-coding RNA, and allele-specific expression. Transcript quantification and differential gene expression analyses are based on the quantification of reads mapped to the transcript. Both approaches use the gene transfer format (GTF) file as a reference to annotate the coordinates of exons and genes. HTSeq-count and featureCounts are some of the software tools used to perform read counting using GTF. Plus, raw read counts generated from RNA seq analysis must be normalized using methods such as RPKM (reads per kilobase of exon model per million reads), FPKM (fragments per kilobase of exon model per million mapped reads), or the more often recommend TPM (transcripts per million).
Therefore, by using normalized counts, DGE measures the differential expression across conditions. EdgeR and DESeq2 are the most popular methods used to estimate DGE, both available as R packages. Data generated during RNA-seq analysis can also be used to analyze alternative splicing by measuring isoform or exon expression levels. Herein, CuffDiff2, rSeqDiff, DEXseq, and DSGSeq are some tools used for this analysis. Finally, to detect allele-specific expression events, variant calling and filtration steps must be performed before the allele count. This application is very similar to whole-exome sequencing (WES) analysis, having an additional statistical estimation step. Thus, ASE inference can be achieved through workflows found in GATK software, specifically using the ASEReadCounter tool.