Received date: October 27, 2016; Accepted date: December 15, 2016; Published date: December 17, 2016
Citation: Zhang C, Zhang B, Vincent MS, Zhao S (2016) Bioinformatics Tools for RNA-seq Gene and Isoform Quantification. Next Generat Sequenc & Applic 3:140. doi:10.4172/2469-9853.1000140
Copyright: © 2016 Zhang C, et al. This is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Next Generation Sequencing & Applications
In recent years, RNA-seq has emerged as a powerful technology in estimation of gene or transcript expression. âÂ€Â˜Union-exonâÂ€Â™ and transcript based approaches are widely used in gene quantification. The âÂ€Â˜Union-exonâÂ€Â™ based approach is simple, but it does not distinguish between isoforms when multiple alternatively spliced transcripts are expressed from the same gene. Because a gene is expressed in one or more transcript isoforms, the transcript based approach is more biologically meaningful than the âÂ€Â˜union exonâÂ€Â™-based approach. However, estimating the expression of individual isoform is intrinsically challenging because different isoforms of a gene usually have a high proportion of genomic overlap. Recently, a number of tools have been developed for RNA-seq isoform quantification. We review those methods and their main features to serve as guidance for users to choose suitable tools for their specific projects.
Bioinformatics tools; RNA-seq gene; Spliced transcripts; Isoforms; Union-exon; Human transcriptome; Annotation databases
Production of distinct mRNA isoforms from the same gene locus is common phenomena in metazoans. According to gene annotation in GENCODE Release 25, there are around 60,000 annotated genes, including protein coding, lncRNA and processed pseudogene, which produce about 200,000 transcripts in the human transcriptome (http:// www.gencodegenes.org). Of those annotated genes, 20,000 are protein coding genes, which in turn produce about 144,000 transcripts. The current annotation gives an estimate of 7 transcript isoforms per protein coding gene; however, the annotation is far from complete. A recent study suggests that more than 1/3 of tissue dependent transcripts have complex local splicing variations (LSVs), where an exon can be involved in more than two alternative junctions . These LSVs can produce a large number of alternatively spliced isoforms per gene locus, generating a much more diverse human transcriptome than previously estimated.
Alternative splicing is a process by which exons or portions of exons or noncoding regions within a pre-mRNA transcript are differentially joined or skipped, resulting in multiple isoforms being encoded by a single gene. The process of splicing occurs in a large ribonucleoprotein (RNP) machine called the spliceosome, which functions in a dynamic assembly-disassembly cycle involving five small nuclear ribonucleoprotein (snRNP) complexes (U1, U2, U4/U6, and U5).
New insights suggest that constitutive splicing primarily occurs cotranscriptionally in the nucleus, whereas alternative splicing mainly occurs post-transcriptionally [2-4]. Alternative splicing generates a tremendous amount of proteomic diversity in humans and significantly affects various functions in cellular processes, tissue specificity, developmental states, and disease conditions. Errors in alternative splicing can lead to various diseases, including muscle disorders [5,6] and cancers [7-9], emphasizing the need to accurately quantify isoform expression.
RNA sequencing (RNA-seq) is emerging as a new technology in transcriptome profiling. Beyond quantifying gene expression, the data generated by RNA-seq facilitates discovery of novel transcripts, identification of alternatively spliced genes, and detection of allele specific expression [10-12]. Compared with microarray technology, RNA-seq not only overcomes some of the technical limitations including varying probe performance and nonspecific hybridization, but can also detect alternative splicing isoforms and subtle changes of splicing under different physiological conditions .
Furthermore, RNA-seq allows for the detection of novel transcript species in well studied organisms, such as unique transcripts in certain tissues or in rare cell types, and has been instrumental to catalog the diversity of novel transcript species including long non-coding RNA, miRNA, siRNA, and other small RNA classes .
Recently, quite a number of methods have been developed for the inference of gene and isoform abundance. In general, these methods can be divided into two categories: ‘union exon’-based approach and transcript-based approach, as illustrated in Figure 1. The ‘union exon’- based methods, such as FeatureCounts and HTSeq is widely used in RNA-seq gene quantification because of its simplicity [15,16].
Figure 1: Illustration of different gene quantification methods. A) A hypothetical gene, its two isoforms and read coverage profile. Assuming that the sum of mapped reads from all genes is 1 million, and each small and large exon is 1 kb and 2 kb long, respectively. B) â€˜Union-exonâ€™ based approach. After exon flattening, the â€˜union exonsâ€™ are 2 kb, 1 kb and 2 kb long, respectively. The calculated RPKM is 6.4. C) transcriptbased approach. Mapped reads are first assigned to individual isoforms, and the corresponding expressions for the two isoforms are 2 RPKM and 6 RPKM, respectively. In this calculation, the entire gene expression is 8 RPKM.
First, all overlapping exons from a gene are merged into ‘union exons’. Then, per-gene read counts are aggregated by intersecting all mapped reads with ‘union exons’ of the gene. Compared with transcript level quantification, reads can generally be assigned to genes with higher confidence. However, gene-level counts do not distinguish between isoforms when multiple alternatively spliced transcripts are expressed from the same gene. Consequently, some important events such as isoform switches and minor isoform expression changes are masked at the gene level, as shown in Figure 2. For UHRR (Universal Human Reference RNA), both transcripts are expressed, and the expression of the short isoform ENST00000375512 is supported by 48 exon-exon spanning reads between exons #7 and #8 (colored in indigo). However, in human brain sample HBRR_C4, only the long transcript ENST00000356884.6 is expressed.
Figure 2: Minor isoform changes are masked at the gene level. Gene BICD2 (Ensembl ID: ENSG00000185963.9) consists of two very similar isoforms. At the gene level, overall, there is not much difference between sample UHRR_C1 and HBRR_C4, as shown in read coverage profiles. However, the difference is dramatic at the transcript level. For human brain sample HBRR_C4, only the long transcript ENST00000356884.6 is expressed; while in sample UHRR_C1, both isoforms are present. Note in sample UHRR_C1, there are 48 reads that span across the junction site between exons #7 and #8 (colored in indigo), and such reads can only originate from the short transcript ENST00000375512. The stranded RNA-seq dataset were downloaded from Illuminaâ€™s Base Space, consisting of 2 UHRR (Universal Human Reference RNA) and 2 HBRR (Human Brain Reference RNA) samples. Only UHRR_C1 and HBRR_C4 are shown in the figure.
At the gene level, overall, there is not much difference between sample UHRR_C1 and HBRR_C4. As a result, the expression change in minor isoform ENST00000375512 is masked at the gene level. Therefore, ignoring transcript isoforms and counting reads at the gene level does not always give correct answers [17,18].
The side-by-side comparison of ‘union exon’-based approach and transcript-based method revealed that gene expression levels are significantly underestimated by ‘union exon’-based approach in some cases, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach . Consequently, the ‘union exons’-based approach in gene quantification is discouraged despite of its popularity. Other studies also suggest that tools that quantify expressions at the transcript level give better results than at the gene level .
Multiple mRNA transcripts can be generated from a gene locus by the usage of alternative transcription start site, alternative splicing and alternative polyadenylation. Different isoforms of the gene typically have a high proportion of overlapping exons (Figure 3). The transcriptome landscape is further complicated by the prevalence of gene overlap on the same or opposite strands in DNA .
Figure 3: Diagram of the complexity of gene structures. Multiple isoforms generated by alternative splicing, together with overlapping genes, make accurate quantification at the transcript level difficult. Assigning ambiguous reads in overlapping regions to its true origin is hard. Genes on the plus strand are colored in blue while genes on the minus strand in green.
Therefore, accurate estimation of expression levels of individual isoforms is intrinsically very difficult. In the following sections, we will review methods available for isoform quantification with a focus on recent advancement in this field. We classify the methods into two main categories: tools that consider only known transcripts and those that incorporate novel transcript discovery.
If the transcriptome of a species is annotated, its annotation databases can be leveraged to map and quantify the expression of most common transcripts. For the human genome, RefGene , Ensembl  and the UCSC  annotation database are most frequently used. The choice of gene annotation can have a significant impact on quantification even at the gene level . In addition to annotation databases, it is possible to incorporate novel transcripts discovered from de novo assembly into existing annotations. One of the most popular assemblers is the TRINITY package .
A large number of RNA-seq specific mapping algorithms have been developed to align large numbers of sequence reads to a reference genome and/or transcriptome, including Bowtie [27,28], TopHat2 , STAR , GSNAP , and Map Splice . Most of the tools require SAM/BAM files as inputs, which are generated by the aligners described above. Some new isoform quantification tools have their own built-in mapping algorithms and can take sequencing reads as inputs directly, instead of aligned SAM/BAM files [33-36].
Nearly all isoform quantification algorithms use either maximum likelihood (ML) estimate or Bayesian inference and expectationmaximization (EM) methods to assign ambiguously mapped reads to transcript isoforms, and they differ mainly in EM convergence speed and some other important features.
Table 1 summarizes these tools and their features, including run time, bias correction, multi-threading, stranded protocol support and indel read alignments. The output of these tools usually reports the uncertainty of isoform quantification as well. By appropriately accounting for uncertainty in quantification, more accurate downstream differential analyses were obtained at both the gene and isoform levels. Sleuth  is a downstream analyses tool specifically developed for differential analyses at the transcript level.
|Run time||Algorithm||Bias correction||Stranded protocol||Multi-thread
Note: The run time, algorithm used, support for bias correction, stranded library preparation protocol, multi-thread computing and indelalignments of popular tools are compared. For a sample with 50 million reads running on a typical HPC with multi-thread computing turned on, run time over 8 hours is considered “+++++”, less than 3 hours but more than 1 hour is considered “+++”, less than 30 minutes but more than 10 minutes is considered “++”, and less than 10 minutes is considered “+”
Table 1: Frequency Distribution of Unvaccinated Measles Count with Predictor Variables.
RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-seq data . It estimates the ML of relative abundances of the transcript isoforms and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms come from iterations of EM method. According to the comparative assessment , RSEM is relatively slow. Recent tools, such as eXpress , aim to reduce the computational burden of isoform quantification by substantially altering the EM algorithm.
It improves the convergence speed using an online-EM algorithm that models indels, fragment length, sequencing errors and corrects sequence-specific fragment biases. Tigar2 utilizes Bayesian inference as an alternative to ML estimation and provides better accuracy for longer reads and supports for variable-length reads produced by Ion Torrent PGM sequencer [39,40]. However, Tigar2 is by far the slowest algorithm, taking more than 9 hours to process 100 million reads even with multi-thread processing . BitSeq also uses Bayesian inference and a user can choose either markov chain monte carlo (MCMC) or variational Bayesian (VB) methods for quantification [41,42]. MCMC is generally slower than VB method, but provides better accuracy.
For the isoform quantification, not all information in the alignment is necessary. Simply the knowledge of the transcripts and positions to which a given read maps is sufficient for quantification purpose. In support of such ‘analysis-efficient’ computation, the concept of quasimapping, lightweight-alignment or pseudo alignment has been introduced recently [33-36]. With appropriate optimization, such tools can map and quantify 100 million reads in 10 minutes even on a laptop. Sailfish was initially implemented using a k-mer approach .
It entirely skips read mapping, a time-consuming step, and provides quantification estimates much faster than existing approaches. However, shredding reads into k-mers discards valuable information present in whole reads, and accordingly, results in loss of accuracy, since each k-mer can potentially align to more transcripts than the read itself. To circumvent this issue, Kallisto uses fast hashing of kmers together with the transcriptome de Bruijn graph to produce a list of transcripts that are compatible with each read while avoiding alignment of individual bases .
K-mer based Sailfish has now been deprecated, and the latest Sailfish uses the quasi-mapping procedure provided by RapMap , as opposed to individual k-mer counting, for transcript-level quantification to increase accuracy without sacrificing speed.
Interestingly, the developer of Sailfish introduces Salmon , another novel method and software tool for isoform quantification that exhibits state-of-the-art accuracy while being significantly faster than most other tools. Salmon achieves this through a two-phase inference procedure, a reduced data representation, and the quasi-mapping algorithm of RapMap. During its online phase, in addition to performing streaming inference of transcript abundances, Salmon also constructs a highly-reduced representation of the sequencing experiment.
Specially, Salmon constructs “rich” equivalence classes over all of the sequenced fragments to greatly reduce the time required to perform iterative optimization. Both Salmon and Sailfish allow users to choose between ML estimate and VB inference for isoform quantification. However, there are a few main differences between Sailfish and Salmon. One is that Salmon implements a dual-phase inference algorithm, consisting of both an online and offline phase, while Sailfish uses only an offline algorithm. Salmon also accepts alignment files in SAM/BAM format, making it a flexible tool for isoform quantification, but Sailfish does not. Another difference is that Salmon contains richer models for bias correction, whereas Sailfish does not.
Transcriptome analysis from RNA-seq data typically involves two sub-problems, i.e. identification of the set of isoforms and estimation of the abundance of these isoforms. If no annotation for a species of interest is available or novel transcript discovery is desired, isoform structures have to be constructed from RNA-seq data first . One popular approach is to include novel transcript discovery and quantification in the same package. These tools include Cufflinks , Scripture , IsoLasso , NSMAP , SLIDE , iReckon , Traph , MiTie , and FlipFlop . These tools either assemble the transcriptome ab initio, or use existing annotations to infer new splicing junctions. The discovered novel transcripts can then be used or added to existing annotations for quantification. However, recent reviews indicate that the quantification algorithms provided by these tools are not on par with tools from the previous category [20,53]. This is because identification of isoforms from RNA-seq data is far from being solved and is still challenging, due in particular to the incomplete nature of RNA-seq reads and the fact that the number of potential candidate isoforms is very large, growing almost exponentially with the number of exons. As a result, the performance reported by the state-of-the-art algorithms is often unsatisfactory.
Cufflinks is perhaps one of the most popular tools for novel transcripts discovery and quantification. It assembles transcripts ab initio and merges them with existing annotations and then quantify the transcripts . In a sense, the quantification strategy in Cufflinks is similar to one iteration of the EM algorithm used in RSEM . During assembly, Cufflinks attempts to explain the observed reads with minimum number of isoforms. iReckon  is introduced for simultaneous determination of the isoforms and estimation of their abundances. Their probabilistic approach incorporates multiple biological and technical phenomena, including novel isoforms, intron retention, unspliced pre-mRNA, PCR amplification biases, and multimapped reads. iReckon utilizes regularized EM to accurately estimate the abundances of known and novel isoforms .
The strategy of simultaneously discovering and quantifying transcripts is also adopted by many other state-of-the-art methods (e.g. SLIDE , StringTie , IsoLasso  and CIDANE . Similar to iReckon, SLIDE requires existing annotations and cannot perform de novo assembly. A unique advantage of SLIDE is that it has the flexibility of incorporating other transcriptomic data, such as RACE, CAGE, and EST, to increase isoform discovery accuracy. IsoLasso  is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible StringTie . It is another popular assembler developed by Salzberg group. It uses a network flow algorithm for the simultaneous discovery and quantification of transcripts. Another advantage of StringTie is that it is part of the HISAT2-StringTie-Ballgown workflow and requires less effort to setup the entire RNA-seq data analysis pipeline . CIDANE  is a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. Accurately estimating isoforms in multiple samples is an important preliminary step to differential expression analysis at the level of isoforms. One promising direction to improve isoform identification and quantification is to exploit several samples at the same time, such as biological replicates or time course experiments. If some isoforms are shared by several samples, potentially with different abundances, the identifiability issue may vanish and the statistical power of the deconvolution methods may increase due to the availability of more data for estimation. The joint RNA isoform detection and quantification from multiple RNA-seq samples is effective in reducing false positive transcript discoveries [43,57].
In this review, we summarize the tools for gene and transcript isoform quantification and provide guidance for end users to choose the tools with desired features. Each of the tools offers unique features that are suitable for answering specific research questions. RNA-seq is emerging as a powerful approach for identification and quantification of transcript isoforms. However, short read fragments that cover only part of the transcript make it difficult to reconstruct full-length transcripts, especially for those expressed at low levels.