alexa Mining Next Generation Sequencing Data: How to Avoid “Treasure in, Error Out” | Open Access Journals
ISSN: 2153-0602
Journal of Data Mining in Genomics & Proteomics
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Mining Next Generation Sequencing Data: How to Avoid “Treasure in, Error Out”

Zhihua Jiang*

Department of Animal Sciences and Center for Reproductive Biology, Washington State University, USA

*Corresponding Author:
Dr. Zhihua Jiang
Department of Animal Sciences and Center for Reproductive Biology
Washington State University, Pullman, WA 99164-7620, USA
Tel: 509-335-8761
Fax: 509-335-4246
E-mail: jiang[email protected]

Received date: May 29, 2015; Accepted date: May 29, 2015; Published date: June 06, 2015

Citation: Jiang Z (2015) Mining Next Generation Sequencing Data: How to Avoid “Treasure in, Error Out”. J Data Mining Genomics Proteomics 6:e119. doi: 10.4172/2153-0602.1000e119

Copyright: © 2015 Jiang Z. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Data Mining in Genomics & Proteomics


During the last ten years, next generation sequencing methods, technologies and platforms have revolutionized genomics and transcriptomics research fields and advanced their applications in agriculture and biomedicine [1-3]. To date, the Roche 454 GS FLX(+) system, Applied Biosystems SOLiD (supported oligonucleotide ligation and detection) and Ion Proton/PGM/Chef systems now owned by Life Technologies (Grand Island, NY); Solexa GA (Genome Analyzer)/ HiSeq/MiSeq/NextSeq developed by Illumina (San Diego, CA); and PacBio RSII system made by Pacific Biosciences (Menlo Park, CA) present five major platforms in the market. They utilize different sequencing chemistries (e.g., sequencing by ligation vs. sequencing by synthesis); templates (e.g., single molecules vs. clusters amplified by emulsion or bridge PCR); product sizes (e.g., from 75 bp to 8,500 bp in length) and number of reads per run (e.g., from one million to 5,000 million) [2,3].

Both DNA and RNA sequencing projects have been largely carried out using these platforms, which require construction of relevant libraries. Depending on the project goals, genomic DNA sequencing can be classified into 1) whole genome sequencing/re-sequencing and 2) reduced genome complexity sequencing. However, transcriptome analysis methods can be classified in different ways. When each transcript is considered as a unit, transcriptome sequencing can be grouped into two categories: 1) end to end (5’ – 3’ end) sequencing and 2) either 5’ end or 3’ end sequencing only. Alternatively, transcriptome analysis can be conducted by sequencing specific RNA sub-populations, such as polyA+ RNA sequencing, polyA- RNA sequencing, microRNA sequencing, circular RNA sequencing, isoform sequencing or others [2].

Nevertheless, the original DNA and RNA molecules are manipulated during library preparation. Fragmentation with chemical treatment, enzyme digestion or physical shearing is often used, breaking DNA/RNA molecules into pieces. Enrichment is a process to develop libraries for specific sub-groups or targets/signatures of DNA/RNA for sequencing. Both 5’ and 3’ end adaptors are ligated to fragmented DNA/RNA molecules, forming “5’ adaptor – DNA/RNA inserts – 3’ end adaptor” sandwiches that fit into sequencing platforms. To date, direct RNA sequencing [4] is the only method that does not require PCR in any part of the process, while other methods use PCR in both library preparation and sequencing stages (such as emulsion and bridge amplification). Conversion of RNA molecules into complementary DNA molecules by transcriptase is an essential step in gene expression profiling. Size selection of library products not only helps remove unused adaptors or oligos, but it also enhances sequencing efficiency set up for each sequencing platform. In addition, end repair and tailing are also needed in some methods. Certainly, these manipulations can produce biases, noise or artifacts, depending on how libraries are constructed.


It is well known that RNA-seq data analyses possesses three biases: fragmentation bias, length bias and transcriptome composition bias [5,6]. When RNA molecules are chemically cleaved or physically sheared, fragmentation bias results in fewer reads derived from both 5’ and 3’ ends of transcripts, because the process favors their internal sequences. Short transcripts should have less total reads in libraries as compared to long transcripts, causing a length bias even though long and short transcripts are expressed at similar levels. Transcriptome composition bias results when one or a few transcripts in a given sample are expressed at extremely high levels, thereby downplaying the number of reads collected for other transcripts [6]. Therefore, several methods have been proposed to correct these biases, thus providing unbiased estimates of gene expression through data normalization. As pointed out recently by Finotello and colleagues [6], however, some methods remain error prone.

Genotyping by sequencing leads to biases as well, particularly in number of reads per sample, number of sites sequenced per sample and number of reads per site. For example, Byrne and colleagues [7] reported that coefficients of variation ranged from 27% to 62% for number of reads per sample. The authors found that both barcode (F(31, 223) = 20.93, p,0.001) and barcode length (F(4, 250) = 29.89, p,0.001) had significant effects on read numbers per sample. In addition to variation in number of reads per library, Chen et al. [8] also observed variations in number of sites with reads mapped per library, ranging from 171,472 to 447,051 sites per sample. Based on the Zea mays genome reference, frequency of enzyme ApeKI recognition sites in the reference genome and size selection of 70 – 318 bp in library construction, Beissinger and colleagues [9] predicted that this process should yield reads for a total of 1,406,269 sites. Theoretically, they expected that the number of reads per site would follow a Poisson distribution with mean equal to the average coverage (~40 reads per site). However, in reality, 0 reads were recorded for 1,021,382 sites and the remaining sites had 1 – 95,014 read(s) per site.


End-to-end sequencing of transcripts by RNA-seq is not costeffective for conventional transcriptome analyses. As such, the community has developed more than 15 methods and technologies to carry out whole transcriptome tag/target sequencing [2]. These methods can produce noisy reads from unwanted products generated in library construction. For example, Ma and coworkers [10] compared different methods for target sequencing of PATs (polyA tags). Surprisingly, the authors reported that the percentage of reads mapped to the 3’untranslated regions accounted for 1.8% to 52.8% among 20 libraries. This means that the noisy read rate ranged from 47.2% to 98.2%. These results clearly indicate that noisy data can be overwhelmingly produced at some point during the library preparation process.

The SAGE (serial analysis of gene expression) technique was designed to catch one tag per transcript only [11]. However, this is not true in practice. When we studied the reactomes of porcine alveolar macrophages infected with porcine reproductive and respiratory syndrome virus (PRRSV), we found that noisy tags were abundant [12]. In addition to the 3’ most tag, for example, at least 10 other tags for pig interleukin 1, alpha (IL1A) were present in the library. Generally speaking, the 3’ most tag is expressed most abundantly, but the numbers of tags per million (TPM) for other tags decrease significantly depending on their distances from poly(A) end of the mRNA. Nevertheless, the results indicate that SAGE-based tools generate noisy data for transcriptome analysis. In fact, only the 3’ most cut site tag is the only tag that should be used in analysis. If other tags were wrongly chosen for analysis, expression patterns/trends of a given gene might be dramatically different.


Library preparation often uses ligases, either T4 ligase for DNA ligation or T4 RNA ligase for RNA ligation. As discussed above, ligation adds adaptors/linkers to 5’ end and/or 3’ end of DNA/RNA inserts, allowing them to fit in specific sequencing platforms. In reality, not only 5’ adaptors – DNA/RNA inserts – 3’ adaptors, but other combinations within inserts or between adaptors and inserts can occur randomly, especially when DNA/cDNA molecules are digested with restriction enzymes. Our recent experience shows that even a 120 bp read can contain three digested fragments that are located on different chromosomes. For example, after “unchaining” of the ligated artifacts, 106 million reads were increased up to 132 million reads with ≥ 36 bp in length. As a consequence, the unique mapping rate was also improved from ~45% to ~65%. In my opinion, the community needs to pay more attention to the artifact issues that occur during the preparation of next generation sequencing libraries.


Here I just listed a few examples about biases, noisy data and artifacts due to DNA/RNA manipulation during preparation of the next generation sequencing libraries. Therefore, bioinformaticians should get to know how the libraries are prepared before they can develop programs and software to handle these challenges sufficiently and carefully. In particular, we need to be cautious if we would develop tools to explore polyadenylation events using RNA-seq data, because the 3’ UTR reads are biased and incomplete. We would not collect reliable information if we simply use the tag counts derived from a SAGE library or its derivatives to determine alternative polyadenylation events or proximal to distal/distal to proximal site shifts. Furthermore, ignoring artifact reads is not cost-effect. Next generation sequencing technologies generate high quality reads since none of the platforms use Sanger sequencing. However, data analysis errors will occur if we do not pay attention to biases, noisy data and artifacts.


This work was supported by the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under Award Number R21HD076845 to ZJ. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

  • 9th International Conference on Bioinformatics
    October 23-24, 2017 Paris, France
  • 9th International Conference and Expo on Proteomics
    October 23-25, 2017 Paris, France

Article Usage

  • Total views: 11724
  • [From(publication date):
    July-2015 - Oct 17, 2017]
  • Breakdown by view type
  • HTML page views : 7927
  • PDF downloads :3797

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri, Food, Aqua and Veterinary Science Journals

Dr. Krish

[email protected]

1-702-714-7001 Extn: 9040

Clinical and Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals


[email protected]

1-702-714-7001Extn: 9042

Chemical Engineering and Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001 Extn: 9040

Earth & Environmental Sciences

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

General Science and Health care Journals

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics and Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001 Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Informatics Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Material Sciences Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Mathematics and Physics Journals

Jim Willison

[email protected]

1-702-714-7001 Extn: 9042

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001 Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

John Behannon

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001 Extn: 9042

© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version