alexa On the Benefit of Publishing Uncurated Genome Assembly Data | OMICS International
ISSN: 2155-9597
Journal of Bacteriology & Parasitology

Like us on:

Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

On the Benefit of Publishing Uncurated Genome Assembly Data

Ferenc Orosz*

Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary

*Corresponding Author:
Ferenc Orosz
Institute of Enzymology, Research Centre for Natural Sciences
Hungarian Academy of Sciences, Budapest, Hungary
E-mail: [email protected]

Received date: July 03, 2017; Accepted date: August 30, 2017; Published date: September 04, 2017

Citation: Orosz F (2017) On the Benefit of Publishing Uncurated Genome Assembly Data. J Bacteriol Parasitol 8:317. doi: 10.4172/2155-9597.1000317

Copyright: © 2017 Orosz F. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Bacteriology & Parasitology


Genome sequencing; Parasitology; Neisseria
; Ribosomal RNA

Short Communication

The uncurated genome assembly data often contains DNA contaminations, originated from exotic organisms, introduced during DNA extraction or sequencing. It happens sometimes that it is not removed when the sequence is deposited into public databases such as GenBank or European Nucleotide Archive. Consequently, database searches could lead mistaken results due to these impurities [1,2]. Human DNA is an everyday contamination, from the scientists who extract and sequence the samples [3]. Impurities of human origin and other laboratory contaminants such as E. coli and cloning vectors can be effectively eliminated using highly efficient computational filters applied to the draft sequences [4,5]. However, other contaminations, as discussed later in the paper, are more difficult to identify. By the spreading of next-generation sequencing this has become a common problem due to the vast amount of reads which are generally short and of low quality in these projects [6-8].

A further source of contamination can be the pathogens present in the samples used for sequencing. Substantial bacterial contamination is routinely found in existing human-derived clinical RNA-seq datasets that likely arises from environmental sources [9]. Insect and other arthropod sequences were identified when analysing plant transcriptomes [10]. Just the opposite happened when the pathogen genome was found to be contaminated by the host. This was the case e.g. when it was discovered that the genome of the bacteria, Neisseria gonorrhoeae included sequences of cow and sheep origin [1].

It was found by me [11] that apicortin, a characteristic protein of apicomplexan parasites but absent in more developed animals (Eumetazoa), was virtually found in an animal genome assembly from the northern bobwhite (Colinus virginianus). Thus I decided to systematically investigate this problem: sequences of the apicoplast, an apicomplexan organelle, were used as queries in BLASTN search against nucleotide sequences of various animal groups, searching for possible contaminations. I found that beside the draft genome of the bobwhite [12] that of a bat, Myotis davidii [13], contained at least 6 and 17 contigs, respectively, of apicoplast origin. This is a general method for fast identification of genomes contaminated by DNA of apicomplexan origin, which needs limited computation and practically does not give false positives, as any significant hit is a clear indication of contamination. Moreover, by comparing some contaminating sequences with sequences of known apicomplexan parasites I was able to construct phylogenetic trees which show the phylogenetic position of the tentative contaminating species. Although the number of the complete apicomplexan genomes is increasing continuously, there are still not enough to use apicoplast sequences for constructing trees. Thus I used two characteristic genes, often used for phylogenetic analysis and known in many cases, 18S ribosomal RNA and the internal transcribed spacer 1 (ITS-1). I suggested that a second member of the Nephroisospora genus exists, which similarly to the first member, Nephroisospora eptesici, is hosted by a bat, Myotis davidii, and proposed its tentative name as “Nephroisospora myotisi”. Of course, the christening of the unknown species was not accepted by the strict rules of parasitology require the isolation and taxonomic description of the species.

However, this idea was picked up and developed significantly by Janus Borner and Thorsten Burmester [14]. They pointed out that “while previous approaches have mostly focused on the removal of contaminating sequences, the identification of parasite-derived contaminations may also enable the discovery of novel parasite taxa and shed light on previously unknown host-parasite associations” [14]. The high level of accuracy and sensitivity of next generation sequencing for quantifying genetic material across organismal boundaries gives tremendous potential for pathogen discovery. Previously, the PathSeq program [15] was developed to identify microorganisms by deep sequencing of human tissue, which first subtracts all reads derived from the human host. Of course, this method can be used only in the case of the high-quality genome data as the human genome is. Borner and Burmeister’s new departure [14] can be applied in a much broader field.

In the case of wild beasts, it is not possible to avoid infection by parasites before sequencing. E.g., in the above mentioned cases, the kidney of the bat and the muscle of the bobwhite, respectively, contained the parasitic cysts. Unveiled contaminations of animal genomes cause misinterpretation of data; however, if known, parasite- originated sequences can provide useful information. Thus Borner and Burmester [14] suggested that parasite-derived “impurities” mean plentiful information that can help the discovery and identification of novel parasites. They argued “that uncurated assembly data should routinely be made available in addition to the final assemblies” [14]. They showed that sequences of apicomplexan origin were found in many animal transcriptomes and genomes, which indicates apicomplexan infection in the sequenced host. They extracted these sequences from the datasets by a novel bioinformatic pipeline (ContamFinder) and assigned to distinct taxa using phylogenetic methods. (The softwares can be freely downloaded from They analysed 920 datasets of which 51 was contaminated and they recognised more than twenty-thousand contigs derived from apicomplexan parasites. The contaminating species were members of various apicomplexan taxa of Haemosporida, Piroplasmida, Coccidia and Gregarinasina. A typical finding was that in the assembly of the superseded genome of Gorilla gorilla gorilla (western lowland gorilla) there were sequences that were more than 99.9% identical at the nucleotide level (!) to those of Plasmodium falciparum, including the full mitochondrial genome. For other, less investigated parasite species, where no or only a few molecular data were known previously, these kinds of draft (uncurated) genomes may represent an abundant source of the gene repertoire of parasites.

These results have a significant importance for apicomplexan research. Sequencing of apicomplexans is rather biased to genus of medical or veterinary interest as, first of all, Plasmodium, then Babesia, Eimeria, Toxoplasma etc., while for Gregarinasina, which parasitizes only invertebrates, much less data are available. Analysis of contaminations renders possible the identification or even the discovery of new parasite taxa and enlightens the apicomplexan phylogeny. Moreover, their method can be generalized and also be applied to investigate contaminations by bacteria, viruses and other pathogens. I agree absolutely with their final conclusion that draft genome assembly data should also be made public.


Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

Article Usage

  • Total views: 439
  • [From(publication date):
    October-2017 - Aug 15, 2018]
  • Breakdown by view type
  • HTML page views : 393
  • PDF downloads : 46

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2018-19
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri & Aquaculture Journals

Dr. Krish

[email protected]

+1-702-714-7001Extn: 9040

Biochemistry Journals

Datta A


[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals


porn sex

[email protected]

1-702-714-7001Extn: 9042

Chemistry Journals

Gabriel Shaw

Gaziantep Escort

[email protected]

1-702-714-7001Extn: 9040

Clinical Journals

Datta A


[email protected]

1-702-714-7001Extn: 9037

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

Food & Nutrition Journals

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

General Science

Andrea Jason

mp3 indir

[email protected]

1-702-714-7001Extn: 9043

Genetics & Molecular Biology Journals

Anna Melissa

[email protected].com

1-702-714-7001Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Materials Science Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Nursing & Health Care Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Medical Journals


Nimmi Anna

[email protected]

1-702-714-7001Extn: 9038

Neuroscience & Psychology Journals

Nathan T


[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

Ann Jose

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001Extn: 9042

© 2008- 2018 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version