| Research Article |
Open Access |
|
| Mining Unique-m Substrings from Genomes |
| Kai Ye1*, Zhenyu Jia2, Yipeng Wang2,3, Paul Flicek4 and Rolf Apweiler5 |
| 1Molecular Epidemiology section, Medical Statistics and Bioinformatics, Leiden University Medical Center, The etherlands |
| 2Department of Pathology & Laboratory Medicine, University of California, Irvine, CA 92697, USA |
| 3Vaccine Research Institute of San Diego, San Diego, CA 92121, USA |
| 4Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands |
| 5EMBL Outstation, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK |
| *Corresponding author: |
Kai Ye, Ph.D.,
Molecular Epidemiology section
Medical Statistics and Bioinformatics Leiden University Medical Center The
Netherlands,
E-mail : K.Ye@lumc.n |
|
| |
| Citation: Ye K, Jia Z, Wang Y, Flicek P, Apweiler R (2010) Mining Unique-m
Substrings from Genomes. J Proteomics Bioinform 3: 099-100. doi: 10.4172/ jpb.1000127 |
| |
| Copyright: © 2010 Ye K, et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided
the original author and source are credited. |
| |
| Abstract |
| Unique substrings in genomes may indicate high level of
specificity which is crucial and fundamental to many genetics
studies, such as PCR, microarray hybridization, Southern
and Northern blotting, RNA interference (RNAi), and
genome (re)sequencing. However, being unique sequence
in the genome alone is not adequate to guaranty high specificity.
For example, nucleotides mismatches within a certain
tolerance may impair specificity even if an interested
substring occur only once in the genome. In this study we
propose the concept of unique-m substrings of genomes for
controlling specificity in genome-wide assays. A unique-m substring is defined if it only has a single perfect match on
one strand of the entire genome while all other approximate
matches must have more than m mismatches. We developed
a pattern growth approach to systematically mine such
unique-m substrings from a given genome. Our algorithm
does not need a pre-processing step to extract sequential
information which is required by most of other rival methods.
The search for unique-m substrings from genomes is
performed as a single task of regular data mining so that the
similarities among queries are utilized to achieve tremendous
speedup. The runtime of our algorithm is linear to the
sizes of input genomes and the length of unique-m substrings.
In addition, the unique-m mining algorithm has been
parallelized to facilitate genome-wide computation on a cluster
or a single machine of multiple CPUs with shared
memory. |
| |
|
This Article |
DOWNLOAD |
|
Services |
|
Google Scholar |
|
Pub Med |
|
|