ISSN: 0974-276X
Journal of Proteomics & Bioinformatics
Like us on:
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Mining Unique-m Substrings from Genomes

Kai Ye1*, Zhenyu Jia2, Yipeng Wang2,3, Paul Flicek4 and Rolf Apweiler5
1Molecular Epidemiology section, Medical Statistics and Bioinformatics, Leiden University Medical Center, The etherlands
2Department of Pathology & Laboratory Medicine, University of California, Irvine, CA 92697, USA
3Vaccine Research Institute of San Diego, San Diego, CA 92121, USA
4Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
5EMBL Outstation, European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Corresponding Author : Kai Ye, Ph.D
Molecular Epidemiology section Medical Statistics
and Bioinformatics Leiden University Medical Center The Netherlands
Received February 12, 2010; Accepted March 16, 2010; Published March 16, 2010
Citation: Ye K, Jia Z, Wang Y, Flicek P, Apweiler R (2010) Mining Unique-m Substrings from Genomes. J Proteomics Bioinform 3:099-100. doi:10.4172/jpb.1000127
Copyright: © 2010 Ye K, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at
DownloadPubmed DownloadScholar Google

Visit for more related articles at Journal of Proteomics & Bioinformatics


Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, nucleotides mismatches within a certain tolerance may impair specificity even if an interested substring occur only once in the genome. In this study we propose the concept of unique-m substrings of genomes for controlling specificity in genome-wide assays. A unique-m substring is defined if it only has a single perfect match on one strand of the entire genome while all other approximate matches must have more than m mismatches. We developed a pattern growth approach to systematically mine such unique-m substrings from a given genome. Our algorithm does not need a pre-processing step to extract sequential information which is required by most of other rival methods. The search for unique-m substrings from genomes is performed as a single task of regular data mining so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique-m substrings. In addition, the unique-m mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory.

Data mining; Genomes; Mismatch; Sequence
Unique substrings in genomes may indicate high level of specificity which is crucial to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, in microarray studies, a small fragment of DNA probe is used to fish for particular RNA/ DNA molecules in the test sample. This probe will capture those RNA/DNA molecules that contain a perfect reverse-complement counterpart. Albeit with relatively lower hybridization intensity, molecules with a couple of mismatches can also be caught by the probe, which is called cross-hybridization and therefore reduces the probe specificity. We consider a DNA fragment as a poor probe if it has quite a few approximately matched sequences in the genome. Those approximate matches can be very confounding since we are uncertain which RNA/DNA molecules in the sample does the probe represent. The chance of RNA/DNA being fished by the probe drops dramatically as the number of mismatches increases. Thus we may use the number of mismatches as an indicator for the probability of cross-hybridization for a probe if we ignore the impact of the distribution of the mismatches along the probe sequence.
In this study we define a DNA fragment as a unique-m probe if it perfectly matches exactly one spot on one of the two strands of the genome while any of its approximate matches along the genome have more than m mismatches. In this way we have a solid criterion to design appropriate DNA probes for Microarrays. Similarly, we might also use these unique-m substrings to design preferable siRNA to reduce off-target effects. As for genome re-sequencing efforts, we may use the information about unique-m substrings to interpret and evaluate the mapping results of short reads. Those reads that meet with unique-m properties (m > 2) are trusted with higher confidence than others which match other locations of the genome with one or two mismatches. Last but not least, with the exponential growth of sequence databases, there is tremendous need to develop techniques to mine and output unique-m substrings and their locations in genomes in an effective way, e.g., linear in time and space.
The data structures of hash tables and suffix arrays were previously used to find (minimum) unique substrings of the human genome for the whole genome tiling arrays (Bertone et al., 2006; Graf et al., 2007). Bertone et al applied BLAST-like scheme: sequences in the database are preprocessed into a hash table of k-tuples so that the potential locations of a query can be efficiently narrowed down by looking for the k-tuples in the query sequence from the hash table (Altschul et al., 1990; Ning et al., 2001). As for suffix arrays, they can be built linearly in time and space, given sufficient memory. Several algorithms have been proposed to build suffix trees in memory efficiently (Edward 1976; Giegerich and Kurtz, 1997; Ukkonen, 1995; Weiner, 1973), but they do not scale up well due to huge space consumption (about 5-16 times of the input database) when the input databases approach the size of the human genome (Publisi et al., 2007).
To overcome aforementioned obstacles, we provide a pattern growth scheme, a data mining oriented algorithm, which guarantees to find all unique-m substrings of genomes linearly in time and space. Our algorithm traces both perfect and approximate matches with no more than m mismatches on both strands of the genome sequences where the value of m is provided by the user. The divide-and-conquer strategy is readily incorporated into the mining process so that the eventual calculation can be allocated either over multiple CPUs with shared memory or on a cluster.
The inputs of the algorithm are a dataset S which consists of a series of non-empty sequences of the alphebet {A, C, G, T}, a minimum (min_l) as well as a maximum (max_l) length of substrings to search for and the number of mismatches m.
The output of the algorithm consists of all substrings that appear exactly once in one of the two strands of the input sequences and all the other approximate matches must have more than m mismatches. The locations of those unique-m substrings will be given from implementing the algorithm.
The algorithm is as follows. Let a be a substring and Sa be the so-called projected database that contains all sequences that contain the substring a, where the last element of each occurrence of a is marked (the so-called a-locations) in both strands of the sequences and the number of mismatches for each occurrence is also recorded. The computation of Sa’ from Sa requires, for each a-location, the check of whether or not the base on its right side (if any) equals the newly appended item b. In case of equality this gives an a’-location in Sa’ with the same number of mismatches; otherwise one more mismatch will be logged. If the number of mismatches exceeds m, this a’-location is not stored.

unique-m (a, Sa)
      if |Sa| ≥ 1 then
            if min_llength (a) ≤ max_l and |Sa| = 1 then
                  report a and Sa
      if length (a) < max_l     
            for each base b do
                        a’ := a with b appended to it
                        unique-m (a’, Sa’)

The main call is unique-m (Λ, SΛ), where Λ is the empty substring. Note that each base in database SΛ is a Λ–location, including the position before each sequence (with Φ mismatch). This call creates a projected database that marks all occurrences of the first base b, reports all unique-m substrings that begin with this b, and then proceeds to the next base.
To evaluate our unique-m substring mining program, we collected the whole “genome” sequences of various “species” with different sizes. The smallest genome in this study is the genome of the Phi-X 174 bacteriophage while the human genome is the largest, as listed in Table 1.
Both single and multi-threaded tests were performed on the same PC configured with 8 processors of Intel Xeon E5420, 8GB of RAM and 250GB hard drive. In benchmarking studies for single thread, we implemented the program to run in serial mode so that only one core was used.
Implementations of the unique-m substring mining algorithm
The unique-m substring mining algorithm was implemented in C++ and i t has been mult i threaded wi th openMP (, a shared memory parallelism). User needs to input genome sequence, the length of unique-m substrings, the maximum number of mismatches (m), and may also specify the number of threads to operate when executing the program.
Runtime of mining unique substrings is linear to the input size
As mentioned in the Method section, our program first initializes a projected database for a given nucleotide or substring and then recursively extends the substring and mines the associated projected databases. Theoretically, the size of the projected database and the number of recursive mining operations are proportional to the size of the input sequences, which has been versified in our first experiment. We ran our unique-m program to mine unique 25-mers using various “genomes” as test sets. As shown in Figure 1a, if the genome size was smaller than 1 million bases, the runtime was a few seconds only. When the genome size reached 100 million bases (C.elegans), it took about 15 minutes to find all unique 25-mers. As for the human genome which is more than 30 times that of C.elegans’s, the entire computation was completed within 10 hours.
Runtime for mining longer unique substrings
It is apparent that when we look for longer substrings, the number of potential substrings (4n) to examine will increase exponentially. However, as shown in Figure 1b, for our algorithm the runtime is still linear to the length of substrings (from 10 to 100) because the number of survived locations in the genome decreases exponentially in each extension step. The genome of Thermoplasma acidophilum (about 1.5 million bases) was used to search for all unique n-mers for demonstration.
Search for unique-m substrings in genomes
As for the unique-m substrings, they will only match one position of the genome and all the other approximate matches will have more than m mismatches. Apparently, by increasing the m values, we will be able to identify probes that have improved uniqueness in genome-wide assays. However, the increase in m compensates for computational expenses since more variants of a substring and more locations in the genome need to be considered with extended substrings.
As shown in Figure 1c, for any given genome, the runtime of mining unique-m substrings indeed increases exponentially as we increase the value of m. However for any particular m value, the runtime is still linear to the size of input genome. The number of unique-m substrings in a particular genome shrinks with the increase of the m value (Figure 2). Such phenomenon tends to be more profound in larger genomes. For C.elegans, about 5% of the unique 25-mers match another location on the genome with a single mismatch, and more than 30% with no more than 3 mismatches. As for the human genome, 10% of the unique 25-mers will match another location of the genome.
As major CPU vendors are encountering many technological problems in boosting speeds on single CPU, they have shifted gears away from ramping up clock speeds to adding parallelism with multicore processors to offer a better-performing processor. In this study, the unique-m algorithm is parallelized with openMP (, a shared memory parallelism) and runs on machines configured with multi-CPUs. As explained in the Method section, we first initialized a projected database to store the locations of a particular substring in a genome and then recursively extended this substring while updating the projected database. When we initialize the projected database with the 1-mers, the computation is then readily divided into four subroutines for the four nucleobases. We could split the computation further by initializing the projected databases with longer substrings. In this case, the number of subroutines is 4n where n is the length of the substrings.
As shown in Figure 3, the program shows excellent speed up with openMP because each subroutine is completely independent so that there is little synchronization among different threads.
The source code of our unique-m mining can be compiled using Intel C++ compiler to run on a cluster. Alternatively, the calculation can be divided into multiple jobs according to prefixes and then run all of them on a cluster. For example, if the user specifies a prefix “ACG”, our program will only search for the unique-m substrings with this particular prefix.
Pattern growth reduces search space
A straightforward algorithm to find all unique-m k-mers from genomes is to generate and test all possible k-mers. It is not difficult to estimate the number of 25-mers to be examined in such an approach. Since at each position all four nucleotides may occur, the total number of 25-mer candidates would be 425˜ 1015. Moreover, for each 25-mer candidate, the entire genome sequences would have to be scanned to record the locations of matches with up to m mismatches. Thus, a sophisticated method is demanded.
In this study, we present a pattern growth solution to remarkably reduce the number of k-mers and the cost of multiple scanning of the dataset. We grow a given substring by adding elements to its right-hand side. Once the new substring does not appear in both strands of the genome, we stop growing it and its extension will not be tested. For example, suppose a string AGCCGAT does not have any matches in both strands, four possible extensions, AGCCGATA, AGCCGATC, AGCCGATG and AGCCGATT won’t be subjected to any further test. In this way, the number of k-mer candidates examined is significantly reduced while all k-mers that satisfy the input requirement will be reported.
In addition, this pattern growth method significantly reduces the efforts in scanning the entire dataset. If we want to examine the locations of a particular string in the genome, the algorithm does not scan the whole genome but only checks the locations where its prefix is, recursively.
Memory requirement
In the present prototype of unique-m substring mining algorithm,we have not fully optimized its speed and memory usage. For example, the genome is loaded into the memory without compressing the bases into bits. We used unsigned integers in C++ to record the locations of a given substring in the projected databases. We may consider using relative distances to anchor points instead of absolute locations to save space in the projected databases.
In the current setting, our program still utilizes much less space than suffix trees or suffix arrays algorithms in finding unique substrings in genomes. Let us consider the average memory usage if we assume that the numbers of occurrences for four bases are comparable in the genome. We take the size of the input genome as the unit. In this case, we first spent one unit of memory to store the input genome. Then for each of the four bases, we created a projected database to record the locations of this base on both strands. We used 32 bits integers in C++ to index the locations. Thus we need two units of memory to store such projected database. After that, for each extension of the substring, the projected database is about 25% of the previous one. Thus the total memory consumption of mining all unique substrings with any given length n is
which converges to if n approaches infinity.
We may also initialize the projected database with a longer prefix to save memory consumption. For example, if we had used k-mers instead of the 1-mers to initialize the projected database, the memory consumption is
which converges to
In this case, if we use 2-mers to initialize the projected database, we only need 5/3* 3GB = 5GB of memory to find unique substrings of any length from the human genome. Note that 3 GB out of 5 GB is used to store the sequences of the human genome.
On the other hand, if we want to mine unique-m substrings instead of unique ones, the memory consumption is
Approaches to improving oligonucleotide probe specificity
A variety of approaches have been routinely implemented for estimating the specificity of oligonucleotide probes in microarray platform design (Nordberg, 2005; Chen, 2002; Reymond, 2004),including (i) removing probes with more than 75% identity with a non-target sequence (Hughes et al., 2001; Kane et al., 2000), (ii) removing probes that include contiguous stretches (more than 15nt) of identical sequences with a non-target sequence (Kane et al., 2000), (iii) removing probes with low complexity regions, such as long stretches of the same base (i.e. “AAAAAA” or “TTTTTT”). We recently found that there is still a lot of room to improve the probe specificity although great endeavor had been made to do so (Xia, in press). Unique-m substring approach provides a unique way to control for the probe specificity. The number of mismatches (m) should be customized for different microarray studies which vary in probe length, sample species and microarray platform etc.
In this study we propose the concept of unique-m substrings of genomes for controlling specificity in genome-wide assays. A unique-m substring has exactly one perfect match on one of the strands of the genome while all other approximate matches must have more than m mismatches. We developed a pattern growth approach to systematically mine such unique-m substrings from a given genome. The search for unique-m substrings from genomes is performed as one data mining task so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique-m substrings. In addition, the unique-m mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory.
This work was financially suppor ted by an NWOBioinformatics Breakthrough Grant (050-71-041), The Netherlands.
  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403-410.»  CrossRef  »  PubMed  »  Google Scholar
  2. Bertone P, Trifonov V, Rozowsky JS, Schubert F, Emanuelsson O, et al. (2006) Design optimization methods for genomic DNA tiling arrays. Genome Res 16: 271-281.»  CrossRef  »  PubMed  »  Google Scholar
  3. Chen H, Sharp BM (2002) Oliz, a suite of perl scripts that assist in the design of microarrays using 50mer oligonucleotides from the 3’ untranslated region. BMC Bioinformatics 3: 27.»  CrossRef  »  PubMed  »  Google Scholar
  4. Edward MM (1976) A Space-Economical Suffix Tree Construction Algorithm. J ACM 23: 262-272.»  CrossRef   »  Google Scholar
  5. Giegerich R, Kurtz S (1997) From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica. 19: 331-353.»  CrossRef »  Google Scholar
  6. Graf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, et al. (2007) Optimized design and assessment of whole genome tiling arrays. Bioinformatics 23: 195-204.»  CrossRef  »  PubMed  »  Google Scholar
  7. Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, et al. (2001) Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol 19: 342-347.»  CrossRef  »  PubMed  »  Google Scholar
  8. Kane MD, Jatkoe TA, Strumpf CR, Lu J, Thomas JD, et al. (2000) Assesment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res 28: 4552-4557.»  CrossRef  »  PubMed  »  Google Scholar
  9. Kurtz S, Schleiermacher C (1999) REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15: 426-427.»  CrossRef  »  PubMed  »  Google Scholar
  10. Ning ZM, Cox AJ, Mullikin JC (2001) SSAHA: A fast search method for large DNA databases. Genome Research 11: 1725-1729.»  CrossRef  »  PubMed  »  Google Scholar
  11. Nordberg EK (2005) Yoda: selecting signature oligonucleotides. Bioinformatics 21: 1365-1370.»  CrossRef  »  PubMed  »  Google Scholar
  12. Publisi SJ, Smyth WF, Turpin AH (2007) A taxonomy of suffix array construction algorithms. Acm Computing Surveys 39.»  CrossRef »  Google Scholar
  13. Reymond N, Charles H, Duret L, Calevro F, Beslon G, et al. (2004) Roso: optimizing oligonucleotide probes for microarrays. Bioinformatics 20: 271-273.»  CrossRef  »  PubMed  »  Google Scholar
  14. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14: 249- 260.»  CrossRef »  Google Scholar
  15. Weiner P (1973) Linear pattern matching algorithms. In 14th IEEE Symp. on Switching and Automata Theory.
  16. Xia X, Jia Z, Porwollik S, Long F, Hoemme C, et al. (2010) DNA Microarray Probe Design: Evaluating the Significance of Oligonucleotide Probe Properties and Development of a New Factor. Nucleic Acids Res. In press.
Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

Article Usage

  • Total views: 11444
  • [From(publication date):
    March-2010 - Oct 25, 2016]
  • Breakdown by view type
  • HTML page views : 7705
  • PDF downloads :3739

Post your comment

captcha   Reload  Can't read the image? click here to refresh

OMICS International Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
OMICS International Conferences 2016-17
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri, Food, Aqua and Veterinary Science Journals

Dr. Krish

1-702-714-7001 Extn: 9040

Clinical and Biochemistry Journals

Datta A

1-702-714-7001Extn: 9037

Business & Management Journals


1-702-714-7001Extn: 9042

Chemical Engineering and Chemistry Journals

Gabriel Shaw

1-702-714-7001 Extn: 9040

Earth & Environmental Sciences

Katie Wilson

1-702-714-7001Extn: 9042

Engineering Journals

James Franklin

1-702-714-7001Extn: 9042

General Science and Health care Journals

Andrea Jason

1-702-714-7001Extn: 9043

Genetics and Molecular Biology Journals

Anna Melissa

1-702-714-7001 Extn: 9006

Immunology & Microbiology Journals

David Gorantl

1-702-714-7001Extn: 9014

Informatics Journals

Stephanie Skinner

1-702-714-7001Extn: 9039

Material Sciences Journals

Rachle Green

1-702-714-7001Extn: 9039

Mathematics and Physics Journals

Jim Willison

1-702-714-7001 Extn: 9042

Medical Journals

Nimmi Anna

1-702-714-7001 Extn: 9038

Neuroscience & Psychology Journals

Nathan T

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

John Behannon

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

1-702-714-7001 Extn: 9042

© 2008-2016 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version