Data Mining Empowers the Generation of a Novel Class of Chromosome-specific DNA Probes

Probes that allow accurate delineation of chromosome-specific DNA sequences in interphase or metaphase cell nuclei have become important clinical tools that deliver life-saving information about the gender or chromosomal make-up of a product of conception or the probability of an embryo to implant, as well as the definition of tumor-specific genetic signatures. Often such highly specific DNA probes are proprietary in nature and have been the result of extensive probe selection and optimization procedures. We describe a novel approach that eliminates costly and time consuming probe selection and testing by applying data mining and common bioinformatics tools. Similar to a rational drug design process in which drug-protein interactions are modeled in the computer, the rational probe design described here uses a set of criteria and publicly available bioinformatics software to select the desired probe molecules from libraries comprised of hundreds of thousands of probe molecules. Examples describe the selection of DNA probes for the human X and Y chromosomes, both with unprecedented performance, but in a similar fashion, this approach can be applied to other chromosomes or species.


Introduction
Fluorescence in situ hybridization (FISH) is an established laboratory method to delineate specific nucleic acid sequences within interphase cell nuclei or metaphase spreads, which has proven indispensible for the detection of specific features in DNA and a variety of chromosomal rearrangements in cytogenetic and cancer research [1][2][3].
In the typical procedure, non-isotopically labeled nucleic acid probes are prepared first, and hybridized to denatured, single-stranded DNA or RNA targets, before unbound probe molecules are removed by wash steps. The bound probes can be seen in a fluorescence microscope equipped with filter combinations that match the excitation and emission properties of the fluorescent reporter molecules. Several targets can be tagged or 'decorated' simultaneously, and seen as individual hybridization signals, if the probes were labeled with different haptens [4][5][6][7].
The studies described below focus on the detection of DNA targets, and more specifically on detection, scoring or enumeration of specific chromosomes in interphase cell nuclei. This type of investigation finds clinical applications in a variety of fields. For example, if one of the parents of an unborn child, embryo or fetus is a known carrier of one of the many known recessive X-linked diseases, the chances of the offspring to be affected by the disease can be better prognosticated by determining its gender [8][9][10]. This is a relevant issue: according to Online Mendelian Inheritance in Man (OMIM), as of February 22, 2011, the current estimate of sequenced X-linked genes has reached 620 and the total including vaguely defined traits is an estimated 1138 [11].
Another common use of Y chromosome-specific DNA probes is the identification and tracking of grafted cells in sex-mismatched xenograft studies. For example, when Y Zeng et al.: Data mining for probes 4 chromosome bearing white blood cells from a male donor are transplanted to a female recipient, minute amounts of grafted cell can be identified following cell harvest and hybridization due to the presence of the Y-specific signal [12][13][14][15].

Materials and Methods
All procedures involving human subject have been reviewed and approved by the UC Berkeley Institutional Review Board.

Retrieval of target sequence for the Y probe
The nucleic acid sequence used for data mining was defined in our previous studies on in vitro DNA amplification of Y chromosome-specific DNA repeat sequences.

Database searches
We screened the National Center for Biotechnology Information (NCBI) human genome nucleotide DNA database for homologous sequences with one of the most widely used bioinformatics programs, Basic Local Alignment Search Tool (BLAST) [24].
The BLAST approach to rapid sequence comparison directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. The basic algorithm is simple, robust and versatile; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences [24,25].

Retrieval of probe information for the pericentromeric region of the X chromosome
We used the University of California Santa Cruz (UCSC) Genome Browser GRCh37/hg19, built Februar 2009, at the University of California, Santa Cruz, accessible at http://genome.ucsc.edu/ to identify bacterial artificial chromosome (BAC) clones with high satellite DNA content [26]. The graphic user interface was set to display BAC end pairs and repeat DNA elements in the pericentromeric region of the human X chromosome, i.e., from position 58,232,531 bp to position 61,922,800 bp.

Image acquisition and analysis
Fluorescence microscopy was performed on a Zeiss Axioskop microscope (Zeiss, New York, NY) equipped with a filter sets for observation of Cy5/Cy5.5, Texas red/rhodamine, FITC or DAPI (84000v2 Quad, ChromaTechnology; Brattleboro, VT).
Images were collected using a CCD camera (VHS Vosskuehler; Osnabrueck, Germany) and processed using Adobe Photoshop® software (Adobe Inc.; Mountain View, CA).

Chromosome Y BAC clone selection and validation
The bacterial artificial chromosome clone with the highest homology score, RP11-242E13 (Genbank accession number AC068123) contains 28 identical copies of 23 of the 27 nt query sequence in its 98295 bp insert (Fig.1). According to the NCBI database, this clone has been mapped to the long arm of the Y chromosome, band q12 (Table 1). With one exception (hit #27), the 23 bp sites (Fig.1) are spaced in more-orless regular intervals of slightly more than 3500 bp ( Table 2). The site #27 is spaced 1994 bp from the previous site of homology with the query sequence, thus indicating a single incomplete 3.5kb DNA repeat in BAC RP11-242E13. Without including site #27, the sites are spaced an average of 3566 bp, which is in excellent agreement with the published sequence for a single repeat of 3564 bp described by Nakahori et al. [16].

Validation of the Y-specific BAC clone RP11-242E13 by FISH
The result of an in situ hybridization experiment using a biotinylated DNA probe prepared from BAC clone RP11-242E13 is shown in Figure 2A. As anticipated, the 98 kb DNA probe decorated the distal long arm of the human Y chromosome suggesting that there are hundreds or even thousands of copies of the complementary DNA target sequence [35] (Figure 2A). Interestingly, even in this simple hybridization experiment that did not include any type of blocking DNA, the probe lit up just the chromosome Y, band q12. No cross-hybridization with other chromosomes that are known to carry satellite III-type heterochromatin such as human chromosome 9 [19] was found. In summary, our findings suggest that BAC clone RP11-242E13 carries multiple copies of the 3.6kb DNA repeat which hybridize exclusively to the human Y chromosome.

BAC clones from the pericentromeric region of the X chromosome
The data mining effort searching for a large insert, recombinant DNA clone that hybridizes specifically to the heterochromatic DNA tandem repeats in the pericentromeric region of the human X chromosome was guided by our prior observation that chromosome-specific alpha satellite DNA repeats can be identified in BAC clones through database searches [36]. We had noticed that alpha satellite DNA repeats which typically appear as chromosome-specific, high order tandem repeats in the pericentromeric region of human autosomes [26,37] can also be found dispersed in low copy numbers in some significant distance from the centromeric regions and in vitro DNA amplification can be used to enrich the alpha satellite sequences [36].
The graphical user interface of the UC Santa Cruz Genome Browser suggested that several clones from the RP11 BAC library map into the X chromosome-specific pericentromere and contain alpha satellite DNA repeats (Figure 3). We decided to test two BAC clones, one that is essentially free of interspersed DNA repeats such as short or long interspersed element (LINE) sequences (RP11-294C12) mapping to chromosome X, band q11.2 (Table1) and a second clone (RP11-348G24), which contains a few interspersed DNA repeats and maps to chromosome X, band p11.1 ( Figure 3). After retrieval from the freezer, both clones grew in Luria Bertoni broth at approximately the same rate. The DNA yields from 5 ml overnight cultures were similar, but the labeled DNAs showed drastically different hybridization patterns. The results are shown in Figures 2B and 2C.
The digoxigenin-labeled probe prepared from BAC clone RP11-348G24 containing the interspersed repeats gave a strong signal on the X chromosome, but also unacceptably high levels of cross-hybridization with other autosomes ( Figure 2B). Many interphase nuclei exibited such high levels of cross-hybridization that it was not possible to delineate the X chromosome target. In contrast, hybridization of a biotinylated DNA probe prepared from BAC clone RP11-294C12 resulted in signals that localized exclusively to the pericentromeric region of the human X chromosome ( Figure 2C).
To demonstrate that the highly specific DNA probes for chromosomes X and Y can be combined in dual-color multiplex hybridization experiments [38][39][40], we labeled the Y chromosome-specific DNA from BAC clone RP11-242E13 with Spectrum Green-dUTP (green fluorescence) and the X chromosome-specific DNA from BAC clone RP11-294C12 with Spectrum Orange-dUTP (red fluorescence). Hybridization of this probe mixture gave strong, specific signals in metaphase as well as interphase cells ( Figure 2D).

Probe generation and testing rates
Turnaround times for generation and validation of DNA probes prepared from BAC's are at least one order of magnitude shorter than those observed with yeast artificial chromosome clones as large insert human genomic DNA probe template [41][42][43].

Conclusions
This study was undertaken to specifically seek answers to three questions: 1. Can researchers with limited access to computational capabilities data mine and take advantage of the huge resources generated by the International Human Genome Project?
2. How can one identify recombinant, large insert DNA clones with features that make them ideal hybridization probes? and 3. What constitutes an inexpensive and rapid approach to validate the selected clones?
We believe that all three questions have been answered in this study. First, we have been able to demonstrate that simple, publicly available bioinformatics tools such as BLAST searches running on a simple desktop computer allow the operator to data mine and extract the desired information from publicly available archives. Secondly, we began to define rules to predict DNA probe properties based on DNA sequence analysis. Lastly, we were able to show that in a matter of days, chromosome-specific DNA probes can be defined, prepared and validated by FISH.
The importance of probe specificity can not be overemphasized. While recombinant DNA clones carrying large chunks of the human genome such as the BAC or P1 clones (39,40,(43)(44) are easy to propagate, the presence of non-chromosomespecific interspersed DNA repeats can lead to major impediments in FISH-based interphase cell analysis [36]. As shown above, careful probe selection based on mining of the publicly accessible databases may circumvent some of these problems. The simple, yet efficient data mining approach presented in this paper is our first step towards creation of an 'in silico' DNA probe set that may find extensive use in the analysis of NexGen, deep sequencing data [45,46]. Thus, this new class of 'in silico probes' has little or nothing in common with the conventional recombinant DNA probes [1,2,4,19,47], many of which have not even been sequenced, and it is more akin to the well known 'in silico PCR' [48]. Here, we describe a software-based interactive approach in which the user checks the results of the data mining operation and validates the BAC clones via FISH. In our lab setting, the approach allowed us to select and validate a few clones (<10) per week.
In closing, it's important to mention that polymorphisms or centromeric heteromorphisms might present serious problems in the FISH-based analysis of cells from affected individuals using DNA repeat probes [49] regardless whether such an analysis is performed by convention FISH, Spectral Imaging or multicolor FISH [6,34,50]. The before-mentioned deep sequencing in combination with 'in silico' copy number analysis might prove to be a reasonable alternative approach.