Identification and Pattern Analysis of SNPs Involved in Colorectal Cancer

Colorectal cancer (CRC) is the second leading cause of cancer-related deaths globally posing a lifetime risk of 80-100% in every individual. Genetics and relevant mechanisms underlying some key signaling pathways like Wnt, TGF, p53, K-ras etc. play a detrimental role in governing the predisposition for CRC. A high percentage of colorectal tumors (adenomas and carcinomas) show activating mutations in beta-catenin or axin, whereas, loss of certain tumor suppressor genes (TSGs), like APC cause the initiation of random polyps in the colon. All of these molecules incidentally are critical components of an evolutionarily conserved Wnt signaling pathway, which is instrumental at various time-points in the development of this disease. Differences in SNP profiles amongst sample groups in the genomic landscape can be recognized through a smart and efficient use of machine learning techniques. The statistics and pattern analyses of these SNP profiles, interestingly provides us with a concrete and logical platform upon which, relative contribution/s of each unique SNP, ranging “from cause to effect” can be significantly assessed. The biological relevance of these SNP variations with respect to cancer prediction and predisposition, however, remains to be resolved, pending a better understanding of the impact of rational control design in SNP studies. Our results emerging from the analyses of significant SNPs reported here, demonstrates the utility of relevant bioinformatics tools and machine learning techniques in discriminating diseased populations based on realistic SNP data. In this study, we have primarily targeted critical members of Wnt signaling pathway, which play important developmental role/s during different stages of colorectal cancer, depicting a classical “multigene-multistep nature” of cancer. We have identified and related common genetic variants for the “early-acting” and “late-acting” members of this pathway, that are most prevalent in patients with CRC disease, by harnessing the power of developmental biology tools. In addition, complex relationships and correlations hidden in large data-sets have been dug and analyzed here, by deploying various datamining (bioinformatics) techniques. The report discusses the scope of such a combinatorial approach, by identifying some potential candidate targets of therapy, in translational research and clinical medicine interventions.


Introduction
The Wnt signaling pathway is an evolutionary conserved signaling motility and cell fate. Also during tumorigenesis, Wnt signaling pathway plays a central role and an inappropriate regulation of this pathway is notably observed in several human cancers [1]. Wnt signaling pathway is essential to many biological processes like stem cell renewal/differentiation [2] and numerous studies of this pathway over the last few years have led to the identification of several critical components. Nevertheless, many of the mechanisms involved in activation or inactivation of this particular pathway still remain to be elucidated.

Colorectal cancer
Colon cancer is carcinoma (cancer; epithelial in origin) of the large intestine (colon), the lower part of the digestive system. Rectal cancer is a carcinoma of the last several inches of the colon. Together, they are often referred to as colorectal cancers. Most cases of colon cancer begin as small, noncancerous (benign) clumps of cells called adenomatous polyps or "adenomas'. Over time, some of these polyps become colon cancers (carcinomas) following a multigene-mutistep process [3]. More than half a million people worldwide die of Colorectal Cancer (CRC), making CRC, the second largest cause of death by cancer, owing to its genetic epidemiology [4]. A high percentage of tumors show activating mutations in beta-catenin or axin. Mutations in APC and beta-catenin, a key mediator of Wnt signaling are found in wide majority of sporadic colon cancers [5,6]. These molecules are crucial components of the Wnt signaling pathway (see Figure 1). Inactivating mutations of the Wnt signaling pathway inhibitor APC (a tumor-suppressor gene) are associated with the colon cancer susceptibility syndrome, whereas, mutation/prominent genetic alteration, which is present in early premalignant lesions in the intestine, such as aberrant crypt foci and small adenomas. Thus, the chances that colon cancer can indeed be cured or stalled, needs effective measures to inhibit/block this particular stage of development (ie., early polyps).

Single nucleotide polymorphisms (SNPs)
identifying these genetic variants (SNPs) that are prevalent in patients with specific disease [10].
Thus, SNPs serve as potential and invaluable markers (landmarks in the genome) for association-based approaches to discover the genetic components bearing complex traits. Furthermore, where large sample sizes are required, their bi-allelic nature is amenable to high throughput automated genotyping. SNPs have been of particular interest, also to the evolutionary biologists. The genomic areas with extreme SNP density hold specific functions i.e., may have specific gene and structural content [11]. In addition, SNPs are useful genetic markers for family-based linkage studies of Mendelian diseases [12], involving population histories and/or genetics and personalization of medicine [13]. Previous SNP-based efforts have focused on (1) candidate genes for common diseases, (2) genes with expressed sequence tags, or (3) genomic sequences. SNP profiles and patterns may therefore find immense utility in identifying a comprehensive collection of genes that contribute to the development and susceptibility of complex diseases such as cancer [14]. Mining of SNPs from either EST Databases [15] or DNA-Sequence Databases [16] have been the most prevalent approaches in such ventures.

Machine learning
Human genome analysis and the development of high-throughput techniques have provided us with a wide array of complicated biological data. Because of unwieldiness of these data, traditional statistical methods haven't performed efficiently on this kind of analysis. Machine learning algorithms carry the potential for mining significant information from relatively large, noisy, and complicated data.
Machine learning (ML) essentially is the study and computer modelling of learning processes, including acquisition of new declarative knowledge, organization of new knowledge into general effective representations, and discovery of new facts through observation and experimentation. These programs are advantageous in many cases, where the input/output pairs can be specified, but the concise relationship between the input/output pairs is not usually known. Machine learning programs can thus help in extracting the complex relationships and obscure correlations in relatively large datasets [17] with the intention of uncovering hidden patterns (a process sometimes referred to as: data-mining). Machine Learning deals with computer programs that learn from and improve with experience by training data [18]. However, learners designed to recognize patterns in training data SNPs with realistic clinical data are scarce. Complex diseases, in nature, arise from the effects of multiple genes which often interact with each other to produce the symptomatic traits. SNP markers are most commonly used to identify such genes for complex disease diagnosis. An increasingly popular and promising way to utilize SNP data for complex disease diagnosis [19] is to employ Artificial Neural Networks (ANN). ANN* is a computer-based algorithm that can be trained to recognize and categorize the complex patterns.
[*ANN is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist  In the absence of Wnt signal (left panel), action of the destruction complex (CKIα, GSK3β, APC, Axin) creates a hyperphosphorylated β-catenin, which is a target for ubiquitination and degradation by the proteosome. Binding of Wnt ligand to a Frizzled/LRP-5/6 receptor complex (right panel) leads to stabilization of hypophosphorylated β-catenin, which interacts with TCF/LEF (T-cell Factor/Lymphoid Enhancer-binding Factor) proteins in the nucleus to activate transcription and trigger tumorigenesis. Dishevelled (Dsh) is a family of proteins involved in canonical and non-canonical Wnt-signalling pathways. Dsh is a cytoplasmic phospho-protein that acts directly downstream of Frizzled receptors. In the canonical pathway, CKIα, GSK3β, APC, and Axin act as negative regulators and all other components act towards positive regulation. It is widely accepted now that constitutive activation of Wnt signaling caused by mutations in the components of this pathway is responsible for initiation of Colorectal Cancer [7]. approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data.]

Tissue sampling and expression profiling
The paraffin embedded archived tissue biopsies of colorectal cancer patients (45 cases) were obtained from the Veterans Affairs Hospital, Long Beach, CA (USA). The details of these cases are as follows:

Number of cases analyzed: (45 patients)
To examine the protein-expression in situ, the archived tissuebiopsies (paraffin-embedded sections) were analyzed by immunelabeling using fluorescence-based histochemical methods and processing of normal/cancerous/polyp tissues was carried out following a standard protocol [20] with slight modifications like, retrieval of antigen by pressure-cooker treatment in dilute citric acid buffer ( [21]; see Appendix-I) prior to treatment with primary and secondary antibodies. For the results described in the report, the following antibodies (Santa Cruz Biotechnology, Santa Cruz, CA) were used at various dilutions. HDlg (anti-human,monoclonal mouse) was used at 7ug/ml, HCKIε (polyclonal mouse) at 10ug/ml and HAPC (polyclonal rabbit -C-20) at 8ug/ml. ("H" stands for antibodies raised in various animals here, against human antigen/s: H-Dlg, CKIε, APC ).

For combination 1: *Green tag-FITC-conjugated (anti-mouse) *Red tag-Cy3-conjugated (anti-rabbit) For combination 2: *Red tag-Cy3-conjugated (anti-mouse) *Green tag-FITC-conjugated (anti-rabbit)
The processed sections were mounted in Vectashield (a standard anti-quenching mountant) and preserved in dark at 4 o C (to prevent bleaching of flourophores) for further observation after undergoing the immune-histochemical regimen. The protein localization of various genes in question (in several permutation/s and combination/s: see Results section), were viewed using MRC-1024, BioRad/Nikon Diaphot-200 Confocal microscope and Laser Sharp image analyzer software (BioRad Microscience Division, Cambridge, MA). For each case, the intensity of various proteins (in question) was compared with case-matched (from same patient) normal intestinal tissue, which served as a control. Each case was independently scored three-times in a single-blind manner to reduce the scoring bias. A mean calculated for 3 such scores (for each case) indicated whether there is a significant loss of the protein or no change at all (as compared to control) for that particular permutation/combination.

Softwares used
QualitySNP: This tool was used to detect reliable SNPs and insertions/deletions (indels) in EST data, both with and without quality files [22]. It basically uses three filters for the identification of reliable SNPs. Filter 1 screens for all potential SNPs and identifies variation between or within genotypes. Filter 2 is the core filter that uses a haplotype-based strategy to detect reliable SNPs. Clusters with potential paralogs as well as false SNPs caused by sequencing errors are thus located. Filter 3 screens SNPs by calculating a confidence score, based upon sequence redundancy and quality. QualitySNP is an efficient tool for SNP detection, storage and retrieval in diploid as well as polyploid species. It is available for running on Linux or UNIX FastSNP: FASTSNP (Function Analysis and Selection Tool for Single Nucleotide Polymorphisms) web server was used to efficiently identify and prioritize high-risk SNPs [23] according to their phenotypic risks and putative functional effects. A unique feature of FastSNP is that the functional effect information used for SNP prioritization is always up-to-date, since FastSNP extracts the information from 11 external web-servers at query time using a team of web wrapper agents. Moreover, FastSNP is extendable by simply deploying more Web wrapper agents. It prioritizes SNPs according to 13 phenotypic risks and putative functional effects, such as changes to the transcriptional level, pre-mRNA splicing, and protein structure and so on.

Results
Expression profiling of potential "early" and "late" acting Upon observing the tissue biopsies for spatio-temporal protein localization patterns for various potential candidate molecules (like; p53, h-Dlg, CKI, EGFr, ZO-1 etc) that are implicated in colorectal cancer, the most prominent and noticeable results were those of APC-CSNK1ε interactions presented here (CSNK1ε is mentioned as CKI for ease; Figure 2). The results from hDlg-APC interaction are also reported here (as control) and for comparison ( Figure 3). It was cases of polyposis), that, in the early-to-late polyp formation stages, when cytoplasmic component of APC begins to disappear (which is supposed to be a characteristic feature of APC mutant cells, seen in culture studies (in-vitro) by Rosin-Arbesfeld et al. [24]; see illustration below), CKI protein also gets concomitantly lost from the cells, which are mutant for APC (Figure 2). Following a similar pattern, the same was recapitulated in the aberrantly developing epithelial cells of the colonic polyps (shown here, Figure 2B). Apart from that, in the normal cells, APC and CKI were seen to be strongly co-localized in-  (Figure 2A). This clearly indicates a strong functional correlation between the two molecules and suggests a temporal early requirement of CKI for the APC tumor suppressor gene. In contrast, when APC-hDlg protein interaction was closely monitored, there was no obvious change in the hDlg expression per se. Dlg expression [discs large (dlg) , initially identified in Drosophila, which belongs to MAGUK family of membrane associated proteins; [25]] thus, seemed to be regulated or perturbed only as a result of/or following the architectural changes in rather than the loss of the cytoplasmic APC protein in APC mutant cells* (which is a prominent characteristic feature of APC mutant cells* both CKI and hDlg were seen to be continually lost from the cancerous tissues in the later stages of development (ie. in carcinomas, not shown here) due to the loss of cell components, cytoskeletal alterations and damage. Thus, hDlg was considered to be a late-acting gene in contrast to CKI, which is an early-acting genetic factor (and thus a potential target gene) temporally interacting and/or associated with the APC gene during polyp formation, for the initiation of CRC disease (Figure

Pattern analysis of "Region Specific SNPs" residing in 5q21genes
These strikingly significant observations on sub-cellular localization of the genes in question provided us with a good start-point to explore and analyze the SNPs associated with and within these genes. Quality SNP tool was used to detect reliable SNPs and insertions/deletions (indels) in EST data of various genes in question. Mainly focusing on the Wnt signaling pathway members, we obtained data with and without quality files. The quality data was filtered using the described standard method (see, Materials and methods). Using the FASTSNP (Function Analysis and Selection Tool for Single Nucleotide Polymorphisms), which allows the users to efficiently identify and prioritize high-risk SNPs according to their phenotypic risks and putative functional effects; we charted out the data for the related SNPs (Tables 1 and 2).
Upon keen observation and analyses of the pattern of distribution of these SNPs, we found that, the total number of SNPs retrieved in HDlg region was the highest, which probably accounts for all the "noise" and does not carry much weight in mutational cause-effect, also incidentally correlated to the functionally persistent condition during the early stages of development of CRC shown by the wet lab data presented here (Figure 3). Although, the total number of exonic SNP's obtained in APC region was very high (44) as expected, but only three significant (3* in Table 1 ) non-synonymous SNPs were found, which actually contribute towards any significant change in terms of being non-sense/mis-sense type. APC being the actual cause of initiation of polyps should have ideally bore a larger number of them, but it doesn't seems to be the case here. However, a strikingly higher number of exonic SNP's (even more than that of APC) for the CSNK1ε came as a big surprise! A significant cause-effect value (ie. 2*/266 for CSNK1ε as against 2*/1917 for HDlg in Table 1.) on 22q13 chromosome was seen, which further suggests that Casein Kinase-I is one of the most significant and functionally relevant candidate genes amongst the analyzed group here.

SNP distribution data
Further careful analyses of pattern distribution of SNPs within the Casein Kinase family, yielded a second potential candidate in this gene-set: CSNK1δ, which is a sister-gene of CSNK1ε and incidentally  abnormal dysplastic tissues per se ( Figure 3C and 3D) of growing polyps, -See illustration below) of the intestinal villi (see, Figure 3D). However,

23, 22q13, 3q21, corresponding to APC, CKI and HDLG
happens to share ~98% amino acid homology with it in the kinase domain [26]. Out of the 40 SNPs retrieved for CSNK1δ in the exonic region, 11* SNPs cause significant effect in terms of generating missense function of the protein (Table 2). Thus, one can easily assess putative functional relevance of these potential candidates depending on their pattern of distribution and the resultant/derived "cause-effect" value. The chromosomal region 22q13, in which CSNK1ε resides, is the richest in the number of exonic SNPs in this data-set and therefore, again tops this chart, in bearing a significant functional relevance to the early intiation-phase of CRC disease.

Discussion
Genotypic diversity is presumed to underlie the heritable phenotypic differences observed as variation in drug response, susceptibility to disease, and other complex traits. It is therefore conceivable, that cataloging DNA sequence variation/s in the genome will presumably provide genuine invaluable tools to relate genotypes with complex phenotypes and hence enable discovery of the causative genetic factors and mechanisms, underlying various complex diseases, such as cancer [27,13]. Common types of sequence variations in humans include single nucleotide polymorphisms (SNPs), insertions and deletions of a few nucleotides, and variation in the repeat number of a motif (i.e., mini-and microsatellites). SNPs have a relatively high abundance in the genome, occurring at a density of ~1 per kb when any two genomes are compared, and they also possess a relatively low mutation rate. Thus, they serve as a reliable platform in which to search for such complex gene functions and diseases. Precise SNP research thus provides a fundamental understanding to many polygenic diseases, paving a promising way to discovering new therapeutic targets. However, the race amongst pharmaceutical companies today is to apply a new genomics approach to identify "novel targets" and validate these targets in the most efficient fashion. We have a very crucial clue from "pattern analysis" reported here (on the basis of significant "cause-effect" value, see: Tables 1 and 2) of the retrieved "regionspecifc" SNP's. Our results clearly point to the fact, that although a particular chromosomal region/stretch is richer than others, in effect, only the SNPs that cause a significant/drastic change in terms of hampering the function of the protein in question, really matter. These    SNPs actually result in dysfunctional/non-functional protein due to the associated frame-shift, nonsense mutations etc. Here, CKIε(epsilon) and δ(delta) regions have emerged such susceptible regions over HDlg.
Thus, CKI can be suggested as a "potential candidate" for such a study. CSNKIε (CKI) for which the expression profiling was also conducted (see; Figure 2 and illustration below), incidentally resides on an SNPrich chromosome 22 q [11]. This chromosome can thus be diligently exploited for further analysis in context to the cancer-disposition genes and oncogenomics. The SNP-rich regions in the entire human genome are presumed to have specific function/s (hot-spots) unlike the SNPpoor regions (cold-spots), and therefore could act as "potential genetic targets" for therapeutic interventions ( Figure 5).
We have already created a simplistic Colorectal Cancer (CRC) SNP data for complex disease diagnosis is to employ Artificial Neural Networks (ANN), and we also intend to train and design an "ANN model" for determining the predisposition of this disease by harnessing the power of this tool in our forthcoming ventures. Our ANN (Artificial Neural Network) model will hopefully predict such predispositions (see Figure below), owing to the very encouraging results presented here. A trained ANN model is mostly helpful in knowing the probability of disease occurrence in an individual/s implicating susceptibility/predisposition of that particular disease (in this context; colorectal cancer). We would basically focus on the molecules of the 4 major alternative signaling pathways defined for colorectal cancer, namely; the Wnt, P53, TGF-beta and K-Ras [28], like the targeted approaches deployed earlier by Pierotti et al. [29], and/or other so-called emerging pathways in CRC [30,31]. Wnt pathway has been of particular interest for us because of its implication in stem cell renewal and differentiation, per se [2]. The pathways that bear relevance both in carcinogenesis and stem cell biology would thus be of prime interest, for such a translational-research venture. A corollary of this ongoing project would be to identify precisely those SNPs which are associated with obvious biological effects in response to chemical drugs (deploying chemi-informatics). Thus, this particular SNP effort would, in part, serve as the bedrock of pharmacogenomics (in this context: also, oncogenomics), which is an emerging field for personalized medicine:  Figure 5: CKI emerges as a potential target/candidate for pharmacogenomics study and also as potential connecting link between APC and H-Dlg.