Biomarkers for the Detection of PreCancerous Stage of Cervical Dysplasia

Introduction: Early diagnosis of cancer can dramatically increase healing probability. However many cancer detection methods are time-consuming, invasive, and require skilled medical staff and/or expensive detection systems. Cervical cancer is the fourth most common malignant disease among women, and the fourth leading cause of cancer death in women worldwide. Aim: This pilot study sought to identify reliable biomarkers indicative of early stages of cervical dysplasia, by analysis of changes in volatile organic compound composition in urine samples. Methods: Urine samples of 17 patients with cervical intraepithelial neoplasia (CIN I) and of 9 healthy female subjects were used. The sample composition was analyzed using Gas-Chromatography-Mass-Spectrometry. The statistical analysis of the data was performed using supervised artificial neural networks. Results: We identified four molecules with potential to serve as biomarkers of cervical dysplasia together with two molecules whose absence in the urine can confirm existence of cervical dysplasia. All indications shows that these six potential biomarkers are produced in the body during various physiological processes enhances in sick women. Hence, these potential biomarkers are not related to environmental or dietary origins. Conclusion: Validation of the statistical method used, indicated that the biomarkers identified are highly reliable for detection of cervical dysplasia. Journal of Molecular Biomarkers & Diagnosis J o u r n a l o f M ole cul ar iorkers & iag n o s i s ISSN: 2155-9929 Citation: Elia P, Raizelman S, Katorza E, Matana Y, Zeiri O, et al.(2015) Biomarkers for the Detection of Pre-Cancerous Stage of Cervical Dysplasia. J Mol Biomark Diagn 6: 255. doi:10.4172/2155-9929.1000255


Introduction
Early diagnosis of cancer can dramatically increase healing probability. However, many of the methods used for cancer detection are time-consuming, invasive, and require skilled medical staff and/ or expensive detection systems such as computer tomography (CT), magnetic resonance imaging (MRI), endoscopy and ultrasonography [1][2][3][4]. In recent years much research is focused towards development of safe, reliable, non-invasive and inexpensive early detection schemes. A very promising route is finding specific and reliable biomarkers that are indicative of early stages of cancer. Biomarkers have shown potential for detection of various diseases, including cancer, by their identification in exhaled breath, blood and urine. Urine samples have high potential to contain biomarkers indicative of the physiological condition, indeed, biomarkers were identified for prostate and lung cancers [5,6], tuberculosis [7], exposure to toxic vapors [8] and diabetes mellitus [9]. Identification of biomarkers requires combining skills from different disciplines including analytical chemistry and statistical data analysis [10].
Cervical cancer was the fourth most common diagnosed cancer in women in 2012 and the fourth leading cause of cancer related death in women worldwide [11]. A cervical tumor develops from abnormal cell growth and has been linked to the human Papilloma virus (HPV) [12]. In many cases, infection with certain types of HPV is the first step in the progression from a normal cervix to cervical cancer. It is well established that sexually transmitted HPV induces the growth of abnormal cells that can become malignant [13,14]. As cancer cells form, cells of abnormal size and shape appear on the surface of the cervix and begin to multiply. Cervical cancer can be detected using a laboratory test that examines cervical cells obtained through a gynecological procedure called a Papanicolaou test (Pap test in short) or by a new technology based on liquid-based cytology [15]. The most effective way of screening for cervical cancer is through routine Pap tests or by testing for human Papilloma virus. Women who undergo routine screening have a better chance of early diagnosis and treatment [16,17].
Cervical dysplasia is the term used to describe the early growth of abnormal cells on the cervix that could progress to cancer. Cervical dysplasia is usually the first stage of cervical cancer, but women with cervical dysplasia do not necessarily develop cancer. Dysplastic cells look like cancer cells, but they are not considered malignant provided that they remain on the surface of the cervix and do not invade healthy tissue. Cervical dysplasia is classified by three stages: Cervical  Intraepithelial Neoplasia I,II and III (CIN I, CIN II and CIN III, [18], described in more details in the appendix. Women with a pre-cancerous condition will in most cases remain under physician follow-up, since progression to cervical carcinoma in situ (CIS) occurs in approximately in 11% of the CIN I cases and progression to invasive cancer occurs in approximately 1% of the cases [19]. Most cases of cervical carcinoma can be prevented through proper screening. In fact, during 2012 in developed countries, with better awareness for early detection and better diagnostic tools, cervical cancer was only the ninth leading cause of cancer death in women, while in undeveloped countries cervical cancer was the third leading cause of cancer related death [11]. In addition, errors in cervical sampling and in sample interpretation are common. Consequently, reports of Pap test sensitivity and specificity vary significantly, hence, current screening is far from being sufficiently accurate [20,21].
Herein, we describe results of a pilot study in which chemical analysis of volatile organic compounds (VOCs) in urine samples are compared between women with cervical dysplasia in stage CIN I and urine samples of healthy women. The small sample size used in this pilot study is related to the limited budget provided for this pilot research. The main goal of the study is to try and identify statistically meaningful potential biomarkers that will allow rapid, non-invasive, reliable and unexpansive test for cervical dysplasia in the CIN I stage.

Methods
Composition analysis of VOCs in the headspace over urine samples was performed using gas chromatography (GC) combined with mass spectrometry (MS). The volatile organic compounds composition in the headspace of urine samples of 17 patients with CIN I and 9 samples of healthy female subjects were analyzed.

Subjects
Cervical dysplasia/cancer patients were admitted to the Department of Obstetrics and Gynecology at San Camillo-Forlanini Hospital (Rome, Italy). Colposcopy exams were carried out in all patients before surgery. After LEEP (Loop Electrosurgical Excision Procedure) and pathological analysis, the histological type and grade of the cervical tumors were determined. Only CIN I grade patients were selected for the study and their urine samples were collected. All procedures involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All patients included in the study signed an informed medical consent form, previously approved by the Institutional Ethical Committee and the Medical Board of the San Camillo-Forlanini Hospital. Urine samples were obtained from 17 female CIN I patients (mean age 36.8 years, SD ± 9.9) and 9 healthy female subjects (mean age 39.6 years, SD ± 14.2). See details of the subjects in the Appendix Table A2. All healthy women participants did not present any symptoms of any kind of cancer, acute inflammation, flu, pregnancy, or infectious diseases before and during the period of the experiments. Participants were asked to supply 50 ml of urine in sterile screw-top plastic vials. Urine samples were divided into two 25 ml septum topped vials and frozen at -18°C until use.

Sample analysis
GC-MS analysis was performed using an Agilent 6890 series GC system (Agilent, USA) connected to Agilent 5973 network mass Selective detector (Agilent). The samples were concentrated prior to all measurements and introduced into the GC system using static headspace sample extraction, achieved by exposure of a 65 μm polydimethylsiloxane/divinylbenzene (PDMS/DVB) solid-phase microextraction (SPME) fiber (SUPELCO, Bellefonte, PA) to the headspace over the solvent-free urine sample for 20 min at 80°C. A detailed description of the chromatographic analysis procedures used is given in the appendix section.

Chromatogram pre-processing
The noise level in each chromatogram was evaluated using a home written MATLAB code, according to clinical and laboratory standards institute (CLSI) guideline EP17. The code was used to set the threshold between noise and actual peak according to the Limit of Detection (LOD) definition which is given in the appendix section. The retention times of peaks in the chromatogram were assigned by identification of local maxima that are above the threshold value. The measured chromatograms were represented by a pair of vectors containing peak retention time and their area. The fluctuations in peak positions were measured by analysis of our GC calibration mixture described in the Appendix section. The chromatograms of 40 measurements performed on different days were analyzed to yield a maximum uncertainty at peak position of 1.2 sec. Thus, the entire measurement period was sub-divided into time intervals of 2.4 sec which ensured that identical peaks in two different chromatograms will be located in the same time interval. The area of each peak in the chromatogram was normalized twice consecutively; first by the creatinine level found in the urine sample and then by the largest peak area in the spectrum. Hence, all area values in all spectra were in the range of 0-1.

Statistical analysis
The statistical analysis used is a non-linear approach, based on Artificial Neural Networks (ANN). The ANN contained two layers of "neurons" appropriately connected by weights. Data inputs were connected to the neurons in the first layer ("hidden" neurons), which were connected in turn to the second layer of the "output" neurons.
Adjusting the values of the weights between the neurons during the training of the ANN was carried out using "back-propagation" of the errors between the output neurons and the known data outputs. A constant bias threshold is included as additional input to all hidden and output neurons.
Once the ANN is trained, it is verified by presenting examples not used in the training. ANN modeling has been used in analyzing bio-medical data [22,23], and high dimensional data was successfully modeled and analyzed by the GB ANN algorithm set [24].
The chromatogram obtained for each sample was assigned as belonging to one of two classes: 'Healthy' or 'Sick'. The contribution vector for the separation between the two classes was calculated. The compounds with the largest values in this vector were assigned as possible biomarkers of cervical dysplasia [25][26][27]. Once a trained ANN is available, it can be analyzed to extract the identification of the more relevant features [28]. The causal index (CI) based algorithm [29] was found to be very useful in revealing the influence of input change on the change in relative magnitude and sign of each output. A detailed description of this approach is given in the appendix section.

Validation
Two cross-validation methods were used to examine the generalization of the statistical model obtained by the approach described above. Both validation methods used are suitable for application to cases with small numbers of samples such as those described in this study [30]. The two validation methods used were: Leave one out (LOO) and 7-fold [31,32]. In the 7-fold cross validation the original sample set is randomly partitioned into seven equal size subsamples. One of the subsamples is chosen for validation while the other subsample sets are used as training data. The process is repeated 7 times. The LOO validation approach subdivides the samples space into one sample used for validation and all remaining samples used for training. The process is repeated until all samples have been tested for validation. The classification of each prediction sample for LOO or prediction set for 7-fold was determined by the classification output of each model, as explained in the appendix section.

GC-MS results
A comparison between typical GC-MS chromatograms of urine samples obtained from healthy and sick women samples is presented in Figure 1. Four compounds that were present in all urine samples examined were identified. These four compounds served as internal references in the data analysis stage. The internal reference compounds peaks are marked by arrows in Figure 1 as lref. 1 (ammonia), Iref. 2 (hexamethyl-cyclotrisiloxane), Iref. 3 (octamethyl-cyclotetrasiloxane) and Iref. 4 (1,3-dihydro-5-methyl-2H-Benzimidazol-2-one). The origin of these internal reference peaks is related to both the urine sample composition and the experimental system (column and SPME fiber coatings).
Inspection of the chromatograms in Figure 1 shows clear differences between the data of sick and healthy women. There are many peaks that appear in both healthy and sick urine samples but it is clear that each chromatogram exhibits a large number of unique peaks that do not exist in the other chromatogram. This is not surprising, since urine samples of different individuals are expected to exhibit variations due to differences in diet, habits and physiology. Thus, the central goal of the present study is to check whether one can find a group of statistically meaningful peaks that can uniquely identify the urine sample as belonging to a sick woman.

Data pre-processing
The output obtained by the GC-MS software (enhanced msd ChemStation Vers. E.02.01.1177) is in the form of two column vectors containing the retention time assigned to each peak detected and the area under the peak respectively. All area values in the chromatogram were first normalized according to the concentration of creatinine in the sample. Next, since ammonia (Iref. 1) exhibited the largest peak area in all chromatograms, its value was used to normalize the area of all other peaks in the chromatogram. The peak retention time vector was mapped onto a vector that contained 625 elements, each of duration of 2.4 sec. This time box size was defined by the statistical error in retention time of peaks corresponding to different known compounds used for the calibration. Thus, each time box could contain only a single peak with retention time equal to the value at the box center. This procedure ensured that identical peaks were assigned identical retention times. The normalized peak area was assigned to the box, and if no peak existed in a given time interval, its value was defined to be zero.

Statistical analysis and its verification
A supervised ANN was used for the modeling the GC-MS data where all the input features were connected to 5 neurons in the hidden layer and those were connected to a single neuron in the output layer. The back propagation method was used to perform the training of the ANN using 70% (18 out of 26) randomly chosen chromatograms. The convergence of the training set to a model that correctly reproduced all the training samples is shown in the Appendix, Figure A1. The output level contained a single neuron whose value was in the range 0.1-0.9 for all samples. Output values in the range 0.1-0.4 were classified as belonging to sick women; those between 0.6-0.9 as belonging to healthy women, and outputs in the range 0.4-0.6 were defined as undecided. The quality and generalizability of the outcome model was tested on the remaining 30% of the chromatograms in the population that were not used in the training process. The outcome of this generalization was very good as can be seen in Appendix Table A1.
Validation analysis of the ANN based model obtained was carried out using two different procedures, both suitable for small statistical ensembles [33]: the 7-fold variation and the leave one out (LOO) validations. The validation results, using both validation methods, of the ANN based model are presented in Table 1. Inspection of the results clearly suggests that the ANN based model of the GC-MS results are highly accurate. This accuracy is related to the non-linear correlations between the dependent and independent variables.

Potential biomarkers for CIN I stage of cervical dysplasia
The ANN based model allowed us to identify VOCs in urine samples that may serve as biomarkers for CIN I stage cervical dysplasia. As discussed above, the validation of the model showed that the ANN yields very good results. In the following we shall focus on discussion of the potential biomarkers suggested by the ANN analysis.  The chemical identities of the potential biomarkers obtained by the ANN analysis are listed in Table 2. The peak identification shown in Table 2 was assigned based on the search results of the GC-MS software, ChemStation, using the NIST'08 libraries. Chemical identification was assigned only if the Quality Factor (QF) obtained by the search code was above 80. In some cases few assignments with QF>80 were obtained, so in these cases all the possibilities are given. For each possible biomarker the QF is given together with molecular structure and percent of repetition in the healthy and sick groups. For each chemically identified potential biomarker we also added possible 24.00 Unknown --35.5% 22% -- Table 2: Summary of the chemical identity of potential biomarkers obtained using the ANN approach. The chemical identity of a peak is assigned only if the Quality Factor (QF) was greater than 80, otherwise the peak is assigned as unknown. For some peaks in the chromatogram a few possible assignments were possible, in such case all possibilities with QF>80 are shown. There are compounds which occur in both sick and healthy women samples; however their concentration in urine of sick women is 2-3 folds higher.
routes of its generation or entrance to the body and its identification as biomarker in different types of illnesses according to the literature. Inspection of the data in Table 2 shows that the main sources for most of the potential biomarkers are: dietary, peroxidation processes and environmental sources. Bold fonts were used to mark all the names of potential biomarkers that are produced in the body (Table 2). There are four potential biomarkers that are not identified as originating from environmental or dietary sources. The four potential biomarkers that were identified are: 3-Hexanone, Hexanal, Dodecane, 4-methyland 3-Ethylcyclopentanone. The physiological production of these potential biomarkers is associated with processes related to oxidative stress. Three of these compounds were not identified in the urine sample of any member of the healthy women group while the fourth appeared in both groups but their concentration in the urine of sick women was over two fold higher than in samples from healthy women. This group of four compounds have high probability to serve as reliable biomarkers for CIN I stage cervical dysplasia.
A similar set of compounds that characterizes the healthy women group was also identified. The VOCs identified, using the ANN based approach, are presented in Table 3. Two compounds: 2,7-Dimethyloxepin and 1-Butene, 4-isothiocyanato-were detected as  indicative of healthy women but were not identified in any sample of sick women. All other chemicals listed in the two tables appeared in samples of both sick and healthy women with similar probability. This may suggest that the absence of these two compounds in the urine can also serve as potential biomarkers for CIN I stage of cervical cancer if both are absent while the four potential biomarkers described above appear in the urine.

Conclusions
Potential biomarkers in the urine of stage CIN I of cervical dysplasia were identified. Urine sample analysis was carried out using GC-MS. The data obtained were analyzed using an ANN based statistical approach. The resultant models yield very good separation of the data into two groups: healthy and sick women. The accuracy of the model was examined by two validation methods, both appropriate for examining small statistical groups of data. The validation clearly shows that the ANN-based model is a highly reliable one.
Implementation of the ANN method to the GC-MS data was carefully analyzed. Most of the potential biomarkers in urine samples of sick women that were identified by the ANN method are of environmental or dietary origins. However, the ANN analysis identified four potential biomarkers that are produced by the body. These compounds constitute a sub-set of urine related VOCs, all produced in the body in oxidative stress related processes, have high probability to serve as biomarkers for the CIN I stage of cervical dysplasia. These compounds are (see Table 2 for references): 1. 3-Hexanone: a product of the lipid peroxidation processes in the body. It has been detected, but not quantified, as a breast cancer biomarker in urine.

2.
Hexanal: an alkyl aldehyde found in human biofluids including milk samples. It is a mediator of oxidative stress. Hexanal is a volatile compound that has been associated with the development of undesirable flavors. The content of hexanal, which is a major breakdown product of linoleic acid (LA, n -6 PUFA) oxidation, has been used to follow the course of lipid oxidation. It is a product of the lipid peroxidation process in the body and can be found normally in urine and in cerebrospinal fluid. Abnormal concentrations can be found in urine, blood and exhale breath in lung, liver and breast cancers.

Dodecane, 4-methyl:
This compound has been reported as a biomarker of tuberculosis in a number of publications.

3-Ethylcyclopentanone:
This compound has been reported as a VOC found in the urine whose concentration increases in case of oxidative stress.
In addition, two chemicals that were identified as indicative of urine samples of healthy women but were absent in all samples of the sick women. These compounds are: 2,7-Dimethyloxepin and 1-Butene, 4-isothiocyanato-and they their absence in a urine sample of sick women may be considered as a potential biomarker for CIN I cervical dysplasia provided that the four potential biomarkers listed above were identified in the urine. Most of the potential biomarkers identified here for CIN I stage cervical dysplasia have been observed, separately, in previous investigations of different cancers. However, when identified as a group this set of VOCs, together with the two VOCs found only in urine of healthy women but not in samples of sick women, are suggested to be highly suitable to serve as potential biomarkers in the identification of CIN I cervical dysplasia. This set of biomarkers can be used as a simple non-invasive and unexpansive screening procedure to identify CIN I stage of cervical dysplasia. This constitute an important simple, non-invasive and unexpansive method that compliments the existing cervical dysplasia identification procedure used at present. A last point to be noted is that none of the potential biomarkers identified in the present study were observed in urine samples of smokers or in the urine of individuals whom used to smoke in the past [34,35]. The main drawback of the present study if the small number of urine samples examined for both sick and healthy women. This limitation, as stated in the introduction section, is mainly due to the lack of appropriate financial support for the present investigation. However, the very good results obtained in the validation of the described results suggest that they do have a high potential to become extremely useful. The reliability of these biomarkers has to be further proven in more extensive studies (with larger number of urine samples from sick and healthy women) that are planned to be performed in the future.