alexa Visualization of High Throughput Genomic Data Using R and Bioconductor | Open Access Journals
ISSN: 2153-0602
Journal of Data Mining in Genomics & Proteomics
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Visualization of High Throughput Genomic Data Using R and Bioconductor

Ruchi Yadav and Prachi Srivastava*

Amity Institute of Biotechnology, Amity University, Uttar Pradesh, Lucknow, India

*Corresponding Author:
Prachi Srivastava
Assistant Professor
Department of Biotechnology
Amity University, Uttar Pradesh
Lucknow, India
Tel: +919453141916
E-mail: [email protected], [email protected]

Received date: May 02, 2016; Accepted date: May 17, 2016; Published date: May 26, 2016

Citation: Yadav R, Srivastava P (2016) Visualization of High Throughput Genomic Data Using R and Bioconductor. J Data Mining Genomics Proteomics 7:197. doi: 10.4172/2153-0602.1000197

Copyright: © 2016 Yadav R, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Data Mining in Genomics & Proteomics

Abstract

DNA microarrays, technology aims at the measurement of mRNA levels in particular cells or tissues for many genes simultaneously. Microarray in molecular biology results in huge datasets that need rigorous computational analysis to extract biological information that lead to some conclusion. From printing of microarray chip to hybridization and scanning process it results in variability in quality of data due to which actual information is either lost or it is over represented. Computational analysis plays an important part related to the processing of the biological information embedded in microarray results and for comparing gene expression result obtained from different samples in different condition for biological interpretation. A basic, yet challenging task is quality control and visualization of microarray gene expression data. One of the most popular platforms for microarray analysis is Bioconductor, an open source and open development software project for the analysis and comprehension of genomic data, based on the R programming language. This paper describes specific procedures for conducting quality assessment of Affymetrix Gene chip using data from GEO database GSE53890 and describes quality control packages of bioconductor with reference to visualization plots for detailed analysis. This paper can be helpful for any researcher working on microarray analysis for quality control analysis of affymetrix chip along with scientific interpretations.

Keywords

Microarray; R; Bioconductor; Transcriptomics; Quality control; Genome visualization

Introduction

In the context of the human genome project, new technologies emerged that facilitate the parallel execution of experiments on a large number of genes simultaneously. The measurement of transcriptional activity in living cells is of fundamental importance in many fields of research from basic biology to the study of complex diseases such as cancer [1]. The so-called DNA microarrays, or DNA chips, constitute a prominent example. This technology aims at the measurement of mRNA levels in particular cells or tissues for many genes at once [2]. DNA microarrays provide an instrument for measuring the mRNA abundance of tens of thousands of genes. Currently, the measurements are based on mRNA from samples of hundreds to millions of cells, thus expression estimates provide an ensemble average of a possibly heterogeneous population [3].

Gene expression profiling provides unprecedented opportunities to study patterns of gene expression regulation, for example, in diseases or developmental processes. DNA microarray technology takes advantage of hybridization properties of nucleic acid and uses complementary molecules attached to a solid surface, referred to as probes single strands of complementary DNA for the genes of interestwhich can be many thousands are immobilized on spots arranged in a grid (array) on a support which will typically be a glass slide, a quartz wafer, or a nylon membrane. From a sample of interest, e.g. a tumor biopsy, the mRNA is extracted, labeled and hybridized to the array [4]. Measuring the quantity of label on each spot then yields an intensity value that should be correlated to the abundance of the corresponding RNA transcript in the sample [5]. Microarrays provide a rich source of data on the molecular working of cells. Each microarray reports on the abundance of tens of thousands of mRNAs. Virtually every human disease is being studied using microarrays with the hope of finding the molecular mechanisms of disease [6-8].

Bioinformatics analysis plays an important part of processing the information embedded in large-scale expression profiling studies and for laying the foundation for biological interpretation [8-10].

A basic, yet challenging task in the analysis of microarray gene expression data is the identification of changes in gene expression that are associated with particular biological conditions. Careful statistical design and analysis are essential to improve the efficiency and reliability of microarray experiments throughout the data acquisition and analysis process [11-12].

Microarray Studies

Microarrays are useful in a wide variety of studies with a wide variety of objectives. Many of these objectives fall into the following categories [13,14].

a) A typical microarray experiment is one who looks for genes deferentially expressed between two or more conditions. That is, genes which behave differently in one condition (for instance healthy [or untreated or wild type] cells) than in another (for instance tumor [or treated or mutant] cells). These are known as class comparison experiments.

b) When the emphasis is on developing a statistical model that can predict to which class a new individual belongs we have a class prediction problem. Examples of this are predicting the response to a treatment (e.g. classes are _responder_ and _non-responder_) or the evolution of a disease (e.g. recidivated or cured).

c) Sometimes the objective is the identification of novel sub-types of individuals within a population. For example it has been shown that certain types of leukemia present some subclasses that are very hard to distinguish morphologically but which can be classified using gene expression. This is an example of class discovery.

d) Pathway Analysis studies are those that try to find genes whose co regulation reflects their participation in common or related biochemical processes.

One of the most popular platforms for microarray analysis is Bioconductor, an open source and open development software project for the analysis and comprehension of genomic data, based on the R programming language. This paper describes specific procedures for conducting quality assessment of Affymetrix Gene Chip using data from GEO database using the open-source R programming environment in conjunction with the open-source Bioconductor software [15].

Bioconductor and R

R is a programming language. The name “R” is initials of names of the two R authors (Robert Gentleman and Ross Ihaka). R is introduced in 1991 and R 1.0.0 is released in year 2000. Bioconductor emerges as a boon in life sciences and in high throughput experiments where analysis tools are available free of cost to analyze experimental data. In year 2008 Bioconductor version 2.4 is released and further follows R release. Current version of the Bioconductor is 3.2 and R version is 3.2.2 [16].

R environment is easy to use, coherent and have tools for data analysis. What make R different from other programming languages is its GUI for quick and easy upload of data along with tools for data manipulation, calculation and analysis along with it statistical tools facilitates calculation of standard deviation, variance, t-test, f-test and other statistical tools. R can be accessed from http://cran.r-project.org/ [17].

Bioconductor (www.bioconductor.org) provides bioinformatics tool for analyzing high throughput data that comes from experiment like microarray, SAGE, MS, MS-MS. Bioconductor data packages are divided into three categories Annotation Data, Experiment Data and Software. Currently there are 1104 software packages, 898 Annotation Data and 257 Experiment Data. It is very hard to identify particular package for set of experiments. This paper reviews the methods for visualization of affymetrix gene expression data. These steps are essential part of microarray data analysis that should be taken before utilized in processing and analysis of gene expression differential expression analysis [18].

Microarray data produces lots of experimental errors that emerged because of biasness in dye intensities, laser scanner, spotting errors, hybridization biasness. Before microarray analysis data must be cleaned and processed to extract biological information. Visualization and graphs representation are best suited to study microarray intensity files and comparing probe hybridization signals [19].

Evaluation of data quality is a major issue in microarray analysis. There are many packages that can be used for quality control analysis Table 1 lists the bioconductor packages that are used to study quality of chips and visualizing high throughput microarray data [20].

S.No Package Description
1 a4 Automated Affymetrix Array Analysis
2 a4Base analysis of Affymetrix microarray experiments
3 a4Core Automated Affymetrix Array Analysis
4 a4Preproc package for preprocessing of microarray data
5 a4Reporting Automated Affymetrix Array Analysis Reporting Package
6 affxparser Ackage for parsing Affymetrix files (CDF, CEL, CHP, BPMAP, BAR)
7 Affy  Exploratory oligonucleotide array analysis.
8 affycomp Compare expression measures for Affymetrix Oligonucleotide Arrays.
9 affyContam Affymetrix cel file data
10 Affycoretools Analyses with Affymetrix GeneChips
11 AffyExpress Quality assessment and to identify differentially expressed genes in the Affymetrix affymetrix chip
12 Affyio Parsing Affymetrix data files
13 affylmGUI A Graphical User Interface for analysis of Affymetrix microarray gene expression data using the affy and limma Microarray packages
14 affyPara Oligonucleotide array analysis
15 affyPLM Quality assessment tools for affymetrix chip
16 affyQCReport Quality of a set of affymetrix arrays
17 AffyRNADegradation  Assessment and correction of RNA degradation effects in Affymetrix 3' expression arrays
18 AffyTiling Extraction and annotation of individual probes from Affymetrix tiling arrays.
19 annmap Deep sequencing data analysis
20 arrayQualityMetrics Microarray quality metrics reports
21 ArrayTools Quality assessment and to detect differentially expressed genes for the Affymetrix GeneChips
22 ExpressionView Isualizes possibly overlapping biclusters in a gene expression matrix.
23 limma Linear Models for Microarray Data
24 MiChip Microarray platform using locked oligonucleotides for the analysis of the expression of microRNAs in a variety of species
25 Simpleaffy High level analysis of Affymetrix data
26 Starr Affymetrix tiling arrays (ChIP-chip data)
27 yaqcaffy Quality control of Affymetrix GeneChip expression data
28 annmap Genome annotation and visualisation package pertaining to Affymetrix arrays

Table 1: Quality control packages of bio-conductor.

After quality control analysis and normalization differentially expressed genes calculation can be done for biological interpretation. Table 2 list the differentially expressed gene packages available at bioconductor [21].

S.No Name of Package Description
1 DEGseq To identify differential expressed genes from rna-seq data.
2 DEGreport Report of deg analysis.
3 DEGraph For two-sample test on graph.
4 DEDS Differential expression via distance summary for microarray data.
5 DEseq2 Differential expression analysis based on negative binomial distribution.
6 DEXseq For the inference of differential exon usage in rna-seq.
7 Dexus For identifying the differential expression in rna-seq studies with unknown conditions or without replicates.
8 Derfinder Used for annotation-agnostic differential expression analysis of rna-seq data at base-pair resolution.
9 Derfinder plot To find the plot for derfind.
10 Diffbind Used for differential binding analysis of chip-seq peak data.
11 Diffgeneanalysis Perform differential gene expression analysis.

Table 2:Deg analysis packages of bio-conductor.

Materials and Methods

The raw data for this study is retrieved from Gene Expression Omnibus database: http://www.ncbi.nlm.nih.gov/geo/; GEOID: GSE53890; http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53890. GSE53890 is microarray experiment on REST and Stress Resistance in Aging and Alzheimer’s disease (Figure 1) [22].

data-mining-in-genomics-proteomics-Gene-expression

Figure 1: Gene expression Omnibus database.

Result and Conclusion

From GSE53890 data out of 6 cel files are selected 3 females data and 3 males data chip used in this experiment is HG-U133_Plus_2. Gene expression file is created using affy package of bioconductor and visualization packages are used for quality control analysis and summarizing the output generated by these packages [23].

Gene expression file

Gene expression file created using affy package.

t> library (affy)

> array = ReadAffy (widget=TRUE)

> eset = rma (array)

> write.exprs (eset, file = “array.txt”)

Figure 2 shows the gene expression file and expression values. Using above commands in r this file can be created and used for analysis either in R or in any other softwares like MeV that not accept .cel and .cdf file as input. Either object can be created for expression file using >exprs=write.exprs (eset, file=“array.txt”)

data-mining-in-genomics-proteomics-file-expression-values

Figure 2: The gene expression file and expression values.

Now this object exprs can be used directly in r to visualize expression file and analyze the intensity.

Visualization plots

There are number of plots built in quality control packages for visualization of chips and analysis these plots are also used for comparing chips before and after normalization. Here we describe some plots and their analysis.

Boxplot

Boxplot also called as box-and-whisker plot is statistical plot that graphically plots numerical values for comparison of chips. Five values are plotted in box plot that are lower min value, lower quartile, mean, upper quartile, max value of chip files in parallel for comparison of chip intensities. Box plot display variation in sample without assuming any statistical distribution. Spacing between box lines indicates the spread or degree of dispersion in intensities values of one sample. Boxplot can also be created for individual variables in R using Boxplot function and using fivenum command we can access the five values of expression (Figure 3).

data-mining-in-genomics-proteomics-Boxplot-gene-expression

Figure 3: The Boxplot of gene expression file.

> library (affyQCReport)

> affyQAReport (array)

Folder affyQA is created in same directory where r working directory is set. This folder contains individual graph files.

Intensity plot

Intensity plot is similar to Boxplot but it gives more detailed view. Intensity plot x-axis represents probe density and y-axis probe intensity. Figure 4 represents the intensity plot between 6 cel files any array whose intensity graph is very different from other array is considered as problematic.

data-mining-in-genomics-proteomics-intensity-plot

Figure 4: The intensity plot between 6 cel files any array.

RNA degradation plot

RNA degradation plot is used to assess the quality of RNA molecule used as probe in chips. Since RNA probes are designed from 3’ end of mRNA molecule because RNA degradation starts from 5’ end of mRNA molecule so intensity should be less in 5’ end as compared to 3’end of probe. This plot is used for quality analysis of probes spotted on affymetrix chip (Figure 5).

data-mining-in-genomics-proteomics-RNA-degradation-plot

Figure 5: RNA degradation plot.

MA plot

MA plot where M represents minus sign and A represents mean addition sign for two channel microarray experiment.

Log Ratio: M (“Minus”) = log2(R/G) = log2R – log2G

Average Log Intensity: A (“Add”) = log2 (RG)1/2 or (1/2) (log2R + log2G)

MA plot is used to determine is there any biasness in intensities of red and green dye and other systemic errors or instrumental errors in experiment. To visualize the need of normalization before any analysis.

MA plot is plot of red and green intensities biasness and determine the rate of error in microarray experiment. Figure 6 represents the ma plot between. cel files and variances in intensities of red and green dyes.

data-mining-in-genomics-proteomics-Ma-plot

Figure 6: Ma plot between cel files and variances.

NUSE plot

Normalized Unscaled Standard Errors (NUSE) is used for standard error for each probe on all chips. All probes are normalized to scale one across all arrays. This plot shows the variations between genes and any array with higher standard error is of poor quality and hence can be rejected for further analysis for biological interpretations Figure 7 shows the NUSE plot.

data-mining-in-genomics-proteomics-Nuse-plot

Figure 7: Nuse Plot.

RLE plot

Relative Log Expression (RLE) plot is plot of RLE values that are calculated by assuming that all probeset across all array against median expression value for that probeset is constant and are not changing. Assuming that genes are constant across all arrays RLE values should be near 0.RLE plot is graphically plotted in form of Boxplot that provides quality of chips.

Figure 8 shows the RLE plot of chips in form of box plot and RLE values are set to 0. This plot shows the variation in chips and used for quality assessment of probes.

data-mining-in-genomics-proteomics-quality-assessment

Figure 8: The variation in chips and used for quality assessment of probes.

QC stats

QC plot is recommended by affymetrix. Any array that is shown in red is indication of error and in blue indicates the array within limits of scale factors e within 3-fold. Figure 9 represents the QC plot of arrays.

data-mining-in-genomics-proteomics-QC-plot

Figure 9: QC plot.

Conclusion and Discussion

This paper reviews the different quality control plots used for visualizing high throughput microarray data. And summarizes various packages of bioconductor that are used for microarray quality control and differentially expressed genes analysis. Any researcher working on microarray analysis can get help from this paper for microarray quality control analysis.

Acknowledgements

This is not just to follow the custom of writing acknowledgement but to express and record my heart felt feeling of thankfulness to all those who directly or indirectly helped me in this work. Further we wish to acknowledge bioinformatics tools for conducting this study.

References

Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

  • 9th International Conference on Bioinformatics
    October 23-24, 2017 Paris, France
  • 9th International Conference and Expo on Proteomics
    October 23-25, 2017 Paris, France

Article Usage

  • Total views: 8329
  • [From(publication date):
    July-2016 - Oct 19, 2017]
  • Breakdown by view type
  • HTML page views : 8229
  • PDF downloads :100
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri, Food, Aqua and Veterinary Science Journals

Dr. Krish

[email protected]

1-702-714-7001 Extn: 9040

Clinical and Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals

Ronald

[email protected]

1-702-714-7001Extn: 9042

Chemical Engineering and Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001 Extn: 9040

Earth & Environmental Sciences

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

General Science and Health care Journals

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics and Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001 Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Informatics Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Material Sciences Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Mathematics and Physics Journals

Jim Willison

[email protected]

1-702-714-7001 Extn: 9042

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001 Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

John Behannon

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001 Extn: 9042

 
© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version
adwords