Received date: May 24, 2017; Accepted date: June 14, 2017; Published date: June 19, 2017
Citation: Kondo Y, Hayashi C, Miyazaki S (2017) Comparative Analysis of Intronic Noncoding RNA Genes among Organisms. J Mol Genet Med 11:271 doi:10.4172/1747-0862.1000271
Copyright: © 2017 Kondo Y, et al . This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Visit for more related articles at Journal of Molecular and Genetic Medicine
Development of sequencing techniques allowed us to determine genomic sequences in many organisms. Such a determined genome consists of not only protein-coding genes but also noncoding RNA (ncRNA) genes. We should analyze evolutional histories of such genes to estimate evolutional directions in future. Meanwhile, recent studies showed that some ncRNA genes are located in intragenic regions in protein-coding genes, which are called host genes. We considered that such information can help us to discuss gene evolutions. In this study, we constructed a database to analyze evolutions of protein-coding and noncoding genes based on gene locations in genomic sequences. We found that 547 out of 2,691 human host genes are orthologous to 546 out of 1,633 mouse host genes. Such orthologous host genes are involved in similar biological functions but some non-orthologous host genes have different functions. For example, non-orthologous host genes in human are annotated as neuron-related terms but such genes in mouse are not. Meanwhile, similarity searches for intronic microRNA (miRNA) genes between human and mouse showed that 85 out of the orthologous host genes have retained miRNA genes in the intronic regions. 64 out of such genes have retained intronic miRNA genes among human, mouse and rat. These results suggest that some orthologous genes have retained ncRNA genes in the intronic regions in the evolutionary process.
Genome; Protein coding region; Gene evolution; Introns; Database
A genomic DNA sequence has some kind of meaningful regions involved in gene expressions. Such a region is present sporadically in the DNA sequence. This region, for example, can become RNA molecules or regulate gene expressions. Converting from DNA to RNA sequence is called transcription, which usually makes an exact copy of the DNA sequence. This information flow is one of the important steps in gene expressions, which can make various transcripts playing a variety of roles. Some transcripts can work as the unprocessed form whereas some transcripts are further processed. Such processing makes a no more exact copy of the DNA sequence. One of the representative processing ways is elimination of stretches of RNA sequences like splicing, in which an inter-region of two exons (short for expressed regions) is eliminated from the precursor messenger RNA (pre-mRNA). The spliced region is called an intron (short for intragenic region) . Splicing can create a mature form (mRNA) from the immature form (pre-mRNA). The mature mRNA is then converted into amino acids. This information flow converts an RNA sequence into an amino acid sequence based on a codon usage table. Therefore, the DNA sequence eventually converted into an amino acid sequence is called a protein-coding sequence. This is an essential part of expressions for protein-coding genes .
Meanwhile, other than protein-coding genes exist. Such genes are grouped as noncoding RNA (ncRNA) genes [3,4]. Some ncRNA s play important roles in protein biosynthesis. Such ncRNA s are, for example, small nuclear RNA (snRNA), small Cajal body-associated RNA (scaRNA), ribosomal RNA (rRNA), transfer RNA (tRNA) and small nucleolar RNA (snoRNA). snRNAs, which are modified by scaRNAs, are included in a spliceosome, which consists of many proteins and five snRNAs and works for splicing of pre-mRNAs . A tRNA can deliver an amino acid to an mRNA on a ribosome, which works for elongating an amino acid sequence . Such a ribosome includes rRNAs derived from precursor rRNAs (pre-rRNAs), which are processed and modified by snoRNAs . In addition, some ncRNA s work for expression regulation of protein-coding genes. MicroRNAs (miRNAs) play a role in silencing protein-coding genes  and some miRNAs are involved in brain or neuron related diseases such as Alzheimer’s disease . Long ncRNAs (lncRNAs),(lncRNAs), which are longer than 200 bases, can serve as molecular signals and some lncRNAs are upregulated or downregulated in cancers [9,10]. Piwi-interacting RNAs (piRNAs) are responsible for epigenetic inheritance in germ line and germline bordering somatic cells . Thus,ncRNAs, which are classified into many subcategories, have many functions such as working with proteins and regulating gene expressions.
How genes were evolved is important to predict gene evolutions in future. Evolutionary lineages among protein-coding genes have been well discussed . However, it is not well known how ncRNA genes were evolved. One of the reasons is that evolutional analysis for ncRNA genes is more difficult than protein-coding genes because alignment accuracy for ncRNA genes may be lower than amino acid sequences [13,14]. Sequence alignments of ncRNA genes cannot use information regarding a codon usage table like protein-coding genes. Therefore, it is difficult to identify important regions of the ncRNA gene. In addition, such low accuracy can be caused by the fact that the number of base types is lower than amino acid types.
Meanwhile, some ncRNA genes are located in intronic regions . Such a protein-coding gene including an ncRNA gene in the intronic region is called a host gene. Host genes and intronic ncRNA genes are interesting to investigate how the transcriptions are regulated because such an ncRNA can be transcribed simultaneously with the host gene. This shows that a relationship between an intronic ncRNA gene and a host gene is useful to discuss how gene expressions are regulated. Expressions for intronic ncRNA genes are regulated by some mechanisms. Some intronic ncRNA genes depend on transcriptions of the host genes . Therefore, such an ncRNA gene shares a transcription unit with the host gene. On the other hand, some intronic ncRNA genes are regulated by an independent transcription unit which has an own promoter . These mechanisms for expression regulation of ncRNA genes suggest that some expression regulation of ncRNA genes depend on gene locations in genomic DNA sequences. Such researches for intronic ncRNA genes have shown that many ncRNA genes are located in intronic regions [18-21]. Therefore, we should classify ncRNA genes based on gene locations in genomic sequences.
In this study, we construct a database classifying ncRNA genes based on gene locations in genomic sequences by collecting information concerning some model organisms of the genomic sequences. The database stores orthologous relationships between protein-coding genes. We then summarize statistics for coding and noncoding genes and orthologous relationships between organisms. Moreover, what functions are enriched in the host genes is investigated by focusing on host genes of intronic ncRNA genes. Furthermore, sequences of ncRNA genes are compared in order to investigate whether host genes have re-tained ncRNA genes in the intronic regions. We discuss whether such gene locations in genomic sequences are effective to discuss gene evolutions.
Classification for ncRNA genes based on mRNA-transcribed locations
We propose a new classification for ncRNA genes based on gene locations in a genomic sequence. An outline of the classification is shown in Figure 1. The DNA sequence in Figure 1 is separated into two as pre-mRNA-transcribed and intergenic regions. This separation can divide ncRNA genes into three categories.
1. Intergenic (located on an intergenic region)
2. Intronic (located on a pre-mRNA-transcribed region)
3. Sense (located on a boundary region)
Figure 1: Classification for ncRNA genes based on mRNA-transcribed regions in a DNA sequence. ncRNA genes are classified into three categories: (1) intergenic, (2) intronic and (3) sense. An intergenic ncRNA gene is located on a region not transcribed into precursor mRNAs. An intronic ncRNA gene is located on a region trancribed into a precursor mRNA. A sense ncRNA gene is located on a boundary region between an intergene and precursor mRNA.
Construction of a database based on gene locations
The database consists of 9 tables shown in Figure 2. The database was designed to store all the data downloaded from the Ensembl genome database (release 87) . The ‘gene sets’ table contains the data regarding gene sets downloaded from Ensembl in GTF format. This table only includes the data whose feature is ‘transcript’. The ‘ncRNA’ table contains the data regarding ncRNA genes downloaded from Ensembl in fasta format. The ‘intronic ncRNA’ table contains information regarding the host gene including the intronic ncRNA gene. The ‘sense ncRNA’ table contains information regarding the protein-coding gene overlapping with the sense ncRNA gene. The ‘category’ table stores the three categories of ncRNA genes. The ‘organism’ table stores 6 organisms: Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode) and Saccha-romyces cerevisiae (yeast). The ‘coding gene’ table contains the data regarding coding genes extracted from the ‘gene sets’ table. The ‘ortholog’ table contains the data regarding orthologs downloaded from Ensembl BioMart . The ‘type’ table contains orthologous types such as one2one, one2many and many2many .
Gene set enrichment analysis for host genes
We conducted gene set enrichment analyses (GSEA) by Gene Ontology (GO) terms  using the GOstats package  in Bioconductor. All host genes and coding genes in human and mouse were extracted from the database. Then, the host genes were divided into orthologous and non-orthologous genes. We conducted GSEA of the host genes compared with all coding genes in human and mouse. In addition, we also extracted orthologous host genes possessing intronic miRNAs detected by BLAST searches described below. We conducted GSEA of the detected genes compared with all orthologous host genes in human and mouse. In these GSEA, detected GO terms were visualized by the tagcloud R package. We investigated whether the GO terms are identical or similar between human and mouse.
Sequence comparisons for intronic miRNA genes
Sequences of all intronic miRNA genes were extracted from the database. All combinations of a pair of BLASTN searches [26,27] were conducted by setting the word size as 14. We then investigated whether a detected pair of host genes possessing miRNA genes in the intronic regions is orthologous or not.
Statistics of coding and noncoding genes
Our database stores 462,331 transcripts including 231,749 protein coding transcripts shown in Table 1. These transcripts are produced from 105,246 protein coding genes. On the other hand, other than protein coding transcripts relate with ncRNA s, pseudogenes and immunological products. Some pairs of protein-coding genes between organisms have a same evolutional origin. Our database stores such information as orthologous relationships shown in Table 2, which shows, for instance, 18,023 human genes are orthologous to 18,425 mouse genes. This shows that the numbers of human and mouse orthologous genes are different because of three types of orthologs; one-to-one, one-to-many and many-to-many.
|Organism||Transcript||Coding transcript||Coding gene|
Table 1: The numbers of transcripts and protein coding genes.
Table 2: The numbers of orthologous genes.
All ncRNA genes on chromosomes were annotated with one of the three categories: intergenic, intronic or sense. In each organism, the intergenic ncRNA gene is the largest number and the intronic ncRNA gene is the second largest in the three categories as shown in Table 3. The intronic ncRNA genes have a variety of biotypes as shown in Table 4. The biotypes include ncRNA s with known and unknown functions. For instance, ncRNA s with known functions are miRNA, rRNA, ribozyme, scaRNA, snRNA, snoRNA and tRNA. ncRNA s with unknown functions are 3′ overlapping ncRNA, antisense, lincRNA, misc RNA, non-coding, ncRNA, processed transcript, retained intron, sense intronic, sense overlapping.
Table 3: The numbers of intergenic, intronic and sense ncRNA genes.
Table 4: Biotypes of intronic ncRNA genes.
Table 3 shows 3,886 intronic ncRNA genes in human and Figure 3 shows 2,691 host genes in human. In addition, Table 3 shows 2,267 intronic ncRNA genes in mouse and Figure 3 shows 1,633 host genes in mouse. These results indicate that some host genes include one or more ncRNA genes in the intronic regions. Figure 3 also shows that 547 and 546 host genes have orthologous relations between human and mouse, respectively. These orthologs included one-to-many orthologous relations. For example, one human gene, ENSG00000104131, is orthologous to two mouse genes, ENSMUSG00000027236 and ENSMUSG00000043424. One mouse gene, ENSMUSG00000030738, is orthologous to two human genes, ENSG00000205609 and ENSG00000184110. One mouse gene, ENSGMUSG00000013701, is orthologous to two human genes, ENSG00000265354 and ENSG00000204152.
Enrichment analyses for host genes
We conducted enrichment analyses for 2,144 non-orthologous host genes out of 19,961 coding genes in human and 1,087 nonorthologous host genes out of 22,050 coding genes in mouse by GO terms. Figure 4 shows results of the enrichment analyses. We can find that, for example, Figure 4A includes cellular component organization-, neuron development-, regulation of GTPase- and modification-related terms. Figure 4B includes cell morphogenesis-, glutamate receptor- and splicing-related terms. Figure 4C shows intracelullar-related terms. Figure 4D shows synapse- and lumen-related terms. Figure 4E shows GTP-related terms. Figure 4F includes glutamate receptor- or channelrelated terms. The identical GO terms between human and mouse are ‘postsynaptic density’ and ‘cell junction’.
Figure 4: Gene set enrichment analyses of non-orthologous host genes by GO terms. The GO terms show the results of gene set enrichment analyses for 2,144 and 1,087 non-orthologous genes in human and mouse host genes, respectively. The p-value of each GO term is less than 10-4. The character size of each GO term depends on the p-value: the larger the size is, the smaller the p-value is.
Meanwhile, we also conducted enrichment analyses for 547 orthologous host genes out of 19,961 genes in human and 546 non-orthologous host genes out of 22,050 genes in mouse by GO terms. Figure 5 shows results of the enrichment analyses. Figure 5A includes neuron-related terms. Figure 5B includes cellular component organization-, neuron-, cytoskeleton-related terms. Figure 5C and 5D show junction-related terms. Figure 5E and 5F include binding-related terms. Except for transport-related terms in Figure 5A, the detected GO terms are identical or similar between human and mouse.
Figure 5: Gene set enrichment analyses of orthologous host genes by GO terms. The GO terms show the results of gene set enrichment analyses for 547 and 546 orthologous genes in human and mouse host genes, respectively. The p-value of each GO term is less than 10-4. The character size of each GO term depends on the p-value: the larger the size is, the smaller the p-value is.
Similarity searches to detect conserved intronic miRNA genes
As shown in Table 4, our database stores 771 and 959 intronic miRNA genes in human and mouse, respectively. We conducted 771 × 959 (739,389) BLAST searches by setting the BLAST query as a human gene and subject as a mouse gene. The BLAST searches found 125 hits whose e-value is less than 10-14 as shown in Figure 6A. In the 125 BLAST hits, means of sequence identities and alignment lengths were 93.76% and 79, respectively. The 111 BLAST hits are intronic miRNA genes located in orthologous host genes between human and mouse. Figure 6B shows that the number of such orthologous host genes possessing intronic miRNA genes detected by the BLAST searches is 85 in human and mouse. On the other hand, the numbers of non-orthologous host genes are 6 and 5 in human and mouse, respectively. Meanwhile, out of the 111 hits, Table 5 shows 26 hits whose genes are located on X-chromosome. These 26 hits are hits regarding intronic miRNA genes located in 9 host genes which are orthologous between human and mouse.
|miRNA (human)||miRNA (mouse)||Ident||Len||E-val||Host (human)||Host (mouse)|
Table 5: BLAST hits between a pair of intronic miRNA genes in X-chromosome. The sequence identity (Ident), alignment length (Len) and e-value (E-val) show the BLAST searching result between the pair of miRNA genes. The host gene is a coding gene including the miRNA gene. This table only includes hits that genes are located on X-chromosome.
We conducted enrichment analyses for 85 orthologous host genes possessing conserved miRNA genes out of 547 orthologous host genes in human and 85 out of 546 orthologous host genes in mouse. Figure 7 shows results of the enrichment analyses by GO terms. Figure 7A and 7B include process-related terms. This shows that many GO terms detected by the GSEA are identical or similar between human and mouse. On the other hand, Table 6 shows 14 hits whose host genes are not orthologous. Table 6 contains unique 11 host genes in human and mouse. GO terms associated with the 11 host genes in human and mouse were shown in Figures 8 and 9, respectively. This shows that some host genes have identical or similar GO terms in the pairs of host genes possessing miRNA genes conserved within human and mouse. For instance, Figures 8A and 9A show 4 identical GO terms. Figures 8B and 9B show some identical GO terms such as ‘canonical Wnt signaling pathway’ and ‘ventricular cardiac muscle tissue morphogenesis’.
Figure 7: Gene set enrichment analyses of orthologous host genes possessing intronic miRNA genes by GO terms. The GO terms show the results of gene set enrichment analyses for 85 and 85 orthologous genes in human and mouse host genes, respectively. The p-value of each GO term is less than 0.01. The character size of each GO term depends on the p-value: the larger the size is, the smaller the p-value is.
|miRNA (human)||miRNA (mouse)||Ident||Len||E-val||Host (human)||Host (mouse)|
Table 6: BLAST hits between a pair of intronic miRNA genes among non-orthologous host genes. The sequence identity (Ident), alignment length (Len) and e-value (E-val) show the BLAST searching result between the pair of miRNA genes. The host gene is a coding gene including the miRNA gene. This table only includes hits that the pair of host genes are not orthologous.
We conducted remained combinations of BLAST searches by using the intronic miRNA genes in 6 organisms shown in Table 4. Figure 10A and 10B show results of BLAST searches in human versus rat. Figures 10C and 10D show results of BLAST searches in mouse versus rat. Other combinations of organisms such as human versus fly or mouse versus nematode were not detected. Figure 10A shows that the BLAST searches found 110 hits whose e-value is less than 10-14. The 98 BLAST hits are intronic miRNA genes located in orthologous host genes between human and rat. Figure 10B shows that the number of such orthologous host genes possessing intronic miRNA genes detected by the BLAST searches is 66 in human and rat. The numbers of non-orthologous host genes are 5 and 4 in human and rat, respectively. Figure 10C shows that the BLAST searches found 970 hits whose e-value is less than 10-14. The 322 BLAST hits are intronic miRNA genes located in orthologous host genes between mouse and rat. Figure 10D shows that the numbers of such orthologous host genes possessing intronic miRNA genes detected by the BLAST searches are 232 and 233 in mouse and rat, respectively. The numbers of non-orthologous host genes are 63 and 66 in mouse and rat, respectively. These results show that the number of BLAST hits in mouse versus rat is the largest in the BLAST searches. In addition, by integrating Figure 6B, Figure 10B and 10D, we found 64 orthologous host genes possessing intronic miRNA genes among human, mouse and rat as shown in Figure 11.
Figure 10: Results of BLAST searches in human vs. rat and mouse vs. rat. (A) BLAST hits of intronic miRNA genes possessed by orthologous host genes in all BLAST hits in human vs. rat. (B) Orthologous host genes in all host genes possessing intronic miRNA genes detected by the BLAST searches in human vs. rat. (C) BLAST hits of intronic miRNA genes possessed by orthologous host genes in all BLAST hits in mouse vs. rat. (D) Orthologous host genes in all host genes possessing intronic miRNA genes detected by the BLAST searches in mouse vs. rat.
Our database is created from the Ensembl data. The Ensembl database can identify gene locations in genomic sequences. However, it is not categorized by location information among protein-coding and noncoding genes. Our database categorizes ncRNA genes based on genomic positions and can easily identify host genes of intronic ncRNA s. Our database stores all transcripts in Ensembl. The transcripts include protein-coding transcripts. Some protein-coding transcripts contain a 5′-UTR (untranslated region) and 3′-UTR but some transcripts are not. Therefore, some protein-coding transcripts only contain the coding regions. Our database stores all ncRNA genes in Ensembl. We assigned a category of ncRNA genes based on gene locations in genomic sequences. In intergenic, intronic and sense ncRNA genes, we focused on intronic ncRNA genes. Firstly, we investigated functions of ncRNA genes. Functions of some intronic ncRNA genes are known as shown in Table 4. However, functions of many ncRNA genes are unknown because they are annotated as lincRNA, misc RNA, ncRNA and so on. This indicates that the number of ncRNA genes in each biotype may increase in the future.
We next focused on host genes of the intronic ncRNA genes. As shown in Figure 3, the numbers of host genes are 2,691 and 1,633 in human and mouse, respectively. This indicates that the number of host genes in mouse is fewer than human. In addition, Table 3 shows that the total number of ncRNA genes in human or mouse is 33,884 or 17,988, respectively. This indicates that the total number of ncRNA genes in mouse is fewer than human. These results indicate that the number of intronic ncRNA genes in mouse may increase in the future. We divided the host genes by using orthologous relationships. We then conducted GSEA in order to investigate what genes are contained in the host genes, These GSEA show that some non-orthologous genes tend to have different functions but orthologous genes have similar functions. Moreover, in order to identify host genes associated with diseases, we explored from our database which host genes are associated with diseases. Because the GSEA show that host genes are associated with some neuron-related terms, we searched host genes associated with Alzheimer’s disease. We found that approximately 5% of host genes are associated with diseases. Moreover, 6 human host genes are involved in Alzheimer’s disease. For example, a human gene (ENSG00000182240) is annotated by a GO term (GO:0050435, beta-amyloid metabolic process) and, therefore, it is involved in a term concerning Alzheimer’s disease. Additionally, this human gene is a host gene of an ncRNA (ENST00000458830, snoRNA). This shows that our database is useful to identify ncRNA genes associated with diseases. Furthermore, 4 mouse host genes are involved in Alzheimer’s disease. This shows that our database may guide to investigate relationship among gene evolutions and diseases in future research.
We conducted BLAST searches in order to investigate conservation of ncRNA genes in host genes. However, our database stores a variety of ncRNA genes. Therefore, comparisons of all ncRNA genes are very high computational costs. We then focused on miRNA genes because miRNA genes are relatively short genes than lncRNA genes and so on. We conducted BLAST searches to all combinations of intronic miRNA genes in human and mouse. The BLAST hits show high sequence identities between intronic miRNA genes. This shows that some intronic miRNA genes are conserved in human and mouse. In addition, the most of BLAST hits are intronic miRNA genes located in orthologous host genes between human and mouse as shown in Figure 6A. This shows that many orthologous host genes possess conserved miRNA genes in their intronic regions. We then conducted GSEA in order to investigate what orthologous host genes possess the intronic miRNA genes conserved within human and mouse. The results show that such orthologous host genes are responsible for some processes such as biosynthetic process and metabolic process. On the other hand, the BLAST searches also found some host genes possessing conserved intronic miRNA genes but they are not orthologous within human and mouse. As shown in Figures 8 and 9), some these host genes are responsible for similar functions. This shows that host genes possessing conserved intronic miRNA genes have similar functions even if they are not orthologous.
In addition, we investigated pairs of miRNA genes between remained combinations of the 6 organisms. However, yeast is excluded because it does not have the data of intronic miRNA genes as shown in Figure 4.
The BLAST searches found hits of human versus rat and mouse versus rat as shown in Figure 10. This shows that the number of host genes possessing conserved miRNA genes in the intronic regions is larger as the organisms have a close evolutionary relationship because the number of hits in mouse versus rat is the largest. In addition, we found 64 orthologous host genes possessing intronic miRNA genes conserved within human, mouse and rat as shown in Figure 11. This indicates that an orthologous host gene possesses an intronic miRNA gene conserved within human, mouse and rat. Figure 12 shows a schematic view of such a relationship. This indicates that host genes have retained miRNA genes in the intronic regions within human, mouse and rat. On the other hand, host genes possess conserved intronic miRNA genes only among human, mouse and rat. In other words, the combinations regarding fly or nematode were not found by the BLAST searches. This suggests some possibilities. One possibility is that the host gene lost the miRNA gene from the intronic region in the evolutionary process. Another is that the intronic miRNA gene emerged in the intronic region of the host gene on the way of the evolutionary process.
By the use of the database, we find that orthologous protein-coding genes have retained intronic miRNA genes in the evolutionary process. Therefore, our database is useful to identify relationships among protein-coding and noncoding genes. Meanwhile, we have not clarified that other kinds of ncRNA s such as lncRNA s have been retained in the host genes. Because our database stores not only miRNA genes but also the information concerning such ncRNA s, it should be necessary to investigate whether an intronic ncRNA gene is conserved among organisms and is located within orthologous protein-coding genes. In addition, we plan to associate lncRNA s with diseases such as cancers and store such information into our database. This database is useful to identify lncRNA s involved in diseases.
In this study, we discussed evolutions of intronic ncRNA genes based on gene locations in genome sequences. We found that proteincoding genes which are orthologous between human, mouse and rat possess miRNA genes conserved within them in the intronic regions. This result suggests that some orthologous genes have retained ncRNA genes in their intronic regions in the evolutionary process.