Wei Bin^{1*} and Zhao Jing^{2,3}
^{1}Engineering University of CAPF, Xi’an 710086, China
^{2}Xijing University, Xi’an 710123, China
^{3}Xi’an Jiaotong University, Xi’an 710049, China
Received Date: April 21, 2014; Accepted Date: June 11, 2014; Published Date: June 16, 2014
Citation: Bin W, Jing Z (2014) A Novel Artificial Neural Network and an Improved Particle Swarm Optimization used in Splice Site Prediction. J Appl Computat Math 3:166. doi: 10.4172/21689679.1000166
Copyright: © 2014 Bin W, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Applied & Computational Mathematics
The amount of DNA sequence data produced by several genomic projects had increased dramatically in recent years. One of the main goals of bioinformatics was to identify genes. A crucial part of the gene identification was to precisely detect the exon intron boundaries, i.e. the splice sites. This paper introduced a new type of artificial neural network (called NANN), which was designed specifically to solve the splice sites prediction problem. Moreover, the network connection weights of NANN were determined by an improved particle swarm optimization which was inspired by the wolves' activities circle. In addition, three types of encoding approaches were applied to generate the input for the NANN. Intensive experiments were presented in this paper, and the results showed that our algorithm was better than some current methods, that is, the NANN_IPSO was applicable to splice site prediction problem.
Particle swarm optimization; Neural network; Ensemble; Splice sites prediction
With the completion of sequencing the genome from several eukaryotes [1], discovering the location and structure of genes (often referred to as gene prediction or gene finding) is an increasingly important task [2]. Eukaryotic genes are composed of two types of segments: exons and introns; while the exons are regions of the genetic material that are encoded proteins, on the other hand the introns are noncoding regions that are removed from the primary transcript [3]. The intronexon boundary is referred to as acceptor splice site and the exonintron boundary as donor splice site [4]. The donor and acceptor splice sites are the most critical signals for gene prediction [5].
In the past few decades, numerous methods have been proposed to detect the splice sites, such as hidden Markov model [5,6], Bayesian networks [7,8], artificial neural network (ANN) [9,10], support vector machine [11,12] and decisiontrees [13]. However, due to complex dependencies existing among the bases around splice sites [7,14], the splice site prediction is still a difficult problem, i.e., splice site prediction is still a major bottleneck in gene finding [15]. Thus, development of new methods to accurately predict the splice sites is continuing to be expected.
Recently, the ANN has attracted wide spread attention [1618], and it can be constructed without detailed domain knowledge [19]. However, defining the architecture of ANN is difficult. We propose a novel ANN (we call it NANN, hereafter) that incorporates into the model the codon, which is consecutive three bases that encode an amino acid.
Although ANN model has capability for extracting nonlinear relationships through training, the training process is difficult. Traditional training methods such as back propagation are very slow and possibly stuck at local minimum [20,21]. Recently, several researchers applied the PSO algorithm in training the neural network [22]. However, studies showed that similar to other population based methods, such as genetic algorithm, PSO may be trapped in local optimum and the convergence rate is slow in the later iterations [2326]. Thus, mimicking the activities of wolves circle, we propose an IPSO, which aims to speed up the convergence and avoid getting into the local optimum, to train the NANN. Intensive experiments were presented in this paper, and the results showed that our algorithm was significantly better than some current methods.
A novel ensemble ANN
NANN: Studies showed that ANN with enough hidden layers (include number of nodes per layer) can achieve any function very well through an appropriate training. The NANN consists of four layers: an input layer, two hidden layers and an output layer. On the basis of fact that every threebase (codon) in exons stands for an amino acid (e.g. CAC?histidine), the connection between the input layer and the first hidden layer is defined as follows: every threebase in the region of exons is connected to one node in the first hidden layer, and other bases are onetoone connected with the node in that layer. The other layers’ nodes are fullconnected (Figure 1).
The input is a segment of nucleotide sequences with an uneven window size. The output of the networks comes from just one unit giving a value of 0 or 1, which is used to represent the category (splice site or nonsplice site). The number of nodes in the first hidden layer for donor and acceptor splice sites prediction problem are defined in (1) and (2), respectively.
(1)
(2)
where N_{lu} is the number of inputs in upstream, N_{ld} is the number of inputs in downstream.
Ensemble NANN: The theoretical and experimental results showed that combing multiple neural networks (training several individual ANNs and combining their outputs) can effectively improve the performance of ANN [27,28]. Therefore, in this paper, several individual NANNs are used as the component of the ensemble one (Figure 2). The main steps of the ensemble NANN are given as follows:
1) Several NANNs are trained with various window size (input) and second hidden layer's nodes.
N of the better model is selected.
Create an ensemble NANN consisting of N individual NANNs according to the step 2.
The final decision is made by combining ensemble of NANN with unweighted majority voting.
IPSO
Particle swarm optimization: PSO is an iterative optimization algorithm inspired by the observation of collective behaviors in animals (e.g., bird flocking) [29,30]. In PSO, each candidate solution to an optimization problem is represented by one particle. Each particle i is described by its position x_{i} and velocity v_{i} . The algorithm starts with random initialization of the particles. Then, the particles change their positions according to their velocities, which are updated in each iteration. Given that p_{i} is the best position found by particle i in all the preceding iterations and p_{g} is the best position found so far by the entire swarm, the velocity and position of particle i in bit j will be updated according to the following formulae:
(3)
(4)
where 1 r and 2 r are random numbers between 0 and 1, and c_{1} and c_{2} define the degree of influence of p_{i} and p_{g} on the particle’s velocity. The velocity v_{ij} is bounded within a range of [−V_{max} ,V_{max} ] to prevent the particle from flying out of the solution space.
IPSO: Biologists studied on wolves and found that each pack of wolves has a 15 km radius of activities circle. Miniaturizing three wolves' activities circles to the drawings, they found that the circles are crossing, neither isolated nor completely blending. The intersection provides the possibility of hybrid, and the other sections make them retain some of personality. When the activities circle overlap, wolves are killing; conversely, complete isolation will bring degradation [31]. Inspired by the above phenomenon, we propose an improved PSO named IPSO.
In IPSO, the swarm consists of three subswarms A, B and C. At each iteration, n particles are randomly selected from each subswarm to constitute the set D , where , is the number of particles in • . D represents the intersection of A, B and C which ensures the information sharing. In addition, three enhancing strategies are used in this paper including timevarying acceleration coefficients strategy, crossover and mutation.
a) Timevarying acceleration coefficients strategy
The coefficients change with iteration so as to speed up the convergence during the earlier iterations and keep the diversity during the later iterations. The social coefficient 2 c is linearly decreased during the iterations; inversely, the cognitive coefficient 1 c is increased linearly.
(5)
where iter_{cru} and iter_{max} are the current and maximal iteration, respectively, and c_{max} is the upper bound, and c_{min} is the lower bound. In this paper, the values of c_{max} and c_{min} are set to 2.2 and 1.8, respectively.
b) Crossover
In PSO, the particle swarm has “memory” of past successes, and tends to converge upon the regions of search space that have afforded success previously. However, there is no mechanism for leaps from one region to another, and crossover allows that leaps [32]. Therefore, the following mechanism is used in this paper to prevent the algorithm from trapping in local optimization. If the rand ()<p_{c} , the following operation is executed.
(6)
(7)
where p_{c} is the crossover probability, j_{rand} is randomly chosen between 1 to M with equal probabilities.
c) Mutation
Lack of the diversity, particularly during the later stages of the optimization, is the dominant factor of the convergence to local optimum [33]. Hence, the concept of “mutation” is adopted to enhance the global search capability.
(8)
where p_{m} is the mutation probability.
Based on the above analysis, the main steps of the IPSO are given as follows:
Step 1: Initializing. Generate three independent swarms A, B and C (each one include M particles) randomly.
Step 2: Calculating. Calculate the fitness value of each particle.
Step 3: Updating. A, B and C are updated independently according to the equations (3), (4) and (5)
Step 4: Selecting. n particles are randomly selected from A and C to constitute the set D.
Step 4.1: Crossover. The particles in D are crossed according to (6) and (7).
Step 4.2: Mutating. If rand () < p_{m} , (8) is executed.
Step 5: Checking. If the termination criteria are satisfied, stop. Otherwise, go to Step 2.
Hybrid NANN and IPSO
Although the NANN model has been reported in the previous section, the problem of training the network connection weight is not a trivial problem. It is a nonlinear and dynamic process in that any change of one weight requires adjustment of many others. In this paper, the IPSO is applied to optimize the weights (Figure 3). The fitness function is defined as the rate of misclassification.
In most of the previous studies, the nucleotide encoding was often ignored. In this paper, three encoding methods are used: singlenucleotide (SN) encoding, SN with frequency difference between the true and false sets (SN_FD) [34], and SN_FD with distribution information (SN_FD _DI).
SN encoding
Each nucleotide is given an integer number: A (1), T (1), G (2) and C (2).
SN with FD encoding
A position weight matrix is derived from the true set by counting the frequency which each nucleotide occurs at each position.
(9)
where n is the number of samples in the true set, l is the number of nucleotide in each sample and N_{tj}∈{2,1,1, 2}. In the same way, a position weight matrix can be obtained for the false set. The SN with FD encoding is obtained by subtracting the true coding matrix from the false one [34].
SN_FD with distribution information encoding
However, SN_FD method has its limits. For example, A’s frequency is 0.8 in true set and 0.7 in false set. C’s frequency is 0.2 in true set and 0.1 in false set. When the SN_FD is used, the two features have the same value. However, the contribution of the two bases may be different. Therefore, we propose a novel encoding method that the distribution information (DI) is added to (9).
(10)
where is the frequency of N_{tj} in the class c_{t} , c_{t} ∈{true, false} .
In this section, we tested IPSO on four benchmark functions, traveling salesman problem and bin packing problem. Then, NANN_ IPSO was tested on the Homo Sapiens Splice Sites Dataset (HS3D), and the results were compared with those of several current known algorithms.
Tested the effect of IPSO
Experiment for benchmark testing: Four wellknown benchmark functions were used to evaluate the effectiveness of IPSO, and the results were compared with PSO, EGPSO and MPSO_SCSS [31,35,36]. All the functions used in this paper are defined in Table 1.
Test functions  Search space  Global optimum  

F1  (5.12,5.12)  0  
F2  (2.048,2.048)  0  
F3  (1.28,1.28)  0  
F4  (100,100)  0 
Table 1: Details of benchmark functions.
Each experiment was conducted 30 times, and the average results and the standard deviation were presented. Based on the parameter sensitive analysis (see 4.2.4), we set the parameters of IPSO as follows: p_{m} =0.01, p_{c} =0.8 and k =5. Table 2 shows the results of the four algorithms. In the table, “F” indicates the function, and “D” indicates the dimension of function.
F  D  Average/(Standard Deviation)  
PSO  EGPSO  MPSO_SCSS  IPSO  
F1  5  2.6177*10^{115} （2.8075*10^{115}） 
2.1337*10^{68} （3.7709*10^{68}） 
2.4057*10^{30} （4.4985*10^{30}） 
1.8713*10^{169} （0） 
10  5.6843*10^{15} （1.1874*10^{14}） 
7.1827*10^{11} （1.5699*10^{10}） 
1.4352*10^{18} （1.3143*10^{18}） 
6.9166*10^{35} （1.6381*10^{34}） 

F2  2  0 （0） 
0 （0） 
0 （0） 
0 （0） 
F3  5  2.7501*10^{4} （1.7581*10^{4}） 
5.3997*10^{4} （3.0793*10^{4}） 
5.0818*10^{4} （2.5073*10^{4}） 
2.9644*10^{323 }（0） 
10  1.3445*10^{2} （8.0537*10^{3}） 
2.1525*10^{2} （1.1245*10^{2}） 
4.0352*10^{3} （1.8352*10^{3}） 
1.1449*10^{61} （3.5054*10^{61}） 

F4  2  0 （0） 
6.5451*10^{3} （4.7254*10^{3}） 
0 （0） 
0 （0） 
Table 2: Average and the standard deviation of the optimal values obtained by PSO, EGPSO, MPSO_SCSS and IPSO.
It can be seen from the table that all the methods performed well on functions 2 and 4. However, for the 1 and 3, IPSO improved the average solutions significantly compared with those of others. Therefore, we can reach the conclusion that IPSO outperforms the other three algorithms.
Experiment for bin packing problem testing: The bin packing problem (BPP) is a NPhard combinatorial optimization problem where the aim is to pack a finite number of items using the least bins possible [37]. The problem can be stated as follows: given a finite set of numbers (the items) and a constant C (the bin’s capacity), find the packing pattern that requires the minimum number of bins. The m items were randomly drawn from the range (0,1) to be packed into bins of capacity 1. In this paper, three sets of parameters were used: (1) m=100; (2)m=500; (3)m=1000. Table 3 shows the results of the four algorithms (PSO, IBPSO [38], MPSO [39] and IPSO). According to Table 3, the IBPSO was a very poor method for the BPP. On the other hand, the IPSO considerably outperformed others, finding the better packing even for the difficult instance where m=1000. The results on the three datasets confirm the superiority of the IPSO in comparison with others.
Dataset  Average/(Standard Deviation)  

PSO  IBPSO  MPSO  IPSO  
VR100  36 (0) 
36.6 (0.5040) 
36 (0) 
35.1 (0.1568) 
VR500  178.8 (0.6644) 
180.0 (1.1592) 
178.8 (0.6477) 
176.2 (0.3488) 
VR1000  406 (126.2) 
416 (74.3) 
409 (62.1) 
397 (36.8) 
Table 3: The results of the bin packing problem for 30 trials.
Experiment for traveling salesman problem testing: Traveling salesman problem (TSP) is one of most widely studied combinational optimization problems in which a salesman has to visit every city in the shortest way [40]. The TSP instances were derived from the TSPLIB [41]. The three versions of PSO (PSO, PSO_c3dyn [42] and IPSO) were tested on four instances. According to Table 4, it can be seen that IPSO obtained the best results in all instances. From the table, even the worst results obtained by our algorithm were better than the mean results obtained by others in some of cases. According to standard deviations, the IPSO is more stable.
Dataset (best known) 
Average/(Standard Deviation)  

PSO  PSO_c3dyn  IPSO  
Eil51 (426) 
441.5 (3.62) 
457.9 (4.17) 
434.4 (2.68) 
Berlin52 (7542) 
7657.4 (144.84) 
8062.2 (68.37) 
7548.4 (13.58) 
St70 (675) 
685.8 (6.94) 
700.6 (5.09) 
674.8 (6.05) 
Pr76 (108159) 
117390.6 (1251.30) 
119550.3 (933.02) 
108852.0 (232.21) 
Table 4: The results of the Traveling Salesman Problem for 30 trials.
Experiment for splice site prediction
Dataset: The dataset was extracted from GenBank Rel.123. HS3D contains 2796 donor sites and 2880 acceptor sites. In addition, there are 271937 false donor sites and 329374 false acceptor splice sites, which were selected by searching GT and AG dinucleotides in nonsplicing positions [43].
Evaluation criteria: The evaluation criteria used in this paper are sensitivity (Sn), specificity (Sp), accuracy (Acc) and Q^{9} .
(11)
(12)
(13)
(14)
(15)
True negative (TN): number of false splice sites that were correctly classified as false.
True positive (TP): number of true splice sites that were correctly classified as true.
False negative (FN): number of true splice sites that were wrongly classified as false.
False positive (FP): number of false splice sites that were wrongly classified as true.
In addition, the kfold method was employed in the experiments, with the value of k set to five. For fivefold crossvalidation, the whole dataset was divided into five subsets with approximately equal size. Then, the classifier was trained five timeseach time, one subset was used as the testing data to validate the classifier.
Firstly, an individual NANN with different encoding methods and different inputs were used to analyze the effects of three encoding methods, and the results are showed in Figure 4 and Figure 5, where LU is the length of upstream sequences of splice sites and LD is the length of downstream sequences of splice sites. From the figures we can see that the SN obtained the worst Acc for both the donor and acceptor datasets. On the other hand, the SN_FD_DI obtained the best results compared with other two encoding methods. Therefore, it can be concluded that the SN_FD_DI encoding method can more accurately reflect the difference between true and false sites.
Secondly, the effects of the number of individual ( N ) NANN(s) used in the ensemble one were analyzed (Tables 5 and 6). It is clear that the results were greatly influenced by the value of N, and the best results were obtained when N = 5 . Figures 6 and 7 shows the frequency of nucleotides which used as the inputs of NANN_IPSO. Yaxis indicates the frequency of nucleotides composition bias. From figure we can see that: a) the adjacent sequences of splice sites have remarkable conservativeness; (b) introns show more remarkable conservativeness than exons.
N  Sn  Sp  Acc  Q^{9} 

1  76.18%  98.28%  87.23%  83.11% 
3  86.91%  98.93%  92.92%  90.71% 
5  90.77%  98.93%  94.85%  93.43% 
7  85.62%  99.14%  92.38%  89.82% 
9  84.12%  98.71%  91.42%  88.73% 
Table 5: Donor sites prediction with different N NANN(s).
N  Sn  Sp  Acc  Q^{9} 

1  57.50%  97.08%  77.29%  69.88% 
3  77.92%  97.71%  87.81%  84.30% 
5  85.63%  97.62%  91.77%  89.73% 
7  75.42%  98.13%  86.77%  82.57% 
9  74.17%  97.50%  58.83%  81.65% 
Table 6: Acceptor sites prediction with different N NANN(s).
Finally, our method was compared with MM1SVM [11], SVMB [44] and HMM [5]. The results are showed in Table 7. We can see that all the algorithms generated better results for the donor dataset than the acceptor one, and the reason is that the donor set is more conservative than the acceptor one. Moreover, our method outperformed the other algorithms in the evaluation criteria Q^{9} on all the datasets, and it can be explained by the effectiveness (the information used to construct NANN were usefulness) of the algorithm proposed in this paper.
Method  donor  acceptor  

Sn  Sp  Q^{9}  Sn  Sp  Q^{9}  
MM1SVM  93.06%  91.31%  92.10%  90.24%  87.57%  88.79% 
SVMB  94.31%  90.99%  92.38%  90.90%  88.16%  89.37% 
HMM  94.24%  92.42%  93.23%  88.23%  90.11%  89.40% 
Our method  90.77%  98.93%  93.43%  85.63%  97.62%  89.73% 
Table 7: Prediction accuracies obtained by three algorithms
Sensitivity in relation to parameters
To study the effects of parameters, we solved the splice site prediction problem by NANN_IPSO with variousk, p_{c} and p_{m}.
• Sensitivity in relation to the k
The k is an important parameter for our algorithm. Tables 8 and 9 show the effects of k on the solutions where the mutation probability and crossover probability were 0.01 and 0.8, respectively. From the table we can see that our algorithm achieved the best results when the k equaled 5. This can be explained as follows: the small value leads to overlaps among the three groups, and then the particles become over competitive, and the swarms degrade in the end. In contrast, the large value makes it cannot communicate with others.
K  Sn  Sp  Acc  Q^{9} 

20  85.41%  98.93%  92.17%  89.65% 
10  87.34%  99.79%  93.56%  91.05% 
5  90.77%  98.93%  94.85%  93.43% 
2  87.55%  98.50%  93.03%  91.14% 
Table 8: Variation of the results obtained by NANN_IPSO with different k (donor site).
k  Sn  Sp  Acc  Q^{9} 

20  83.33%  97.56%  90.63%  88.12% 
10  84.17%  97.92%  91.04%  88.71% 
5  85.63%  97.62%  91.77%  89.73% 
2  83.75%  97.92%  90.83%  88.42% 
Table 9: Variation of the results obtained by NANN_IPSO with different k (acceptor site).
• Sensitivity in relation to crossover probability
Then, the effects of crossover probability on the results were investigated. Crossover probability plays a key role in the method. For small values, the algorithm will converge slowly. However, large values will lead the particles to a random walk in the search space. Therefore, a proper value is vital for the method. The effects of the crossover probability on the results are shown in Tables 10 and 11 where k and mutation probability were set to 5 and 0.01, respectively. From the tables we can see that the algorithm achieved the best results when the crossover probability was 0.8.
p_{c}  Sn  Sp  Acc  Q^{9} 

0.2  86.70%  98.50%  92.60%  90.53% 
0.4  88.41%  98.71%  93.56%  91.76% 
0.6  88.20%  99.36%  93.78%  91.64% 
0.8  90.77%  98.93%  94.85%  93.43% 
Table 10: Variation of the results obtained by NANN_IPSO with different crossover probability (donor site).
p_{c}  Sn  Sp  Acc  Q9 

0.2  82.50%  97.92%  90.21%  87.54% 
0.4  84.79%  98.33%  91.56%  89.18% 
0.6  85.00%  98.13%  91.56%  89.31% 
0.8  85.63%  97.62%  91.77%  89.73% 
Table 11: Variation of the results obtained by NANN_IPSO with different crossover probability (acceptor site).
Sensitivity in relation to mutation probability
Tables 12 and 13 show the effects of the mutation probability on the solutions where the k and the crossover probability were 5 and 0.8, respectively. Variations of results were observed with different mutation probabilities. From the tables we can see that our method achieved the best results when the mutation probability equaled 0.01.
p_{m}  Sn  Sp  Acc  Q^{9} 

0.01  90.77%  98.93%  94.85%  93.43% 
0.05  87.12%  99.14%  93.13%  90.88% 
0.1  84.76%  99.14%  91.95%  89.12% 
0.2  83.91%  98.93%  91.42%  88.59% 
Table 12: Variation of the results obtained by NANN_IPSO with different mutation probability (donor site).
p_{m}  Sn  Sp  Acc  Q^{9} 

0.01  85.63%  97.62%  91.77%  89.73% 
0.05  81.88%  98.75%  90.31%  87.15% 
0.1  80.00%  98.33%  89.17%  85.81% 
0.2  78.75%  98.96%  88.85%  84.96% 
Table 13: Variation of the results obtained by NANN_IPSO with different mutation probability (acceptor site).
In summary, the computational results confirmed that the method used in this paper works well. Its performance was fairly robust, and this is a very appealing feature, as most real world applications involve analyzing huge volumes of genome sequence data. Moreover, the five fold crossvalidation experiment was reliable and appropriate for splice site prediction.
Predicting splice sites is an important part of gene prediction. Due to complex dependencies existing among bases around splice sites, splice site prediction remains a major bottleneck in gene prediction. A novel neural network model (NANN) was introduced, which was based on the fact that threebase code words (codons) in mRNA stand for amino acids in proteins, to predict the splice sites. In addition, the IPSO mimicking the activities of wolves circle was used as a training phase of NANN to determine the weights of network. Moreover, three encoding methods were applied and showed how the predicting performance was benefited from the use of the SN_FD_DI encoding approach. The experimental results demonstrated that the proposed algorithm was able to discriminate between the true and false splice sites and had better prediction accuracy than others.
This study was supported by the National Natural Science Foundation of China (Grant No. 61309022).
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals