An Immunobioinformatic Comparison of Influenza A Subtype Hemagglutinins

The purpose of this research is to identify nucleotide and amino acid positions are essential for the function of the HA gene and its protein products. A metric for sequence variability was Hamming distance, determined for all possible inter-subtype HA sequence pairs of H1N1, H3N2 and H6N1 HA sequences in the complete NCBI HA gene datasets. Almost all (97.22%) of nucleotide positions at which the Hamming distance was zero were in the HA2 domain of the HA gene; the invariant nucleotides occupied second and first codon positions except for a tail-region encoded invariant tryptophan. In contrast with the results at the nucleotide level, the patterns of epitope distribution in the encoded HA proteins were similar except for a 25-amino acid sequence (283-307) in the HA1 region of HA H6N1. These results demonstrate the occurrence of similar organization of immunological epitopic biopatterns in influenza A hemagglutinins at the protein level, even in the face of large differences in the sequences of encoding nucleotides and encoded amino acids.


Introduction
The hemagglutinin (HA) gene of the influenza A virus encodes proteins that enable the influenza virus to bind to sialic acid on target cell membranes and to be internalized by those cells through a process involving fusion of the cell lipid membrane with the viral lipid membrane [1,2]. After syntheses of the hemagglutinin protein (HA0) encoded by the HA gene, the HA0 protein is cleaved by cellular proteases into a signal peptide, an HA1 protein and an HA2 protein [3]. The HA1 protein contains the recognition site for binding the virus to the sialic acid [4,5] on the membrane of the target cell. Internalization of the virus into the target cell takes place by a process of membrane fusion, mediated by the HA2 protein [1]. The influenza virus thus enters into the endosomes of the cells where, following lowering of pH, the virus replicates. The N-terminal signal peptide of the HA0 protein is involved with translocation of the nascent influenza virus across the endoplasmic reticulum membrane of the infected cell [6]. Other regions encoded by the HA gene, mainly in the HA2 domain, are involved with packaging of the (-) RNA and the proteins of the virus into the functional viral structure [7] although participation of HA2 may not be critical in the packaging process [8]. Immunogenic regions of the HA1 protein, involved in binding the virus to the host cell surface are important targets in vaccine development [9,10].
A purpose of this research is to help identify patterns of organization in the nucleotide positions of the influenza A HA gene and the HA proteins. A reduced variability at such positions may reflect biological constraints on that variability, especially given the high mutation rate of influenza viruses [11] and the lack of error-correction [12]. Nucleotide variability at second codon positions reflects variability of protein sequence since there are no degenerate codons with mutations in the second position [13].
In this research on sequence variation in the influenza A virus, the HA genes of subtypes H1N1, H3N2 and H6N1 are analyzed. Subtypes H1N1 and H3N2 are the causes of epidemics and pandemics [3], while H6N1 has posed a recent threat [14]. The research presented here uses Hamming distance [15,16] as the metric for comparing the HA gene nucleotide variation. Hamming distance does not require prior realignment of the sequences according to statistical criteria and is defined only for sequences of equal length. Hamming distance has been used to analyze the HA gene of individual HA subtypes [17,18] but, to my knowledge, this is the first report of the simultaneous application of Hamming distance to HA genes of multiple subtypes. The second codon Hamming distance distributions reflect differences in protein sequence. Despite the differences in protein sequences, the patterns of B-cell linear epitope distributions remained intact in the encoded HA proteins. Thus, it is reported here that nucleotide variation in regions of the HA gene is non-randomly distributed and it is proposed that these observed variational patterns reflect host immunological and other biologically significant evolutionary forces that act on the HA proteins.

Methods
Sequence sorting, codon translation, data analysis and plotting were performed with Python 2.7 (EPD 7.3-1, 64-bit), Numpy 1.6.1, Scipy (0.10.1) and Biopython 1.61. Consensus sequences were determined with JalView [19]. Hamming distances [15,16] were computed with Python 2.7.3 on the computer facilities of the Brown University Center for Computation and Visualization (CCV) for all possible inter-subtype sequence pairs. Hamming distances at first, second and third codon positions were obtained by decimative multiplication, as previously reported for arrays of information entropy [20].
The entire set of FLU project influenza A HA coding region nucleotide sequences (17017) was downloaded from the NCBI Influenza Virus Resource Database [21] in FASTA format [22] on December 26, 2013 along with an additional special set of H7N9 HA sequences linked through the NCBI website. These nucleotide sequences are DNA versions of the influenza virus HA mRNA. Only complete sequences were downloaded and laboratory strains were excluded. The hemagglutinin (HA) gene sequences of the following influenza A virus subtypes were sorted from the complete download set: H1N1, H2N2, H3N2, H5N1, H6N1 and H7N9. It was determined that 82.85% of the H1N1 HA sequence dataset and 99.77% of the H3N2 HA gene dataset consisted of complete HA gene nucleotide segments of length 1701, with 16.63% of the H1N1 HA dataset being of length 1698 nucleotides. One of the H6N1 sequences was of length 1698 and one was of length 1704; the remaining H6N1 sequences (98.77%) were of length 1701. The HA genes of none of the other influenza A subtypes were 1701 nucleotides in length. Thus, the HA genes of influenza A subtypes H1N1, H3N2 and H6N1 of length 1701 nucleotides provided the requisite and suitable sequence datasets for a Hamming distancebased analysis.
Hamming distance (d) at a single nucleotide position of two nucleotide sequences (i, j) of equal length, one sequence from set i and one sequence from set j, is defined as d=0 if nucleotide(i)=nucleotide(j) and as d=1 if nucleotide(i) ≠ nucleotide(j): The total Hamming distance (D) at a single nucleotide position of the HA subtype sets is defined as: where d is summed over ϕ, and ϕ is the total number of possible pairs between sequences i and j. The summed Hamming distance (ΣD) between paired (i, j) HA subtype gene sequences is: where D is summed over n nucleotide positions. In the HA genes of the three subtypes analyzed in this study, the maximum value of n is L, where L is the length of the complete H1N1, H3N2 and H6N1 HA genes and L=1701.
Prediction and detection of B-cell linear epitopes were performed on the influenza A consensus HA protein sequences with Bepipred [23] using the recommended default cutoff score of 0.35. The Bepipred website was accessed via the Immune Epitope Database and Analysis Resource, La Jolla Institute for Allergy & Immunology, CA (http:// www.iedb.org/). Percentage identity (PID) was calculated for the raw concensus protein sequences and for consensus sequences that were realigned with Clustal-Omega [24; http://www.ebi.ac.uk/Tools/ services/web_clustalo/toolform.ebi].

Results
FLU Project HA sequences of length 1701 nucleotides of the following subtypes (sequence numbers in parentheses) were downloaded from the NCBI Influenza Virus Resource database: H1N1 (5655), H3N2 (4107) and H6N1 (160). In order to remove sequences that could interfere with Hamming distance analysis, downloaded sequences which could not serve as templates for translation into full-length (566 amino acid) HA0 proteins were purged, yielding: H1N1 (5365; 94.87% of download), H3N2 (3989; 97.12% of download) and H6N1 (159; 99.38% of download). The (H1N1, H3N2) Hamming distance, summed over 1701 positions, was computed from 5365*3989=21,400,985 sequence pairs, the (H1N1, H6N1) Hamming distance was computed from 5365*159=853, 035 sequence pairs and the (H3N2, H6N1) Hamming distance was computed from 3989*159=634, 251 sequence pairs. Hamming distances between the pairs of these subtype sequences are shown in Table 1. The median Hamming distance of 1223 observed between HA genes of subtypes H1N1 and H3N2 represents a frequency 0.7190 summed over the 1701 nucleotide positions. Using that frequency as a probability for a binomial approximation leads to a predicted standard deviation that is 2.54 greater than the observed standard deviation. Similarly, the standard deviations binomially predicted for the (H1N1, H6N1) pairs and for the (H3N2, H6N1) pairs are 3.20 and 2.75 greater than those observed, respectively. The summed Hamming distance of 883 for the (H1N1, H6N1) paired datasets is less than those observed for both the (H1N1, H3N2) paired datasets and for the (H3N2, H6N1) paired datasets; there was no overlap between the (H1N1, H6N1) Hamming distance range with those of the other two dataset pairs. The probability (p) values for the observed absolute non-overlap between the (H1N1, H6N1) Hamming distance and the Percent identity (PID) matrices are given in Table 2 for the native and for the realigned H1N1, H3N2 and H6N1 HA0 proteins. Considerable differences among the HA0 sequences are present even after realignment, especially between HA0 H3N2 and the HA0 proteins of the other two subtypes.  Other, smaller joint epitope clusters were distributed throughout the sequences. There was a paucity of epitopic residues in amino acid residues 283-307 of the H6N1 HA protein.

Discussion
A goal of this research is to detect nonrandom patterns in the variation of the HA gene and protein of subtypes H1N1, H3N2 and H6N1 influenza A viruses because such nonrandom patterns may be direct evidence of significant biological forces acting on those viruses. HA genes of subtypes H1N1, H3N2 and H6N1 are of epidemiological importance and provide the opportunity for detecting such nonrandom patterns by Hamming distance metric because these genes are of equal length (see Results). Hamming distance provided a robust, yet sensitive metric, involving determination of sequence differences between all possible inter-subtype pairs in the sequence sets. Despite the commonality of sequence length, there were nucleotide differences between many of the nucleotide position pairs. Hamming distances were nonzero at 71.90% of (H1N1, H3N2) HA sequence pairs, 46.73% of (H1N1, H6N1) HA sequence pairs and at 65.75% of the (H3N2, H6N1) HA sequence pairs (Table 1). However, Hamming distances between the HA(H1N1, H6N1) sequence pairs were lower those of other two sets of pairs so that the HA(H1N1, H6N1) Hamming distances were completely non-overlapping, disjoint with the HA(H1N1, H3N2) and HA(H3N2, H6N1) Hamming distances. The probability (p) of each these results having occurred randomly is only p<5.4777 x 10 -14 and p<1.8483 x 10 -12 , respectively.
The lower Hamming distance between HA (H1N1, H6N1) pairs, discussed above, is associated with nonlinear distribution of Hamming distance beginning at approximately nucleotide position 800 ( Figure  1b), especially at nucleotides in first and second codon positions ( Figure 2). The observed reduced distribution of Hamming distances at these codon positions suggests biological constraints acting at the protein level rather than synonymous mutations at the nucleotide level. Nucleotide position 800 of the HA gene is near the 3'-end of the HA1 region of the gene.
Thirty six (36) nucleotide positions were identified at which the Hamming distance was equal to zero between all sequence pairs of the HA genes of all three influenza A subtypes ( Table 3) Figure  2). This pattern of constrained Hamming distance suggests processes whose actions begin on the 3'-end of the HA1 domain (near nucleotide position 848) and act at the 35 positions in the HA2 domain. Such a distribution suggests that there are constraining effects on the HA0 protein, ie, before proteolytic cleavage (1) of HA0 into HA1 and HA2 proteins.
The HA2 protein is known to have several functions in the influenza virus. HA2 protein interacts with the HA1 protein so as to enable HA1 to bind to sialic acid on the membrane of the target cell [1]. HA2 participates in a membrane fusion process during internalization of influenza virus into the target cell. HA2 participates in the packaging of nascent influenza A virus particles, thereby facilitating secretion of mature virions from endosomes [7] although it should be noted that HA2 participation in packaging may not be essential [8]. The HA2 protein provides a hydrophobic virus tail that anchors the virion to the virus lipid membrane coat [26]. The nucleotide positions and corresponding amino acid positions listed in Table 2 have not yet been specifically implicated in any of these processes. It seems especially interesting that the nucleotides at positions 1657, 1658 and 1659 are uniformly U, G, and G, encoding Trp553. Trp553 is a hydrophobic amino acid located 13 amino-acid positions from the COOH-terminus of the anchoring tail region. It is postulated here that the 36 nucleotide positions with zero Hamming distance, identified in Table 3    absolutely conserved in all of the HA nucleotide sequences studied, are participating in these, and perhaps other, biological processes that are essential to the influenza A virus. Insight into the mechanisms of nucleotide conservation at these 36 nucleotide positions can be increased by analyzing biological effects of base substitutions at these positions on the structure and function of influenza A viruses.
Soundararajanin et al [27] have reported epitope networking interaction patterns within realigned HA sequences and have effectively coupled immune epitope data with crystallographic analyses. The research described here is based upon analysis of HA sequences of equal length, thereby avoiding the sequence realignment step. Avoiding sequence realignment may make it possible to detect factors influencing and governing immunological epitope distribution patterns by direct, computerized application to large sequence sets in a manner analogous to that shown for the consensus sequences in Figure 4.
The nucleotide substitutions discussed above, especially at second and first codon positions represent non-degenerate substitutions that could significantly affect the biological activity of the HA0 protein and the derivative signal peptides, HA1 and HA2 proteins of the influenza A subtypes studied. Accordingly, the comparison of epitope distributions in the HA proteins of the three subtypes was undertaken. The percent identity (PID) of these HA proteins were small, even after realignment ( Table 2). Despite these differences in amino acid sequence, the patterns of epitopic potential were correlated ( Figure 3) and were similarly distributed ( Figure 4). The paucity of epitopic sites in amino acid residues 283-307 of the H6N1 HA protein is in the HA1 region of the HA protein. The similar overall distribution of epitopic sites in the three hemagglutinins, despite the differences in amino acid sequence and composition, is consistent with the non-randomness displayed at the nucleotide level. These non-random nucleotide and amino acid patterns suggest the existence of governing networks of biological organization. Detection and analysis of such networks in sets of sequences should increase our understanding of influenza biology and may be useful for design of vaccines.