Joel K Weltman*
Department of Medicine, Alpert Medical School, Brown University, USA
Received Date: June 25, 2013; Accepted Date: June 29, 2014; Published Date: July 01, 2014
Citation: Weltman JK (2014) An Immunobioinformatic Comparison of Influenza A Subtype Hemagglutinins. J Med Microb Diagn 3: 135. doi: 10.4172/2161-0703.1000135
Copyright: © 2014 Weltman JK. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Medical Microbiology & Diagnosis
The purpose of this research is to identify nucleotide and amino acid positions are essential for the function of the HA gene and its protein products. A metric for sequence variability was hamming distance, determined for all possible inter-subtype HA sequence pairs of H1N1, H3N2 and H6N1 HA sequences in the complete NCBI HA gene datasets. Almost all (97.22%) of nucleotide positions at which the Hamming distance was zero were in the HA2 domain of the HA gene; the invariant nucleotides occupied second and first codon positions except for a tail-region encoded invariant tryptophan. In contrast with the results at the nucleotide level, the patterns of epitope distribution in the encoded HA proteins were similar except for a 25-amino acid sequence (283-307) in the HA1 region of HA H6N1. These results demonstrate the occurrence of similar organization of immunological epitopic biopatterns in influenza a hemagglutinins at the protein level, even in the face of large differences in the sequences of encoding nucleotides and encoded amino acids.
Influenza A; Subtypes; Hemagglutinin; Bioinformatics; Hamming distance; Bepipred; Epitopes
The hemagglutinin (HA) gene of the influenza A virus encodes proteins that enable the influenza virus to bind to sialic acid on target cell membranes and to be internalized by those cells through a process involving fusion of the cell lipid membrane with the viral lipid membrane [1,2]. After syntheses of the hemagglutin protein (HA0) encoded by the HA gene, the HA0 protein is cleaved by cellular proteases into a signal peptide, an HA1 protein and an HA2 protein . The HA1 protein contains the recognition site for binding the virus to the sialic acid [4,5] on the membrane of the target cell. Internalization of the virus into the target cell takes place by a process of membrane fusion, mediated by the HA2 protein . The influenza virus thus enters into the endosomes of the cells where, following lowering of pH, the virus replicates. The N-terminal signal peptide of the HA0 protein is involved with translocation of the nascent influenza virus across the endoplasmic reticulum membrane of the infected cell . Other regions encoded by the HA gene, mainly in the HA2 domain, are involved with packaging of the (-) RNA and the proteins of the virus into the functional viral structure  although participation of HA2 may not be critical in the packaging process . Immunogenic regions of the HA1 protein, involved in binding the virus to the host cell surface are important targets in vaccine development [9,10].
A purpose of this research is to help identify patterns of organization in the nucleotide positions of the influenza A HA gene and the HA proteins. A reduced variability at such positions may reflect biological constraints on that variability, especially given the high mutation rate of influenza viruses  and the lack of error-correction . Nucleotide variability at second codon positions reflects variability of protein sequence since there are no degenerate codons with mutations in the second position .
In this research on sequence variation in the influenza A virus, the HA genes of subtypes H1N1, H3N2 and H6N1 are analyzed. Subtypes H1N1 and H3N2 are the causes of epidemics and pandemics , while H6N1 has posed a recent threat . The research presented here uses Hamming distance [15,16] as the metric for comparing the HA gene nucleotide variation. Hamming distance does not require prior realignment of the sequences according to statistical criteria and is defined only for sequences of equal length. Hamming distance has been used to analyze the HA gene of individual HA subtypes [17,18] but, to my knowledge, this is the first report of the simultaneous application of Hamming distance to HA genes of multiple subtypes. The second codon Hamming distance distributions reflect differences in protein sequence. Despite the differences in protein sequences, the patterns of B-cell linear epitope distributions remained intact in the encoded HA proteins. Thus, it is reported here that nucleotide variation in regions of the HA gene is non-randomly distributed and it is proposed that these observed variational patterns reflect host immunological and other biologically significant evolutionary forces that act on the HA proteins.
Sequence sorting, codon translation, data analysis and plotting were performed with Python 2.7 (EPD 7.3-1, 64-bit), Numpy 1.6.1, Scipy (0.10.1) and Biopython 1.61. Consensus sequences were determined with Jalview . Hamming distances [15,16] were computed with Python 2.7.3 on the computer facilities of the Brown University Center for Computation and Visualization (CCV) for all possible inter-subtype sequence pairs. Hamming distances at first, second and third codon positions were obtained by decimative multiplication, as previously reported for arrays of information entropy .
The entire set of FLU project influenza A HA coding region nucleotide sequences (17017) was downloaded from the NCBI Influenza Virus Resource Database  in FASTA format  on December 26, 2013 along with an additional special set of H7N9 HA sequences linked through the NCBI website. These nucleotide sequences are DNA versions of the influenza virus HA mRNA. Only complete sequences were downloaded and laboratory strains were excluded. The hemagglutinin (HA) gene sequences of the following influenza A virus subtypes were sorted from the complete download set: H1N1, H2N2, H3N2, H5N1, H6N1 and H7N9. It was determined that 82.85% of the H1N1 HA sequence dataset and 99.77% of the H3N2 HA gene dataset consisted of complete HA gene nucleotide segments of length 1701, with 16.63% of the H1N1 HA dataset being of length of 1698 nucleotides. One of the H6N1 sequences was of length 1698 and one was of length 1704; the remaining H6N1 sequences (98.77%) were of length 1701. The HA genes of none of the other influenza A subtypes were 1701 nucleotides in length. Thus, the HA genes of influenza A subtypes H1N1, H3N2 and H6N1 of length 1701 nucleotides provided the requisite and suitable sequence datasets for a Hamming distancebased analysis.
The entire set of FLU project influenza A HA protein sequences (13944) was downloaded from the NCBI Influenza Virus Resource Database  in FASTA format  on May 27, 2014, yielding 5905 H1N1 intact HA protein sequences (yield=96.79%), 5268 H3N2 intact HA protein sequences (yield=98.39%) and 184 H6N1 intact HA protein sequences (yield=99.46%).
Hamming distance (d) at a single nucleotide position of two nucleotide sequences (i, j) of equal length, one sequence from set i and one sequence from set j, is defined as d=0 if nucleotide(i)=nucleotide(j) and as d=1 if nucleotide(i) ≠ nucleotide(j):
The total Hamming distance (D) at a single nucleotide position of the HA subtype sets is defined as:
Where d is summed over F, and F is the total number of possible pairs between sequences i and j. The summed Hamming distance (SD) between paired (i, j) HA subtype gene sequences is:
Where D is summed over n nucleotide positions. In the HA genes of the three subtypes analyzed in this study, the maximum value of n is L, where L is the length of the complete H1N1, H3N2 and H6N1 HA genes and L=1701.
Prediction and detection of B-cell linear epitopes was performed on the influenza A consensus HA protein sequences with Bepipred  using the recommended default cutoff score of 0.35. The Bepipred website was accessed via the Immune Epitope Database and Analysis Resource, La Jolla Institute for Allergy & Immunology, CA (http:// www.iedb.org/). Percentage identity (PID) was calculated for the raw concensus protein sequences and for consensus sequences that were realigned with Clustal-Omega [24; http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi].
FLU Project HA sequences of length 1701 nucleotides of the following subtypes (sequence numbers in parentheses) were downloaded from the NCBI Influenza Virus Resource database: H1N1 (5655), H3N2 (4107) and H6N1 (160). In order to remove sequences that could interfere with Hamming distance analysis, downloaded sequences which could not serve as templates for translation into fulllength (566 amino acid) HA0 proteins were purged, yielding: H1N1 (5365; 94.87% of download), H3N2 (3989; 97.12% of download) and H6N1 (159; 99.38% of download). The (H1N1, H3N2) Hamming distance, summed over 1701 positions, was computed from 5365*3989=21,400,985 sequence pairs, the (H1N1, H6N1) Hamming distance was computed from 5365*159=853, 035 sequence pairs and the (H3N2, H6N1) Hamming distance was computed from 3989*159=634, 251 sequence pairs. Hamming distances between the pairs of these subtype sequences are shown in Table 1. The median Hamming distance of 1223 observed between HA genes of subtypes H1N1 and H3N2 represents a frequency 0.7190 summed over the 1701 nucleotide positions. Using that frequency as a probability for a binomial approximation leads to a predicted standard deviation that is 2.54 greater than the observed standard deviation. Similarly, the standard deviations binomially predicted for the (H1N1, H6N1) pairs and for the (H3N2, H6N1) pairs are 3.20 and 2.75 greater than those observed, respectively. The summed Hamming distance of 883 for the (H1N1, H6N1) paired datasets is less than those observed for both the (H1N1, H3N2) paired datasets and for the (H3N2, H6N1) paired datasets; there was no overlap between the (H1N1, H6N1) Hamming distance range with those of the other two dataset pairs. The probability (p) values for the observed absolute non-overlap between the (H1N1, H6N1) Hamming distance and the Hamming distances of the (H1N1, H3N2) and (H3N2, H6N1) datasets are: p<(1/853,035)*(1/21,400,985.)=5.4777×10-14 and p<(1/853,035.) *(1./634,251.)=1.8483× 10-12, respectively.
Table 1: Hamming Distances Summed Over the 1701 Nucleotide Positions ofthe Hemagglutinin Gene. Influenza A subtypes of the paired gene sequences ofthe sequence datasets are identified in parentheses. The predicted values were obtained from 1×106 pseudorandom Bernoulli binomial trials.
The cumulative distributions of Hamming distances (SD) over the complete lengths of the H1N1, H3N2 and H6N1 HA genes are shown in Figure 1. Hamming distance is distributed linearly between HA (H1N1, H3N2) sequence pairs (Figure 1a) and between HA (H3N2, H6N1) sequence pairs (Figure 1c) but has a clearly nonlinear distribution in HA (H1N1, H6N1) sequence pairs (Figure 1b). HA(H1N1, H3N2) gene sequence pairs and HA (H3N2, H6N1) pairs are well described as linear summations of Bernoulli binomial functions  but correlation of a binomial function with the observed summed Hamming distance along gene length in HA (H1N1, H6N1) pairs has a large residual. The nonlinearity in Hamming distance distribution along the length of the HA genes occurs mainly in first (decimative 100) and second (decimative 010) codon positions (Figure 2), beginning at about nucleotide number 800.
Figure 1: Cumulative Hamming Distances (ΣD) Between Paired Influenza A Subtype Hemagglutinins. A. Data are shown for (H1N1, H3N2) sequence pairs (left), B. (H1N1, H6N1) sequence pairs (middle) and C. (H3N2, H6N1) sequence pairs (right). Black lines (observed ΣD); red dashed lines (approximations by summed single binomial function).
Figure 2: Decimative Cumulative Hamming Distances (ΣD) Between Paired Influenza A Subtype Hemagglutinins. Data are shown for (H1N1, H3N2) sequences pairs, (H1N1, H6N1) sequence pairs and (H3N3, H6N1) sequence pairs for decimative frame 100 (first codon position; left), decimative frame 010 (second codon position; center) and decimative frame 001 (third codon position; right).
Percent identity (PID) matrices are given in Table 2 for the native and for the realigned H1N1, H3N2 and H6N1 HA0 proteins. Considerable differences among the HA0 sequences are present even after realignment, especially between HA0 H3N2 and the HA0 proteins of the other two subtypes.
|(a) Native HA0 sequences|
|(b) Realigned HA0 sequences|
Table 2: Percent Identity (PID) Matrices for HA0 Proteins of Influences A Subtypes H1N1, H3N2 and H6N1. (a) Native, unrealigned sequences (b) realigned sequences.
|N(nt)||N(aa)||Codon Position||Nucleotide||HA0 Domain|
Table 3: HA Nucleotide Positions with Zero Hamming Distance Between All H1N1, H3N2 and H6N1 Sequence Pairs. Nucleotide (nt) and amino acid (aa) positions are identified by number (N) in the second and third columns. Codon positions are designated as decimation frame first codon position (100), second codon position (010) and third codon position (001) in the fourth column. The termination codon is designated as TER in row 40. The HA nucleotides are identified in the mRNA representation.
Correlations between the Bepipred Scores (B) of the 566 amino acid positions of all pairs of HA proteins are shown in Figure 3. The Spearman correlation coefficients (r) for B values of the amino acid positions of the HA proteins shown in Figure 3 are: HA (H1N1) and HA (H3N2) r=0.4497, p=1.5971e-29; HA (H1N1) and HA (H6N1) r=0.8282, p=6.2993e-144; HA (H3N2) and HA (H6N1) r=0.4287, p=1.0401e-26. All of these correlations are statistically significant.
The distributions of Bepipred scores (B) for each of the three influenza A subtype HA proteins are shown in Figure 4. Only values above the recommended cutoff (0.35) are shown. B values at positions that were significant for all three HA subtypes were summed and are also shown for reference and comparison. There were 86 amino acid positions with significant B scores jointly, in all three subtypes. Clusters of amino acids at positions with significant B values were observed at HA protein positions 141-153, 200-205, 222-228, 236-241 and 371-390. Other, smaller joint epitope clusters were distributed throughout the sequences. There was a paucity of epitopic residues in amino acid residues 283-307 of the H6N1 HA protein.
Figure 4: Bepipred Scores (B) for Amino Acid Positions of Influenza A Subtype HA Proteins. B values are shown only for those amino acid positions with a B value above the cutoff (B> 0.35). (a) HA (H1N1); (b) HA (H3N2); (c) HA(H6N1); (d) summed B scores (ΣB) for positions that are significant (>0.35) jointly for all three subtypes.
A goal of this research is to detect nonrandom patterns in the variation of the HA gene and protein of subtypes H1N1, H3N2 and H6N1 influenza A viruses because such nonrandom patterns may be direct evidence of significant biological forces acting on those viruses. HA genes of subtypes H1N1, H3N2 and H6N1 are of epidemiological importance and provide the opportunity for detecting such nonrandom patterns by Hamming distance metric because these genes are of equal length (see Results). Hamming distance provided a robust, yet sensitive metric, involving determination of sequence differences between all possible inter-subtype pairs in the sequence sets. Despite the commonality of sequence length, there were nucleotide differences between many of the nucleotide position pairs. Hamming distances were nonzero at 71.90% of (H1N1, H3N2) HA sequence pairs, 46.73% of (H1N1, H6N1) HA sequence pairs and at 65.75% of the (H3N2, H6N1) HA sequence pairs (Table 1). However, Hamming distances between the HA (H1N1, H6N1) sequence pairs were lower those of other two sets of pairs so that the HA (H1N1, H6N1) Hamming distances were completely non-overlapping, disjoint with the HA (H1N1, H3N2) and HA (H3N2, H6N1) Hamming distances. The probability (p) of each these results having occurred randomly are only p <5.4777x10-14 and p <1.8483x10-12, respectively.
The lower Hamming distance between HA (H1N1, H6N1) pairs, discussed above, is associated with nonlinear distribution of Hamming distance beginning at approximately nucleotide position 800 (Figure 1b), especially at nucleotides in first and second codon positions (Figure 2). The observed reduced distribution of Hamming distances at these codon positions suggests biological constraints acting at the protein level rather than synonymous mutations at the nucleotide level. Nucleotide position 800 of the HA gene is near the 3’-end of the HA1 region of the gene.
Thirty six (36) nucleotide positions were identified at which the Hamming distance was equal to zero between all sequence pairs of the HA genes of all three influenza A subtypes (Table 2). One of the positions (848) is in the HA1 domain of the HA gene and the other 35 positions are in the HA2 domain. 61.1% of these nucleotide positions are second positions of codons and 31.1% are first positions of codons. These results are consistent with the pattern of reduced distribution of Hamming distance in (H1N1, H6N1) HA sequence pairs (Figure 2). This pattern of constrained Hamming distance suggests processes whose actions begin on the 3’-end of the HA1 domain (near nucleotide position 848) and act at the 35 positions in the HA2 domain. Such a distribution suggests that there are constraining effects on the HA0 protein, i.e, before proteolytic cleavage (1) of HA0 into HA1 and HA2 proteins.
The HA2 protein is known to have several functions in the influenza virus. HA2 protein interacts with the HA1 protein so as to enable HA1 to bind to sialic acid on the membrane of the target cell . HA2 participates in a membrane fusion process during internalization of influenza virus into the target cell. HA2 participates in the packaging of nascent influenza A virus particles, thereby facilitating secretion of mature virions from endosomes  although it should be noted that HA2 participation in packaging may not be essential . The HA2 protein provides a hydrophobic virus tail that anchors the virion to the virus lipid membrane coat . The nucleotide positions and corresponding amino acid positions listed in Table 2 have not yet been specifically implicated in any of these processes. It seems especially interesting that the nucleotides at positions 1657, 1658 and 1659 are uniformly U, G, and G, encoding Trp553. Trp553 is a hydrophobic amino acid located 13 amino-acid positions from the COOH-terminus of the anchoring tail region. It is postulated here that the 36 nucleotide positions with zero Hamming distance, identified in Table 2, which are absolutely conserved in all of the HA nucleotide sequences studied, are participating in these, and perhaps other, biological processes that are essential to the influenza A virus. Insight into the mechanisms of nucleotide conservation at these 36 nucleotide positions can be increased by analyzing biological effects of base substitutions at these positions on the structure and function of influenza A viruses.
Soundararajanin et al.  have reported epitope networking interaction patterns within realigned HA sequences and have effectively coupled immune epitope data with crystallographic analyses. The research described here is based upon analysis of HA sequences of equal length, thereby avoiding the sequence realignment step. Avoiding sequence realignment may make it possible to detect factors influencing and governing immunological epitope distribution patterns by direct, computerized application to large sequence sets in a manner analogous to that shown for consensus sequences in Figure 4.
The nucleotide substitutions discussed above, especially at second and first codon positions represent non-degenerate substitutions that could significantly affect the biological activity of the HA0 protein and the derivative signal peptides, HA1 and HA2 proteins of the influenza A subtypes studied. Accordingly, the comparison of epitope distributions in the HA proteins of the three subtypes was undertaken. The percent identity (PID) of these HA proteins were small, even after realignment (see Results). Despite these differences in amino acid sequence, the patterns of epitopic potential were correlated (Figure 3) and were similarly distributed (Figure 4). The paucity of epitopic sites in amino acid residues 283-307 of the H6N1 HA protein is in the HA1 region of the HA protein. The similar overall distribution of epitopic sites in the three hemagglutinins, despite the differences in amino acid sequence and composition, is consistent with the non-randomness displayed at the nucleotide level. These non-random nucleotide and amino acid patterns suggest the existence of governing networks of biological organization. Detection and analysis of such networks in sets of sequences should increase our understanding of influenza biology and may be useful for design of vaccines.
This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University.