Development and Evaluation of a Commercial Sequence-Based Strain Typing Service for Listeria monocytogenes

Listeria monocytogenes is an environmentally ubiquitous bacterium capable of contaminating various foods, particularly dairy products and processed meats, but also smoked fish and raw fruits and vegetables [1,2]. Contamination is aided by its ability to grow at low temperatures and its tolerance to freezing, drying, and heat; long term persistence in food processing facilities has been well documented [3-6]. Upon ingestion, L. monocytogenes can cause severe infection, particularly in immunocompromised individuals, the elderly, pregnant women, and neonates [7]. Its virulence derives from its abilities to survive and grow within host phagocytes and aggressively spread through tissues, mediated by internalin proteins on its surface and induction of host actin polymerization [8,9]. Although most cases of food-borne listeriosis are sporadic, there have been multiple outbreaks in recent years, including one in 2011 responsible for 147 infections and 33 deaths across 28 states that was traced to contaminated cantaloupe (www.cdc.gov/listeria/outbreaks).


Introduction
Listeria monocytogenes is an environmentally ubiquitous bacterium capable of contaminating various foods, particularly dairy products and processed meats, but also smoked fish and raw fruits and vegetables [1,2]. Contamination is aided by its ability to grow at low temperatures and its tolerance to freezing, drying, and heat; long term persistence in food processing facilities has been well documented [3][4][5][6]. Upon ingestion, L. monocytogenes can cause severe infection, particularly in immunocompromised individuals, the elderly, pregnant women, and neonates [7]. Its virulence derives from its abilities to survive and grow within host phagocytes and aggressively spread through tissues, mediated by internalin proteins on its surface and induction of host actin polymerization [8,9]. Although most cases of food-borne listeriosis are sporadic, there have been multiple outbreaks in recent years, including one in 2011 responsible for 147 infections and 33 deaths across 28 states that was traced to contaminated cantaloupe (www.cdc.gov/listeria/outbreaks).
Tracking down the source of food-borne listeriosis is a challenging task that begins with laboratory culture (enrichment followed by plating) and species identification (to distinguish L. monocytogenes from its non-pathogenic sister species). This may be followed by conventional serotyping, although its limitations including low resolution (serotypes 4b, 1/2a, and 1/2b predominate) and high cost have largely led to its replacement with molecular typing systems. Among the latter, the gold standard is pulsed-field gel electrophoresis (PFGE) comparing fragment lengths following restriction enzyme digestion of chromosomal DNA [10]. Its primary advantage is high resolution, but its limitations include lengthy turnaround time, technical complexity, high cost, and the inherently low portability of length data, although the latter has been addressed through strict standardization and pattern analysis algorithms as implemented in the PulseNet system [11]. The limitations of serotyping and PFGE have encouraged the development of numerous alternative L. monocytogenes typing methods [12][13][14]. The most promising and practical of these is multilocus variable number of tandem repeats analysis (MLVA), first adapted to L. monocytogenes by Murphy et al. [15] and subsequently by other laboratories [16][17][18][19][20][21]. MLVA typically employs multiplex PCR and capillary electrophoresis to assess length variation due to insertions and deletions (indels) in 9 or so loci that contain tandem repeats, which are typically polymorphic due to slippage between repeat units during DNA replication [22]. using the Tandem Repeats Database (TRDB; https://tandem.bu.edu). BLASTN searches were conducted on the NCBI website (www.ncbi. nlm.nih.gov) against the NCBI Genomes database, and as needed against the WGS and Nucleotide collection databases. Downloaded sequences were trimmed to common termini, and aligned with clustalw2 (www.ebi.ac.uk/Tools/msa/clustalw2) in PHYLIP format using default parameters. Alignments were analyzed using dnapars (DNA parsimony), and dendrograms constructed using drawgram, both from the PHYLIP package (http://evolution.genetics.washington. edu/phylip.html). Simpson's index of diversity was calculated using the formula D=1-(Ʃn (n-1) / N (N-1)), where n=number of strains with a given allele and N=number of (epidemiologically unrelated) strains. PCR-based serogroups were determined by BLASTN searches for serogroup-specific genes [41]. For strains lacking experimentally determined PFGE profiles but having complete genome sequences, profiles were modelled in silico (http://insilico.ehu.es/digest/index. php?mo=Listeria).

Strains and culture conditions
A diverse set of 60 L. monocytogenes strains from the NRRL collection, including representatives of each lineage and serogroup, were obtained from T. Ward (USDA-ARS, Peoria, IL). Strains 1875 and 1877 are laboratory-generated deletion derivatives of F2365 [42], and ATCC strain 19115 was obtained from J. Brewster (USDA-ARS, Wyndmoor, PA). Strains were cultured on brain heart infusion agar at 37 o C. As per website guidelines (www.microbitype.com/submissionguidelines), isolated colonies were suspended in provided buffercontaining tubes, bacteria were heat inactivated by incubating at 100 o C for 15 min, and tubes were transported to MicrobiType (Plymouth Meeting, PA) by overnight courier at ambient temperature.

PCR and sequence analysis
Proprietary primers for PCR and sequencing were designed based on conserved regions identified by clustalw2 analyses of LmMT1 and LmMT2 loci, and synthesized by IDT (Coralville, Iowa). DNA was purified from heat-inactivated lysates, amplified with Taq polymerase, and subjected to Sanger dideoxynucleotide sequencing using proprietary modifications of standard methods. Chromatograms were visually scanned, and sequences edited as needed. All new LmMT1 sequences generated in this study have been submitted to GenBank with accession numbers KT626013-KT626044.

Identification of candidate typing loci LmMT1 and LmMT2
Tandem repeats were identified within the complete L. monocytogenes genome sequences for serotype 4b strain J1776 (GenBank accession CP006598) and serotype 1/2a strain EGD (accession HG421741) using TRDB (https://tandem.bu.edu). The repeats plus 500 nucleotides upstream and downstream flanking sequence were used as queries in BLASTN searches of the NCBI Genomes (Organism: Listeria monocytogenes) database which, as of March 2015, included 109 strains. Each repeat-containing region was evaluated for degree of polymorphism, presence in all or nearly all strains, and total length <1000 nucleotides to permit coverage by a single sequencing reaction. Two promising candidates were identified that satisfied these criteria.
The inherently higher portability of DNA sequence-based methods, facilitating day-to-day and lab-to-lab comparisons, is exploited by methods including multilocus sequence typing (MLST) analyzing 7 loci [28], and the related multi-virulence-locus sequence typing analyzing 6 loci [29]. These methods analyze single nucleotide polymorphisms (SNPs) within relatively conserved loci, and hence are most useful for analyzing evolutionary trends rather than epidemiology. Furthermore, their multilocus requirement adds to their complexity and cost. Two additional methods representing variations on sequence analysis include multilocus genotyping (MLGT) which employs >100 probes and flow cytometry to detect the presence or absence of previously documented SNPs [30], and a microarray which detects the presence or absence of >100 genes previously shown to vary among strains [31]. Development of both of these methods hinged on genomics; i.e., the availability in recent years of whole genome sequences (WGS) for large numbers of diverse L. monocytogenes strains. Indeed, WGS analysis itself is being explored as a direct approach to L. monocytogenes outbreak detection and investigation [32][33][34][35]. The WGS sequencing technologies commonly employed generate large numbers of short reads which are subsequently analyzed for SNPs, analogous to MLST and MVLST but providing considerably higher resolution. Unfortunately, short reads preclude the unambiguous sequence assembly of some tandem repeat regions. A general concern involves the substantial investments in equipment, reagents, and trained personnel required to implement MLGT, microarrays, and WGS as routine typing tools.
A more practical alternative to the multilocus or WGS approaches described above would be sequence-based typing of one (or two) highly polymorphic loci. Current representatives of this approach include Staphylococcus aureus spa typing [36], Streptococcus pyogenes emm typing [37], Neisseria meningitidis porA and fetA typing [38], and Campylobacter jejuni flaA typing [39]. These loci were selected largely based on their previously characterized roles as immunodominant antigens or virulence factors in these four pathogens. Identifying highly informative loci for sequence-based typing of other bacterial pathogens might benefit from a less biased approach. For example, Miya et al. [40] identified several promising loci for sequence-based typing of L. monocytogenes based on MLVA data generated in their lab.
The approach used here to identify promising loci for sequencebased typing exploits the current availability of numerous genome sequences, and analogous to MLVA focuses on tandem repeats as major contributors to polymorphism. In addition to indels, it was anticipated that these tandem repeat loci, consistent with their length polymorphism, would also exhibit higher rates of SNPs within flanking regions and within the repeats themselves.
The goal was to develop and evaluate a cost-effective but uncompromising outsourcing option for L. monocytogenes strain typing that could be used by food industries to monitor and track down sources of contamination, and by public health and clinical labs for surveillance and outbreak detection and investigation. In addition to identifying promising loci, it was critical to develop a simple and safe procedure for sample submission, robust approaches to sequencing these samples, and a format for typing results that emphasizes interpretability and utility.

Bioinformatics
Tandem repeats in complete genome sequences were identified Clustalw2 alignments of LmMT1 and LmMT2 sequences from representative strains are shown in Figure 1. As expected, polymorphisms in these loci are mediated primarily by indels involving the tandem repeats ( Figure 1A and 1B). However, there are also SNPs within the repeat regions, reflecting redundancy in the genetic code. Alignments of full-length LmMT1 ( Figure 1C) and LmMT2 (not shown) sequences from serotype 4b and 1/2a strains reveal substantial polymorphism extending beyond the repeats to both flanking sequences, primarily SNPs but also additional indels. Thus, analysis of this region by length polymorphism alone, as is the case with MLVA, would be much less informative than sequence analysis that weighs both indels and SNPs.

Phylogenetic analyses
All full-length LmMT1 and LmMT2 sequences from strains within the NCBI Genomes database were downloaded, aligned with clustalw2, and subjected to DNA parsimony analysis, weighting both indels revealed that this CCGGTAGAT repeat was included in several validated MLVA assays, where its length polymorphism yielded the highest diversity index of all repeats; specifically, 0.87 to 0.93 [15,[17][18][19][20]. Furthermore, LmMT1 includes the 0.3 kbp TR1 region analyzed by Miya et al. [40] that yielded a diversity index of 0.95 in sequencebased typing of 70 strains, mostly from Japan.
The second candidate, LmMT2, involves a 0.7-0.8 kbp region encoding Asp-Ala repeats (13 to 40 residues) within a putative peptidoglycan binding protein (gene lmo1799 in EGD). This GATGCR repeat was also included in several validated MLVA assays, where its length polymorphism yielded a diversity index of 0.80 to 0.88 [16,[18][19][20][21]. Similarly, LmMT2 includes the 0.5 kbp TR2 which yielded a diversity index of 0.91 in the Miya et al. [40] studies noted above. Interestingly, this same gene encodes a second polymorphic Asp-Ala repeat which, however, was a less promising typing target due to its excessive length in many strains.
For LmMT2, phylogenetic analysis ( Figure 3) resolved 38 alleles from the full-length sequences for 69 strains represented in the NCBI Genomes database. (The reduced number -69 strains compared to 109 for LmMT1 -is largely due to reduced representation of R8 strains). Again, based on their annotations and literature searches it was estimated that 59 of these strains are epidemiologically unrelated. This value combined with 38 alleles yielded a Simpson's index of diversity of 0.98. Consistent with this slightly lower value, several sets of strains that were resolved with LmMT1 ( Figure 2) were clustered with LmMT2 Although their intended use is for epidemiology rather than evolutionary analysis, it is noteworthy that both LmMT1 and LmMT2 dendrograms divide the strains into three distinct groups representing lineages I (serotype 4b complex, including 4b, 4d, and 4e) , II (serotype 1/2a complex, including 1/2a, 1/2c, 3a, 3c), and III/IV (serotypes 4a, 4c), as previously described [30,43].

LmMT1 typing of epidemiologically related strains with comparison to PFGE
In light of its higher diversity index and ability to resolve closely related strains as noted above, LmMT1 was selected for further analysis. Below it is compared to PFGE with respect to four well characterized outbreaks: (1) Identical LmMT1 sequences, and unique relative to all other strains in NCBI Genomes and WGS databases, were obtained for four serotype 1/2a strains: two from a 1988 outbreak (F6900, F6854) and two from a 2000 outbreak (J0161, J2818) traced to the same Texas meat processor [5]. These strains are also indistinguishable by PFGE [5], and nearly identical based on whole genome SNP analysis [34]. In addition to confirming their epidemiological connection, this lack of variation over 12 years is remarkable, and demonstrates the stability of LmMT1 typing.
(2) The two serotype 1/2b strains (R2-502 and R2-503) from the 1994 Illinois outbreak traced to chocolate milk [44] have identical, and unique, LmMT1 sequences. All strains associated with this outbreak analyzed by PFGE also had PFGE profiles that were indistinquishable or nearly so [44].
(3) The two serotype 1/2a strains (08-5578, 08-5923) from the 2008 Ontario outbreak traced to ready-to-eat meat have identical, unique LmMT1 sequences. By PFGE, these two strains are nearly indistinguishable, differing by a single AscI band which was shown by genome sequencing to be due to prophage insertion, a recognized source of PFGE profile variation in otherwise related strains [32].
(4) Four serotype 4b strains associated with the 2002 outbreak in northeastern U.S. traced to a Pennsylvania poultry processor [45] are indistinguishable by PFGE and MVLST [46], and also have identical LmMT2 sequences (Figure 3). They are, however, resolved by LmMT1 ( Figure 2) into two closely related pairs -human/food strains J1776/ J1926 and processing plant/environment strains J1816/J1817 -differing by a 9 bp indel within the repeat region ( Figure 4). This split is supported by analysis of additional indel-based polymorphic loci ( Figure 4). Furthermore, although both pairs have unique LmMT1 sequences relative to all other NCBI Genomes and WGS strains, the J1816/J1817 pair differs only by two SNPs from recent retail deli isolates such as R8-5726 ( Figure 2). These data cast doubt on the presumed connection of environmental strains J1816 and J1817 to the 2002 outbreak [47].

Typing of non-monocytogenes Listeria species
While human pathogenesis is limited to L. monocytogenes, Listeria ivanovii is a pathogen of ruminants; additional species are non-pathogenic but often co-isolate with, and must be differentiated from, L. monocytogenes. Thus, it would be advantageous if LmMT1 or LmMT2 typing clearly distinguished between these species. Eleven strains of four non-monocytogenes Listeria species are represented in the NCBI Genomes database (as of March 2015). Among these, the LmMT1 locus is present in only Listeria innocua, where it resolves all three strains ( Figure 5A). The LmMT2 locus, on the other hand, is present in all 11 non-monocytogenes strains, 10 of which have complete sequences and are resolved into 8 alleles ( Figure 5B). For both LmMT1 and LmMT2, the non-monocytogenes strains are clearly resolved from L. monocytogenes lineages I, II, and III/IV strains.

Laboratory evaluation
In addition to identifying promising loci, the development of a practical outsourcing option for strain typing relies on a simple and safe procedure for sample submission and robust approaches to sequencing these samples. Non-thermophilic and non-spore-forming bacteria are effectively inactivated by heating to 100 o C, and in this form do not require expensive and cumbersome biohazard packaging and shipping. To test the compatability of heat inactivation with LmMT1 and LmMT2 typing, isolated colonies from L. monocytogenes strains representing lineages I (serotypes 1/2b and 4b), II (1/2a and 1/2c), and III (4a) were prepared, submitted, and analyzed as described in Materials and Methods. The resulting chromatograms and sequences (representative results in Figure 6) were of uniformly high quality for both LmMT1 and LmMT2.
Using this protocol, LmMT1 analysis was extended to an additional 56 L. monocytogenes strains and 1 L. innocua strain from the USDA-ARS collection, and the resulting dendrogram is shown in Figure 7. Of the 62 total strains, 34 defined 32 new alleles (i.e., unique LmMT1 sequence). The remaining 28 had LmMT1 sequences identical to 12 NCBI database strains (underlined in Figure 7). These include: strain F2365 and its laboratory-generated derivatives 1875 and 1877 which matched database F2365, strain ScottA which matched database Scott A, strain 33419 (derived from J0161) which matched database F6854 (epidemiologically related to J0161 as discussed above), and strain 33233 (derived from H7858) which matched database H7858. All of these matches were expected, and demonstrate that routine passaging does not alter LmMT1 sequence. The 59 epidemiologically unrelated strains (excluding L. innocua, 1875, and 1877) defined 44 LmMT1 alleles, and yielded a Simpson's diversity index=0.98.

Discussion
The goal of this study was to develop and validate a commercial sequence-based strain typing service for the foodborne pathogen L. monocytogenes that meets the following requirements: (1) resolution comparable to PFGE, the current gold standard; (2) clustering consistent with epidemiological relatedness, evolutionary lineage, and serotype; (3) data portability to facilitate day-to-day and lab-tolab comparisons; (4) readily interpretable results that reference public domain databases; (5) simple and safe sample submission; (6) turnaround time of 2 to 3 days; and (6)  genomics-based approach was employed which led to the identification of LmMT1 and LmMT2. These tandem repeat-containing loci were present in all L. monocytogenes genomes, and could be amplified and sequenced by robust, inexpensive methods that are compatible with samples submitted as heat-inactivated colonies. Both demonstrate complex patterns of strain-dependent indels and SNPs that can be readily compared, using standard bioinformatics tools, to publicly available NCBI sequence databases currently representing >300 total strains. Importantly, LmMT1 sequence analysis was demonstrated here to provide strain resolution comparable to PFGE, and hence sufficient for outbreak detection and investigation in the public health sector and tracking down contamination in the food processing sector. Furthermore, LmMT1 and LmMT2 include the more limited TR1 and TR2 regions, respectively, previously shown by Miya et al. [40] using a distinct set of strains to exceed the resolution obtained by virulence gene-based MLST. Finally, the PCR-based technology behind LmMT1 and LmMT2 typing provides the potential to be used with unpurified and mixed samples, while technologies such as PFGE and whole genome sequencing rely on pure cultures. This is highly relevant in light of the trend toward culture-independent diagnostic methods [48].
Two approaches are currently used to summarize and share typing data. Dendrograms (e.g., Figure 2) are the most informative, as they can reveal epidemiological connections between isolates while also providing evolutionary perspective. A second, space-spacing approach is to apply a type designation in a nomenclature format that, unfortunately, is often cryptic and uninformative (i.e., a new type is arbitrarily assigned the next highest number). We propose a combination of these two approaches for sequence-based single locus typing systems such as LmMT1, where identical matches to NCBI database strains can be readily determined by BLASTN search. For example, USDA-ARS strains 33868 and 33873 are designated type LmMT1:N1-017 since they share their LmMT1 sequence with database strain N1-017 ( Figure 7). (If there are multiple strains with that sequence in the database, as is the case for N1-017, the first one deposited is given priority). On the other hand, strain 57066 (and 32 additional USDA-ARS strains) had a unique LmMT1 sequence (deposited in NCBI with accession numbers KT626013-KT626044), and hereafter strains with matching sequence will be designated type LmMT1:57066. These designations are intrinsically informative, since the strain name (e.g., N1-017) can be used to search sequence, literature, and internet databases. Of course, the utility of this type designation system would be enhanced by consistent and comprehensive annotation of NCBI sequence files.