The Genomic Blueprint of Salmonella enterica subspecies enterica serovar Typhi P-stx-12

Salmonella enterica subspecies enterica serovar Typhi is a rod-shaped, Gram-negative, facultatively anaerobic bacterium. It belongs to the family Enterobacteriaceae in the class Gammaproteobacteria, and has the capability of residing in the human gallbladder by forming a biofilm and hence causing the person to become a typhoid carrier. Here we present the complete genome of Salmonella enterica subspecies enterica serotype Typhi strain P-stx-12, which was isolated from a chronic carrier in Varanasi, India. The complete genome comprises a 4,768,352 bp chromosome with a total of 98 RNA genes, 4,691 protein-coding genes and a 181,431 bp plasmid. Genome analysis revealed that the organism is closely related to Salmonella enterica serovar Typhi strain Ty2 and Salmonella enterica serovar Typhi strain CT18, although their genome structure is slightly different.


Introduction
Salmonella enterica serovar Typhi is a particular Salmonella serovar that causes typhoid fever [1][2][3]. There are an estimated 20 million cases of typhoid fever and 200,000 deaths from this disease reported each year, worldwide [4,5]. S. enterica serovar Typhi belongs to the family Enterobacteriaceae. All Enterobacteriaceae ferment glucose, reduce nitrates, and are oxidatively negative [6]. In general, S. enterica serovar Typhi is motile, produces minimal H 2 S, and is resistant to bile acids [7]. S. enterica serovar Typhi has three types of antigens [3], namely the H antigen for motility, specific O antigen for synthesizing lipopolysaccharides and biofilm formation, and Vi antigen which is a capsular polysaccharide that acts as a major virulence factor. This Vi antigen is only specific for S. enterica serovar Typhi and is found in Salmonella Pathogenicity Island-7 [8]. In 2003, comparative genomics of S. enterica serovar Typhi strains Ty2 and CT18 was carried out by Deng et al. [9]. In that study, a half-genome interreplichore inversion in Ty2 relative to CT18 was discovered. It was reported that S. enterica serovar Typhi Ty2 does not harbor any plasmid and hence it is susceptible to antibiotics. On the other hand, S. enterica serovar Typhi CT18 carries two plasmids with one conferring multidrug resistance. We published the complete genome sequence of S. enterica serovar Typhi P-stx-12 earlier last year [10]. This sequencing project helps us to better understand the genome organization and the contribution of the virulence machinery in this pathogen. Here we present a summary of S. enterica serovar Typhi Pstx-12 and its unique features, together with the description of the complete genomic sequencing and annotation.

Classification and features
S. enterica serovar Typhi P-stx-12 was isolated from a typhoid carrier in Northern India, Uttar Pradesh, Varanasi in 2009. This serotype is known to inhabit the Peyer's patches (lymph node) of the small intestine, liver, spleen, bone marrow, bile, and blood stream of infected humans. Cells of S. enterica serovar Typhi P-stx-12 were Gram-negative, motile, rod-shaped, and non-spore forming. This strain grew at an optimum temperature of 35°C-37°C, but could tolerate temperatures between 7°C and 45°C. Strain P-stx-12 is a facultative anaerobe and utilizes glucose as the main carbon source. The pure isolate did not produce cytochrome oxidase but was able to reduce nitrate and break down glucose by pathways for oxidation and fermentation. This strain did not produce urease. In Triple Sugar Iron medium, there was an alkaline/acid reaction with a very small amount of H 2 S production. Indole was not produced in peptone water. The strain was able to ferment glucose and mannitol without production of gas; however lactose and sucrose were not fermented. The strain could be agglutinated by poly O, poly H, factors O9, H-d, and Vi antisera (data not shown). Figure 1 shows the phylogenetic neighborhood of S. enterica serovar Typhi P-stx-12 in a 16S rRNA based tree. There were seven 16S rRNA gene copies in the genome of S. enterica serovar Typhi P-stx-12. Two out of the seven copies differed from the rest by having a single base substitution (G to A). Thus, the common gene copy was used for tree building. In relation to others in the genus Salmonella, strain P-stx-12 is closely related to S. enterica serovar Typhi strain Ty2 and S. enterica serovar Typhi strain CT18. The classification and features of this organism are summarized in Table 1.

Genome sequencing and annotation
Genome project history S. enterica serovar Typhi P-stx-12 was selected for sequencing because it was isolated from a typhoid carrier in India, where there is a high rate of typhoid fever cases. This isolate was obtained from a 32-year old male who had been showing persistent high titers for Widal test and Vi antibody for more than one year. DNA isolation was carried out at Banaras Hindu University. This genome sequence was first published in April 2013 [10]. A summary of the project information is shown in Table 2.

Figure 1.
Phylogenetic tree highlighting the position of Salmonella enterica serovar Typhi strain P-stx-12 relative to other strains within the Enterobacteriaceae. Strains shown are those within the Enterobacteriaceae having corresponding GenBank accession numbers. The phylogenetic tree was constructed using Ribosomal Database Project [11] tree builder that utilizes the Weighbor weighted neighbor-joining tree building algorithm [12]. The bootstrap value was 100. Escherichia coli strain Z83205 was used as an outgroup. Phylum Proteobacteria TAS [14] Class Gammaproteobacteria TAS [15,16] Current classification Order Enterobacteriales TAS [17] Family Enterobacteriaceae TAS [18][19][20] Genus Salmonella TAS [18,[21][22][23] Species Salmonella enterica TAS [23,24] Subspecies Salmonella enterica enterica TAS [23,24] Gram stain negative TAS [6] Cell shape Rod-shape TAS [ , not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [27]. If the evidence code is IDA, then the property was directly observed for a living isolate by one of the authors or an expert mentioned in the acknowledgements.

Growth conditions and DNA isolation
The stool specimen of strain P-stx-12 was collected from a known chronic typhoid carrier patient. For the isolation of the bacterium, 5 gm of freshly passed unpreserved stool was sieved through a gauze piece to remove the coarse particles. The filtrate was centrifuged at 4,000 rpm for 5 min.
The pellet was washed twice with Phosphate Buffered Saline, pH 7.2 and suspended in selenite F broth (50 ml) for enrichment with some modified technique (under process of patenting). After overnight incubation, the broth was examined for turbidity and subcultured on deoxycholate citrate agar and MacConkey agar. Extraction of genomic DNA was carried out using a Phenol-Chloroform and Proteinase K method with some modification [28]. The DNA preparation was checked by PCR amplification of the flagellin (fliC) gene of S. enterica serovar Typhi [29,30] and 16S rRNA gene [31].

Genome sequencing and assembly
Whole-genome sequencing was performed with a combined strategy of 454 and Illumina sequencing technologies. A 4-kb paired-end library was constructed according to the manufacturer's instructions (454). A total of 242,499 reads were generated using the GS FLX Titanium system, giving ~18× coverage of the genome. Initial assembly of 97.09% of the reads using the Newbler assembler (Roche) resulted in ~200 large contigs within 11 scaffolds. A total of ~500 Mb of 3-kb mate-pair sequencing data were generated to reach a depth of 100× coverage with an Illumina GA IIx. These sequences were mapped to the scaffolds using the Burrows-Wheeler Alignment (BWA) tool [32]. A majority of the gaps within the scaffolds were filled by local assembly of 454 and Illumina reads. The remaining gaps were filled by sequencing the PCR products of the gaps using an ABI 3730xl capillary sequencer. The putative sequencing errors were verified by the coverage of 454 and Illumina reads.

Genome annotation
Annotation of the S. enterica serovar Typhi P-stx-12 genome was done using a combination of ISGA (Integrative Services for Genomic Analysis) [33] and the DIYA (Do-It-Yourself Annotator) pipeline [34], which comprises of Glimmer [35], tRNAscan-SE [36], RNAmmer [37], BLAST [38], and Asgard [39]. RPS-BLAST searches against the Clusters of Orthologous Groups (COG) database enabled assignment of COG functional categories to the ORFs. CLC Genomics Workbench was used to further improve and check the annotation results. Frameshifts and partial gene fragments that indicate potential pseudogenes were identified by the NCBI Submission Check tool and manually verified. Protein coding genes were searched against the NCBI RefSeq database using BLASTP [40]. Clustered Regularly Interspersed Short Palindromic Repeats (CRISPR) regions were identified using the CRISPR Finder program [41]. PHAST (PHAge Search Tool) [42] was used to search for prophage sequences within the genome. Potential genomic islands were identified using the IslandViewer web server [43]. Comparison between different S. enterica serovar Typhi strains was done using progressiveMauve [44].

Genome properties
The complete genome of S. enterica serovar Typhi P-stx-12 contains a single circular chromosome of 4,768,352 bp with a GC content of 52.1%, and a circular plasmid of 181,431 bp with a GC content of 46.4% (Figure 2 and Figure 3). The chromosome consists of 4,885 predicted genes, of which there are 4,691 protein-coding genes, 22 rRNA genes, and 76 tRNA genes. Specific COGs were assigned to 75.34% of the genes in the chromosome, and 25% of these genes were also assigned with enzyme classification numbers which were involved in 268 metabolic pathways. The properties and statistics of the genome are summarized in Tables 3 and 4. The plasmid harbors 234 proteincoding genes, with 187 annotated as hypothetical proteins with unknown function. The remaining genes were grouped into specific COGs, the majority of which fell into the category of information storage and processing with respect to replication, recombination and repair.

Paralog clusters
In order to identify paralog families, BLASTP was used to calculate all possible protein homologs in the S. enterica serovar Typhi P-stx-12 genome.
Homologs that had at least 30% shared amino acid similarity were selected. Paralog pairs were imported into the S. enterica serovar Typhi P-stx-12 database in Pathway Studio as a new type of interaction called "Paralog" [46]. Protein functional families were identified as clusters in the global Paralog network using the direct force layout algorithm. The biological function was assigned to each paralog cluster based on the functional annotation of the protein (Figure 4). The major paralog clusters identified include ATPase components that are mainly involved in transport systems, transcriptional regulator, transcriptional repressor, transposases, major facilitator superfamily permeases, response-regulator containing CheY-like receiver domain and an HTH DNA binding domain, P-pilus assembly proteins, multidrug efflux system proteins, and fimbrial-like adhesins. a) The total is based on either the size of the genome in base pairs or the total number of protein coding genes in the annotated genome.

Insights into the genome Comparisons with other fully sequenced S. enterica serovar Typhi genomes
The genome of S. enterica serovar Typhi P-stx-12 was compared with the other two published S. enterica serovar Typhi genomes, CT18 (isolated from Vietnam) and Ty2 (isolated from Russia). Comparison between these three genomes revealed that the coding genes of S. enterica serovar Typhi P-stx-12 were 84% similar to those of CT18 [47] and Ty2 [9]. The genome organization of these three strains is shown in Figure 5. The location of the genes in strains P-stx-12 and Ty2 are identical. Both have three blocks of genes that are inverted from strain CT18. Our observations are in agreement with the work of Deng et al. [9], where they discovered that half of the Ty2 genome was inverted relative to the CT18 genome. Nevertheless, most of the genes have the same function, indicating that these are the possible housekeeping genes which maintain the survival of this pathogen. Besides that, this Pstx-12 strain has one plasmid which shares 169 orthologous CDSs with pHCM1, the plasmid belonging to CT18 (Genbank accession number AL513383). pHCM1 is a conjugative plasmid which encodes resistance to antimicrobial agents and heavy metals; similar to IncHI plasmid R27. This further supports the hypothesis that the presence of a plasmid signifies a dynamic link between resistance and pathogenicity. Indeed, it was reported that the stable maintenance of IncHI1 plasmids in S. enterica serovar Typhi occurred throughout the development of antibiotic resistance in S. enterica serovar Typhi [48]. It is worth noting that the plasmid of P-stx-12 carries genes encoding the tetracycline resistance protein and tetracycline repressor protein TetR, possibly conferring drug resistance to this strain. This resistance protein is also found in strain CT18. On the other hand, the number of pseudogenes in this genome appears to be only 96, which is less than those in S. enterica serovar Typhi CT18 and S. enterica serovar Typhi Ty2 (> 200).

Genomic Islands (GIs) and Salmonella Pathogenicity Island (SPIs)
There are 31 possible genomic islands (GIs) as predicted by IslandViewer ( Figure 6). Analysis of these GIs revealed that most of the genes within the islands encode for hypothetical proteins. Eight Salmonella Pathogenicity Islands (SPI-11, SPI-2, SPI-16, SPI-6, SPI-8, SPI-4, SPI-7 and SPI-10) were found to be embedded in these GIs, whereas the rest of the SPIs spanned between the GIs. Interestingly, the proteins found in SPI-8 are located next to the proteins of SPI-13, which is not classified as one of the predicted GIs. Three GIs within the coordinate 4,376,723 to 4,508,803 make up the total region for SPI-7.    A comparison between the SPIs found in strains CT18 and P-stx-12 revealed that the location of several SPIs in both genomes is different ( Figure  7). Indeed, the orientation for SPI-6, SPI-16, SPI-5, SPI-18, SPI-2, SPI-11, SPI-12, and SPI-17 was inverted in both genomes. These SPIs fall within the inverted genomic regions shown in Figure 5.

Prophage Regions
Prophage are one of the diverse mobile genetic elements that are acquired through horizontal gene transfer. These prophage genes are involved in lysogenic conversion. PHAST (PHAge Search Tool) was used to identify the prophage regions of S. enterica serovar Typhi P-stx-12. Based on the analysis, five predicted prophage regions (three intact, two partial) were identified in the genome.  Table 5.

CRISPR Region
By using the CRISPR Finder tool, one CRISPR repeat region with a length of 394 bp was identified in the S. enterica serovar Typhi P-stx-12 genome. The CRISPR region starts at the position 2,900,675 and ends at the position 2,901,069 with 6 spacers in between. The confirmed CRISPR has the following direct repeat consensus sequence: CGGTTTATCCCCGCTGGCGCGGGGAACAC. Strains CT18 and Ty2 also have a single CRISPR repeat region with the lengths of 385 bp and 394 bp, respectively. The location for the CRISPR region of all three strains falls within the region of 2.9 Mbp on the chromosome. All the strains have 6 spacers and share the common direct repeat consensus sequence. It is worth noting that the CRISPR region, including the length and the spacer sequence, of S. enterica serovar Typhi P-stx-12 is exactly identical to S. enterica serovar Typhi Ty2. It suggests a strong evidence of their evolutionary relevance and shows that the CRISPR region in S. enterica serovar Typhi is conserved. As CRISPRs function as a prokaryotic immune system and confer resistance towards plasmids and phages (thus interfering with the spread of antibiotic resistance and virulence factors), it is reasonable to find only one CRISPR with very few spacers in this pathogen as compared to other bacterial strains that are not pathogenic [49]. Figure 5. Alignment of the S. enterica serovar Typhi CT18, S. enterica serovar Typhi P-stx-12, and S. enterica serovar Typhi Ty2 genomes using progressive Mauve [44]. Colored blocks in the first genome are connected by lines to similar colored blocks in the second and third genomes. Inverted regions in S. enterica serovar Typhi P-stx-12 and S. enterica serovar Typhi Ty2 are presented as blocks below the center line of the genome. Lines indicate regions in each genome that are homologous.