Non-contiguous finished genome sequence and description of Salmonella enterica subsp. houtenae str. RKS3027

Salmonella enterica subsp. houtenae serovar 16:z4, z32:-- str. RKS3027 was isolated from a human in Illinois, USA. S. enterica subsp. houtenae is a facultative aerobic rod-shaped Gram-negative bacterium. Here we describe the features of this organism, together with the draft genome sequence and annotation. The 4,404,136 bp long genome (97 contigs) contains 4,335 protein-coding gene and 28 RNA genes.


Introduction
Salmonella is an important genus of human and animal pathogens [1], and more than 2,600 different serovars have been described. Currently, the genus Salmonella is divided into two species, S. enterica, and S. bongori [2]. S. enterica comprises seven subspecies: I (also called subspecies enterica), II (also called subspecies salamae), IIIa (also called subspecies arizonae), IIIb (also called subspecies diarizonae), IV (also called subspecies houtenae), VI (also called subspecies indica), and VII [3]. Most of Salmonella serovars belong to the S. enterica subspecies I and are responsible for disease in warm-blooded animals and humans [4]. Other serovars were usually isolated from coldblooded organisms and the environment, but could also cause human disease occasionally. In contrast with S. enterica subspecies I, very limited information is available regarding pathogenicity of the other subspecies. When infecting humans, these serovars usually cause an intestinal infection (e.g., diarrhea), but previous reports in the literature [5] have shown that the serovars of Salmonella subspecies II-IV are capable of causing serious infections, including septicemia and abscesses. There has been an increase in case reports on extraintestinal infections caused by these subspecies [6]. S. enterica subsp. houtenae serovar 16:z4,z32:--str. RKS3027 is a human isolate. This strain is of interest because of its pathogenicity as well as its divergent phylogenetic position among S. enterica.

Classification and features
Few 16S rRNA sequences of Salmonella subspecies are available except S. enterica subsp. enterica. Meanwhile, it is increasingly commonplace to construct the phylogenetic tree by using the wholegenome sequence for higher precision and robustness [7,8]. Therefore we used a total of 2,500 orthologs of 18 strains of Salmonella for constructing a genome-scale phylogenetic tree. Genetic relatedness of S. enterica subsp. houtenae strain RKS3027 to other Salmonella subspecies strains was shown in Figure 1. On the tree, all S. enterica subsp. enterica strains were clustered together, and S. enterica subsp. houtenae RKS3027 positioned between S. enterica subsp. enterica and S. bongori. The Salmonella genus belongs to the bacterial family Enterobacteriaceae [11]. The bacteria are rod shaped, Gram-negative, with diameter of 0.7 to 1.5 µm and length of 2 to 5 µm (Table 1). They are facultative anaerobes, non-spore-forming, flagellated, and motile. They grow within the optimal temperature range 35 °C -37 °C and within an optimal pH range of 7.2-7.6. S. enterica subsp. houtenae is salicin-positive and able to grow in KCB medium, two distinguishing characteristics when compared with S. enterica subsp. enterica. The strain is deposited in the Salmonella Genetic Stock Centre (SGSC), University of Calgary, Canada as S. enterica subsp. houtenae RKS3027 (= SGSC 3086).

Genome sequencing information Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position and its serious virulence in humans compared to the reptiles. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession ANHR00000000. The version described in this paper is the first version, ANHR01000000, and the sequence consists of 97 large contigs. Ta-ble 2 presents the project information and its association with MIGS version 2.0 compliance [12].

Growth conditions and DNA isolation
S. enterica subsp. houtenae strain RKS3027 was grown Luria Broth (LB) medium at 37°C. The DNA was extracted from the cell, concentrated and purified using the Qiamp kit (Qiagen), as detailed in the manual for the instrument.

Genome sequencing and assembly
The genome of S. enterica subsp. houtenae RKS3027 was sequenced using the Illumina sequencing platform by the paired-end strategy (2×100bp). The details of library construction and sequencing can be found at the Illumina web site   Altitude Not report NAS a) Evidence codes -IDA: Inferred from Direct Assay; TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [25].

Genome annotation
Genes were predicted using RAST (Rapid Annotation using Subsystem Technology) [27] with gene caller GLIMMER3 [28] followed by manual curation. The predicted bacterial protein sequences were compared with the annotated genes from four available Salmonella genomes, i.e., S. enterica subsp. enterica Typhi P-stx-12, S. enterica subsp. enterica Heidelberg B182, S. enterica subsp. enterica Typhimurium UK-1 and S. enterica subsp. enterica Typhimurium 4/74 and searched against the Clusters of Orthologous Groups (COG) databases using BLASTP. The BLAST results were filtered with the following parameters: identities >90% and compared length >70%. CGViewer was used for visualization of genomic features [29].

Genome properties
The genome of S. enterica subsp. houtenae RKS3027 is 4,404,136 bp long (97 contigs) with a 51.68% G + C content (Table 3 and Figure 2). Of the 4,363 predicted genes, 4,335 were proteincoding genes, and 28 were RNAs (1 5S rRNA gene and 27 predicted tRNA genes). A total of 3,378 genes (77.42%) were assigned a putative function.
The remaining genes were annotated as hypothetical proteins. The properties and statistics of the genome are summarized in Table 3. The distribution of genes into COGs functional categories is presented in Table 4. a) The total is based on either the size of the genome in base pairs or the total number of protein coding genes in the annotated genome. Standards in Genomic Sciences  a) The total is based on the total number of protein coding genes in the annotated genome.