Complete genome sequence of the facultatively chemolithoautotrophic and methylotrophic alpha Proteobacterium Starkeya novella type strain (ATCC 8093T)

Starkeya novella (Starkey 1934) Kelly et al. 2000 is a member of the family Xanthobacteraceae in the order ‘Rhizobiales’, which is thus far poorly characterized at the genome level. Cultures from this species are most interesting due to their facultatively chemolithoautotrophic lifestyle, which allows them to both consume carbon dioxide and to produce it. This feature makes S. novella an interesting model organism for studying the genomic basis of regulatory networks required for the switch between consumption and production of carbon dioxide, a key component of the global carbon cycle. In addition, S. novella is of interest for its ability to grow on various inorganic sulfur compounds and several C1-compounds such as methanol. Besides Azorhizobium caulinodans, S. novella is only the second species in the family Xanthobacteraceae with a completely sequenced genome of a type strain. The current taxonomic classification of this group is in significant conflict with the 16S rRNA data. The genomic data indicate that the physiological capabilities of the organism might have been underestimated. The 4,765,023 bp long chromosome with its 4,511 protein-coding and 52 RNA genes was sequenced as part of the DOE Joint Genome Institute Community Sequencing Program (CSP) 2008.


Introduction
Strain ATCC 8093T (ATCC 8093 = DSM 506 = NBRC 14993) is the type strain of the species Starkeya novella [1] and the type species of the genus Starkeya [1], which currently contains only one other species, S. koreensis [2]. The most prominent feature of S. novella is its ability to grow as a facultative chemolithoautotroph [3], a heterotroph [4], or methylotroph [1,5]. Cultures of strain ATCC 8093T were first isolated from soil samples taken from agricultural land in New Jersey by Robert L. Starkey in the early 1930s [6,7] and deposited in the American Type Culture Collection (ATCC) under the basonym Thiobacillus novellus [3,8]. The bacterium was referred to as the 'new' Thiobacillus as it was the first facultatively chemolithoautotrophic sulfur oxidizer to be isolated. Until then, all known dissimilatory sulfur-oxidizing bacteria were also obligate autotrophs. As a result, the metabolism of T. novellus was intensely studied for many years following its discovery, and particularly following the development of more sophisticated biochemical and molecular methods in the 1960s.
Based on the 16S rRNA gene sequence in 2000 Kelly et al. [1] proposed the reclassification of T. novellus to S. novella. The genus name Starkeya is in honor of Robert L. Starkey and his important contribution to soil microbiology and sulfur biochemistry [1]; the species epithet was derived from the Latin adjective 'novella', new [3]. Here we present a summary classification and a set of features for S. novella ATCC 8093T, together with the description of the genomic sequencing and annotation.

Classification and features 16S rRNA analysis
The single genomic 16S rRNA sequence of strain ATCC 8093T was compared using NCBI BLAST [30,31] under default settings (e.g., considering only the high-scoring segment pairs (HSPs) from the best 250 hits) with the most recent release of the Greengenes database [32] and the relative frequencies of taxa and keywords (reduced to their stem [33]) were determined, weighted by BLAST scores. The most frequently occurring genera were Ancylobacter (30.0%), Starkeya (13.4%), Agrobacterium (13.1%), Xanthobacter (12.4%) and Azorhizobium (11.5%) (98 hits in total). Regarding the three hits to sequences from members of the species, the average identity within HSPs was 99.5%, whereas the average coverage by HSPs was 92.8%. Among all other species, the one yielding the highest score was Ancylobacter rudongensis (AY056830), which corresponded to an identity of 98.1% and an HSP coverage of 98.4%. (Note that the Greengenes database uses the INSDC (= EMBL/NCBI/DDBJ) annotation, which is not an authoritative source for nomenclature or classification.) The highest-scoring environmental sequence was EU835464 ('structure and quorum sensing reverse osmosis RO membrane biofilm clone 3M02'), which showed an identity of 98.4% and an HSP coverage of 100.0%. The most frequently occurring keywords within the labels of all environmental samples which yielded hits were 'skin' (6.0%), 'microbiom' (3.0%), 'human, tempor, topograph' (2.5%), 'compost' (2.1%) and 'dure' (2.1%) (152 hits in total) and fit only partially to the known habitat of the species. Environmental samples that yielded hits of a higher score than the highest scoring species were not found. Figure 1 shows the phylogenetic neighborhood of in a 16S rRNA based tree. The sequence of the single 16S rRNA gene copy in the genome differs by nine nucleotides from the previously published 16S rRNA sequence (D32247), which contains one ambiguous base call.
To measure conflict between 16S rRNA data and taxonomic classification in detail, we followed a constraint-based approach as described recently in detail [41], conducting both unconstrained searches and searches constrained for the monophyly of both families and using our own re-implementation of CopyCat [42] in conjunction with AxPcoords and AxParafit [43] was used to determine those leaves (species) whose placement significantly deviated between the constrained and the unconstrained tree.
The best-supported ML tree had a log likelihood of -12,191.55, whereas the best tree found under the constraint had a log likelihood of -12,329.92. The constrained tree was significantly worse than the globally best one in the SH test as implemented in RAxML [37,44] (α = 0.01). The best supported MP trees had a score of 1,926, whereas the best constrained trees found had a score of 1.982 and were also significantly worse in the KH test as implemented in PAUP [8,44] (α < 0.0001). Accordingly, the current classification of the family as used in [45,46], on which the annotation of Figure 1 is based, is in significant conflict with the 16S rRNA data. Figure 1 also shows those species that cause phylogenetic conflict as detected using the ParaFit test (i.e., those with a p value > 0.05 because ParaFit measures the significance of congruence) in green font color. According to our analyses, the Hyphomonadaceae genera (Blastochloris and Prosthecomicrobium) nested within the Xanthobacteraceae display significant conflict. In the constrained tree (data not shown), the Angulomicrobium-Methylorhabdus clade is placed at the base of the Xanthobacteraceae clade (forced to be monophyletic). For this reason, Angulomicrobium and Methylorhabdus were not detected as causing conflict (note that the ParaFit test essentially compares unrooted trees). A taxonomic revision of the group would probably need to start with the reassignment of these genera to different families. Standards in Genomic Sciences Figure 1. Phylogenetic tree highlighting the position of S. novella relative to the type strains of the other species within the family Xanthobacteraceae (blue font color). The tree was inferred from 1,381 aligned characters [34,35] of the 16S rRNA gene sequence under the maximum likelihood (ML) criterion [36]. Hyphomicrobiaceae (green font color for those species that caused conflict according to the Parafit test, black color for the remaining ones; see below for the difference) were included in the dataset for use as outgroup taxa but then turned out to be intermixed with the target family; hence, the rooting shown was inferred by the midpoint-rooting method [29]. The branches are scaled in terms of the expected number of substitutions per site. Numbers adjacent to the branches are support values from 550 ML bootstrap replicates [37] (left) and from 1,000 maximum-parsimony bootstrap replicates [38] (right) if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [39] are labeled with one asterisk, those also listed as 'Complete and Published' with two asterisks (see [40] and CP000781 for Xanthobacter autotrophicus, CP002083 for Hyphomicrobium denitrificans and CP002292 for Rhodomicrobium vannielii).

Morphology and physiology
Cells of S. novella ATCC 8093T are non-motile, Gram-negative staining short rods or coccobacilli with a size of 0.4-0.8 μm × 0.8 -2.0 μm, occurring singly or in pairs ( Figure 2, Table 1) [1]. Colonies grown on thiosulfate agar turn white with sulfur on biotin supplemented growth media [1], while in the presence of small amounts of yeast extract (DSMZ medium 69) the colonies have a pale pink appearance following growth on thiosulfate and no sulfur formation is observed. Cells grow on thiosulfate and tetrathionate under aerobic conditions, but not on sulfur or thiocyanate [1]. Ammonium salts, nitrates, urea and glutamate can serve as nitrogen sources [1]. Several surveys of substrates supporting heterotrophic growth have been published, and include glucose, formate, methanol, oxalate [1,2,4,6]. The growth range spans from 10-37°C, with an optimum at 25-30°C, and a pH range from 5.7-9.0 with an optimum at pH 7.0 [1].

Chemotaxonomy
The lipopolysaccharide of strain ATCC 8093T lacks heptoses and has only 2,3-diamino-2,3dideoxyglucose as the backbone sugar [1]; other data on the cell wall structure of strain ATCC 8093T are not available. The major isoprenoid quinone is ubiquinone Q-10 [1], and the major cellular fatty acids are octadecenoid acid (C18:1) and C19 cyclopropane acid; no hydroxyl acids are present [1]. Cells contain putrescine and homospermidine.

Genome sequencing and annotation Genome project history
This organism was selected for sequencing on the basis of the DOE Joint Genome Institute Community Sequencing Program (CSP) 2008. The genome project is deposited in the Genomes On Line Database [39] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2.

Growth conditions and DNA isolation
Strain ATCC 8093 T was grown from a culture of DSMZ 506 in DSMZ medium 69 at 28°Cg DNA was purified using the Genomic-tip 100 System (Qiagen) following the directions provided by the supplier. The purity, quality and size of the bulk gDNA preparation were assessed by JGI according to DOE-JGI guidelines.   [47] and the NamesforLife database [48].

Genome sequencing and assembly
The genome was sequenced using a combination of Illumina and 454 sequencing platforms. All general aspects of library construction and sequencing can be found at the JGI website [57]. Pyrosequencing reads were assembled using the Newbler assembler (Roche). The initial Newbler assembly consisting of 13 contigs in one scaffold was converted into a phrap [58]

Genome annotation
Genes were identified using Prodigal [62] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [63]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) non-redundant database, Standards in Genomic Sciences UniProt, TIGRFam, Pfam, PRIAM, KEGG, COG, and InterPro databases. These data sources were combined to assert a product description for each predicted protein. Non-coding genes and miscellaneous features were predicted using tRNAscan-SE [64,RNAMMer [65], Rfam [66], TMHMM [67], and SignalP [68].

Genome properties
The genome consists of a circular 4,765,023 bp chromosome a 67.9% G+C content (Table 3 and Figure 3). Of the 4,563 genes predicted, 4,511 were protein-coding genes, and 52 RNAs; 80 pseudogenes were also identified. The majority of the protein-coding genes (74.8%) were assigned a putative function while the remaining ones were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4. A total of 388 genes are predicted to encode proteins involved in signal transduction, including 284 one-component systems, 41 histidine kinases, 47 response regulators, seven chemotaxis proteins and two additional unclassified proteins.

Insights into the genome
As indicated in the introduction, because S. novella was the first facultative sulfur chemolithotrophic bacterium to be isolated, many studies of its metabolic capabilities were carried out following its discovery. Several groups worked on the carbon metabolism of S. novella, which led to the discovery of an operational pentose phosphate pathway in this bacterium [69], which is also the only reported pathway of glucose metabolism in the description of S. novella [1]. However, analysis of the genome sequence revealed that in addition to a pentose phosphate pathway, S. novella also contains enzymes required for the Entner-Doudoroff pathway (Snov_2999 & Snov_3400, 2-dehydro-3-deoxyphosphogluconate aldolase; 6-phosphogluconate dehydratase; biocyc database) and the enzymes required for the Embden-Meyerhoff pathway, although this pathway appears to lack a phosphofructokinase (EC 2.7.1.11), indicating that it may only be able to be used for gluconeogenesis.
The respiratory chain of S. novella has also been studied and an aa 3 type terminal oxidase was identified and characterized in some detail [70][71][72][73]. It was also discovered that the cytochrome c that interacts with this cytochrome oxidase (most likely this cytochrome is encoded by Snov_1033) has properties that are reminiscent of the mitochondrial respiratory chain cytochrome c [70][71][72][73][74][75], including a high pI and an ability to transfer electrons to the bovine cytochrome oxidase [76]. The analysis of the genome revealed a much greater diversity of respiratory chain complexes than previously recognized, including two NADH oxidases (gene regions Snov_1853 & Snov_2407), one succinate dehydrogenase (Snov_3317 gene region) and a cytochrome bc 1 complex (Snov_2477 gene region).
In addition to these components, the genome encodes two aa 3  We also re-evaluated the range of substrates that support growth of S. novella. In the description of the genus Starkeya [1] only glucose, formate, methanol and oxalate were listed as growth-supporting substrates in addition to thiosulfate and tetrathionate. An early paper reporting a test of the heterotrophic potential of S. novella was published in 1969 by Taylor and Hoare [4] in which they identified 16 potential growth substrates (Table no. 7 in [4]) including all of the above except oxalate, which was identified subsequently by [5] who were seeking to evaluate the C1 compound metabolism of S. novella and also identified formamide as a potential substrate. It is unclear why the description of the genus Starkeya did not list all of the 16 growth substrates identified by Taylor and Hoare. To confirm the earlier data, we carried out a growth substrate screen using the Biolog system (GN2 assay plates) as well as an api20NE test for bacterial identification. Some substrates that are not part of this Biolog GN2 plate (e.g. oxalate, fructose, succinate etc.) were independently tested in the laboratory for their ability to support growth. In the API20NE test, in addition to a positive oxidase response, S. novella tested positive for ESC/Fecit and pnitrophenyl hydrolysis, glucose, mannitol and gluconate utilization. The Biolog assay clearly showed that the heterotrophic potential of this bacterium is greater than previously identified, with a total of 28 growth-supporting substrates being identified in the screen (  39 substrates that have been identified as supporting heterotrophic growth of S. novella. In addition to sugars such as glucose, fructose and arabinose, several sugar alcohols and amino acids as well as some organic acids can be used as growth substrates (Table 5). This reasonably large range of growth substrates is reflected in the size and the diversity of metabolic pathways present in the S. novella genome which, with a size of 4.6 Mb, is comparable to the genomes of e.g., Escherichia coli and Rhodopseudomonas palustris.
Although the analyses presented above are limited, they clearly illustrate that while the genome data confirm many of the results from early studies of the physiology of this bacterium, the metabolic capabilities of S. novella as indicated by the genome data clearly exceed those previously published in the literature and suggest that the versatility and adaptability to changing environments likely is a significant factor for its survival.