Genome sequence of the reddish-pigmented Rubellimicrobium thermophilum type strain (DSM 16684T), a member of the Roseobacter clade

Rubellimicrobium thermophilum Denner et al. 2006 is the type species of the genus Rubellimicrobium, a representative of the Roseobacter clade within the Rhodobacteraceae. Members of this clade were shown to be abundant especially in coastal and polar waters, but were also found in microbial mats and sediments. They are metabolically versatile and form a physiologically heterogeneous group within the Alphaproteobacteria. Strain C-Ivk-R2A-2T was isolated from colored deposits in a pulp dryer; however, its natural habitat is so far unknown. Here we describe the features of this organism, together with the draft genome sequence and annotation and novel aspects of its phenotype. The 3,161,245 bp long genome contains 3,243 protein-coding and 45 RNA genes.


Introduction
Strain C-Ivk-R2A-2 T (= DSM 16684 = CCUG 51817 = HAMBI 2421) is the type strain of the species Rubellimicrobium thermophilum [1]. The genus name Rubellimicrobium was derived from the Neo-Latin adjective 'rubellus', red or reddish, and the Neo-Latin noun 'microbium', microbe, referring to its reddish pigmentation. The species epithet was derived from the Greek noun 'thermê', heat, as well as from the Neo-Latin adjective 'philus -a -um', friend/loving, referring to its growth temperature [1]. C-Ivk-R2A-2 T was isolated from colored deposits in a pulp dryer in Finland, so the natural habitat is so far unknown [1]. At the time of writing, Pub-Med records did not indicate any follow-up research with strain C-Ivk-R2A-2 T after the initial description and valid publication of the new species Rubellimicrobium thermophilum [1]. Here we present a summary classification and a set of features for R. thermophilum C-Ivk-R2A-2 T , together with the description of the genomic sequencing and annotation. We also describe novel aspects of its phenotype.

Features of the organism 16S rRNA gene analysis
The single genomic 16S rRNA gene sequence of R. thermophilum DSM 16684 T was compared using NCBI BLAST [2,3] under default settings (e.g., considering only the high-scoring segment pairs (HSPs) from the best 250 hits) with the most recent release of the Greengenes database [4] and the relative frequencies of taxa and keywords (reduced to their stem [5]) were determined, weighted by BLAST scores. The most frequently occurring genera were Rubellimicrobium (26.9%), Oceanicola (18.5%), Rhodobacter (12.4%), Methylarcula (10.4%) and Loktanella (10.1%) (37 hits in total). Regarding the five hits to sequences from members of the species, the average identity within HSPs was 99.9%, whereas the average coverage by HSPs was 99.2%. Among all other species, the one yielding the highest score was 'Pararubellimicrobium aerilata' (EU338486), which corresponded to an identity of 94.2% and an HSP coverage of 97.8%. (Note that the Greengenes database uses the INSDC (= EMBL/NCBI/DDBJ) annotation, which is not an authoritative source for nomenclature or classification.) The highest-scoring environmental sequence was AJ489269 (Greengenes short name 'food Echinamoeba thermarum clone'), which showed an identity of 99.9% and an HSP coverage of 99.1%. The most frequently occurring keywords within the labels of all environmental samples which yielded hits were 'skin' (10.1%), 'fossa' (6.0%), 'poplit' (3.6%), 'forearm, volar' (3.6%) and 'water' (2.5%) (213 hits in total). The most frequently occurring keywords within the labels of those environmental samples which yielded hits of a higher score than the highest scoring species were 'biofilm' (18.2%), 'echinamoeba, food, thermarum' (9.1%) and 'color, machin, moder, paper, paper-machin, thermophil' (9.1%) (2 hits in total). Figure 1 shows the phylogenetic neighborhood of R. thermophilum in a 16S rRNA sequence based tree. The sequence of the single 16S rRNA gene copy in the genome does not differ from the previously published 16S rDNA sequence (AJ844281). Figure 1 Phylogenetic tree highlighting the position of R. thermophilum relative to the type strains of the type species of the other genera within the family Rhodobacteraceae. The tree was inferred from 1,330 aligned characters [6,7] of the 16S rRNA gene sequence under the maximum likelihood (ML) criterion [8]. Rooting was done initially using the midpoint method [9] and then checked for its agreement with the current classification ( Table 1). The branches are scaled in terms of the expected number of substitutions per site. Numbers adjacent to the branches are support values from 650 ML bootstrap replicates [10] (left) and from 1,000 maximum-parsimony bootstrap replicates [11] (right) if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [12] are labeled with one asterisk, those also listed as 'Complete and Published' with two asterisks [13].
According to [1], R. thermophilum is able to metabolize a wide range of carbon sources. This observation is not fully confirmed by the OmniLog measurements at 28°C. For instance, more than eleven sugars were not metabolized under the given cultivation conditions in the Generation-III microplates. This is apparently caused by distinct cultivation conditions, because the behavior is in high agreement with [1] if a temperature of 37°C is chosen, which is closer to the reported optimum temperature [1]. Particularly the optimal growth temperature of 45°C highly differs from the one that had to be used in the OmniLog assays (28°C). Conversely, in contrast to [1] the OmniLog measurements yielded positive reactions for citrate, Lhistidine and L-serine at 28°C and additionally for propionate at 37°C. This may be due to the higher sensitivity of respiratory measurements compared to growth measurements [24,25].

Genome sequencing and annotation Genome project history
The genome was sequenced within the project "Ecology, Physiology and Molecular Biology of the Roseobacter clade: Towards a Systems Biology Understanding of a Globally Important Clade of Marine Bacteria" funded by the German Research Council (DFG). The strain was chosen for genome sequencing according the Genomic Encyclopedia of Bacteria and Archaea (GEBA) criteria [26,27]. Project information is stored at the Genomes On-Line Database [12]. The Whole Genome Shotgun (WGS) sequence is deposited in Genbank and the Integrated Microbial Genomes database (IMG) [28]. A summary of the project information is shown in Table 2.

Growth conditions and DNA isolation
A culture of DSM 16684 T was grown aerobically in DSMZ medium 830 (R2A medium) [29] at 45°C. Genomic DNA was isolated using Jetflex Genomic DNA Purification Kit (GENOMED 600100) following the standard protocol provided by the manufacturer but modified by an incubation time of 60 min, the incubation on ice over night on a shaker, the use of additional 50 µl proteinase K, and the addition of 100 µl protein precipitation buffer. DNA is available from DSMZ through the DNA Bank Network [30].

Genome sequencing and assembly
The genome was sequenced using a combination of Illumina and 454 libraries ( Table 2). Illumina sequencing was performed on a GA IIx platform with 150 cycles. The paired-end library contained inserts of 456 nt length in average. To correct sequencing errors and improve quality of the reads, clipping was performed using fastq-mcf [31] and quake [32]. The remaining 4,190,250 reads with an average length of 106 nt were assembled using Velvet [33]. To gain information on the contig arrangement an additional 454 run was performed.
The paired-end jumping library of 3 kb insert size was sequenced on a 1/8 lane. Pyrosequencing resulted in 115,925 reads, with an average read length of 451 nt, assembled with Newbler (Roche Diagnostics) into a draft assembly comprising 36 scaffolds. Both draft assemblies (Illumina and 454 sequences) were fractionated into artificial Sanger reads of 1000 nt in length plus 75 nt overlap on each site. These artificial reads served as an input for the phred/phrap/consed package [34]. By manual editing the number of contigs was reduced to 44 organized in ten scaffolds. The combined sequences provided a 203 × coverage of the genome.

Genome annotation
Genes were identified using Prodigal [35] as part of the JGI genome annotation pipeline [36]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) non-redundant database, UniProt, TIGR-Fam, Pfam, PRIAM, KEGG, COG, and InterPro databases. Identification of RNA genes were carried out by using HMMER 3.0rc1 [37] (rRNAs) and tRNAscan-SE 1.23 [38] (tRNAs). Other non-coding genes were predicted using INFER-NAL 1.0.2 [39] Additional gene prediction analysis and functional annotation was performed within the Integrated Microbial Genomes -Expert Review (IMG-ER) platform [40]. CRISPR elements were detected using CRT [41] and PILER-CR [42].

Genome properties
The genome statistics are provided in Table 3 and Figure 3. The genome has a total length of 3,161,245 bp and a G+C content of 69.1%. Of the 3,288 genes predicted, 3,243 were protein-coding genes, and 45 RNAs. The majority of the proteincoding genes (80.4%) were assigned a putative function while the remaining ones were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.

Insights into the genome
The ten scaffolds of the draft genome sequence of strain C-Ivk-R2A-2 T were screened with BLAST for the presence of the four abundant plasmid replicases from the Rhodobacterales, representing DnaA-like, RepABC-, RepA-and RepB-type replicons [43]. None of these typical extrachromosomal elements was detected. Prophage-like structures have been found in many bacteria and they are known to drive the diversity of bacteria by facilitating lateral gene transfer [44]. Genome analysis of strain DSM 16684 T revealed the presence of several genes encoding proteins associated with prophages (ruthe_00218 to 00220, ruthe_00605, ruthe_00607 to 00610, ruthe_00612, ruthe_00614, ruthe_00617, ruthe_00618, ruthe_00620, ruthe_2061, ruthe_2066, ruthe_02072, ruthe_02185, ruthe_02480, ruthe_02482 to 02484, ruthe_02495, ruthe_02499, ruthe_02502, ruthe_02972, ruthe_02974, ruthe_02976, ruthe_02977, ruthe_02984, ruthe_02988, and ruthe_02991 to 03295). The soxB gene (ruthe_01788) encodes a component of the thiosulfate-oxidizing Sox enzyme complex, which is known to be part of the genomes of various groups of bacteria [45]. Several other genes involved in this process were also detected (e.g. ruthe_01784, ruthe_01785 and ruthe_01786).
Additional gene sequences of interest encode a predicted ring-cleavage extradiol dioxygenase (ruthe_00477), which indicates a possible degradation of aromatic compounds. A sensor of blue light using FAD (BLUF, ruthe_01818) was also found, indicating possible blue-light dependent signal transduction.