Complete genome sequence of Geodermatophilus obscurus type strain (G-20T)

Geodermatophilus obscurus Luedemann 1968 is the type species of the genus, which is the type genus of the family Geodermatophilaceae. G. obscurus is of interest as it has frequently been isolated from stressful environments such as rock varnish in deserts, and as it exhibits interesting phenotypes such as lytic capability of yeast cell walls, UV-C resistance, strong production of extracellular functional amyloid (FuBA) and manganese oxidation. This is the first completed genome sequence of the family Geodermatophilaceae. The 5,322,497 bp long genome with its 5,161 protein-coding and 58 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain G-20 T (= DSM 43160 = ATCC 25078 = JCM 3152) is the type strain of the species Geodermatophilus obscurus, which is the type genus in the family Geodermatophilaceae [1,2]. The species name derives from the Latin word 'obscurus' meaning dark, obscure, indistinct, unintelligible [1]. The genus Geodermatophilus and family Geodermatophilaceae were originally proposed in 1968 by Luedemann [1]. The genus Geodermatophilus was first described as a genus closely related to genus Dermatophilus, but being isolated from soil, as indicated by the prefix 'geo', which derives from Greek 'Gea' meaning Earth [1]. In contrast, members of the genus Dermatophilus originated from skin lesions of cattle, sheep, horses, deer, and man [3], as the meaning of the genus name is 'skin-loving'. Yet, on the basis of 16S rRNA gene sequences, Geodermatophilus proved to be only distantly related to Dermatophilus [4] and was thus included in 1989 in the family Frankiaceae [5], together with the genera Blastococcus and Frankia. In 1996, the genera Dermatophilus and Blastococcus were excluded again from the family Frankiaceae [6] and finally formally combined with the genus Modestobacter in the family Geodermatophilaceae again [2]. G. obscurus is the only validly described species in the genus Geodermatophilus [7], and consists of four subspecies [1] which have never been validly published [8].
The type strain G-20 T , together with other strains, has been isolated from soil in the Amargosa Desert of Nevada, USA [3]. Further Geodermatophilus strains were isolated from limestone [8,9] and rock varnish [10] in the Negev Desert, Israel, from marble in Delos, Greece [8,9], from chestnut soil in Gardabani, Central Georgia [11], from rock varnish in the Whipple Mountains, California, USA [12], from orange patina of calcarenite in Noto, Italy [13], from gray to black patinas on marble in Ephesus, Turkey [13], and from high altitude Mount Everest soils [14,15]. Here we present a summary classification and a set of features for G. obscurus G-20 T , together with the description of the complete genomic sequencing and annotation.

Classification and features
Cells of Geodermatophilus produce densely packed cell aggregates [8], which are described as a muriform, tuber-shaped, noncapsulated, holocarpic thallus consisting of masses of cuboid cells averaging 0.5 to 2.0 µm in diameter (Table 1 and Figure 1) [1]. The thallus breaks up, liberating cuboid or coccoid nonmotile cells and elliptical to lanceolate zoospores [1]. The single cell can differentiate further into polar flagellated motile zoospores [15]. Thus, cells of Dermatophilus may express a morphogenetic growth cycle in which it switches between a thalloid C-form and a motile zoosporic R-form [15]. It has been supposed that tryptose (Difco) contains an unidentified factor, M, which controls morphogenesis in Geodermatophilus [15], though others could not observe the motile, budding zoospores of the R-form [8]. As colonies, strains of Geodermatophilus strains exhibit usually a dark brownish, greenish, or black pigmentation with a smooth to rough surface and in most cases a solid consistency, including minor variations in colony shape [8]. Young colonies are almost colorless, having smooth edges which become distorted and lobed in older colonies, where the colony consistency becomes somewhat crumby [8]. The colonies become darkly pigmented immediately when they started to protrude upwards in the space above the agar [8]. Geodermatophilus does not produce hyphae, vesicles, outer membranous spore layers or capsules [5]. Strain G-20 T utilizes L-arabinose, D-galactose, Dglucose, glycerol, inositol, D-levulose, D-mannitol, sucrose, and D-xylose as single carbon sources for growth, but not D-arabinose, dulcitol, β-lactose, melezitose, α-melibiose, raffinose, D-ribose, and ethanol [1,23]. Growth with L-rhamnose is only poor [1]. Strain G-20 T is negative for β-hemolysis of blood agar (10% human blood) [1]. Also, nitrate reduction occurs only sporadically with both inorganic or organic nitrate broth [1]. Strain G-20 T hydrolyses starch, is weakly positive for gelatin liquefaction and negative for casein utilization [23]. Strain G-20 T showed a remarkable production of extracellular functional bacterial amyloid (FuBA), which is accessible to WO2 antibodies without saponification [24]. The WO2 antibody has been shown to bind only to amyloid and not to other kinds of protein aggregates [20,24]. One strain of G. obscurus was described as having a lytic activity on yeast cell walls [12]. Another strain from rock varnish was shown to exhibit very strong resistance to UV-C light (220 J×m -2 ) [12]. Two strains from rock varnish in the Negev Desert were able to oxidize manganese [10]. Only three G. obscurus isolates have 16S rRNA gene sequences with >98% sequence similarity to strain G-20 T : isolate G18 from Namibia, 99.1% [2], isolate 06102S3-1 from deep-sea sediments of the East Pacific and Indian Ocean (EU603760) 98.5%, and G. obscurus subspecies utahensis DSM 43162, 98.03% [8]. The highest degree of sequence similarity in environmental metagenomic surveys, 93.3% was reported from a marine metagenome (AACY020064011) from the Sargasso Sea [25]. (January 2010). Figure 2 shows the phylogenetic neighborhood of for G. obscurus G-20 T in a 16S rRNA based tree. The sequences of the three 16S rRNA gene copies in the genome of G. obscurus G-20 T do not differ from each other, but differ by 24 nucleotides from the previously published 16S rRNA sequence obtained from DSM 43160 (X92356). These considerable discrepancies are most likely due to sequencing errors in the latter sequence. Genbank accession L40620, which was obtained from ATCC 25078, differs by only one single nucleotide from the 16S rRNA gene copies in the genome obtained from DSM 43160.

Genome sequencing and annotation Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position, and is part of the Genomic Encyclopedia of Bacteria and Archaea project. The genome project is deposited in the Genome OnLine Database [30] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2.

Genome sequencing and assembly
The genome was sequenced using a combination of Sanger and 454 sequencing platforms. All general aspects of library construction and sequencing performed at the JGI can be found at the JGI website (http://www.jgi.doe.gov/). 454 Pyrosequencing reads were assembled using the Newbler assembler version 1.1.02.15 (Roche). Large Newbler contigs were broken into 5,725 overlapping fragments of 1,000 bp and entered into assembly as pseudo-reads. The sequences were assigned quality scores based on Newbler consensus q-scores with modifications to account for overlap redundancy and adjust inflated q-scores. A hybrid 454/Sanger assembly was made using the parallel phrap assembler (High Performance Software, LLC). Possible misassemblies were corrected with Dupfinisher or transposon bombing of bridging clones [38]. A total of 1,530 Sanger finishing reads were produced to close gaps, to resolve repetitive regions, and to raise the quality of the finished sequence. Illumina reads were used to improve the final consensus quality using an in-house developed tool (the Polisher). The error rate of the completed genome sequence is less than 1 in 100,000. Together, the combination of the Sanger and 454 sequencing platforms provided 29.8× coverage of the genome. The final assembly contains 48,209 Sanger reads and 353,553 pyrosequencing reads.

Genome annotation
Genes were identified using Prodigal [39] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [40]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) nonredundant database, Uni-Prot, TIGR-Fam, Pfam, PRIAM, KEGG, COG, and In-terPro databases. Additional gene prediction analysis and functional annotation was performed within the Integrated Microbial Genomes -Expert Review (IMG-ER) platform [41].

Genome properties
The genome is 5,322,497 bp long and comprises one main chromosome with a 74.0% GC content ( Figure 3 and Table 3). Of the 5,219 genes predicted 5,161 were protein coding genes, and 58 RNAs. In addition, 350 pseudogenes were also identified. The majority of the protein-coding genes (69.8%) were assigned with a putative function while those remaining were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.  Table 5 provides an overall comparison of the genomes of G. obscurus strain G-20 T with the closest available genomes, that is, Acidothermus cellulolyticus 11B T , Frankia alni ACN14A and N. multipartita Y-104 T . The total length of (non-overlapping) high-scoring segment pairs (HSPs) and the number of identical base pairs within these HSPs were determined using the GGDC web server [42] by directly applying NCBI Blastn to the genomes represented as nucleotide sequences [43].

Comparison with closest related genomes
Number and proportion of shared homologs were determined using the 'Phylogenetic Profiler' function of the IMG system [41] using default values. While the relative order of 16S rRNA difference does not correspond to the genomic similarities, the four genome-based measures uniformly indicate that N. multipartita Y-104 T possesses the genome most similar to the one of G. obscurus G-20 T , followed by F. alni ACN14A and A. cellulolyticus 11B T .  1 Percent-wise 16S rRNA sequence divergence compared to genomic similarity for the three closest available genomes to G. obscurus strain G-20 T . GGD formulas: formula 1, length of sequence fragments not in HSPs per average total genome length; formula 2, number of non-identical bases per total HSP length; formula 3, number of non-identical bases within HSPs per average total genome length.