Complete genome sequence of Streptosporangium roseum type strain (NI 9100T)

Streptosporangium roseum Crauch 1955 is the type strain of the species which is the type species of the genus Streptosporangium. The ‘pinkish coiled Streptomyces-like organism with a spore case’ was isolated from vegetable garden soil in 1955. Here we describe the features of this organism, together with the complete genome sequence and annotation. This is the first completed genome sequence of a member of the family Streptosporangiaceae, and the second largest microbial genome sequence ever deciphered. The 10,369,518 bp long genome with its 9421 protein-coding and 80 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain NI 9100 T (= DSM 43021 = ATCC 12428 = JCM 3005) is the type strain of the species Streptosporangium roseum, which is the type species of the genus Streptosporangium, the type genus of the actinobacterial suborder Streptosporangineae [1][2][3][4]. S. roseum NI 9100 T was isolated from vegetable garden soil and first described by Crouch in 1955 [2,4]. The name derives from 'strepto' from Greek meaning 'coiled' combined with 'sporangium', Latin for 'spore case', to mean 'streptomyces-like' but with sporangia [2,4]. The species epithet 'roseum' derives from the pinkish color on potato dextrose agar [2]. Here we present a summary classification and a set of features for S. roseum NI 9100 T , together with the description of the complete genomic sequencing and annotation.

Classification and features
The 16S rRNA genes of the thirteen other validly named species currently ascribed to the genus Streptosporangium share 96-100% (S. vulgare [5]) sequence identity with NI 9100 T , but S. claviforme (94%) [6,7] apparently does not belong to this genus (but to the genus Herbidospora) and thus has been excluded from phylogenetic analysis (see below). Two reference strains, DSM 43871 (X89949), and DSM 44111 (X89947), differ by just one nucleotide from strain NI 9100 T , whereas the not effectively published 'species' 'S.  Figure 1a and Figure 1b show the phylogenetic neighborhood of S. roseum NI 9100 T in 16S rRNA based trees. The sequence of the six 16S rRNA gene copies in the genome do not differ from each other, and are identical to the previously published sequence generated from DSM 43021 (X89947), whereas the sequence generated in the same year from the JCM 3005 version of strain 9100 T (U48996) differs by 24 nucleotides (1.7%).

Figure 1a.
Phylogenetic tree highlighting the position of S. roseum NI 9100 T relative to the type strains of the other species within the genus except for S. claviforme (see text). The tree was inferred from 1,411 aligned characters [8,9] of the 16S rRNA gene sequence under the maximum likelihood criterion [10] and rooted with the results of Figure 1b. The branches are scaled in terms of the expected number of substitutions per site. Numbers above branches are support values from 1,000 bootstrap replicates if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [11] are shown in blue, published genomes in bold.

Figure 1b.
Phylogenetic tree highlighting the position of S. roseum NI 9100 T relative to the type strains of the other genera within the suborder Streptosporangineae. The tree was inferred from 1,369 aligned characters [8,9] of the 16S rRNA gene sequence under the maximum likelihood criterion [10] and rooted in accordance with the current taxonomy. The branches are scaled in terms of the expected number of substitutions per site. Numbers above branches are support values from 1,000 bootstrap replicates if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [11] are shown in blue, published genomes in bold.
A summary of the classification and features for S. roseum is listed in Table 1. We draw attention to the reader that we find quite an amount of contradictive results between old and more recent literature (see below). A potential but not ultimate source for this observation could be the usage of different experimental methods. A variety of media were used in the original description pertaining to cellular and mycelium morphology ( Figure  2).
The characteristics of the ribosomal protein AT-L30 of strain S. roseum JCM2178T in comparison to other bacteria of the genus Streptosporangium is described elsewhere [25]. These data should be taken cautiously, as according to the Japanese Col-lection of Microorganisms (JCM) catalogue the strain number "JCM2178" is affiliated with Aspergillus oryzae (accessed to JCM in August 09), hence the true nature of strain S. roseum JCM2178T in the study of Ochi [25] is unclear. Altitude not reported Evidence codes -IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from of the Gene Ontology project [23]. If the evidence code is IDA, then the property was directly observed for a live isolate by one of the authors or an expert or mentioned in the acknowledgements.

Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position, and is part of the Genomic Encyclopedia of Bacteria and Archaea project. The genome project is deposited in the Genome OnLine Database [11] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2.

Growth conditions and DNA isolation
S. roseum NI 9100T, DSM 43021, was grown in DSMZ medium 535, Trypticase Soy Broth [28], at 28°C. DNA was isolated from 0.5-1 g of cell paste using the JGI CTAP procedure with modification ALM as described in [29].

Genome sequencing and assembly
The genome was sequenced using a combination of Sanger and 454 sequencing platforms. All general aspects of library construction and sequencing performed at the JGI can be found at http://www.jgi.doe.gov/. 454 Pyrosequencing reads were assembled using the Newbler assembler version 1.1.02.15 (Roche). Large Newbler contigs were broken into 11,709 overlapping fragments of 1,000 bp and entered into assembly as pseudo-reads. The sequences were assigned quality scores based on Newbler consensus q-scores with modifications to account for overlap redundancy and to adjust inflated q-scores. A hybrid 454/Sanger assembly was made using the parallel phrap assembler (High Performance Software, LLC). Possible mis-assemblies were corrected with Dupfinisher [30] or transposon bombing of bridging clones (Epicentre Biotechnologies, Madison, WI). Gaps between contigs were closed by editing in Consed, custom primer walk or PCR amplification. A total of 2,837 Sanger finishing reads were produced to close gaps, to resolve re-petitive regions, and to raise the quality of the finished sequence. The error rate of the completed genome sequence is less than 1 in 100,000. Together all sequence types provided 36.05× coverage of the genome. The final assembly contains 128,042 Sanger and 1,033,578 pyrosequence reads.

Genome annotation
Genes were identified using Prodigal [31] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline (http://geneprimp.jgi-psf.org) [32]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) nonredundant database, UniProt, TIGR-Fam, Pfam, PRIAM, KEGG, COG, and InterPro databases. Additional gene prediction analysis and functional annotation was performed within the Integrated Microbial Genomes -Expert Review (IMG-ER) platform [33].

Genome properties
The genome consists of a 10,341,314 bp long chromosome and a small 28,204 bp plasmid with a 70.9% GC content (Table 3 and Figure 3). Of the 9,501 genes predicted, 9,421 were protein coding genes, and 80 RNAs. In addition, 446 pseudogenes were identified. The majority of protein-coding genes (62.5%) were assigned a putative function while those remaining were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.