Complete genome sequence of Nocardiopsis dassonvillei type strain (IMRU 509T)

Nocardiopsis dassonvillei (Brocq-Rousseau 1904) Meyer 1976 is the type species of the genus Nocardiopsis, which in turn is the type genus of the family Nocardiopsaceae. This species is of interest because of its ecological versatility. Members of N. dassonvillei have been isolated from a large variety of natural habitats such as soil and marine sediments, from different plant and animal materials as well as from human patients. Moreover, representatives of the genus Nocardiopsis participate actively in biopolymer degradation. This is the first complete genome sequence in the family Nocardiopsaceae. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 6,543,312 bp long genome consist of a 5.77 Mbp chromosome and a 0.78 Mbp plasmid and with its 5,570 protein-coding and 77 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain IMRU 509 T (= DSM 43111 = ATCC 23218 = JCM 7437) is the type strain of Nocardiopsis dassonvillei, which in turn is the type species of the genus Nocardiopsis. Currently, N. dassonvillei is one of 40 validly published species belonging to the genus. The genus name derives from the Greek name opsis, appearance, and from Edmond Nocard, who first described in 1888 the type species of the genus Nocardia, N. farcinica [1,2]. Nocardiopsis means "that which has the appearance of Nocardia". The species epithet is chosen in honor of Charles Dassonville, a contemporary French veterinarian [3]. The genus Nocardiopsis was first described by Meyer in 1976 [4] for bacteria that were previously classified as either Streptothrix dassonvillei (Brocq-Rousseau 1904) [3], Nocardia dassonvillei [5], or Actinomadura dassonvillei [6] on the basis of their morphological characteristics and cell wall type [4]. The strain IMRU 509 T is the neotype of the species N.
dassonvillei (Brocq-Rousseau 1904). Databases provide contradictory speculations on the ecological and geographical origin of strain IMRU 509 T (e.g., soil from Paris, France; mildewed grain of unspecified geographical origin), however, solid information could not be extracted from the original literature [4,5,[7][8][9]. Members of this species can be isolated from a variety of different habitats, including mildewed grain and fodder [3], different soils [10][11][12][13], antartic glacier [14], marine sediments [10,15], actinoryzal plant rhizosphere [16], gut tract of animals [17], active stalactites [18], cotton waste and occasionally in hay [19], air of a cattle barn [20], atmosphere of a composting facility [21], salterns [22] and from patients suffering from conjunctivitis [23] or cholangitis [8]. N. dassonvillei strains were also isolated from nodules and draining sinuses associate with an actinomycetoma of the anterior aspect of the right leg below the knee of a 39year-old man [24]. A microorganism identical to Streptothrix dassonvillei was isolated two years later, but was placed in the genus Nocardia and designated N. dassonvillei [23]. Subsequently, the genus Actinomadura was described to harbor, among other species, also N. dassonvillei (Brocq-Rousseau) Liegard and Landrieu [4,8]. Further analysis supplied evidence that A. dassonvillei is not related to nocardiae [7]. Therefore, a new genus was created for A. dassonvillei on the basis of the characteristic development of spores, including the specific zig-zag formation of aerial hyphae before spore dispersal and the lack of madurose [4]. In 1976, A. dassonvillei was transferred to this new genus and was designated Nocardiopsis dassonvillei [4]. Also, N. dassonvillei is an earlier heterotypic synonym of N. alborubida [25]. The species epithet alborubida was considered as orthographically incorrect and corrected by Evtushenko to albirubida [10]. Subsequently, the species N. dassonvillei has been divided into three subspecies, namely subsp. prasina [26], subsp. albirubida (Grund and Kroppenstedt 1990) [10] and subsp. dassonvillei (Brocq-Rousseau 1904) [4,27], which is an earlier heterotypic synonym of Streptomyces flavidofuscus Preobrazhenskaya 1986 [28]. DNA-DNA hybridization data, as well as the results of biochemical tests, indicated that N. alborubida DSM 40465, N. antarctica DSM 43884, and N. dassonvillei DSM 43111 represent a single species designated N. dassonvillei [25]. Here we present a summary classification and a set of features for N. dassonvillei strain IMRU 509 T , together with the description of the complete genomic sequencing and annotation.

Classification and features
The 16S rRNA gene sequences of the strain IMRU 509 T share 95.9 to 99.5% sequence similarity with the 16S rRNA gene sequences of the type strains from the other members of the genus Nocardiopsis [29] The 16S rRNA gene of the strain IMRU 509 T also shares 99% similarity with an uncultured 16S rRNA gene sequence of the clone AKIW919 from urban aerosol in USA [30], but none of the sequences in metagenomic libraries (env_nt) shares more than 89% sequence identity, indicating that members of the species, genus and even family are poorly represented in the habitats screened thus far (as of November 2010). A representative genomic 16S rRNA sequence of N. dassonvillei was compared with the most recent release of the Greengenes database [31] using NCBI BLAST under default values and the relative frequencies of taxa and keywords, weighted by BLAST scores, were determined. The three most frequent genera were Nocardiopsis (91.1%), Streptomyces (7.1%) and Prauseria (1.8%). The species yielding the highest score was N. dassonvillei (including hits to N. dassonvillei subsp. dassonvillei, formerly also known as Streptomyces flavidofuscus [9,28]). The five most frequent keywords within the labels of environmental samples which yielded hits were 'soil(s)' (15.4%), 'algeria, nocardiopsis, saccharothrix, saharan' (5.7%), 'source' (2.0%) and 'alkaline' (2.0%). These keywords fit to the morphology of the type strain as well as to the ecology of habitats from which the type strain and also other members of the species were isolated. The single most frequent keyword within the labels of environmental samples which yielded hits of a higher score than the highest scoring species was 'desert/soil' (50.0%). Figure 1 shows the phylogenetic neighborhood of N. dassonvillei strain IMRU 509 T in a 16S rRNA based tree. The sequences of the five 16S rRNA gene copies in the genome differ from each other by up to ten nucleotides, and differ by up to eight nucleotides from the previously published 16S rRNA sequence (X97886). Phylogenetic tree highlighting the position of N. dassonvillei strain IMRU 509 T relative to the type strains of the other species within the genus and to the type strains of the other genera within the family Nocardiopsaceae. The trees were inferred from 1,442 aligned characters [32,33] of the 16S rRNA gene sequence under the maximum likelihood criterion [34] and rooted in accordance with the current taxonomy [35]. The branches are scaled in terms of the expected number of substitutions per site. Numbers above branches are support values from 750 bootstrap replicates [36] if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [37] are shown in blue, published genomes in bold [38]. Note that the tree is more in accordance with the view of Grund and Kroppenstedt (1990) [39] to treat N. alborubida as a species of its own, rather than with the view of Yassin et al. (1997) [25] and Evtushenko et al. 2000 [10] to regard it as a subspecies of N. dassonvillei based on a 71% DDH value [10]. Altitude not reported NAS Evidence codes -IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from of the Gene Ontology project [50]. If the evidence code is IDA, then the property was directly observed by one of the authors or an expert mentioned in the acknowledgements.
The cells of strain IMRU 509 T are aerobic and Gram-positive [4]. (Table 1). Aerial mycelia are long, moderately branched, and, at the beginning of sporulation, more or less zig-zag-shaped ( Figure 2). Later, the hyphae are straight or somewhat coiled [4]. They then divide into long segments which subsequently subdivide into smaller spores of irregular size [4]. Spores are elongated and smooth.

Genome sequencing and annotation Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position [54], and is part of the Genomic Encyclopedia of Bacteria and Archaea project [55]. The genome project is deposited in the Genome OnLine Database [37] and the complete genome sequence is deposited in Gen-Bank. Sequencing, finishing and annotation were performed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2.  medium) [56] at 28°C. DNA was isolated from 0.5-1 g of cell paste using Qiagen Genomic 500 DNA Kit (Qiagen, Hilden, Germany) following the standard protocol as recommended by the manufac-turer, with modification st/DALM for cell lysis as described in Wu et al. [55].

Genome sequencing and assembly
The genome was sequenced using a combination of Sanger and 454 sequencing platforms. All general aspects of library construction and sequenc-ing can be found at the JGI website [57]. Pyrosequencing reads were assembled using the Newbler assembler version 2.1-PreRelease (Roche). Large Newbler contigs were broken into 6,356 overlap ping fragments of 1,000 bp and entered into assembly as pseudo-reads. The sequences were assigned quality scores based on Newbler consensus q-scores with modifications to account for overlap redundancy and adjust inflated qscores. A hybrid 454/Sanger assembly was made using the PGA assembler. Possible mis-assemblies were corrected and gaps between contigs were closed by by editing in Consed, by custom primer walks from sub-clones or PCR products. A total of 462 Sanger finishing reads were produced to close gaps, to resolve repetitive regions, and to raise the quality of the finished sequence. Illumina reads were used to improve the final consensus quality using an in-house developed tool (the Polisher ) [58]. The error rate of the completed genome sequence is less than 1 in 100,000. Together, the combination of the Sanger and 454 sequencing platforms provided 28.77 × coverage of the genome. The final assembly contains 68,385 Sanger reads and 1,376,163 pyrosequencing reads.

Genome annotation
Genes were identified using Prodigal [59] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [60]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) nonredundant database, Uni-Prot, TIGRFam, Pfam, PRIAM, KEGG, COG, and In-terPro databases. Additional gene prediction analysis and functional annotation was performed within the Integrated Microbial Genomes -Expert Review (IMG-ER) platform [61].

Genome properties
The genome consists of a 5,767,958 bp long chromosome with a 73% GC content, and a 775,354 bp long plasmid a 72% GC content (Table 3 and Figure 3a and Figure 3b). Of the 5,647 genes predicted, 5,570 were protein-coding genes, and 77 RNAs; 73 pseudogenes were also identified. The majority of the protein-coding genes (69.6%) were assigned with a putative function while the remaining ones were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.