Complete genome sequence of Gordonia bronchialis type strain (3410T)

Gordonia bronchialis Tsukamura 1971 is the type species of the genus. G. bronchialis is a human-pathogenic organism that has been isolated from a large variety of human tissues. Here we describe the features of this organism, together with the complete genome sequence and annotation. This is the first completed genome sequence of the family Gordoniaceae. The 5,290,012 bp long genome with its 4,944 protein-coding and 55 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain 3410 T (= DSM 43247 = ATCC 25592 = JCM 3198) is the type strain of the species Gordonia bronchialis, which is the type species of the genus. The genus Gordonia (formerly Gordona) was originally proposed by Tsukamura in 1971 [1]. The generic name Gordona has been chosen to honor Ruth E. Gordon, who studied extensively 'Mycobacterium' rhodochrous (included later as a member of Gordona) [1]. In 1977, it was subsumed into the genus Rhodococcus [2], but revived again in 1988 by Stackebrandt et al. [3]. At the time of writing, the genus contained 28 validly published species [4]. The genus Gordonia is of great interest for its bioremediation potential [5]. Some species of the genus have been used for the decontamina-tion of polluted soils and water [6,7]. Other species were isolated from industrial waste water [8], activated sludge foam [9], automobile tire [10], mangrove rhizosphere [11], tar-contaminated oil [12], soil [13] and an oil-producing well [7]. Further industrial interest in Gordonia species stems from their use as a source of novel enzymes [14,15]. There are, however, quite a number of Gordonia species that are associated with human and animal diseases [16], among them G. bronchialis. Here we present a summary classification and a set of features for G. bronchalis 3410 T , together with the description of the complete genomic sequencing and annotation.

Classification and features
Strain 3410 T was isolated from the sputum of a patient with pulmonary disease (probably in Japan) [1]. Further clinical strains in Japan have been isolated from pleural fluid, tumor in the eyelid, granuloma, leukorrhea, skin tissue and pus [17]. In other cases, G. bronchialis caused bacteremia in a patient with a sequestrated lung [18] and a recurrent breast abscess in an immunocompetent patient [19]. Finally, G. bronchialis was isolated from sternal wound infections after coronary artery bypass surgery [20]. G. bronchialis shares 95.8-98.7% 16S rRNA gene sequence simi-larity with the other type strains of the genus Gordonia, and 95.3-96.4% with the type strains of the neighboring genus Williamsia. Figure 1 shows the phylogenetic neighborhood of for G. bronchialis 3410 T in a 16S rRNA based tree. The sequences of the two 16S rRNA gene copies in the genome of G. bronchialis 3410 T , differ from each other by one nucleotide, and differ by up to 5 nucleotides from the previously published 16S rRNA sequence from DSM 43247 (X79287). These discrepancies are most likely due to sequencing errors in the latter sequence. Phylogenetic tree highlighting the position of G. bronchialis 3410 T relative to the other type strains within the genus Gordonia. The tree was inferred from 1,446 aligned characters [21,22] of the 16S rRNA gene sequence under the maximum likelihood criterion [23] and rooted with the type strains of the neighboring genus Williamsia. The branches are scaled in terms of the expected number of substitutions per site. Numbers above branches are support values from 1,000 bootstrap replicates if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [24] are shown in blue, published genomes in bold.

Chemotaxonomy
The cell-wall peptidoglycan is based upon mesodiaminopimelic acid (variation Alγ). The glycan moiety of the peptidoglycan contains N-glycolylmuramic acid. The wall sugars are arabinose and galactose. Mycolic acids are present with a range of ca. 48-66 carbon atoms. The predominant menaquinone is MK-9(H2), with only low amounts of MK-9(H0), MK-8(H2), and MK-7(H2) [3,8,[39][40][41]. Moreover, the cell envelope of G. bronchialis 3410 T contains a lipoarabinomannan-like lipoglycan [42]. The same study also observed a second amphiphilic fraction with properties suggesting a phosphatidylinositol mannoside [42]. The cellular fatty acid composition (%) is C16:0 (23), tuberculostearic acid (20), C16:1cis9 (16), C16:1cis7 (11), C18:1 (10), and 10-methyl C17:0 (7). All other fatty acids are at 3% or below [8]. Altitude not reported Evidence codes -IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from of the Gene Ontology project [35]. If the evidence code is IDA, then the property was directly observed for a live isolate by one of the authors or an expert mentioned in the acknowledgements.

Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position, and is part of the Genomic Encyclopedia of Bacteria and Archaea project. The genome project is deposited in the Genome OnLine Database [24] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were per-formed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2.

Genome sequencing and assembly
The genome was sequenced using a combination of Sanger and 454 sequencing platforms. All general aspects of library construction and sequencing performed at the JGI can be found on the JGI website. 454 Pyrosequencing reads were assembled using the Newbler assembler version 1.1.02.15 (Roche). Large Newbler contigs were broken into 5,776 overlapping fragments of 1,000 bp and entered into assembly as pseudo-reads. The sequences were assigned quality scores based on Newbler consensus q-scores with modifications to account for overlap redundancy and to adjust inflated q-scores. A hybrid 454/Sanger assembly was made using the parallel phrap assembler (High Performance Software, LLC). Possible mis-assemblies were corrected with Dupfinisher [45] or transposon bombing of bridging clones (Epicentre Biotechnologies, Madison, WI). Gaps between contigs were closed by editing in Consed, custom primer walk or PCR amplification. A total of 876 primer walk reactions, 12 transposon bombs, and 1 pcr shatter libraries were necessary to close gaps, to resolve repetitive regions, and to raise the quality of the finished sequence. The error rate of the completed genome sequence is less than 1 in 100,000. Together all sequence types provided 51.2× coverage of the genome. The final assembly contains 52,329 Sanger and 508,130 pyrosequence reads.

Genome annotation
Genes were identified using Prodigal [46] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [47]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) nonredundant database, Uni-Prot, TIGRFam, Pfam, PRIAM, KEGG, COG, and In-terPro databases. Additional gene prediction analysis and manual functional annotation was performed within the Integrated Microbial Genomes Expert Review (IMG-ER) platform [48].

Genome properties
The genome consists of a 5.2 Mbp long chromosome and a 81,410 bp plasmid (Table 3 and Figure  3). Of the 4,999 genes predicted, 4,944 were protein coding genes, and 55 RNAs; 264 pseudogenes were also identified. The majority of the proteincoding genes (69.1%) were assigned with a putative function while those remaining were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.