Genome of the marine alphaproteobacterium Hoeflea phototrophica type strain (DFL-43T)

Hoeflea phototrophica Biebl et al. 2006 is a member of the family Phyllobacteriaceae in the order Rhizobiales, which is thus far only partially characterized at the genome level. This marine bacterium contains the photosynthesis reaction-center genes pufL and pufM and is of interest because it lives in close association with toxic dinoflagellates such as Prorocentrum lima. The 4,467,792 bp genome (permanent draft sequence) with its 4,296 protein-coding and 69 RNA genes is a part of the Marine Microbial Initiative.


Introduction
Strain DFL-43 T (= DSM 17068 = NCIMB 14078) is the type strain of Hoeflea phototrophica, a marine member of the Phyllobacteriaceae (Rhizobiales, Alphaproteobacteria) [1]. The genus, which was named in honor of the German microbiologist Manfred Höfle [2], contains four species, with H. marina as type species [2]; the name of a fifth member of the genus, 'Hoeflea siderophila', is until now only effectively published [3]. H. phototrophica DFL-43 T and strain DFL-44 were found in the course of a screening program for marine bacteria containing the photosynthesis reaction-center genes pufL and pufM [4]. The species epithet 'phototrophica' refers to the likely ability of H. phototrophica strains to use light as an additional energy source [1]. Strain DFL-43 T was isolated from single cells of a culture of the toxic dinoflagellate Prorocentrum lima maintained at the Biological Research Institute of Helgoland, Germany [1]. Here we present a summary classification and a set of features for H. phototrophica DFL-43 T including so far undiscovered aspects of its phenotype, together with the description of the complete genomic sequencing and annotation. This work is part of the Marine Microbial Initiative (MMI) which enabled the J. Craig Venter Institute (JCVI) to sequence the genomes of approximately 165 marine microbes with funding from the Gordon and Betty Moore Foundation. These microbes were contributed by collaborators worldwide, and represent an array of physiological diversity, including carbon fixers, photoautotrophs, photoheterotrophs, nitrifiers, and methanotrophs. The MMI was designed to complement other ongoing research at JCVI and elsewhere to characterize the microbial biodiversity of marine and terrestrial environments through metagenomic profiling of environmental samples.

Classification and features 16S rRNA analysis
A representative genomic 16S rRNA sequence of H. phototrophica DFL-43 T was compared using NCBI BLAST [5,6] under default settings (e.g., considering only the high-scoring segment pairs (HSPs) from the best 250 hits) with the most recent release of the Greengenes database [7] and the relative frequencies of taxa and keywords (reduced to their stem [8]) were determined, weighted by BLAST scores. The most frequently occurring genera were Rhizobium (53.7%), Sinorhizobium (24.0%), Hoeflea (4.5%), Bartonella (4.5%) and Ahrensia (3.7%) (132 hits in total). Regarding the two hits to sequences from members of the species, both, the average identity within HSPs and the average coverage by HSPs were 100.0%. Regarding the single hit to sequences from other members of the genus, the average identity within HSPs was 98.2%, whereas the average coverage by HSPs was 100.0%. Among all other species, the one yielding the highest score was H. marina (AY598817), which corresponded to an identity of 98.2% and an HSP coverage of 100.0%. (Note that the Greengenes database uses the INSDC (= EMBL/NCBI/DDBJ) annotation, which is not an authoritative source for nomenclature or classification.) The highest-scoring environmental sequence was AY922224 (Greengenes short name 'whalefall clone 131720'), which showed an identity of 98.1% and an HSP coverage of 97.5%. The most frequently occurring keywords within the labels of all environmental samples which yielded hits were 'bee' (3.1%), 'singl' (3.0%), 'abdomen, bumbl, distinct, honei, microbiota, simpl' (2.9%), 'microbi' (2.8%) and 'structur' (1.8%) (118 hits in total). Environmental samples which yielded hits of a higher score than the highest scoring species were not found, indicating that H. phototrophica is rarely found in environmental samples. Figure 1 shows the phylogenetic neighborhood of H. phototrophica in a 16S rRNA based tree. The sequences of the two identical 16S rRNA gene copies in the genome differ by one nucleotide from the previously published 16S rRNA sequence (AJ582088).

Chemotaxonomy
Phosphatidylglycerol, phosphatidylethanolamine and phosphatidylmonomethylethanolamine were the predominant polar lipids of the membrane. The most frequent cellular fatty acids in strain DFL-43 T are the mono-unsaturated straight chain acids C 18:1 ω7 (62.8%) and its methylated form C 18:1 ω7 11Me (21%), followed by C 16:0 (6.3%) and C 19:1 (3.4%) [1]. The absorption spectrum of an acetone/methanol extract showed the presence of bacteriochlorophyll a and an additional carotenoid (possibly spheroidenone) in small amounts [1]. Further experiments indicated that the pigment production depends on the concentration of sea salts in the medium [1].  [11]. Rooting was done initially using the midpoint method [12] and then checked for its agreement with the current classification ( Table 1). The branches are scaled in terms of the expected number of substitutions per site. Numbers adjacent to the branches are support values from 1,000 ML bootstrap replicates [13] (left) and from 1,000 maximum-parsimony bootstrap replicates [14] (right) if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [15] are labeled with one asterisk, those also listed as 'Complete and Published' (CP002279 for Mesorhizobium opportunistum) with two asterisks.  Evidence codes TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). Evidence codes are from the Gene Ontology project [24].

Genome sequencing and annotation Genome project history
The genome was sequenced within the MMI supported by the Gordon and Betty Moore Foundation. Initial Sequencing was performed by the J. Craig Venter Institute, JCVI (Rockville, MD, USA), and a high-quality draft sequence was deposited at INSDC. The number of scaffolds and contigs was reduced and the assembly improved by a subsequent round of manual gap closure at HZI/DSMZ. A summary of the project information is shown in Table 2.

Growth conditions and DNA extractions
Cells of strain DFL-43 T were grown for two to three days on a LB & sea-salt agar plate, containing (l -1 ) 10 g tryptone, 5 g yeast extract, 10 g NaCl, 17 g sea salt (Sigma-Aldrich S9883) and 15 g agar. A single colony was used to inoculate LB & sea-salt liquid medium and the culture was incubated at 28°C on a shaking platform. The genomic DNA was isolated using the Qiagen Genomic 500 DNA Kit (Qiagen 10262) as indicated by the manufacturer. DNA quality and quantity were in accordance with the instructions of the genome sequencing center. DNA is available through the DNA Bank Network [26].

Genome sequencing and assembly
The genome was sequenced with the Sanger technology using a combination of two libraries. All general aspects of library construction and se-quencing performed at the JCVI can be found on the JCVI website. Base calling of the sequences were performed with the phredPhrap script using default settings. The reads were assembled and assemblies analyzed using the phred/phrap/consed pipeline [27]. The last gaps were closed by adding new reads produced by recombinant PCR and PCR primer walks. In total 21 Sanger reads were required for gap closure and improvement of low quality regions. The final consensus sequence was built from 46,086 Sanger reads (10.3 × coverage).

Genome annotation
Gene prediction was carried out using GeneMark as part of the genome annotation pipeline in the Integrated Microbial Genomes Expert Review (IMG-ER) system [28]. To identify coding genes, Prodigal [29] was used, while ribosomal RNA genes within the genome were identified using RNAmmer [30]. Other non-coding genes were predicted using Infernal [31]. Manual functional annotation was performed within the IMG platform [28] and the Artemis Genome Browser [32].

Genome properties
The draft genome consists of one circular scaffold with a total length of 4,467,822 bp containing five large contigs with a total length of 4,467,792 bp and a G+C content of 59.8%. Contig lengths vary from 133,683 bp to 2,215,172 bp ( Figure 3); genome statistics are provided in Table 3. Of the 4,296 genes predicted, 4,227 were protein-coding genes, and 69 RNAs; pseudogenes were not identified. The majority of the protein-coding genes (83.1%) were assigned a putative function while the remaining ones were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.