Genome sequence of the Antarctic rhodopsins-containing flavobacterium Gillisia limnaea type strain (R-8282T)

Gillisia limnaea Van Trappen et al. 2004 is the type species of the genus Gillisia, which is a member of the well characterized family Flavobacteriaceae. The genome of G. limnea R-8282T is the first sequenced genome (permanent draft) from a type strain of the genus Gillisia. Here we describe the features of this organism, together with the permanent-draft genome sequence and annotation. The 3,966,857 bp long chromosome (two scaffolds) with its 3,569 protein-coding and 51 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain R-8282 T (= DSM 15749 = LMG 21470 = CIP 108418) is the type strain of the species Gillisia limnaea [1], which in turn is the type species of the Gillisia, a genus currently encompassing six known species [1]. The strain was isolated from a microbial mat in Lake Fryxell, Antarctica [1] during the MICROMAT project, which systematically collected novel strains from Antarctic lakes [2]. The genus was named after the Belgian bacteriologist Monique Gillis for her work on bacterial taxonomy [1]. The species epithet was derived from the Neo-Latin adjective 'limnaeae', living in the water, referring to the microbial mats in Lake Fryxell where the organism was first isolated [1]. Standards in Genomic Sciences from the best 250 hits) with the most recent release of the Greengenes database [5] and the relative frequencies of taxa and keywords (reduced to their stem [6]) were determined, weighted by BLAST scores. The most frequently occurring genera were Flavobacterium (80.2%), Gillisia (17.8%), Chryseobacterium (1.0%) and Cytophaga (1.0%) (94 hits in total). Regarding the single hit to sequences from members of the species, the average identity within HSPs was 99.1%, whereas the average coverage by HSPs was 98.2%. Regarding the five hits to sequences from other members of the genus, the average identity within HSPs was 95.6%, whereas the average coverage by HSPs was 94.3%. Among all other species, the one yielding the highest score was Gillisia hiemivivida (AY694006), which corresponded to an identity of 97.1% and an HSP coverage of 90.8%. (Note that the Greengenes database uses the INSDC (= EMBL/NCBI/DDBJ) annotation, which is not an authoritative source for nomenclature or classification.) The highest-scoring environmental sequence was EU735617 (Greengenes short name: 'archaeal structures and pristine soils China oil contaminated soil Jidong Oilfield clone SC78'), which showed an identity of 99.0% and an HSP coverage of 98.4%. The most frequently occurring keywords within the labels of all environmental samples which yielded hits were 'librari' (3.2%), 'dure' (3.0%), 'bioremedi, broader, chromat, groundwat, microarrai, polylact, sampl, stimul, subsurfac, typic, univers' (2.9%), 'spring' (2.5%) and 'soil' (2.4%) (156 hits in total). The most frequently occurring keywords within the labels of those environmental samples which yielded hits of a higher score than the highest scoring species were 'soil' (15.4%), 'archaeal, china, contamin, jidong, oil, oilfield, pristin, structur' (7.7%) and 'antarct, cover, lake' (7.7%) (2 hits in total). Whereas some of these keywords confirm the environment of G. limnaea, others are indicative of other habitats in which related taxa are found. Figure 1 shows the phylogenetic neighborhood of G. limnaea in a 16S rRNA based tree. The sequences of the two 16S rRNA gene copies in the genome differ from each other by up to eleven nucleotides, and differ by up to eight nucleotides from the previously published 16S rRNA sequence (AJ440991), which contains seven ambiguous base calls.
Cells of strain G. limnaea R-8282 T are Gramnegative and rod-shaped [ Figure 2] [1]. They are 0.7 µm in width and 3.0 µm in length [1], whereas scanning electron micrographs of strain R-8282 T revealed a cell diameter that varies from 0.4 µm to 0.5 µm, and a length that varies from 1.6 µm to longer than 4.9 µm [ Figure 2], which is more consistent with data previously reported for several Gillisia strains [32][33][34]. Motility, especially gliding motility, was not observed [1], despite the presence of numerous genes associated with gliding motility (see below), and the presence of pilicontaining cells in scanning electron micrographs of strain R-8282 T . It is unclear if these pili are involved in gliding motility or bacterial adhesion to surfaces. Cells are strictly aerobic, psychrophilic and chemoheterotrophic [1]. Growth occurs between 5°C and 30°C with an optimum at 20°C [1]; the strain is unable to grow at temperatures above 37°C [1]. Growth occurs within a salinity range of 0% to 5% NaCl, but not in 10% NaCl, indicating moderate halotolerance [1]. Peptone and yeast extract were required for growth [1]. When cultivated on marine agar, colonies are yellow in color, convex and translucent with diameters of 1-3 mm forming entire margins after 6 days of incubation [1]. When cultivated on Anacker & Ordal's agar, colonies become flat and round with entire margins and 0.7 to 0.9 mm in diameter after 14 days incubation [1]. Additionally growth is both detectable on nutrient agar and R2A, but the strain does not grow on trypticase soy agar [1]. Further detailed physiological data such as carbon source utilization, carbon degradation, and enzyme activities have been reported previously [1].

Genome sequencing and annotation Genome project history
This organism was selected for sequencing on the basis of its phylogenetic position [35], and is part of the Genomic Encyclopedia of Bacteria and Archaea project [36]. The genome project is deposited in the Genomes On Line Database [13] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed by the DOE Joint Genome Institute (JGI). A summary of the project information is shown in Table 2. The tree was inferred from 1,366 aligned characters [7,8] of the 16S rRNA gene sequence under the maximum likelihood (ML) criterion [9]. Rooting was done initially using the midpoint method [10] and then checked for its agreement with the current classification ( Table 1). The branches are scaled in terms of the expected number of substitutions per site. Numbers adjacent to the branches are support values from 1,000 ML bootstrap replicates [11] (left) and from 1,000 maximum-parsimony bootstrap replicates [12] (right) if larger than 60%. Lineages with type strain genome sequencing projects registered in GOLD [13] are labeled with one asterisk, those also listed as 'Complete and Published' with two asterisks [14][15][16]; for Ornithobacterium rhinotracheale see CP003283).   [17] and NamesforLife [18].

Genome sequencing and assembly
The genome was sequenced using a combination of Illumina and 454 sequencing platforms. All general aspects of library construction and sequencing can be found at the JGI website [39]. Pyrosequencing reads were assembled using the Newbler assembler (Roche). The initial Newbler assembly consisting of 93 contigs in one scaffold was converted into a phrap [40]

Genome annotation
Genes were identified using Prodigal [44] as part of the Oak Ridge National Laboratory genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [45]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) non-redundant database, UniProt, TIGRFam, Pfam, PRIAM, KEGG, COG, and InterPro databases. These data sources were combined to assert a product description for each predicted protein. Additional gene prediction analysis and functional annotation was performed within the Integrated Microbial Genomes -Expert Review (IMG-ER) platform [46].

Genome properties
The genome consists of two scaffolds with 3,558,876 bp and 407,981 bp length, respectively, with a G+C content of 37.6% (Table 3 and Figure  3). Of the 3,620 genes predicted, 3,569 were protein-coding genes, and 51 RNAs; 135 pseudogenes were also identified. The majority of the proteincoding genes (66.0%) were assigned a putative function while the remaining ones were annotated as hypothetical proteins. The distribution of genes into COGs functional categories is presented in Table 4.

Insights into the genome sequence
Genome analysis of G. limnaea R-8282 T revealed the presence of three rhodopsin genes related to proteorhodopsin (PR, GenBank Accession No. EHQ04368, Gilli_0216) and xanthorhodopsin (XR, EHQ02967, Gilli_2340) protein-encoding sequences, whereas a third rhodopsin protein sequence (EHQ02971, Gilli_2344) seems to be truncated. Another finding was a set of genes involved in β-carotene biosynthesis, together with a gene encoding a β-carotene 15,15'-monooxygenase (EHQ04367, Gilli_0215), an enzyme that oxidatively cleaves β-carotene into two molecules of retinal, which is necessary for rhodopsin function. PRs and XRs are photoactive transmembrane opsins that bind retinal and which belong to the microbial rhodopsin superfamily [47]. When exposed to light, a change in protein conformation causes a proton translocation with respect to its cofactor retinal from the inside to the outside of the cell [48]. This proton-pump activity generates a proton motive force across the cell membrane, which can be used in heterologously PRexpressing E. coli cells for for ATP synthesis [49] as well as to power general cellular functions like transmembrane nutrient transport or flagella rotation [50]. In contrast to PRs, XRs are light-driven proton pumps containing a dual chromophore: one retinal molecule and one carotenoid antenna [51,52], that was first discovered in Salinibacter ruber M31 T [53,54]. Its carotenoid antenna salinixanthin transfers as much as 40-45% of the absorbed photons to retinal [55], resulting in a potentially much more efficient light capturing system compared to PRs from Bacteria [56,57] or bacteriorhodopsins from Archaea [58].
NCBI BLAST analysis [3] revealed that the protein encoded by Gilli_0216 shares distinct identities with many PR protein sequences, found in other species within the Flavobacteriaceae (Figure 4). It shows typical features necessary for proton pump activity: K224 (K231) for retinal-binding, and D88 (D97) as well as E99 (E108) (EBAC31A08 numbering shown in brackets), which act as a proton acceptor and proton donor in the retinylidene Schiff's base transfer during the PR photocycle [60,61]. Furthermore, the putative PR (Gilli_0216 protein) has a M96 (L105) (EBAC31A08 numbering in parentheses), which mainly indicates that it is a green light-absorbing proteorhodopsin [48,62].
The gene encoding the putative XR (Gilli_2340) of strain R-8282 T shows identities to XR-related proteins, but provides evidence of a new cluster of rhodopsins found in very few flavobacterial isolates like Dokdonia donghaensis PRO95 (EHQ04368) [63] and Krokinobacter sp. 4H-3-7-5 (AEE18495) [64], which was reclassified into the genus Dokdonia [65,66] (Figure 4). This rhodopsin-encoding sequence also reveals typical features necessary for rhodopsin function: K316 (K231) for retinal binding and L181 (L105), which mainly indicates a green-light absorbing rhodopsin [48,62] (EBAC31A08 numbering shown in brackets). But amino acid residues functioning as proton acceptor and proton donor in proteorhodopsin differ from those commonly known. Instead of D97 and E108 (EBAC31A08 numbering), the related amino acids N173 and Q184 are found in the protein sequence encoded by Gilli_2340, which indicates a possible new kind of rhodopsins.
Interestingly, no rhodopsin-encoding sequence could be detected in the genome sequence of Gillisia sp. strain CBA3202 [67], which was isolated from the littoral zone on Jeju Island, Republic of Korea [67]. Digital DNA-DNA hybridization (DDH) [68] between strain R-8282 T and CBA3202 revealed an estimate between 9.7% and 13.9% (depending on the formula used), indicating that Gillisia sp. strain CBA3202 does not belong to the species G. limnaea.

Figure 4.
Rhodopsin tree for Gillisia and relatives. Amino acid sequences were processed in the same way as the 16S rRNA sequences used in Figure 1 except for the explicit determination of an optimal maximumlikelihood model, which turned out to be Lateral Gene Transfer [59]. GenBank Accession Numbers are shown in parentheses.