Complete genome sequencing and analysis of Saprospira grandis str. Lewin, a predatory marine bacterium

Saprospira grandis is a coastal marine bacterium that can capture and prey upon other marine bacteria using a mechanism known as ‘ixotrophy’. Here, we present the complete genome sequence of Saprospira grandis str. Lewin isolated from La Jolla beach in San Diego, California. The complete genome sequence comprises a chromosome of 4.35 Mbp and a plasmid of 54.9 Kbp. Genome analysis revealed incomplete pathways for the biosynthesis of nine essential amino acids but presence of a large number of peptidases. The genome encodes multiple copies of sensor globin-coupled rsbR genes thought to be essential for stress response and the presence of such sensor globins in Bacteroidetes is unprecedented. A total of 429 spacer sequences within the three CRISPR repeat regions were identified in the genome and this number is the largest among all the Bacteroidetes sequenced to date.


Introduction
Saprospira grandis is an obligately aerobic, Gramnegative marine bacterium belonging to the family Saprospiraceae and is commonly found in marine littoral sand and coastal zones in various locations around the world [1,2]. First isolated and described by Gross in 1911 [3], both marine and fresh water species of Saprospira have been isolated and studied [1,2,[4][5][6][7][8]. It is an unusual bacterium because it can prey upon other bacteria using a mechanism known as 'ixotrophy' to obtain nutrients [1]. Members of Saprospiraceae are also known to actively hydrolyze proteins in activated-sludge waste treatment plants [9] and this highlights their role as decomposers in various habitats. Bacteria of the family Saprospiraceae have been shown to actively prey upon harmful diatoms [10] and cyanobacteria such as Microcystis aeruginosa [11]. Saprospiraceae are also found in an epiphytic bacterial biofilm community that colonizes algal surfaces [12]. This association of Saprospiraceae with marine phytoplankton and algae is of considerable interests as the bacteria may play an active role in controlling harmful algal blooms in oceans. Lysis of cyanobacterial cells by Saprospira species has also been reported in another study and the experiments indicated that the lysis took place through direct cell-to-cell contact and not through bactericidal substances [13]. Another curious feature of S. grandis is the presence of phage-like structures known as "rhapidosomes" [14][15][16][17][18]. Although the rhapidosomes superficially resemble phage particles, bactericidal activities have not been recorded in growth assays and the rhapidosomes appear to be normal components of the cells [15,16].
While bacteria of the genus Saprospira are studied quite extensively, genome information is lacking thus far. Therefore, it is of interest to obtain the complete genome sequence of S. grandis to determine its metabolic potential, predatory lifestyle, and genes that encode proteins involved in rhapidosome formation. Here, we report on the complete genome sequencing and annotation of S. grandis str. Lewin, the first member of the Saprospiraceae family to have its complete genome sequenced. We also performed proteomic experiments to identify the proteins that form rhapidosomes in S. grandis str. Lewin.

Classification and features
There are three identical copies of the 16S rRNA gene in the Saprospira grandis str. Lewin genome and one copy was chosen to search against the nucleotide database using NCBI BLAST [19]. It has the highest sequence identity to Saprospira grandis SS98-5 (99.7%, AB088636) isolated from Kagoshima Bay, Japan in 1998 [10], 99.4% identity to Saprospira grandis DSM 2844, and 98.0% identity to the type strain Gross [20]. S. grandis DSM 2844 is the only other strain with a draft genome sequence currently available from the Joint Genome Institute (JGI). Figure 1 shows the phylogenetic neighborhood of S. grandis str. Lewin in relation to type and non-type strains within the genus Saprospiraceae. Chitinophaga pinensis was used as an outgroup to root the tree. Phylogenetic tree highlighting the position of Saprospira grandis strain Lewin relative to other type and non-type strains within the Saprospiraceae. The tree was inferred from 1,350 aligned characters of the 16S rRNA gene sequence using maximum likelihood method. The branch lengths indicate the expected number of substitutions per site and the numbers adjacent to the branches are support values from 1,000 bootstrap replicates. Bootstrap values are indicated only if they are larger than 60%. Best topology of the tree was inferred by the phylogenetic analysis tool RAxML using GTR (General Time Reversible) model of substitution with the gamma model of rate heterogeneity [21]. Chitinophaga pinensis 16S rRNA gene was used to root the tree. Saprospira grandis has helical, filamentous cells about 1 μm wide and 5-500 μm long [1]. Individual cells within filaments are about 1-5 μm long [1]. They can grow well at 30ºC but can survive at 40ºC for several hours [2]. S. grandis moves by gliding motility at the speed of 2-5 μm/s [1]. S. grandis is known to be auxotrophic for the following amino acids: arginine, histidine, isoleucine, leucine, methionine, phenylalanine, threonine, tryptophan, and valine [2], and prefers nutrients rich in peptides and amino acids [2,8]. Saprospira grandis str. Lewin was originally isolated from La Jolla beach in San Diego, California (Table 1) by the late marine microbiologist Ralph A. Lewin and was a gift to S-I Aizawa [19]. Currently, the strain is not deposited to a culture collection agency but available from the Aizawa lab upon request. We plan to deposit the strain to a culture collection agency as soon as possible. Table 2 presents the project information and associated MIGS version 2.0 identifiers [27]. Altitude sea level NAS a) Evidence codes -IDA: Inferred from Direct Assay; TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [26].

Growth conditions and DNA isolation
S. grandis str. Lewin was cultured at 30ºC in seawater medium ( 3% CrystalSea Marine Mix (Marine Enterprises International, Inc.) with 0.5% tryptone). Cells were grown by gentle shaking for 1 day for DNA isolation and 2-3 days for isolation of rhapidosomes. Cells were harvested by lowspeed centrifugation and suspended with TE buffer (50 mM Tris-HCl pH 8.0, 0.15 M EDTA).
Lysozyme, proteinase K, and SDS were gradually added to the suspension and incubated at 37ºC for 30 min. RNaseA was then added to the sample and incubated at 65ºC for 30 min. To purify the genomic DNA, phenol-chloroform-isoamyl alcohol (PCI) solution was added to the cell lysate and genomic DNA was collected by ethanol precipitation.

Genome sequencing and assembly
The genome of S. grandis str. Lewin was sequenced using two different sequencing technologies: capillary-based Sanger sequencing and 454 pyrosequencing. For the Sanger sequencing method, 3-kb and 8-kb shotgun libraries were constructed and the inserts were sequenced from both ends using ABI 3730xl sequencers. A total of 28,669 3-kb and 8,727 8-kb paired-end reads were generated. A total of 378,705 pyrosequences were also generated by the Roche GS FLX system. Sequences from both methods were assembled using Newbler and finishing primers were designed from assembled contig scaffolds. Several rounds of PCR amplification and sequencing using custom-designed primers enabled all the remaining gaps to be closed. Final gaps were manually closed using the Minimus assembler from AMOS package [28] and Seqman II program from DNAStar (DNAstar Inc, Madison, WI). The total sequences covered roughly 30× of the genome.

Genome annotation
Annotation of S. grandis str. Lewin was done using the NCBI PGAAP annotation pipeline [29] and manually checked to improve assignment of protein functions. The pipeline uses Genemark to predict open reading frames (ORFs) and searches against a manually curated list of prokaryotic proteins known as Protein Clusters [30]. Frameshifts and partial gene fragments that indicate potential pseudogenes were identified by the NCBI Submission Check tool and manually verified. Protein coding genes were searched against the NCBI RefSeq database using BLASTp [19]. RPS-BLAST searches against the COG database enabled assignment of COG functional categories to the ORFs. In addition, InterPro searches were also performed using the "iprscan.pl" tool [31,32] to identify conserved domains and protein signatures in each ORF. Ribosomal RNAcoding regions were searched using tRNAscan-SE [33] and Infernal programs [34]. Clustered Regularly Interspersed Short Palindromic Repeats (CRISPR) regions were searched using CRISPR Finder program [35] and predicted proteincoding sequences found within these regions were manually removed. Potential genomic islands were identified using IslandViewer web server [36]. To reconstruct metabolic pathways, the annotated genome in Genbank format was first imported to the Pathway Tools program [37] and pathways were automatically reconstructed. Next, the automatically built pathways in Biopax format were imported to Pathway Studio® software from Ariadne Genomics (Rockville, MD, USA) to manually curate the metabolic pathways. Orthologs of S. grandis str. Lewin proteins in the following 18 bacterial species were identified via reciprocal best BLAST hit (RBH) as reported previously [38]: Clostridium acetobutylicum, Escherichia coli K12, Escherichia coli CFT073, Escherichia coli O157:H7 str. EDL933, Bacillus subtilis, Helicobacter pylori, Staphylococcus aureus subsp. aureus N315, Pasteurella multocida subsp. multocida str. Pm70, Salmonella typhimurium LT2, Agrobacte-Standards in Genomic Sciences rium tumefaciens str. C58, Burkholderia xenovorans LB400, Streptococcus pneumoniae TIGR4, Bordetella pertussis, Listeria monocytogenes EGD-e Actinobacillus pleuropneumoniae L20, Flavobacterium johnsoniae UW101, Streptococcus suis 05ZYH33, and Pseudomonas aeruginosa PAO1. Custom-built bacterial genome databases from Pathway Studio and MetaCyc were used as references to manually reconstruct the metabolic pathways in S. grandis str. Lewin. All metabolic pathways were inspected manually to remove functional classes with no members indicating the absence of corresponding enzymatic step(s) in the pathway. Pathways that did not have any gaps after manual curation were considered fully reconstructed.

Genome properties
The genome contains a single circular chromosome of 4,345,237 bases and a circular plasmid of 54,948 bases. The circular genomic maps of the S. grandis str. Lewin chromosome and plasmid are shown in Figure 2A and Figure 2B, respectively, and the general genome features are listed in Table 3. The G+C% of the genome is 46.36%. A total of 4,251 ORFs with an average length of 886 bp were predicted. Protein coding genes with known functions account for 50.4% of the genes identified and 34.8% of the gene products have no known function associated with them, i.e., annotated as hypothetical proteins. Conserved hypothetical proteins account for 14.7% of the coding sequences. The distribution of genes into COG functional categories is listed in Table 4. There are 3 ribosomal RNA operons and 48 tRNA genes. The IslandViewer web server predicted 18 putative genomic islands within the genome (Figure 2A). Clustered regularly interspersed repeats (CRISPRs) and its associated protein modules are a type of immune system present in different bacteria and archaea and is important to protect them from invading viruses and plasmids [39]. Using the CRISPR Finder tool, we identified three confirmed CRISPR repeat regions in the genome and the size of these regions are 11,778 bp, 10,545 bp, and 8,255 bp ( Figure 2A). The three CRISPR regions have the following direct repeat consensus sequences: CRISPR region 1 (GTTTCAATGCTGCTTCGCCTGCAAAGGGTTTAG-TAT), CRISPR region 2 (ATACTAAACCCATTGCAG-GCAAAGCAGCATTGAAAC), and CRISPR region 3 (GTTTCAATGCTGCTTCGCCTGCAAAGGGTTTAGTAT ). The numbers of spacers in each of these regions are 165, 148, and 116 for CRISPR regions 1, 2, and 3, respectively, i.e., a total of 429 spacers. Sizes of spacer sequences range from 32 to 76. S. grandis str. Lewin has the largest number of CRISPR spacers among all the Bacteroidetes genomes with identified CRISPR regions and has the second largest number of spacers among all bacteria with CRISPR regions.

Isolation and purification of rhapidosomes for proteomic analysis
S. grandis str. Lewin cells were cultivated at 30ºC in seawater medium by gentle shaking for 3 days and the cells were harvested by low-speed centrifugation and suspended in sucrose solution (0.5 M sucrose, 0.15 M tris base) by gentle stirring. Lysozyme (final conc. 0.1 mg/ml) and EDTA (final conc. 0.2 mM) were gradually added to the suspension, and the mixture was incubated on ice with gentle stirring. After 60 min of incubation, the cells were lysed with TritonX-100 (final conc. 1%), and the cell debris and nonlysed cells were removed by low-speed centrifugation. To recover rhapidosomes, the supernatant was recentrifuged and resuspended in TET (10 mM Tris/HCl pH8, 1 mM EDTA and 0.1% triton X-100). The samples were analyzed by sodium dodecyl sulfatepolyacrylamide gel electrophoresis (SDS-PAGE) and 2D-gel and each band was analyzed by LC/MS Q-TOF and MALDI-TOF/TOF. The peptide fragments identified were searched against all proteins in the S. grandis str. Lewin genome by BLASTp and also against the genome by tBLASTn.

Insights from the genome
Metabolic pathway reconstruction from the S. grandis str. Lewin genome revealed incomplete pathways for the biosynthesis of nine essential amino acids. This strongly indicates the necessity for external sources of amino acids. A large number of peptidases detected in the genome may facilitate acquisition of supplemental amino acids from the surrounding environments. The genome revealed ten copies of putative globin-coupled sensors. All ten copies of this gene have an Nterminal sensor globin domain and C-terminal STAS domain. Sensor globin-like domains were not identified in any of the Bacteroidetes genomes in our analysis and the presence of this domain and multiple copies of the rsbR gene in the genome are quite intriguing. Out of the ten putative sensor globins, three were experimentally confirmed to be able to bind oxygen, i.e., showed characteristic spectra of globin proteins (data not shown). Top BLASTp hits to all of these rsbR genes are from Vibrio species. We conclude that an rsbR gene was likely acquired from Vibrio species in marine habitats and was later duplicated in the genome. While the exact role of the sensor globin domain in S. grandis is unknown, these RsbR paralogs may be needed for oxygen sensing or in response to oxidative stress.
Biological functions of rhapidosomes are still a mystery despite previous attempts to understand its roles [16][17][18]. Through the use of genomics and proteomics, we have identified potential proteins that are possibly involved in formation of rhapidosome structures: SGRA_0791, SGRA_1316, and SGRA_1317. SGRA_0791 has a match to Pfam domain "Band_7" which is classified as Stomatin-like integral membrane domain found in all domains of life and also in viruses [40]. SGRA_1316 has a "CHP2241_phage" domain that is usually found in phage tail proteins. SGRA_1317 contains a "Phage_sheath_1" domain. All three proteins can be considered as phagelike proteins but do not seem to be part of a functional phage; they seem to be remnants of horizontally acquired phage genes adapted for as yet unknown functions in S. grandis.  a) The total is based on the total number of protein coding genes in the annotated genome.
In order to better understand the ecophysiology and phylogeny of S. grandis, we profiled the complete genomes of 46 Bacteroidetes (including S. grandis str. Lewin) and 1 Chlorobi based on 14,228 orthologous groups identified between them. ORFs from these genomes were searched against each other using reciprocal BLAST hit (RBH) method. Orthologous genes shared between the organisms were identified by the Markov Clustering method using OrthoMCL [41,42]. A 14,228 × 47 matrix table based on the presence or absence of these orthologs was then imported to R program [43] and "gplots" package was used to calculate the Pearson correlation and to represent the correlation matrix using a heatmap plot ( Figure 3).
Using the orthologous clustering approach, we were able to group different Bacteroidetes with similar physiologies and concluded that S. grandis is closely related to C. hutchinsonii and M. tractuosa in terms of niche specialization and adaptation ( Figure 3). Marivirga tractuosa DSM 4126 is also a member of Cytophagales and was isolated from beach sand in Vietnam [44] and is very similar to S. grandis str. Lewin in terms of the niche it occupies. Both also have chitinases to help them utilize chitin from marine eukaryotes. This orthologous gene clustering method is quite a powerful method to classify bacteria based on physiological adaptation and could be useful for characterizing newly isolated bacteria (especially the uncultivated ones) without known physiology.