Non-contiguous finished genome sequence and description of Collinsella massiliensis sp. nov.

Collinsella massiliensis strain GD3T is the type strain of Collinsella massiliensis sp. nov., a new species within the genus Collinsella. This strain, whose genome is described here, was isolated from the fecal flora of a 53-year-old French Caucasoid woman who had been admitted to intensive care unit for Guillain-Barré syndrome. Collinsella massiliensis is a Gram-positive, obligate anaerobic, non motile and non sporulating bacillus. Here, we describe the features of this organism, together with the complete genome sequence and annotation. The genome is 2,319,586 bp long (1 chromosome, no plasmid), exhibits a G+C content of 65.8% and contains 2,003 protein-coding and 54 RNA genes, including 1 rRNA operon.


Introduction
Collinsella massiliensis strain GD3 T (= CSUR P902 = DSM 26110) is the type strain of C. massiliensis sp. nov. This bacterial strain was isolated from the fecal flora of a 53-year-old French Caucasoid female admitted to the intensive care unit (ICU) in the Timone Hospital of Marseille, France, for Guillain-Barré syndrome. This study was part of a "culturomics" effort to cultivate all bacteria within human feces [1]. C. massiliensis is a Grampositive, obligatly anaerobic, non-endospore forming, non-motile and rod shaped bacillus. Thanks to the development of high throughput sequencers and the rapidly declining cost of genome sequencing, the number of sequenced bacterial genomes has reached almost 12,000 as of January 2 nd , 2014, with an additional 18,000 sequencing projects ongoing [2]). In an effort to include genomic information among the genotypic criteria used for the taxonomic description of bacterial isolates, and not only rely on a combination of 16S rRNA gene phylogeny and nucleotide sequence similarity, G + C content and DNA-DNA hybridization [3][4][5][6]. We proposed a new strategy named taxono-genomics that we used to describe several new bacterial taxa .
In 1999, Kageyama et al. reclassified Eubacterium aerofaciens into a new genus named Collinsella [39] based on a 16S rRNA gene sequence divergence and the presence of a unique peptidoglycan type when compared to other members of the genus Eubacterium. In addition to the type species, C. aerofaciens [39], the genus Collinsella currently includes C. intestinalis [40], C. stercoris [40] and C. tanakaei [41]. All four species have been isolated from the human gastrointestinal tract. In the present manuscript, we apply the taxonogenomics strategy to the description of Collinsella massiliensis sp. nov., and describe the complete genome sequencing and annotation of Collinsella massiliensis strain GD3 T (= CSUR P902 = DSM 26110). These characteristics support the circumscription of the C. massiliensis species.

Classification and Features
A stool sample was collected from a 53-year-old female admitted to the intensive care unit of the Timone Hospital in Marseille, France, for Guillain-Barré syndrome. The patient gave a written informed consent for the study, which was approved by the Ethics Committee of the Institut Fédératif de Recherche 48, Faculty of Medicine, Marseille, France, under agreement number 09-022. She received antibiotics at the time of stool sample collection and the fecal specimen was preserved at -80°C immediately after collection. Strain GD3 T (Table 1) was first isolated in January 2012 after incubation for two weeks in an anaerobic blood culture bottle that also contained clarified and sterile sheep rumen. Then, the strain was sub-cultivated anaerobically at 37°C on 5% sheep blood-enriched Columbia agar (BioMerieux, Marcy l'Etoile, France). Several other new bacterial species were isolated from this stool specimen using various culture conditions. When compared to sequences available in GenBank, the 16s rRNA sequence of C. massiliensis strain GD3 T (GenBank accession number JX424766) exhibited the highest sequence identity of 95.7% with Collinsella tanakaei (Figure 1). This value was lower than the threshold (98.7%) recommended by Stackebrandt and Ebers to delineate a new species without carrying out DNA-DNA hybridization [4], and was in the range of 16S rRNA identity values observed among the four Collinsella species with validly published names (92.2 between C. intestinalis and C. aerofaciens to 97.7% between C. intestinalis and C. stercoris) [49]. , not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [48]. If the evidence is IDA, then the property was directly observed for a live isolate by one of the authors or an expert mentioned in the acknowledgements. Standards in Genomic Sciences Growth of the strain was tested in 5% sheep blood-enriched Columbia agar (BioMerieux) under anaerobic and microaerophilic conditions (GENbag anaer and GENbag microaer systems, respectively, BioMerieux), and in aerobic conditions, with or without 5% CO2. Growth was achieved only anaerobically. In addition, among the four different incubation temperatures tested (25, 30, 37, 45°C), no growth was observed at 25°C and 30°C but strain GD3 T grew at 37 and 45°C. The best growth was obtained at 37°C after 48 hours of incubation. Colonies were grey, translucent and 0.4 mm in diameter on bloodenriched Columbia agar. Gram staining showed Gram-positive rods unable to form spores (Figure 2). A motility test was negative. In electron microscopy, cells grown on agar had a mean diameter of 0.57µm, a mean length of 1.19µm ( Figure 3) and were mostly grouped in short chains or small clumps. Strain GD3 T showed neither catalase nor oxidase activities. Using an API ZYM strip (BioMerieux), positive reactions were observed for acid phosphatase, naphthol-AS-BI-phosphohydrolase,galactosidase, alkaline phosphatase, leucine arylamidase, α-glucosidase. Negative reactions were observed for cystin arylamidase, βglucuronidase, nitrate reduction, urease, esterase (C4), esterase lipase (C8), lipase (C14), Trypsin, α-chemotrypsin, N-actetyl-β-glucosaminidase, α-mannosidase and α-fucosidase. Using an API Rapid ID 32A strip (BioMerieux), positive reactions were observed for α-galactosidase, αglucosidase, α-fucosidase, leucine arylamidase, proline arylamidase, arginine dihydrolase, serine arylamidase and glycine arylamidase. Negative reactions were observed for histidin arylamidase, urease, phenylalanine arylamidase, tyrosin arylamidase, leucyl-glycyl arylamidase, alanine arylamidase, and arginine arylamidase.
human gut human gut human gut human gut na na: data not available; +/-: depending on tests used Matrix-assisted laser-desorption/ionization time-of-flight (MALDI-TOF) MS protein analysis was peformed as previously described [50] using a Microflex spectrometer (Bruker Daltonics, Leipzig, Germany). The spectra from 12 distinct colonies from a culture agar plate were imported into the MALDI BioTyper software (version 2.0, Bruker) and analyzed by standard pattern matching (with default parameter settings) against the main spectra of 4,706 bacteria including 2 spectra from Collinsella aerofaciens, that were part of the reference data contained in the BioTyper database. The resulting score enabled the presumptive identification and discrim-ination of the tested isolate from those in the database according to the following rule: a score > 2 with a validated species enabled the identification at the species level; a score > 1.7 but < 2 enabled the identification at the genus level; and a score < 1.7 did not enable any identification. No significant score was obtained for strain GD3 T , suggesting that the isolate was not a member of any known species. The reference mass spectrum of Collinsella massiliensis strain GD3 T and the gel view comparing this spectrum with other phylogenetically close species are presented in Figures 4 and 5, respectively.

Genome project history
The organism was selected for sequencing on the basis of its phylogenetic position and 16S rRNA similarity to members of the genus Colinsella, and is part of a study of the human digestive flora aiming at isolating all bacterial species within human feces [1]. It was the fifth ge-nome of a Colinsella species and the first genome of C. massiliensis sp. nov. The GenBank accession number is CAPI00000000 and consists of 15 scaffolds and 118 large contigs. Table 3 shows the project information and its association with MIGS version 2.0 compliance [42].

Genome sequencing and assembly
Five µg of DNA was mechanically fragmented on Covaris device (KBioScience-LGC Genomics, Teddington, UK) using miniTUBE-red. The DNA fragmentation was visualized through an Agilent 2100 BioAnalyzer on a DNA labchip 7500 with an optimal size of 1.9kb. A 5 kb paired-end library was constructed according to the 454 GS FLX Titanium paired-end protocol (Roche). Circularization and nebulization were performed and generated a pattern with an optimal at 567 bp. After PCR amplification through 17 cycles followed by double size selection, the single stranded paired-end library was quantified with the Quant-it Ribogreen kit (Invitrogen) on the Genios Tecan fluorometer at 505pg/µL. The library concentration equivalence was calculated as 8.17E+09 molecules/µL. The library was stored at -20°C until further use. The paired-end library was clonally amplified with 0.5cpb and 1cbp in 4 SV-emPCR reactions with the GS Titanium SV emPCR Kit (Lib-L) v2 (Roche). The yields of the emPCR reactions were 9.35 and 14.76% respectively, in the range of 5 to 20% from the Roche procedure. The library was loaded on a GS Titanium PicoTiterPlate PTP Kit 70x75 and sequenced with the GS Titanium Sequencing Kit XLR70 (Roche). The run was performed overnight and then analyzed on the cluster through the gsRunBrowser and Newbler assembler (Roche). A total, of 672,867 passed filter wells were obtained and generated 214.2Mb with a length average of 301bp. These sequences were assembled using Newbler (Roche) with 90% identity and 40bp as overlap. The final assembly identified 15 scaffolds and 118 large contigs (>1500bp) generating a genome size of 2.32 Mb which corresponds to a coverage of 92x genome equivalent.

Genome annotation
Open Reading Frames (ORFs) were predicted using Prodigal [51] with default parameters. However, when predicted ORFs spanned a sequencing gap region, they were excluded. The predicted bacterial protein sequences were searched against the GenBank [52] and Clusters of Orthologous Groups (COG) databases using BLASTP. The tRNAScan-SE [53] and RNAmmer [54] softwares were used to predict tRNAs and rRNAs, respectively. Signal peptides and numbers of transmembrane helices were predicted using SignalP [55] and TMHMM [56], respectively. Mobile genetic elements were predicted using PHAST [57] and RAST [58]. ORFans were identified if their BLASTP E-value was lower than 1e-03 for alignment length greater than 80 amino acids. If alignment lengths were smaller than 80 amino acids, we used an E-value of 1e-05. Such parameter thresholds have already been used in previous works to define ORFans. Artemis [59] and DNA Plotter [60] were used for data management and visualization of genomic features, respectively. Mauve alignment tool (version 2.3.1) was used for multiple genomic sequence alignment [61].
To estimate the mean level of nucleotide sequence similarity at the genome level between C. massiliensis and the other 4 members of the genus Collinsella (Table 6), we used the Average Genomic Identity Of gene Sequences (AGIOS) home-made software [7]. Briefly, this software combines the Proteinortho software [62] for detecting orthologous proteins between genomes compared two by two, then retrieves the corresponding genes and determines the mean percentage of nucleotide sequence identity among orthologous ORFs using the Needleman-Wunsch global alignment algorithm. C. massiliensis strain GD3 T was compared to C. intestinalis strain DSM 13280 (GenBank accession number ABHX00000000), C. aerofaciens strain ATCC 25986 (AAVN00000000), C. stercoris strain DSM 13279 (ABXJ00000000), C. tanakaei strain YIT 12063 (ADLS00000000), Eggerthella lenta strain DSM 2243 (CP001726) and Coriobacterium glomerans strain PW2 (CP0002628).

Genome properties
The genome of C. massiliensis strain GD3 T is 2,319,586 bp long (1 chromosome, no plasmid) with a 65.8% G+C content (Table 4 and Figure  6). Of the 2,057 predicted genes, 2,003 were protein-coding genes and 54 were RNAs (51 tRNA and 3 rRNA genes). A total of 1,503 genes (73.06%) were assigned a putative function. A total of 500 genes (24.30%) were annotated as hypothetical proteins. The properties and the statistics of the genome are summarized in Tables 4 and 5. The distribution of genes into COGs functional categories is presented in Table 5. A total of 165 genes were identified as ORFans (8.02%). The total is based on either the size of the genome in base pairs or the total number of protein-coding genes in the annotated genome.

Comparison with other genomes
The genome of C. massiliensis was compared with those of C. intestinalis, C. aerofaciens, C. stercoris, C. tanakaei, Eggerthella lenta and Coriobacterium glomerans ( Table 6). The draft genome of C. massiliensis is larger than that of C. intestinalis and C. glomerans (2.32, 1.8 and 2.12 Mb, respectively) but smaller than all other other studied genomes (Table 6). In contrast, it exhibits a higher G+C content than all other genomes ( Table 6). The distribution of genes into COG categories in the genomes from all 5 compared Collinsella species and Coriobacterium glomerans was similar but different from Eggerthella lenta (Figure 7). In addition, C. massiliensis shared 867, 947, 953, 1,029, 751 and 841 orthologous genes with C. intestinalis,C. aerofaciens, C. stercoris, C. tanakaei, Eggerthella lenta and Coriobacterium glomerans, respectively. Among compared Collinsella genomes except C. massiliensis, AGIOS values ranged from 74.19 between C. aerofaciens and C. tanakaei to 81.80% between C. intestinalis and C. stercoris. When C. massiliensis was compared to other Collinsella species, AGIOS values ranged from 74.37 with C. tanakaei to 76.52% with C. stercoris (Table 7). In addition, C. massiliensis exhibited AGIOS values of 71.24 and 73.73% with Eggerthella lenta and Coriobacterium glomerans, respectively (Table 7).