Complete genome of the switchgrass endophyte Enterobacter clocace P101

The Enterobacter cloacae complex is genetically very diverse. The increasing number of complete genomic sequences of E. cloacae is helping to determine the exact relationship among members of the complex. E. cloacae P101 is an endophyte of switchgrass (Panicum virgatum) and is closely related to other E. cloacae strains isolated from plants. The P101 genome consists of a 5,369,929 bp chromosome. The chromosome has 5,164 protein-coding regions, 100 tRNA sequences, and 8 rRNA operons.


Introduction
Numerous Enterobacter cloacae strains have been associated with plants as agents of disease [1][2][3][4], but E. cloacae strains have also been associated with plants as endophytes [5][6][7][8], used for biocontrol of fungal pathogens [9][10][11][12][13][14][15][16], and associated with nosocomial infections in hospital settings [17][18][19]. E. cloacae is in the E. cloacae complex, which also includes the Enterobacter species of E. asburiae, E. hormaechei, E. kobei, E. ludwigii, and E. nimipressuralis. While 16S rRNA sequences are used to initially identify E. cloacae strains, the sequence is not always sufficient for identification at the species and sub-species level [17]. Previous phylogenetic studies with multi-locus sequence analyses of common housekeeping genes demonstrate that there is considerable diversity among the strains designated as E. cloacae due to the formation of multiple clades and the fact that only 3% of the strains group with the type strain E. cloacae subsp. cloacae ATCC 13047 [17,18]. The number of draft and complete E. cloacae genomes has increased recently and there are currently five complete and five draft E. cloacae genomes, with additional registered genome projects [20]. Sequencing and analysis of more E. cloacae genomes may establish a basis for explaining the diversity within the E. cloacae complex and provide new means for more definitive species or sub-species designation.

MIGS-22
Oxyg en requirement facultative anaerobe TAS [37] Carbon source carbohydrates TAS [37] Energ y source chemoorg anotroph TAS [37] MIGS Evidence codes -IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living , isolated sample, but based on a g enerally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontolog y project [38]. If the evidence code is IDA, then the property was directly observed for a live isolate by one of the authors, or an expert mentioned in the acknowledg ements.

Figure 1.
Phylogenetic tree of 16S rRNA sequences from Enterobacter sp. with g enome sequences. E. cloacae strains g rouped separately into a clade from other Enterobacter species using Bayesian phylogenetic analyses of the 16S rRNA reg ion. Analyses were implemented in MRBAYES [39] and the Bayesian Information Criterion (BIC), DT-M odSel [40] was used to determine the nucleotide substitution model best suited for the dataset. To ensure that the average split frequency between runs was less than 1%, the Markov chain Monte Carlo search included two runs with four c hains each for 10,000,000 g enerations. Pectobacterium carotovorum served as the outg roup for the analysis. Numbers in parentheses behind the bacterial names correspond to the GenBank accession numbers for the g enome sequences. The scale bar indicates the number of substitutions/site.

Genome sequencing and annotation Genome project history
The E. cloacae P101 genome project was initiated as part of an undergraduate class at the University of Florida [36]. For the class, whole-genome sequence was obtained using a Genome Sequencer 20 (454 Life Sciences, Branford, CT) and the students used PCR and sequencing to resolve some gaps. Although the project began with these data, little progress was made towards closing the genome. As a result, new next-generation DNA sequencing data for P101 was obtained at the Laboratory for Biotechnology and Bioanalysis at Washington State University using the PacBio RS platform and the PCR products generated to confirm the genome assembly were sequenced at Elim Biopharmaceuticals (Hayward, CA). A BglII cut optical map of P101 was obtained from OpGen (Gaithersburg, MD) in 2009 and was also used in the genome assembly process. The complete chromosome sequence has been deposited in GenBank under the accession number CP006580. The raw data from the 16 SMRT cells were assembled using the HGAP protocol of the SMRT Analysis v2.0.0 software (Pacific Biosciences). The standard bacterial HGAP assembly protocol with an expected genome size of 5.0 Mb was used. The same protocol was also used to assemble the data from 12 SMRT cells, which excluded four CLR SMRT cells run under instrument software v1.3.0, due to concerns of artifacts in the assembly based on how the quality scores were handled by that version of the software. The 20 contigs from the 16 SMRT cell assembly were used as the base set of contigs. The largest contig was 1.7 Mbp in length and the average coverage for all the contigs was 131× with an N50 of 591,864 bp. The 12 SMRT cell contig set was essentially the same, but there were 28 contigs with an N50 of 3,479,841 bp (also the length of the longest contig). The contigs were mapped to the P101 optical map. This allowed the contigs to be ordered and for overlapping regions to be joined together. Primer pairs for regions throughout the genome assembly were generated and used to verify the assembly using GoTaq Polymerase (Promega) according to the manufacturer's protocol and 50 ng of P101 genomic DNA, which had an annealing temperature of 52°C and an extension of 1 m. Sequencing was completed for both strands of the PCR amplicons using the same primers used for amplification of the fragments. The assembled chromosome and sequences from the PCR products were aligned with Bioedit (Ibis Biosciences, Carlsbad, CA).

Genome annotation
The submission file for GenBank was prepared using Sequin [46]. The genome sequence was submitted to GenBank and annotated with the NCBI Prokaryotic Genome Annotation Pipeline [44].

Genome properties
The genome of E. cloacae P101 has one circular chromosome of 5,369,929 bp ( Table 3). The average G+C content for the genome is 54.4% (Table  3). There are 100 tRNA genes and 8 rRNA operons, each consisting of a 16S, 23S, and 5S rRNA gene. There are 5,164 predicted protein-coding regions and 29 pseudogenes in the genome. A total of 4,419 genes (83.6%) have been assigned a predicted function while the remainders have been designated as hypothetical proteins ( Table  3). The numbers of genes assigned to each COG functional category are listed in Table 4. Of the annotated genes, 19.6% were not assigned to a COG or are of unknown function.  The total is based on the total number of protein coding g enes in the entire annotated g enome