Complete genome sequence of Pyrobaculum oguniense

Pyrobaculum oguniense TE7 is an aerobic hyperthermophilic crenarchaeon isolated from a hot spring in Japan. Here we describe its main chromosome of 2,436,033 bp, with three large-scale inversions and an extra-chromosomal element of 16,887 bp. We have annotated 2,800 protein-coding genes and 145 RNA genes in this genome, including nine H/ACA-like small RNA, 83 predicted C/D box small RNA, and 47 transfer RNA genes. Comparative analyses with the closest known relative, the anaerobe Pyrobaculum arsenaticum from Italy, reveals unexpectedly high synteny and nucleotide identity between these two geographically distant species. Deep sequencing of a mixture of genomic DNA from multiple cells has illuminated some of the genome dynamics potentially shared with other species in this genus.

The genus Pyrobaculum is known for its range of respiratory capabilities [6]. Three of the currently known members of the genus can respire oxygen; P. aerophilum is a facultative micro-aerobe, while P. calidifontis and P. oguniense can utilize atmospheric oxygen. P. aerophilum [7], P. calidifontis, and four other metabolically unique Pyrobaculum species have been fully sequenced; together with P. oguniense, we sought to further broaden the understanding of this important hyperthermophilic group. Pairwise whole-genome alignments of previously sequenced Pyrobaculum species reveal many structural rearrangements. With the availability of highthroughput sequencing, we were able to further explore rearrangements that occur between species, and our use of a not-quite-clonal population allowed exploration of rearrangements within a single species.  Table 1 summarize the phylogenetic position and characteristics of Pyrobaculum oguniense TE7 relative to other members of the Pyrobaculum genus, respectively.

Genome sequencing information
Genome project history Table 2 presents the project information and its association with MIGS version 2.0 compliance [23].

Growth conditions and DNA isolation
The initial culture was obtained in 2003 from the Leibniz Institute-German Collection of Microorganisms and Cell Cultures (DSMZ), and grown anaerobically in stoppered, 150ml glass culture bottles at 90°C. This culture was stored at 4°C for an extended period (six years) before being sampled for this study. Sequences were aligned using MAFFT v.6 [8], followed by manual curation [9] to remove 16S ribosomal introns and all terminal gap columns caused by missing sequence. The maximum likelihood tree was constructed using Tree-Puzzle v. 5.2 [10] using exact parameter estimates, 10,000 quartets and 1000 puzzling steps. Thermoproteus tenax Kra1 (NC_016070.1, DSM 2078) was included as an outgroup. Numbered branches show bootstrap percentages and branch lengths depict nucleotide mutation rate (see scale bar upper right).
A set of ten-fold dilutions of an actively growing culture (~10 8 cells/ml) was carried out and growth was monitored over a five-day period. All cultures were grown at 90°C without shaking in 200ml modified DSM 390 medium, using 1g tryptone, 1g yeast extract, pH 7, supplemented with 10mM Na 2 S 2 O 3 in 1L flasks under a headspace of nitrogen. At day four of growth, a new 400ml aerobic culture was inoculated with 20ml from the penultimate member of the dilution series (10 -8 ) and shaken at 100 rpm, supplemented with 10mM Na 2 S 2 O 3 , and subsequently was used for sequencing. We note that at day five, turbid growth was seen in the final member of the dilution series (10 -9 initial dilution). This implies that the initial 10 -8 inoculum used for sequencing likely included more than 10 cells. Cell pellets were obtained from the 400ml aerobic culture, frozen at -80°C and suspended in 15ml SNET II lysis buffer (20mM Tris-Cl pH 8, 5mM EDTA, 400mM NaCl, 1% SDS) supplemented with 0.5mg/ml Proteinase K and incubated at 55°C for four hours. DNA was extracted from this digest using an equal volume of Tris-buffered (pH 8) PCI (Phenol:Chloroform:Isoamyl-OH (25:24:1)). Following pha se-separation (3220g, 10 min. at 4°C), the resulting aqueous phase was treated with RNase A (25µg/ml) for 30 minutes at 37°C. This reaction was PCI-extracted a second time, followed by CHCl 3 extraction of the resulting aqueous phase and a final phase separation as before.  Contigs were assembled to a single scaffold using the mate-pair library generated for use on the ABI SOLiD sequencer. The library was produced with an insert size range of 1000-3,500 bp, and final sequencing yielded 30,631,205 read pairs of 25 bp read length. Those read-pairs were mapped to the 20 pyrosequencing-derived contigs to produce a From:: To table of uniquely mapping read-pairs; read-pair counts were accumulated for each of the 20×20 contig-pair assignments in each of the three possible relative contig orientations (same, converging or diverging). The scaffold closed easily with these data and yielded a single main chromosome with three major inversions and an extrachromosomal element.

Genome annotation
Gene prediction and annotation was prepared using the IMG/ER service of the Joint Genome Institute [24], where protein coding genes were identified using Prodigal [25]. RNase P RNA [26], SRP RNA and ribosomal RNA(5S, 16S, 23S) were identified by homology to the currently described Pyrobaculum members using the UCSC Archaeal Genome Browser (archaea.ucsc.edu) [27]. Annotation of transfer RNA (tRNA) genes was established using tRNAscan-SE [28], supplemented with manual curation of noncanonical introns. C/D box sRNA genes were identified computationally using Snoscan [29] with extensions supported by transcriptional sequencing [30]. H/ACA-like sRNA genes were identified using transcriptionally-supported homology modeling of experimentally validated sRNA transcripts [31]. CRISPR repeats were identified using CRT [32] or CRISPR-finder [33], with strandedness established by transcriptional sequencing.

Genome properties
The properties and overall statistics of the genome are summarized in Table 3, Table 4, Table 5, Table  6, and Table 7. The single main chromosome (55.08% GC content) has a total size of 2,436,033 bp. Ultra-deep mate-pair sequencing has revealed three regions of the genome that are present in an inverted orientation within a minority of the population ( Table 7). The genome also includes an extrachromosomal element of 16, 887 bp (50.58% GC), that encodes 35 predicted protein-coding genes. Of those genes, seven have an annotated function and the remaining 28 genes are annotated as hypothetical proteins. Of the seven annotated genes, three are coded with viral functions [35].
The majority of the P. oguniense genome is structurally syntenic to the genome of P. arsenaticum, and genes found in both species show an average of approximately 97% nucleotide identity. The P. oguniense genome is approximately 15% larger than P. arsenaticum, with the former encoding 536 more (2835 -2299) open reading frames (ORFs) predicted to be genes. Vast stretches of sequence space are syntenic between the two species ( Figure  2, regions in blue), broken by relatively few regions that appear to arise from either gene loss in P. arsenaticum or genomic expansion in P. oguniense, possibly a result of the numerous paREP elements present in these genomes ( Figure 2). These repetitive regions are difficult to assemble, and some are putative transposons (PaREP2b, for example).
We can identify specific genes and gene clusters that are present in P. oguniense but are missing in P. arsenaticum. Notably, the cobalamin synthetic cluster and two thiamine synthetic genes (ThiW and ThiC) are absent in P. arsenaticum. The terminal cytochrome cluster associated with aerobic respiration [36] is also absent in P. arsenaticum as expected from an obligate anaerobe. Among the 16 largest deletions in P. arsenaticum (relative to P. oguniense), four are associated with paREP2 genes, six with paREP1/8, and one with paREP6 (Table 5).

Conclusion
Genomic sequencing and assembly of Pyrobaculum oguniense has yielded a complete genome and an extra-chromosomal element. The main chromosome is largely syntenic to Pyrobaculum arsenaticum and contains a number of gene clusters that are absent in that species. This is of particular interest considering that these species were isolated on opposite sides of the Eurasian continent; P. oguniense was isolated in Japan, while P. arsenaticum was isolated in an arsenic-rich anaerobic pool in Italy. The synteny that has been retained between the genomes of P. oguniense and P. arsenaticum allows a close examination of gene gain or loss events in the genetic history of these two species. P. arsenaticum is missing the gene clusters that support cobalamin and thiamine synthesis, and it is missing the aerobic cytochrome cluster. Given that P. oguniense and the next closest member in the clade, P. aerophilum, have both retained these capabilities; the most parsimonious explanation is gene loss in P. arsenaticum. Because these genes are located at disparate positions in the P. oguniense genome, it would further appear that these losses are the result of multiple events in the evolutionary history of P. arsenaticum. Within this genome, 145 non-coding RNA genes are described. These include a single operon encoding 16S and 23S ribosomal RNA, the associated 5S rRNA, the 7S signal recognition particle(SRP), and the RNase P RNA. There are 47 annotated tRNA genes, plus a single tRNA pseudogene. Also included are 83 predicted C/D box sRNA genes and nine additional H/ACA-like sRNA, each of which has been transcriptionally validated [31]. The non-coding RNA content of the P. oguniense genome has become the most extensively annotated among crenarchaeal genomes to date. The use of a not-quite-clonal cell population for DNA isolation, coupled with ultra-deep sequencing has provided a view of three major inversions that are each present in over 17% of the sample population. The boundaries of one of these inversions are defined by an inverted repeat encoding a duplication of glutamate dehydrogenase (GluDH). Notably, this duplication appears to be present in each of the currently sequenced Pyrobaculum members, suggesting that those genomes may also host similar inversions. A second inversion has at its termini another inverted duplication, encoding a gene associated with one of the paREP members and a CRISPR-associated gene. It remains unclear if these common structural variants impart a physiological advantage, and if so, how the variation provides utility to its host. Based on our expanded genome diversity observations, we suggest that avoiding the use of a strictly clonal population for sequencing purposes can provide a significant benefit to understanding both the biology of the host and a clearer understanding of the genome dynamics of the species.