Non-contiguous genome sequence of Mycobacterium simiae strain DSM 44165T

Mycobacterium simiae is a non-tuberculosis mycobacterium causing pulmonary infections in both immunocompetent and imunocompromized patients. We announce the draft genome sequence of M. simiae DSM 44165T. The 5,782,968-bp long genome with 65.15% GC content (one chromosome, no plasmid) contains 5,727 open reading frames (33% with unknown function and 11 ORFs sizing more than 5000 -bp), three rRNA operons, 52 tRNA, one 66-bp tmRNA matching with tmRNA tags from Mycobacterium avium, Mycobacterium tuberculosis, Mycobacterium bovis, Mycobacterium microti, Mycobacterium marinum, and Mycobacterium africanum and 389 DNA repetitive sequences. Comparing ORFs and size distribution between M. simiae and five other Mycobacterium species M. simiae clustered with M. abscessus and M. smegmatis. A 40-kb prophage was predicted in addition to two prophage-like elements, 7-kb and 18-kb in size, but no mycobacteriophage was seen after the observation of 106 M. simiae cells. Fifteen putative CRISPRs were found. Three genes were predicted to encode resistance to aminoglycosides, betalactams and macrolide-lincosamide-streptogramin B. A total of 163 CAZYmes were annotated. M. simiae contains ESX-1 to ESX-5 genes encoding for a type-VII secretion system. Availability of the genome sequence may help depict the unique properties of this environmental, opportunistic pathogen.


Genome sequencing and annotation
Genome project history M. simiae is the first member of the M. simiae species complex for which a genome sequence has been completed. This organism was selected to gain understanding in the genetics of M. simiae complex in detail (Table 2).

Genome sequencing and assembly
The concentration of the DNA was measured using a Quant-it Picogreen kit (Invitrogen) on the Genios Tecan fluorometer at 79.36 ng/µl. A 5 µg quantity of DNA was mechanically fragmented on the Covaris device (KBioScience-LGC Genomics, Teddington, UK) through miniTUBE-Red 5Kb. The DNA fragmentation was visualized in an Agilent 2100 BioAnalyzer on a DNA labchip 7500 with an optimal size of 3.57kb. The library was constructed according to the 454 Titanium paired end protocol (Roche, Boulogne-Billancourt, France). Circularization and nebulization were performed to generate a pattern with an optimum at 415 bp. After PCR amplification through 17 cycles followed by double size selection, the single stranded paired end library was quantified on the Quant-it Ribogreen kit (Invitrogen) on the Genios_Tecan fluorometer at 865pg/µL. The library concentration equivalence was calculated as 1.91E+09 molecules/µL. The library was stocked at -20°C until used. The library was clonally amplified with 0.5 cpb in 2 emPCR reactions with the GS Titanium SV emPCR Kit (Lib-L) v2 (Roche, Boulogne-Billancourt, France). The yield of the emPCR was 20.2%, which is somewhat high compared to the range of 5 to 20% from the Roche procedure. A total of 790,000 beads were loaded on the GS Titanium PicoTiterPlate PTP Kit 70x75 and sequenced with a GS Titanium Sequencing Kit XLR70 (Roche, Boulogne-Billancourt, France). The run was done overnight and analyzed on the cluster through the gsRunBrowser and gsAssembler_Roche. A total of 241,405 passed filter wells were obtained and generated 88.64Mb with an average 367 bp length. The passed filter sequences were assembled on the gsAssembler (Roche, Boulogne-Billancourt, France), with 90% identity and 40 bp as overlap, yielding one scaffold and 338 large contigs (>1,500 bp), generating a genome size of 5.78 Mb, which corresponds to a coverage of 15.33 × genome equivalents.

Genome annotation
Open reading frames (ORFs) were predicted using Prodigal [35,36] with default parameters. The predicted bacterial protein sequences were searched against the NCBI NR database, UNIPROT [37] and against COGs [38] using BLASTP. The ARAGORN software tool [39] was used to find tRNA genes, whereas ribosomal RNAs were found by using RNAmmer [40] and BLASTn against the NR database. Proteins were also checked for domain using a hidden Markov model (HMM) search against the PFAM database [41]. The Tandem Repeat Finder was used for repetitive DNA prediction [42]. The prophage region prediction was completed using PHAST (PHAge Search Tool) [43]. CRISPRs were found using the CRISPER finder [44]. The antibiotic resistance genes were annotated using. The CAZYmes, which are enzymes involved in the synthesis, metabolism, and transport of carbohydrates were annotated using CAZYmes Analysis Toolkit (CAT) (mothra.ornl.gov/cgibin/cat.cgi?tab=CAZymes)

Genome properties
M. simiae strain DSM 44165 T genome consists of a 5,782,968-pb long (65.15% GC content) chromosome without plasmids (Figure 4). Table 3 presents the nucleotide content and gene count levels of the genome and the distribution of genes into COGs functional categories is presented in Table 4. The genome contains three rRNA (5S rRNA, 23S rRNA and 16S rRNA), 52 tRNA genes with one transfer-messenger RNA (tmRNA) and 5,727 ORFs with 4,673 ORFs (81.6%) having at least one PFAM domain. The properties and the statistics of the genome are summarized in Table 3. Of the coding sequences, 66% could be assigned to COG families (Table 4). Standards in Genomic Sciences  a) The total is based on either the size of the genome in base pairs or the total number of protein coding genes in the annotated genome  (Table 3) were annotated. M. simiae DSM 44165 T showed the presence of 163 Carbohydrate-Active Enzymes genes belonging to 36 CAZy family (supplementary data S1).
Analysis of the distribution of M. simiae ORF size revealed 11 ORFs > 5,000-pb, including two ORFs > 10,000-pb: a 12,942-bp ORF showed 77% similarity with a M. avium 104 gene encoding a linear gramidicin synthase subunit D; a 14,415-bp ORF showed no similarity with NR database. We verified the open reading frames of the two ORFs using ORFs finder online software [45] and found that these ORFs encode 4,313 and 4,804 amino acids proteins respectively. A heatmap based on the distribution of ORFs sizes in M. simiae and five other genomes was done in R [46], which clusters M. simiae with M. abscessus and M. smegmatis, indicating that the three genomes have similar ORFs size distribution ( Figure 5). Recent evidence shows that mycobacteria have developed novel and specialized secretion systems for the transport of extracellular proteins across their hydrophobic, highly impermeable, cell wall [47]. M. tuberculosis genomes encode up to five of these transport systems, and ESX-1 and ESX-5 systems are involved in virulence [47]. In comparison with M. tuberculosis H37Rv type VII clusters using Blastp, a total of 77 proteins encoding a type VII secretion system were annotated in M. simiae (supplementary data II). ESX-5 seems to be a conserved cluster between M. tuberculosis and M. simiae, in agreement with opportunistic pathogenicity of M. simiae.