Comparative genomics of the Bifidobacterium breve taxon

Background Bifidobacteria are commonly found as part of the microbiota of the gastrointestinal tract (GIT) of a broad range of hosts, where their presence is positively correlated with the host’s health status. In this study, we assessed the genomes of thirteen representatives of Bifidobacterium breve, which is not only a frequently encountered component of the (adult and infant) human gut microbiota, but can also be isolated from human milk and vagina. Results In silico analysis of genome sequences from thirteen B. breve strains isolated from different environments (infant and adult faeces, human milk, human vagina) shows that the genetic variability of this species principally consists of hypothetical genes and mobile elements, but, interestingly, also genes correlated with the adaptation to host environment and gut colonization. These latter genes specify the biosynthetic machinery for sortase-dependent pili and exopolysaccharide production, as well as genes that provide protection against invasion of foreign DNA (i.e. CRISPR loci and restriction/modification systems), and genes that encode enzymes responsible for carbohydrate fermentation. Gene-trait matching analysis showed clear correlations between known metabolic capabilities and characterized genes, and it also allowed the identification of a gene cluster involved in the utilization of the alcohol-sugar sorbitol. Conclusions Genome analysis of thirteen representatives of the B. breve species revealed that the deduced pan-genome exhibits an essentially close trend. For this reason our analyses suggest that this number of B. breve representatives is sufficient to fully describe the pan-genome of this species. Comparative genomics also facilitated the genetic explanation for differential carbon source utilization phenotypes previously observed in different strains of B. breve. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-170) contains supplementary material, which is available to authorized users.


Background
Bifidobacteria are a common component of the microbiota of the gastrointestinal tract (GIT) of a broad range of hosts, and their presence is associated with a positive health status of the gut [1]. However, little is known about the precise molecular mechanisms that explain these probiotic effects [1][2][3]. For this reason a considerable number of ongoing scientific efforts aim to precisely explain how these benefits are being provided, and in many cases such efforts involve comparative and functional genome analyses.
Sequenced bifidobacterial genomes range in size from 1.94 to 2.8 Mbp (Bifidobacterium animalis subsp. lactis DSM 10140 and Bifidobacterium longum subsp. infantis ATCC 15697, respectively), and their genomic organization is in line with that of a typical bacterial chromosome [4].
B. longum subsp. infantis, Bifidobacterium bifidum and Bifidobacterium breve are typical inhabitants of the infant intestine, which is presumed sterile at birth but becomes rapidly colonized by bacteria immediately following (vaginal) delivery [5,6]. Functional analyses conducted on bifidobacterial genomes have also revealed how they adapt to a certain niche. For example, the presence of enzymes dedicated to the metabolism of human milk oligosaccharides (HMOs) in B. longum subsp. infantis showed how this species is specialized in colonizing the infant gut [7].
In vivo gene expression analyses conducted on B. breve UCC2003 and B. bifidum PRL2010 have revealed genes that encode functions required for gut colonization and persistence [8,9]. Furthermore, Comparative Genome Hybridization (CGH) analyses on various B. breve isolates has highlighted the existence of a high level of sequence homology among members of this species, and it also identified genetic functions that appear to be more variable within this bifidobacterial taxon [8]. Such variable functions are associated with bifidobacterial adaptation to the host environment and defence against invasion of foreign DNA. They include, among others, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) sequences, (type II) Restriction/Modification (R/M) systems and genes involved in the production of particular extracellular structures, such as capsule exopolysaccharides (EPS) and sortase-dependent pili [8]. However, CGH analysis is not sufficient to describe the genetic diversity of a species, as it can only detect genes present in the reference genome, but cannot identify genes that are present in tested genomes yet absent in the reference genome. For this reason we decided to investigate the genome variability within the B. breve taxon by performing whole genome sequencing and comparative analysis of thirteen B. breve strains. The generated genome data sets were used to perform a pan-genomic analysis which allowed the definition of the total number of different genes encoded by the entire B. breve group (the pangenome), as well as the total number of common genes present in all isolates (the core-genome) [10,11]. Corresponding pan-and core-genome information, as obtained by an increasing number of genome sequences, can be used to determine if sequenced representatives of a certain species have provided all expected gene diversity present in that taxon (closed trend), or if additional sequencing is still necessary before essentially all genes of the species have been identified (open trend) [10][11][12]. As this approach takes the overall collection of genetic functions assigned to a certain species (pan-genome) in consideration, rather than conducting individual analyses for each strain, it is believed to represent an accurate and advanced method to explore genomic diversity of a particular bacterial taxon.

General genome features
In order to assess the chromosomal features of representative members of the B. breve species, we analyzed the genome sequences of thirteen different B. breve strains (  [8], established an average sequence length of 2,323,100 bp, where B. breve JCM 7017 represents the strain with the smallest chromosome (with a size of 2,288,919 bp), while B. breve UCC2003 possessed the largest chromosome (with a size of 2,422,684). All B. breve genomes here analyzed displayed an average G+C content of 58%, which is consistent with the range of G+C mol% content of genomes of the Bifidobacterium genus [12].
In order to facilitate a coherent comparative analysis, we performed a consistent open reading frame (ORF) prediction for all available B. breve (complete and incomplete) genome sequences. In this way, a comparable number of genes was obtained for each genome, with an average value of 1817 Open Reading Frames (ORFs) per genome (Table 2). Notably, a (BLAST-based) functional in silico prediction could be made for 74% of the identified ORFs, while the remaning 26% were predicted to encode hypothetical proteins.
As displayed in Table 2, all fully sequenced genomes were observed and experimentally verified to encompass two identical rRNA loci located at non-adjacent positions in the genome with the exception of B. breve S27 which contains three of such loci; an average of 53 dispersed tRNA genes were noted per B. breve genome.
BLASTP comparisons performed between all ORFs identified on the eight fully sequenced genomes (as it produces a better definition of the variable regions; Figure 1) revealed the presence of 1141 orthologues (i.e. homologous genes, which are present in single copy, that are considered to have evolved vertically from a single ancestral gene) in addition to 924 gene families whose presence was observed in some but not all of the eight investigated B. breve members (also named shared genes). The comparative analyses of the eight fully sequenced B. breve genomes also allowed the identification of an average of 53 unique genes per genome, mostly representing hypothetical functions or mobile elements ( Figure 1, panel a). The Cluster of Orthologous Group (COG) classification performed for the identified orthologous genes showed that the majority of these genes are predicted to be involved in various housekeeping functions, especially those related to carbohydrate and amino acid metabolism, and associated transport activities, in agreement with the general features previously observed for bifidobacteria ( Figure 1, panel b) [13][14][15][16].
The predicted mobilome of B. breve species All complete genome sequences were investigated for the presence of mobile elements such as IS elements and genes specifying transposases, and this analysis revealed that the B. breve JCM 7019 genome contains the largest number (i.e. 54) of such mobile elements, while the B. breve ACS-071-V-Sch8b genome encompasses just 12 IS elements and transposase-encoding genes. The IS classification according to the ISFinder database [17] also showed that IS30 is the most frequently occurring insertion family in B. breve.  Plasmid Prophage/Integrated episome The complete chromosomes were also examined for the presence of prophages and plasmids. The prophagelike DNA element Bbr-1 of B. breve UCC2003 was previously analysed and represents a likely prophage-remnant [18]. Notably, in our analysis, we identified two other prophage like-elements (Additional file 1: Figure S1), 689b-1 in B. breve 689b (represented by the DNA region occupied by locus tags B689b_0284 through to B689b_ 0311), which appears to be incomplete, and 7019-1 in B. breve JCM 7019 (encompassing locus tags B7019_ 0905 through to B7019_1003), which appears to represent a complete prophage. Notably, the B. breve S27 genome appears to contain an integrated episome S27-1 (encompassing locus tags BS27_1090 through to BS27_1136), which is predicted to specify several hypothetical proteins, an integrase (BS27_1090), a DNA transfer protein (BS27_1114), and a cell wall anchor domain protein (BS27_1125). An extra-chromosomal (plasmid) sequence was confirmed to be present in B. breve NCFB 2258, and this plasmid is 100% identical to the previously published pCIBb1 [19] in full length BLASTN alignment.

Whole-genome alignments and phylogenetic analysis
The eight fully sequenced B. breve genomes were also aligned using B. breve UCC2003 as the reference chromosome. The observed degree of alignment as displayed in a dot-plot exhibited a near-continuous straight line, indicating that all these genomes are highly syntenic, with the only exception of apparent inversions in the middle of the genome sequences of B. breve ACS-071-V-Sch8b (1126 Kb, corresponding to B. breve UCC2003 genome coordinates 611,964 -1,653,404) and B. breve JCM 7017 (169 Kb, corresponding to UCC2003 genome coordinates 1,181,452 -1,350,961) (Additional file 2: Figure S2). In the case of B. breve JCM 7017 we confirmed this genomic inversion by PCR, demonstrating that this was not due to an assembly error (data not shown). Furthermore, analysis of the DNA that directly flanks these two inversions revealed the presence of sequences specifying mobile elements/transposases, which may have acted as mechanistic drivers for this genomic reshuffling through homologous recombination [20]. B. breve ACS-071-V-Sch8b contains truncated integrases and transposases flanking the inversion (HMPREF9228_0467-69 and HMPREF9228_1435-38, respectively), while the B. breve JCM 7017 genome contains an hypothetical protein and a putative conjugative transposon at the left end of the inversion and sequences encoding a replication initiation factor, excisionase and integrase at the other inversion end (B7017_0896-97 and B7017_1053-55, respectively). A BLAST alignment performed on the above mentioned genes for B. breve JCM 7017 revealed high identity (88-100% in BLASTP alignment) with mobile elements found in Clostridium difficile 630 [21], which suggests their acquisition by means of horizontal gene transfer (HGT).
In order to investigate the phylogenetic relationship between B. breve and other bifidobacteria, a phylogenetic supertree was computed based on 165 orthologues, selected on the basis of the comparison of the thirteen B. breve genomes (see above), other sequenced Bifidobacterium species (B. longum subsp. longum NCC2705, B. longum subsp. infantis ATCC 15697, B. bifidum PRL2010, B. adolescentis ATCC 15703, B. dentium Bd1, B. animalis subsp. animalis ATCC 25527, B. animalis subsp. lactis DSM 10140 and Bifidobacterium asteroides PRL2011), and three additional actinobacterial genomes, (i.e. Gardenerella vaginalis ATCC 14019, Leifsonia xyli subsp. xyli CTCB07 and Tropheryma whipplei TW08/27), combined with a member of Firmicutes as a representative outgroup (Lactobacillus plantarum WCFS1). As shown in the resulting consensus tree (Figure 1, panel c) all thirteen B. breve members fall into the B. longum phylogenetic group, which is consistent with a previous assignment based on a multilocus approach [22]. As shown in a previous study the colonization of the infant gut by representatives of B. breve and B. longum occurs immediately after birth, with a correlation being observed between strains present in mother and progeny, thus suggesting that such bifidobacteria are transmitted from mother to child during vaginal delivery and/or breast feeding [5,6]. The strains analyzed in this study possess very similar ecological origins and it was therefore not surprising that no clear separation of these strains was observed within the tree. However, B. breve JCM 7019, an isolate from adult faeces, clustered in a separate branch, while the B. breve milk isolates also cluster together. Additionally, most of the infant stool isolates were shown to cluster in the same group at the bottom of the tree.

B. breve core and dispensable genome
Comparative genomic analysis based on BLASTP comparisons and MCL clustering algorithm between the eight complete B. breve genomes (see Methods) allowed the definition of a set of 1323 gene families, representing the core genome for the B. breve species, defined as a pool of gene families that is present in all of the considered genomes [10,11], and representing the 1141 orthologues mentioned above plus an additional 182 paralogues. Inspection of corresponding COG assignments (Additional file 3: Figure S3) revealed that many components of this core genome represent functions related to cellular housekeeping. It is also worth mentioning that this set of core families is composed of common functions which can be present in single copy (named orthologues and including a large proportion of the identified housekeeping genes), but also functions present in multiple copy (also named paralogues, of which ATP Binding Cassette (ABC)-type transporters represent a typical example). Variability among the B. breve genomes is due to a specific set of functions also called dispensable genes which are present in more than one of the examined B. breve genomes, yet not present in all, as well as genes that are specific for just one strain [10,11]. Our analysis revealed a total of 924 families of variable genes, 426 of which are classified as unique. Of these 924 gene families, 49% encode hypothetical proteins, while the remainder is assigned to more informative features, such as genes predicted to encode proteins involved in capsular exopolysaccharide (EPS) synthesis, in phage resistance (CRISPR locus and R/M systems), in the production of sortasedependent pili, and in carbohydrate metabolism, including various carbohydrate transporters (Additional file 4: Table S1). Notably, our in silico data corroborate and extend previously published CGH analyses, that had been performed to explore the genomic diversity of B. breve [8].
Furthermore, the total gene pool (ORFeome) extracted from the eight B. breve complete genomes was compared with that of six complete and publicly available chromosome sequences of B. longum subsp. longum (B. longum subsp. longum NCC2705, B. longum subsp. longum DJO10A, B. longum subsp. longum JCM 1217, B. longum subsp. longum ATCC 15697, B. longum subsp. longum 157F, B. longum subsp. longum BBMN68), which is phylogenetically the closest related taxon to B. breve [22]. This comparison showed that 564 gene families (Additional file 5: Table S2) are specifically present in B. breve, yet absent in B. longum subsp. longum. Of these 564 gene families, approximately 50% encode unknown or hypothetical functions, while the other 50% represent functions similar to the ones observed in the variable regions of B. breve (i.e. glycosyl hydrolases, ABC transporters, CRISPR genes and mobile elements). The analysis also showed that 581 genes families are specifically present in B. longum subsp.
longum, yet absent in B. breve, where approximately 68% are coding unknown or hypothetical functions, while the remaining 32% specify mobile elements, ABC transporters and glycosyl hydrolases (data not shown).

Variability among B. breve genomes
In order to determine the presence of regions containing clusters of genes putatively acquired by HGT (Horizontal Gene Transfer) in B. breve, the G+C mol% for each ORF was calculated and only genes with a significantly different G+C content (i.e. higher that 68% or lower than 49%) were plotted ( Figure 2, panel b). This analysis established that 10% of the total of gene families present in B. breve displays a deviating G+C content (Figure 2, panel c), the majority of which (85%) are mapped in eight variable genomic regions (indicated as REG1-8 in Figure 2, panel b) that were concordant with the regions of variability identified by a BLAST-based comparative analysis ( Figure 2, panel a) and the variable regions previously detected by CGH [8]. As shown by the outcomes of a hierarchical clustering analysis that was performed to scrutinize the presence or absence of genes in the eight complete B. breve representatives, the overall of variability accounts for approximately 30% of the total of genes predicted for this species (a third of which is predicted to be acquired by HGT), while the remaining 70% represents the core genes ( Figure 2, panel c).
Comparative analysis conducted on the eight fully sequenced B. breve chromosomes revealed the presence of an apparently intact EPS cluster 2 in B. breve JCM 7017, B. breve JCM 7019, B. breve 689b and B. breve S27, positioned in an identical genomic location as B. breve UCC2003 with genes ranging between 100-50% of similarity in BLASTP alignments. In contrast, B. breve ACS-071-V-Sch8b, B. breve NCFB 2258, and B. breve 12L appear to only contain a remnant EPS biosynthesis cluster, where the gene encoding a presumed priming glycosyl transferase is present, while many other genes related to EPS biosynthesis are absent (Table 3; Additional file 6: Figure S5, panel a). Interestingly, two EPS clusters were observed in the draft genome B. breve 31L (B31L_0002-0010 and B31L_1353-84), one of which containing several transposases and located in the same genomic position as the EPS cluster 2 of B. breve UCC2003. A second putative exopolysaccharide biosynthesis cluster named EPS cluster 1 (Bbr_1786-1801, unpublished data), was also shown to be present in the other analysed B. breve genomes and it appears to be more conserved than EPS cluster 2 among the strains analyzed (Additional file 6: Figure S5, panel b). A CRISPR locus, previously described for B. breve UCC2003 [8], was found in the chromosome of B. breve S27, B. breve ACS-071-V-Sch8b, B. breve NCFB 2258, B. breve JCM 7017 and B. breve JCM 7019, where variability was observed in the spacer region of each strainspecific CRISPR (Table 3). In contrast, a CRISPR locus appears to be absent from the chromosomes of B. breve  (Table 3).
Regarding genes that encode (predicted) adhesion factors, a type IVb tight adherence (tad) locus was previously characterized in B. breve UCC2003 [8] and its presence was also observed in all other B. breve strains with an high degree of similarity (100-98% in BLASTP alignment; Additional file 7: Figure S4, panel a). In contrast to the Tad-like genes, the analyzed B. breve genomes were shown to contain a varying number of sortase-dependent pilus-encoding loci: B. breve UCC2003, B. breve NCFB 2258 and B. breve JCM 7017 contain three sortase-dependent pili loci (designated pil1, pil2 and pil3), while B. breve JCM 7019, B. breve 12L, B. breve 689b, B. breve ACS-071-V-Sch8b, B. breve S27 only contain 2 (pil1 and pil2; Table 3) (Additional file 7: Figure S4, panel b). In most cases (with the only exception of B. breve 12L where the clusters also appear to lack a dedicated sortaseencoding gene), an apparent frameshift within a 10-11 guanine nucleotide stretch in the first surface proteinencoding gene of pil1 and pil3 was present, a phenomenon previously observed for B. breve UCC2003 [8], as well as for B. bifidum PRL2010 [9,25].

Pan-genome analysis
In order to estimate the total number of different genes present in representatives of the B. breve species, and the expected number of new genes for every inclusion of an additional B. breve genome, we applied a pan-genome analysis pipeline named PGAP v1.0 [26]. A total number  of thirteen B. breve genomes (both complete and incomplete genome sequences) were included in this analysis, and from the results as displayed in Figure 3 the pangenome curve displays an asymptotic trend, growing with an average rate of 196 genes per genome in the first nine iterations, after which the number of new genes rapidly decreases and where the resulting curve leads to a value of 3667 genes. This suggests that following the inclusion of nine genomes, the incorporation of additional genomes only leads to a minor increase in the pan genome size. In fact, in silico analysis of such 'new' genes show that they mostly encode small and hypothetical proteins. Similar results were achieved using the core-genome function, where the asymptotic trend is even more evident and after the 9th genome iteration the total number of genes in the core genome stabilizes to a value of 1307, which is comparable with the number of gene families extracted in our comparative analysis (see above). Both trends observed in the pan-genome and core-genome functions indicate that B. breve displays an essentially closed pan-genome, and that the number of genomes analyzed here is sufficient to describe the complete gene repertoire of this bifidobacterial species.

Carbohydrate utilization
As from the previous CGH work performed on 18 B. breve isolates [8], genetic variability among the analysed representatives of this group was observed for genes previously characterized as being involved in the utilization of the carbohydrates ribose [27], sucrose [28] and raffinose [29], as well as the plant-derived polysaccharides starch [30], galactan [31] and cellodextrin [32]. Interestingly, in all analysed strains, genes are present that are predicted to encode enzymes involved in the uptake and utilization of host-derived mono/oligosaccharides, in particular mucin and Human Milk Oligosaccharides (HMOs). Examples of this include gene clusters predicted to be involved in the metabolism of sialic acid (Bbr_0160-73 and Bbr_1247), lacto-N-biose through a Leloir-like metabolic pathway [33] (Bbr_1587, Bbr_0491, Bbr_1884 and Bbr_1585), fucose (Bbr_1740-42) and Nlinked glycans [34] (Bbr_1141-50). Although B. breve is not known to be able to grow on mucin or HMOs [35,36], host-derived mono/oligosaccharides may become available through hydrolytic activities of other (bifido) bacteria present in the gut (e.g. B. bifidum PRL2010 [14] and B. longum subsp. infantis [37]), allowing B. breve strains to utilize such liberated carbohydrates through a phenomenon called cross-feeding [38].
In order to extend our knowledge on carbohydrate fermentation capabilities of B. breve, an in silico prediction of all glycosyl hydrolases (GHs) was first performed on the eight complete B. breve genomes. This analysis essentially confirmed an abundance of members of GH family 13 (α-amylase function), which was previously defined as a distinctive characteristic of B. breve [30]. It also highlighted the presence of a small number of GHs that appear to be present in just a single member of B. breve (GH9, GH10, GH23, GH59, GH129 and GHNc [39]; Figure 4, panel b), and their annotation suggests that they may encode novel hydrolytic activities.
Comparing carbohydrate fermentation profiles of nine B. breve strains (B. breve UCC2003, B. breve 689b, B. breve 12L, B. breve 2 L, B. breve 31 L, B. breve 2258, B. breve JCM 7017, B. breve JCM 7019 and B. breve S27) revealed that all strains are able to ferment a common set of sugars, such as glucose, lactose, lactulose, maltose and raffinose. In contrast, fermentation capabilities for the other sugars such as galactan, sucrose, pullulan, amylopectin, starch, maltodextrins, sorbitol, mannitol, fructose, melezitose, cellobiose, xylose and ribose, were shown to be variable among the strains tested. None of the B. breve strains assayed here was shown to be capable of utilizing inulin, arabinose, maltulose, mannose, trehalose and galactose (Figure 4, panel a).
In bifidobacteria, genes involved in the utilization of a given sugar are frequently organized in gene clusters containing genes that encode one or more specific GHs and associated transport system, and are frequently placed under the transcriptional control of a LacI-type regulator specified by a gene that is also located adjacent to or within such a gene cluster [27][28][29][30][31][32].
A gene-trait matching analysis performed on these B. breve strains, based on the association between the 51 presence/absence clusters of genes (named Cluster1 through to Cluster51) obtained by hierarchical clustering (HCL) analysis and growth/no growth phenotype, allowed an in silico assessment of the role of certain genes associated with carbohydrate metabolism and carried by the B. breve chromosomes sequenced here, several of which had previously been characterized in B. breve UCC2003 [27][28][29][30][31][32]. This analysis was carried out on the sugars that generated differential carbohydrate profiles among the tested B. breve strains (Figure 4, panel a), and allowed the identification of 34 genes that correspond with a strain's ability to grow on ribose, galactan, sucrose, melezitose, cellobiose, mannitol and sorbitol (Table 4).
For example, growth on ribose was shown to correspond with Cluster45, which contains genes that have previously been described to be required for ribose utilization [27], such as rbsD, encoding a component of the presumed ribose transport system, and rbsK, encoding a ribokinase. The obtained strain-specific growth profile distribution on galactan corresponds with Cluster6, which encompasses a gene specifying an endogalactanase (Bbr_0422), which has been previously shown to be involved in galactan metabolism in B. breve UCC2003 [31]. Also, strain-specific growth on melezitose corresponds to Cluster10, which includes a gene encoding an alphagalactosidase (Bbr_1856) shown to be required for the utilization of this carbohydrate [40,41]. Furthermore, this analysis allowed the identification of a gene cluster in B. breve JCM 7017 (ORFs B7017_1846-1848), which is predicted to encode elements for the regulation, transport and metabolism of the sugar alcohol sorbitol by some members of the B. breve taxon ( Figure 5, panel a). Notably, this locus encompasses an alcohol dehydrogenaseencoding gene (B7017_1848), which was targeted for insertion mutagenesis, and followed by a gene (B7010_1847) predicted to specify an ABC transporter, and presumed to be responsible for internalization of this carbohydrate, and gene (B7017_1845) encoding a predicted a transcriptional regulator. As compared to the wild type B. breve JCM 7017, the insertion mutant strain, designated here as B. breve JCM 7017-1848, was incapable of using sorbitol as a sole carbohydrate source, thereby confirming the role of this gene cluster in sorbitol metabolism ( Figure 5,  panel b).

Conclusions
Genome sequencing of eight B. breve strains and comparative analysis of these genomes, combined with five additional, publicly available B. breve genomes, allowed the description of the pan-genome of the B. breve species, which was shown to follow an essentially closed trend. As pan-genomic analysis was only recently introduced for the description of bacterial species [10], its application is still somewhat controversial and subject to scientific scrutiny. One clear limitation of pan-genome analysis is the difficulty of assessing whether a pangenome is really closed or not, given the dynamic nature of a given bacterial population and its associated tendency to evolve and exchange genetic material. In general it can be said that a closed pan-genome implies that the gene exchange within a species is low and this certainly seems to be the case for the thirteen B. breve genomes analyzed here.
Moreover, the in silico prediction of ORFs with a deviating G+C mol% content coupled with a comparative genomics analysis, allowed the identification of eight genomic regions of variability in the B. breve pangenome representing approximately 30% of the total of gene content within the B. breve species, containing a large fraction of ORFs that have been acquired by HGT (which constitutes the 10% of the total of gene content in B. breve). Apart from hypothetical proteins and mobile elements, the gene functions contained within these variable regions are predicted to be required for environmental niche adaptation of this group. For a gut commensal the process of colonization involves cell-cell and cell-host interactions (involving for example genes that encode sortase-dependent pili for adhesion, biofilm formation and cell aggregation [8,9]) and evasion of the host adaptive immune response [23] (requiring genes specifying the biosynthesis of an exopolysaccharide capsule), as well as metabolic flexibility to acquire energy from a variety of carbon sources independent of the age of the host (when the host shifts from a milk-only diet to a diversity of solid foods, thus explaining the predicted capacity of producing a wide variety of GHs). Furthermore, evolutionary pressures to resist invasion of foreign DNA (e.g. phages and plasmids) also appears to provide an explanation for the presence of CRISPR, CRISPR-associated genes, as well as R/M systems in the variable regions of B. breve.
The comparative genomics approach used in this study also facilitated the explanation of certain differences previously observed in the carbon sources utilization in some B. breve members and allowed the definition of a new cluster responsible for the fermentation of the sugar alcohol sorbitol.
For this reason the in silico analysis presented in this study represents a robust starting point for future functional genomics investigations focusing on (individual members of ) this bifidobacterial species, in order to elucidate the spectrum of functions and mechanisms of interaction with the host environment to explain the presence of these bacteria in the human gut and the reported beneficial effects on their host.

Genome sequencing and data assembly
All genomes used in this study are human isolates of B. breve, of which B. breve UCC2003 was isolated and sequenced as part of a previous study [8], B. breve 12L, B. breve 2L and B. breve 31L, B. breve 689b, B. breve S27 were isolated from human milk and breast-fed infant feaces as previously described [42], while B. breve JCM 7017, B. breve JCM 7019, B. breve NCFB 2258 were obtained from the Japan Collection of Microorgnisms and National Collection of Food Bacteria, respectively). All genomes were sequenced using one or more Next Generation Sequencing (NGS) platforms. In order to construct an initial scaffold backbone, reads were first obtained using a 454 Roche genome sequencer FLX Titanium instrument employing a long-tag, paired-end library (average read length of 400 bp).
The genomes of B. breve 689b, B. breve 12L and B. breve S27 were sequenced using a Roche 454 FLX Titanium instrument by the commercial sequencing service providers Agencourt Bioscience (Beverly, MA) and Eurofins MWG Operon (Germany) and then assembled, after which remaining gaps were closed using Sanger sequencing of PCR products. The obtained consensus genome sequence consisted of an approximately 30-fold overall coverage, where any remaining low quality regions or other sequence conflicts were resolved using additional Sanger sequencing of PCR products. Assembly was performed using Newbler v2.6 (http://454.com/ products/analysis-software/index.asp) and Gap4 (Staden package v1.6.0 (http://sourceforge.net/projects/staden/).
In the case of the genomes of B. breve NCFB 2258, B. breve JCM 7017 and B. breve JCM 7019, sequences were obtained using a combination of the afore mentioned 454 FLX Titanium and Illumina Hiseq 2000 sequencing platforms, both performed by Macrogen (Seoul, Republic of Korea) (and using a paired-end library with an average read length of 450 bp and 101 bp, respectively). The obtained sequences were assembled employing a hybrid assembly using a combination of Newbler v2.6 (http://454. com/) for long reads and Abyss v1.3.4 (http://www. bcgsc.ca/) for short reads, resulting in a 200-fold coverage. Any remaining gaps and quality issues were resolved using Sanger sequencing of PCR products.
Finally, the draft genome sequences of B. breve 2L and B. breve 31L were carried out by GenProbio Ltd. (http:// genprobio.com) combining the output of two runs of Ion Torrent PGM (Life Technologies, Germany) following a previously described protocol [43], reaching a coverage of approximately 50-fold. The obtained raw data were assembled using MIRA v.3.9 (http://www.chevreux.org/ projects_mira.html), applying default parameters recommended for Ion Torrent data processing. All reads were quality checked and trimmed in order to improve the assembly process; this quality check/trimming step was performed for both 454 FLX and Illumina data using NGSQCToolkit v.2.3 [44]. For Ion Torrent reads the pre-processing step was performed using a built-in function of the Mira assembler software (v3.9) (http://www. chevreux.org/).

General features prediction
Open Reading Frame (ORF) prediction was performed with a combined approach of the predictor Prodigal v2.0 (http://prodigal.ornl.gov) and BLASTX v2.2.26 [45] alignment for all the genomes analysed in this study; identified ORFs were then automatically annotated on the basis of BLASTP v2.2.26 [45] analysis using B. breve UCC2003 as the reference genome (NCBI Reference Sequence: NC_020517.1). Functional assignment was performed and manually edited based on similarity searches against the non-redundant protein database curated by the National Centre for Biotechnology Information (ftp://ftp.ncbi.nih. gov/blast/db/).
Artemis v.14 (http://www.sanger.ac.uk/resources/software/artemis/) was used to combine and inspect the results of the ORF finder and the associated BLASTP [45] results, while this software tool was also used for manual editing, where necessary, of the start codon of a predicted gene. Where appropriate, annotations were further refined, verified or adjusted using information retrieved from alternative databases, e.g. Uniprot/EMBL (http://www.uniprot.org/), protein family (Pfam) (http:// pfam.sanger.ac.uk) and COGs [46].
Deviations in G+C mol% were computed based on the ORFs nucleotide sequence using Geecee function from the EMBOSS v6.5.7 package [49].
All genome sequences were searched for the presence of Restriction/Modification systems using a BLASTP [45] alignment function of the REBASE database (http:// rebase.neb.com/rebase/rebase.html) (cut-off E-value of 0.0001; and at least showing 30% of similarity of at least 80% of the sequence length).

Comparative genomics
Whole-genome sequence alignments were performed at the DNA level using the software package MUMmer v3.0 [50]. Sequence comparisons at protein level were performed using an all-against-all, bi-directional BLAST alignment [45] (cut-off: E-value 0.0001, with at least 50% identity across at least 50% of either protein sequence), and the resulting output was then clustered into protein families sharing the same function using the Markov Cluster Algorithm (MCL) implemented in the mclblastline pipeline v12-0678 [51]. The obtained gene families were classified as belonging to either the core or to the dispensable genome based on their presence in either all strains or in a subset of the investigated strains, respectively. In the orthologues extraction an additional filter for paralogues was applied by selecting only those families that were shown to contain a single protein member for each genome. Proteins identified as belonging to the mobilome, such as IS elements or phages, were also discarded from this pool of genes and orthologues were then functionally classified using COG category assignments [46].

Phylogenetic analysis
The supertree computation was performed from the alignment of a set of orthologous genes obtained from the same BLAST-based comparative approach as indicated above (Additional file 8: Table S3).

Pangenome calculation
For the available B. breve genomes a pan-genome computation was calculated using the PGAP v1.0 [26], which performs this analysis according to the Heap's law pangenome model [10]; the ORF content of each genome is organized in functional gene clusters using the GF (Gene Family) method and a pan-genome profile was then built.

Carbohydrates fermentation profiles
In order to investigate their carbohydrate-utilization capabilities, nine B. breve strains, which were available to us (B. breve UCC2003, B. breve 689b, B. breve 12L, B. breve 2L, B. breve 31L, B. breve 2258, B. breve JCM 7017, B. breve JCM 7019 and B. breve S27) were experimentally tested for growth on 24 different carbohydrates (glucose, lactose, lactulose, maltose, raffinose, galactan, sucrose, pullulan, amylopectin, starch, maltodextrin, sorbitol, mannitol, fructose, melezitose, cellobiose, inulin, arabinose, maltulose, mannose, trehalose, galactose, xylose and ribose). The seven B. breve strains were grown in modified Rogosa medium supplemented with a given carbohydrate (final concentration 0.5%) and optical densities (OD at 600 nm) were recorded at regular intervals during 24 hours. In order to evaluate the phenotypic patterns of such strains, a lower limit OD of 0.3 was used as a cut-off value to discriminate between carbohydrates that did or did not support growth of a given strain. A further in silico gene-trait matching excercise was performed in order to correlate an observed carbohydrate-linked growth phenotype with the presence/absence of particular genes. For this analysis all shared gene families as obtained from the comparative genomic analysis described above were organized in 51 clusters, according to their presence in each strain. Subsequently all the data (phenotypic and genomic) were binarized and compared on an individual basis. The resulting matching distances were then reported in a heatmap and manually inspected with the additional support of PFAM database (http://pfam.sanger.ac.uk).

Construction of B. breve JCM7017 insertion mutant
In order to verify our predictions from the gene trait matching an insertion mutant was created in the alcohol dehydrogenase encoding gene of the predicted sorbitol utilization gene cluster of B. breve JCM7017. An internal fragment of open reading frame B7017_1848 (corresponding to codon numbers 78 through to 175 out of the 335 codons present in B7017_1848) were amplified by PCR using B. breve JCM 7017 chromosomal DNA as a template and primer pairs IM1848F (5'-CCTACAAGCTTCA GAAGTCACCAACGTCAAG-3') and IM1848R (5'-CGA TGCTCTAGAGATTCCGGCAAGATCCACCTG-3') The insertion mutation was generated as described previously [24] to produce B. breve JCM7017-1848. Site-specific recombination in potential Tet-resistant mutant isolates was confirmed by colony PCR using primer combinations tetWFw (5'-ATGCTCATGTACGGTAAG-3') and tetWRv (5'-CATTACCTTCTGAAACATA-3') to verify tetW gene integration, and primers 1848-F (5'-GCTCCGCTGCCG CAGTTCCG-3' , positioned upstream of the selected internal fragment of B7017_1848), in combination with tetWFw to confirm integration at the correct chromosomal location.

Nucleotide sequence accession numbers
All the sequences here generated have been submitted to GenBank database with the following accession num-