Resistome Snp Calling via Read Colored De Bruijn Graphs

New Results

Resistome SNP Calling via Read Colored de Bruijn Graphs

doi: https://doi.org/10.1101/156174

Abstract

The microbiome and resistome, which refers to all the antimicrobial resistant (AMR) genes in pathogenic and non-pathogenic bacteria, are oft studied using shotgun metagenomics data [13, 50]. Unfortunately, there are few methods capable of identifying single nucleotide polymorphisms (SNPs) in metagenomics data, and to the best of our knowledge, at that place are no methods that identify SNPs in AMR genes. Nonetheless, the identification of SNPs in AMR genes is an important problem since it allows these genes, which confer resistance to antibiotics, to be "fingerprinted" and tracked beyond multiple samples or time periods. In this paper, we nowadays Vari, which allows SNPs to exist identified in AMR genes from metagenomes data. LueVari is based on the read colored de Bruijn graph, an extension of the traditional de Bruijn graph that we present and formally define in this newspaper. We show that read coloring allows regions longer than the yard-mer length and shorter than the read length to exist identified unambiguously. In add-on to this theoretical concept, nosotros present a succinct data structure that allows for large datasets to be analyzed in a reasonable amount of time and infinite. Our experiments demonstrate LueVari was the only SNP caller that reliably reported sequences that spanned on average 47.five% of the AMR gene. Competing methods (GATK and SAMtools) only reported specific loci and crave a reference to exercise so. This feature, along with the high accuracy of Fiftyue5ari, allows distinct AMR genes to exist detected reliably in a de novo fashion.

ane Introduction

Antimicrobial resistance (AMR) refers to the ability of an organism to end an antimicrobial from working against it and is described as "an increasingly serious threat to global public health" since it causes standard treatments (e.yard. antibiotics) to be ineffective [xl]. This threat prompted the creation of the National Action Program for Combating Antibiotic-Resistant Leaner, whose fifth goal is to: "better international collaboration and capacities for antibody-resistance prevention, surveillance, control, and antibiotic research and development" [15, pp. 49]. The plan recognizes that humans, animals, and the environment are sources of AMR and calls for a "one-health approach to affliction surveillance" [15, pp. five]. The resistome, which refers to the set of all AMR genes found in pathogenic and non-pathogenic bacteria, defines the potential resistance to known antibiotics. Shotgun metagenomics data has been generated to characterize the resistome in clinical [thirteen, fifty] and food production [39, 38, 51] settings. This label corresponds to the identification of specific AMR genes, their affluence, and the single nucleotide polymorphisms (SNPs) in the identified genes. Although systems exists that identify AMR genes and their affluence, no such methods exist to indentify SNPs in AMR genes from metagenomics data [13, 21]. The lack of methods is due to both the complexity of the problem and the recentness of this application of metagenomics data [fifty]. In this paper, we tackle this problem by developing LUEVARI (Finnish for "read coloring"), which is a scalable reference-free method that is tailored to identify and quantify SNPs in AMR genes.

Although there be methods to identify SNPs and other variants,the majority of these methods are tailored for eukaryote species—a sentiment expressed by Nijkamp et al. [37] when they country: "Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes." Current reference-free methods that are specifically designed for metagenomics information require an "assembly" footstep, which employs an overlap-layout-consensus (OLC) approach [19] or an Eulerian that uses the de Bruijn graph [3, 28, 34, 23, 47, 17]. Thus, SNP detection in metagenomics information is a difficult problem since it exacerbates the weaknesses of these two algorithmic approaches; methods that apply an overlap graph are computationally too inefficient to handle large datasets, whereas, de Bruijn graph approaches tin be constructed efficiently but are prone to combining read segments inappropriately from dissimilar species, which we refer to as chimeric sequences. OLC approaches, such equally Bambus2 [19], take an advantage over de Bruijn graph approaches that require each read be fragmented into smaller k-length sequences (chosen grand-mers), which further convolutes read-species data. For this reason, many metagenomic variant callers use overlap graphs that chronicle the entirety of a read to a single region and thus, produce graphs that have minimal complexity. Sequences and variants assembled from such graphs are composed and supported from collections of reads, rather than 1000-mers—an attribute of overlap graphs known as read coherence [33]. This, in theory, reduces the frequency of chimera variants. Unfortunately, computing the overlap betwixt pairs of reads is inefficient and thus, these approaches are unable to handle large datasets [25].

De Bruijn graph approaches have greater capacity to scale to larger datasets, particularly in calorie-free of existing succinct data structures for amalgam and processing the de Bruijn graph [44, 6, 7, 52, 4], but lose the sensitivity needed to detect longer variants accurately. Thus, such approaches confront having to choose between two undesirable choices: only existence able to telephone call brusque variant sequences constructed from non-branching paths, or risk the introduction of chimeric sequences that ascend from spanning branches in the graph. This potential occurrence of chrimeric sequences becomes more than frequent in resistome assay due to sequence homology betwixt AMR genes. AMR genes in the aforementioned form share homologous regions that are typically between sixty bp and 150 bp in length [21], which is longer than the typical 1000-mer value and shorter than the read length. Hence, such regions often correspond to several connected paths in the de Bruijn graph that are difficult to traverse unambiguously, implying that the de Bruijn graph cannot exist used to reliably construct the correct sequences corresponding to these genes and their respective SNPs. Every bit previously mentioned, these paths are more likely to exist read coherent in the overlap graph just the fourth dimension complexity of OLC methods is besides large to exist practical for resistome analysis, which encompasses the detection of AMR genes and their SNPs in hundreds or thousands of samples. For example, the USDA is calling for food production facilities to use sequencing to monitor AMR by 2022 [48]—if achieved, even at a small scale, thousands of samples will be sequenced and analysed to monitor food-borne outbreaks.

Therefore, an arroyo for identification of SNPs in AMR genes necessitates a method that combines the scalability of de Bruijn graph approaches with the read coherence of OLC approaches. To address this need, we develop LueFiveari, a read colored de Bruijn graph based SNP caller. It extends the concept of the colored de Bruijn graph, which was first introduced by Iqbal et al. [sixteen] for the detection of variants in eukaryote species. Given a set of northward samples, the colored de Bruijn graph extends the traditional de Bruijn graph in that each vertex (and edge) in the graph has a gear up of colors associated with it, where each color corresponds to 1 of the n samples. In Iqbal et al.'s [16] original application, each sample corresponds to the sequence data of ane private and traversal of the colored de Bruijn graph allows for variation to be detected, forth with the individuals containing that variation. In 2017, Muggli et al. [32] created a succinct data structure for constructing and storing the de Bruijn graph. Although these methods that apply the colored de Bruijn graph allow for variation to be detected among individuals of a population, information technology has the same issue of read coherence every bit the traditional de Bruijn graph and does not assist the identification SNPs in AMR genes. Very briefly, a read colored de Bruijn graph annotates each vertex (and edge) by an unique color which corresponds to each individual read (in one or more samples), allowing for read coherence to be preserved among paths longer than the k-mer size (typically, ≤ 60 bp) and shorter than the read length (typically, 150 bp). We formally define this concept later in this newspaper.

The read colored de Bruijn graph is an attractive concept since chimeric sequences are avoided past keeping each read every bit a separate colour but it does present construction challenges not present in colored de Bruijn graph. One such challenge is that a metagenomic sample may exist too large to shop on even the largest servers' hard drives in uncompressed form. For example, a set of metagenomic samples from a cattle production facility [38] contains shut to 41 billion 32-mers, with the commencement sample containing over 57 million reads¹. Storing each chiliad-mer-read combination with a single bit would crave 285 petabytes of space. This mandates that the succinct representation be built in an online fashion such that the complete uncompressed matrix need never exist stored explicitly. Therefore, we present a succinct data structure to construct and shop the read colored de Bruijn graph, which extends the representation of Muggli et al. [32].

Our contributions

In this paper, we formally define the read colored de Bruijn graph, along with several new concepts, including multicolored bulges and colour coherence, and demonstrate how they can exist used to resolve chimeric sequences that are betwixt 60 bp and 150 bp. Our experiments testify the utility of Lue5ari since information technology is the only SNP caller that reliably reports sequences (containing the identified SNPs) that spanned on average 47.5% of the AMR gene. It is able to attain this without a reference genomes. Whereas, GATK and SAM tools only reported specific loci and require a reference to do then. This is a pregnant advantage as information technology allows specific (fingerprinted) AMR genes to be detected in de novo fashion, which is needed for the complete characterization of the resistome. Both SAMtools and LueVari high accuracy on the simulated data but SAMtools reported a high number of faux positives. GATK had no false positives just depression accuracy (≤ 7%) on simulated data. We demonstrate the utility of our method in characterizing the resistome of a food product facility in the Us. Lastly, we note that FiftyueVari is freely available at https://github.com/baharpan/cosmo/tree/LueVari.

2 Related Work

Metagenomics enables the study of microbiomes every bit whole communities, allowing the assembly and analysis of genomes that belong to both known and unknown bacterial species, many of which remain difficult to civilization in isolation [46]. For completeness, nosotros requite an brief overview of metagenomic assemblers in improver to our discussion of metagenomic variant callers.

At that place are 2 major approaches beingness used for assembly of metagenomic data: the overlap graph and the de Bruijn graph, the latter of which has become the ascendant arroyo due to its computational efficiency [26]. Some prominent assemblers that are based on the de Bruijn graph include: metaSPAdes [iii], SOAPdenovo2 [28], MetaVelvet [34], and MegaHit [23].

For metagenomic variation detection, there are packages, such as metafast [47], crAss [viii], Commet [30], compareads [29], and FOCUS [43], that practise comparative metagenomics, but return similarity measures instead of specific variants. There are many that align reads to a reference, such as Hansel and Gretal [36], LENS [twenty], Platypus [42], MIDAS [35], Sigma [1], Strainer [9], and constrains [27]. Some reference-based read alignment tools, such as QuRe [41], ShoRAH [53], and Vispa [2], additionally use combinatorial optimization techniques to find sequences that best explain the reads and can be used on viral metagenomic samples. The combinatorial optimization approaches are computationally expensive, limiting their applications to simply relatively small datasets [12]. In some of these programs, an assembly from one sample may serve as reference for the analysis step. This approach means they will miss variants when the varying section does not align to the reference, so they tend to focus on haplotypes and SNPs. Many programs, such as SAMtools [24] and GATK [31], while commonly used to identify SNPs, indels, and structural variants, were adult to detect variations inside diploid organisms, namely eukaryotes, which limits their effectiveness on haploid prokaryotes [54]. Reference-guided assembly packages, like SHEAR [22], also inherently meet variation relative to a reference, then could be used for structural variant detection by aligning emitted contigs to the reference, if the bubble cannot be directly emitted. However, such programs are not designed for the unique challenges faced in metagenomic samples.

The closest package is MaryGold [37], which does reference-gratis, metagenomic variant detection. There is some overlap between graph-based variant detection and metagenomic, co-assembly programs. They differ in that co-associates may focus on maximizing contiguity of each sample independently, whereas variant detection benefits from long contiguous sequences partially shared betwixt ii samples, specifically a common start and cease. Variant detection besides reports the specific alignment of these sequences where co-assembled output would require an all pairs of contigs alignment post processing step to produce the same results. Thus, MaryGold compares itself to a metagenomic scaffolding program called Bambus2 [nineteen], which can written report bubbles in a graph constructed from the read data.

3 Methods

Equally previously discussed, almost all de Bruijn graph approaches for SNP calling share the problem of losing read information when the reads are fragmented into g-mers, introducing the possibility of the sequences containg the variation non existence read coherent [33]. This lack of read coherence is more than problematic in metagenomics since they can lead to chimeric sequences that contain genomic regions of different species. In this department, nosotros describe the concept of read coloring and demonstrate how information technology tin be incorporated into SNP calling to improve the accuracy in detection of SNPs in AMR genes from metagenomics data.

iii.one Read Colored De Bruijn Graphs

Permit R = {r ₁,…, r_n } be the set of north input reads. Nosotros brainstorm by get-go defining the de Bruijn graph. We use the following standard effective definition. The de Bruijn graph for R is constructed by first creating an edge for each unique k-mer in R, labelling the vertices of that border every bit the prefix and suffix of that m-mer, and lastly, gluing vertices that accept the same label. We let K = {k _ane,…, k_m } be the prepare of unique k-mers constructed from R.

Next, we define a sub-read, a concept that has non been previously practical to the context of de Bruijn graph based associates or variant detection. Given a read r and a value k, the sub-reads of r are sets of m-mers synthetic past fragmenting r into k-mers until a repeated k-mer occurs, at this point those k-mers are grouped into 1 sub-read, a new sub-read for r is created and the procedure of fragmenting into one thousand-mers continues. Nosotros notation that if at that place is no repeated k-mer in r so the set of k-mers itself is the sub-read of r. For case, let r be equal to ACGTACGTACGT and m = 3. The substring ACGT is repeated three times in r and therefore, the sets of sub-reads are Due south _one = {ACG, CGT, GTA, TAC}, South ₂ = {ACG, CGT, GTA, TAC}, and South ₃ = {ACG, CGT}.

Lastly, we define the read colored de Bruijn graph One thousand = (V, E) equally the de Bruijn graph (constructed for R) that has the modification that there exists a listing of colors associated with each border (chiliad-mer). These colors are stored in a two-dimensional binary array C, where at that place exists a row for each unique one thousand-mer in K, and a cavalcade for each sub-read. Hence, C(i, j) = 1 if the k-mer associated with edge e_i ∊ Due east is present in jth sub-read; and C(i, j) = 0 otherwise. Nosotros refer to C as the color matrix. Hence, we get-go construct the de Bruijn graph on R, split each read in R = {r _one,…, r_n } into sub-reads, and lastly, construct the color matrix based on the sub-reads. Genomic repeats that are shorter than or equal to the read length, and longer than or equal to the k-mer length tin can be resolved using sub-reads equally we volition run across in Subsection iii.3. Figure one illustrates an example of a read colored de Bruijn graph.

Download figure
Open in new tab

Fig. one. (a) Illustration of ii reads r1 and r2 representing a SNP (shown in grey). (b): Shows the sub-reads of r1 and r2. S1, S2, and S3 originate from r1, and S4 and S5 originate from r2. (c) Shows the set up of k-mers synthetic from r1 and r2. (d) Shows the de Bruijn graph constructed from the set of g-mers. (east) The colour matrix constructed from the sub-reads and de Bruijn graph. By traversing in a color coherent mode, the ACATTGGACATTGGACATTGG and ACATTGGACCTTGGACATTGG will be recovered. In this example we show the transpose of colour matrix to salvage space.

3.2 Multi-Colored Bulges

We define a burl in G as a set up of disjoint paths (p _i,…, p_north) which share a source vertex and a sink vertex, and refer to p_i for all i from 1 to northward as the branches of the bulge. We note that northward ≤ 4 due to the alphabet size. We ascertain a path p = e ₁.. east_ℓ in the read colored de Bruijn graph every bit color coherent if the sets of colors corresponding to east_i,e_i ₊₁, S_i and South_i+ ₁, are such that S_i ∩ S_i+ ₁ ≠ ø, for all i = i, …, ℓ−1. Adjacent, we refer to multicolored bulge as a bulge whose branches are color coherent but too accept disjoint lists of colors. See Figure two for an illustration of a multi-colored bulge. Lastly, we ascertain an embedded multicolored burl in G as a multi-colored bulge that occurs in a branch of another multi-colored burl.

Fig. 2.

Download figure
Open in new tab

Fig. two.

Left: Case of a bulge which comes from the existence of a SNP in one gene. Middle: Reads 1 and 2 each supporting one of the branches in the burl. Right: The representation of read colored succinct de Bruijn graph for the bulge on the left side.

3.3 Construction of the Read Colored de Bruijn Graph

Next, we present a succinct data construction for representing the colored de Bruijn graph that enables it to be synthetic and stored efficiently. There are several succinct representations for the de Bruijn graph, including the methods of Simpson et al. [44], Chikhi and Rizk [6], Conway and Bromage [seven] and Ye et al. [52]. In this paper, we extend the Boss information structure [4], which is based on the Burrows-Wheeler Transform (BWT) [5]. We refer the reader to the original newspaper past Bowe et al. [4] for an overview of BWT, FM-index, and a more thorough explanation of this data structure. Here, we volition requite a brief overview of this representation and then demonstrate how information technology can be extended to construct and shop read colored de Bruijn graphs. The first step of amalgam this graph Thousand for a given set of one thousand-mers is to add together dummy g-mers (edges) to ensure that at that place exists an edge (k-mer) starting with kickoff k – 1 symbols of another edges last 1000 – i symbols. These dummy edges ensure that each edge in G has an incoming vertex. After this modest perturbation of the data, a list of all edges sorted into right-to-left lexicographic order of their last k – 1 symbols (with ties broken by the first character) is constructed. Nosotros denote this list equally F, and refer to its ordering as co-lexicographic (colex) ordering. Next, nosotros define Fifty to be the listing of edges sorted co-lexicographically past their starting vertices with ties cleaved co-lexicographically by their ending vertices. Hence, ii edges with same characterization accept the same relative order in both lists; otherwise, their relative club in F is the same as their labels lexicographic order. The sequence of edge labels sorted by their order in list L is called the edge-BWT (EBWT). Now, let B_F exist a flake vector in which every one indicates the last incoming border of each vertex in L, and let B_L be another scrap vector with every 1 showing the position of the last outgoing edge of each vertex in Fifty. Given a character c and a vertex v with co-lexicographic rank rank(c), nosotros can determine the gear up of vs outgoing edges using B_L then search the EBWT(G) for the position of edge e with label c. Using B_F we tin can notice the co-lexicographical rank of es outgoing edge. With repeating this process we can traverse the graph.

Afterwards we construct the Boss representation of the de Bruijn graph on R, nosotros store each k-mers and their lexicographical order in a map M and then it tin can be used for the construction of the color matrix. Nosotros utilise Elias-Fano vector encoding to shop C since information technology permits on-line structure as long as all 1 bits are added in increasing guild of their index in the vector. For example, column six of C cannot be filled before column five. Therefore, we build the colour matrix by initializing each position to 0 and then updating each row at a time. We recall that each row corresponds to a k-mer and the rows are sorted in lexicographical order. Thus, to efficiently find the row in C corresponding to a particular yard-mer, we find its' lexicographical ordering using M, which we announce as i. Next, we marshal the thousand-mer to the union of all the sub-reads using BWA [fourteen] in gild to find all the sub-reads that contain information technology. Next, nosotros store the indices of these sub-reads into an array, sort this array, and update the ith row of C in this order, i.e., C(i, j) = 1 for all jth sub-reads containing the ith thou-mer. Storing and sorting the indices ensures that nosotros run into the construction requirement of Elias-Fano. After we construct and compress (using Elias-Fano) a row, i.e., a binary vector, nosotros append information technology to growing color matrix C. Nosotros continue with this process until all grand-mers accept been explored.

three.4 Multi-Colored Burl Search

We for multi-colored bulges by iterating through each vertex in G, determining those that are potential source vertices, and traversing Thou at those that are potential source vertices to find one or more potential sink vertices. A vertex is determined to be a potential source vertex if the out-degree is greater than one and the sets of colors of the side by side edges are disjoint. Thus, nosotros tin determine whether a vertex five is a potential source vertex past determining the out-degree fusing B_L. Suppose 5 has index i in colex order. We consider the ith 1 in B_L, if in that location are ℓ preceding 0s (ℓ ≤ 4 due to alphabet size) before ith i in B_Fifty, then the out-degree of v is ℓ + 1. Recall that a 1 in B₅₀ indicates that it is the terminal out-going edge. After locating v'southward corresponding i in B_L, the number of preceding 0s plus one will exist the out-degree of v. Side by side, after finding a vertex with out-caste greater than one, we make up one's mind the colors of its' side by side edges. We retrieve that lexicographical order of the edges (k-mers) as their row indices in the color matrix C. Let 5 be a vertex that has out-going edges e ₁ and e _two with lexicographical social club of i and j, respectively. The colors for due east_i represent to the set of columns k for which C(i, k)=1 and for eastward₂ is the set of columns 1000 ^' for which C(j, one thousand') = i. If these ii sets are disjoint then v is a potential source vertex.

Next, we perform depth first search at each potential source vertex later it is establish, comparing the colors of the next edge with those of the previous one. If these sets are colour coherent and in that location exists at least ane unvisited colour and then we mark the side by side color as visited and go on with the traversal; otherwise, we terminate. When we come across a vertex that has in-degree greater than i so it is stored as a potential sink vertex, along with the potential source vertex and the set up of branches. We can find the in-degree of vertex by performing a similar process as described above to find the out-caste using B_F. We note that nosotros do not cease the traversal after visiting a potential sink vertex, just instead, proceed traversing until all potential sink vertices that are reachable from the source vertex are visited and stored. We stop traversing after all possible outgoing edges take been visited. At the conclusion of the traversal algorithm for a potential source vertex, a list of tuples (sink vertex, source vertex, and branches) describing each bulge is stored and ordered by occurrence. Nosotros note that the iterative depth first search algorithm of our traversal method tin can deal with multiple embedded multi-colored bulges.

This traversal is illustrated in Effigy 2. In this illustration, we see that the out-degree of vertex ACG is equal to two since colex society is four and the fourth ane in B_Fifty occurs at B_L [5]. Nosotros determine the ii out-edge labels by considering EBWT(G)[iii,4] (labels are starting from 0). Nosotros see that they are C and T. Next, suppose we want to follow edge e ₁ with characterization C. Since there are but ii edges with characterization A in the graph, and C in EBWT(K)[0,i,two,3] is two, and then e ₁ is F[3]. Now with counting number of 1s in B_F [0,1,2,iii], we see that eastward ₁ has the colex order of 3. With checking the 4th chemical element in L (since labels starts from 0, chemical element with social club 3 happens in 4th position) we find the out-vertex which is CGC. (Note that the edge east ₁ is ACGC). With doing the aforementioned, we find the other out-edge (east ₂) which is ACGT. Next, nosotros need to check the colors of east ₁ and e _two. With checking the position 0 and i in C^T (transpose of C) we see that ACGC belongs only to r ₁ and ACGT just belongs to r ₂ (Note that the order of thousand-mers in the color matrix is lexicographical; hence ACGC has gild 0 and ACGT has guild 1). Now that the colors of two out-edges are mutually different, we annotation the vertex ACG every bit a potential source vertex. With constantly checking the color sets, nosotros volition explore the co-operative starting with ACGC (permit'south phone call this branch p ₁). The potential sink vertex on the style of p _i is vertex ATC. Since its in-degree is 2 (greater than one) and its incoming edges CATC and TATC only belongs to one of the reads. We store this vertex as a potential sink vertex of p _i and since in this instance this is the only reachable potential sink vertex from the source vertex ACG, the exploring of other branch (p ₂ starting with ACGT) begins. Since this co-operative also meets the vertex ATC (the potential sink vertex of p _i) through the edge TATC, a burl is detected and traversing volition stop.

3.5 SNP Recovery

Lastly, nosotros process each multi-colored bulge b with branches {p ₁, …, p_n } by recovering the longest color coherent path that occurs prior to the source vertex s by starting at s and travelling backward on the incoming vertices every bit long equally their exists an unambiguous incoming edge, implying in that location exists 1 incoming border (possibly function of a co-operative of an embedded bulge) tin can exist added to the current path and take information technology remain color coherent. If there exists such an edge and then it is added to the electric current path and the traversal astern continues; otherwise the traversal is halted and the current path p_south is saved. Similarly, a color coherent outgoing path is obtained from traversing the graph in a frontwards direction from the sink vertex t. We refer to this resulting path every bit p_t . Lastly, p_south is concatenated to each of the branch in {p ₁, …, p_n }, p_t is concatenated to each of these resulting paths, and their corresponding sequences are outputted. The variation between these sequences are recovered by alignment. This procedure is continues for all multi-colored bulge.

4 Results

Nosotros evaluate LueVari by comparing its' operation against competing methods on both simulated and existent metagenomic information. The faux information established the sensitivity and specificity of all the methods; whereas, existent dataset demonstrate the ability to identify distinct (fingerprinted) AMR genes in a sample taken from a food production facility. All experiments were performed on a two Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60 GHz server with ane TB of RAM, and both resident ready size and user process time were reported by the operating system.

4.1 Results on Simulated Data

Nosotros imitation two metagenomics datasets using BEAR, a metagenomics read simulator [18]. The datasets were generated to imitate the characteristics (number of reads, number of distinct AMR genes, and their copy number) of existent shotgun metagenomics data generated for resistome analysis–namely those of Noyes et al. [38] and Gibson et al. [13]. The number of reads is relatively pocket-size for a single dataset since eukaryote Dna is filtered prior to resistome analysis [38], which can pose a challenge for many SNP callers. The get-go simulated dataset consists of 193,752 paired-end short reads from 30 AMR genes from the MEGARes database [21], ii copies of the E. coli M-12 MG 1655 reference genome, and two copies of the salmonella reference genome. The average copy number of the AMR genes is 294x. Hence, we refer to this equally the "294x dataset".

The 2d dataset consists of ii,504,238 paired-end sequence reads imitation from 3,824 AMR genes from MEGARes database, 2 copies of the E. coli Thou-12 MG 1655 reference genome, and two copies of the salmonella reference genome. The average re-create number of the AMR genes is 25x, Hence, we refer to this as the "25x dataset". Nosotros used a i% error rate for both these simulations. SNPs were inserted into the AMR genes at a 0.05% polymorphism rate. Hence, 23 and 681 SNPs inserted into the 294x and 25x datasets, respectively.

We compared our method against several SNP and variant callers. Several methods were unable to be compared due to their unsuitability. Marygold and Bambus2 were unable to run and are no longer supported [11]. kSNP3 could not be used for comparing because information technology is specifcally designed for calling the variants between the samples, i.eastward., sequence data from different genomes [10]. DiscoSnp [49] ran very apace (11m CPU time) just did not produce reasonable output. This is unsurprising since we are re-purposing this tool equally it specifically designed for eukaryote genomes. GATK [31] and SAMtools [24] were able to be compared confronting LueFiveari. The results are summarized in Table 1. Both GATK and SAMtools are reference-guided, and thus, we ran them with their recommended parameters and the MEGARes database equally reference genome. LueVari is reference-complimentary and was ran as such. All methods had competitive run-time (≤ 5h CPU time) and used a reasonable amount of memory (≤ 6GB).

Table 1.

Results on the simulated information with 193,752 and 2,504,238 sequence reads. Full refers to the total number of SNPs reported by all programs. Total inserted reported the total number of inserted SNPs that were detected by all programs.

Lue5ari is the merely SNP caller that reliably reported sequences (containing the identified SNPs) that spanned on average 47.v% of the AMR gene, which are typically between 1,000 and one,400 bp in length. Out of the 2,749 reported SNPs, 436 (16%) of them were reported in a sequence spanning more than eighty% of the target AMR gene. GATK and SAMtools only reported specific loci and require a reference to exercise so. This is a significant issue in the analysis of AMR genes since existing AMR databases are largely believed to be incomplete [45]. This will allow specific (fingerprinted) AMR genes to be detected in de novo way. Both SAMtools and 50ueVari detected all the inserted SNPs but SAMtools reported a high number of false positives. GATK had no false positives but only detected a pocket-size portion of the inserted SNPs (0% and 6.3% for 294x and 25x, respectively). This functioning was due to the filtering footstep in GATK that removed a large number of reads (50% and 48% for 294x and 25x, respectively).

4.ii Results on Real Metagenomic Data

We demonstrate the ability of 50ueFiveari to analyze real shotgun metagenomics data, consisting of 29,415,748 paired-stop reads that were sequenced on an Illumina HiSeq 2500 system. The samples were selected across the beef production organization, which contain different interventions (such as, high-heat and lactic acrid treatment) aimed at decreasing AMR in consumable beef. Hence, this dataset is used to explore how microbial communities surrounding beef production facilities evolve in the presence of unlike nutrient production interventions that aim to reduce pathogen load [38]. We ran Fiftyue5ari on this shotgun metagenomic dataset and detected 2,129 distinct AMR genes, i.e., AMR genes found in MEGARes containing an unique SNP. This was accomplished in 25 CPU hours. In time to come works, we plan to clarify these SNPs and apply them to rail the movement and development of the genes.

Footnotes

↵1 Reads were trimmed and those with cryptic base calls removed

References

one.↵

T.-H. Ahn , J. Chai , and C. Pan. Sigma : Strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics, 31(2):170–177, 2015.
2.↵

I. Astrovskaya et al. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC bioinformatics, 12(Suppl vi):S1, 2011.
iii.↵

A. Bankevich et al. SPAdes: a new genome assembly algorithm and its applications to unmarried-cell sequencing. J Comp Bio, 19(5):455–477, 2012.
iv.↵

A. Bowe et al. Succinct de Bruijn graphs. In Proc. WABI, pages 225–235, 2012.
5.↵

M. Burrows and D.J. Wheeler . A cake sorting lossless data compression algorithm. Technical Written report 124, Digital Equipment Corporation, 1994.
6.↵

R. Chikhi and K. Rizk . Space-efficient and exact de Bruijn graph representation based on a Blossom filter. Algorithms Mol. Biol., 8(22), 2012.
seven.↵

T. C. Conway and A. J. Bromage . Succinct data structures for assembling large genomes. Bioinformatics, 27(four):479486, 2011.
eight.↵

B. E Dutilh et al. Reference-independent comparative metagenomics using cross-assembly: crAss. Bioinformatics, 28(24):3225–3231, 2012.
9.↵

J.M. Eppley et al. Strainer: software for assay of population variation in community genomic datasets. BMC bioinformatics, 8(1):398, 2007.
10.↵

S.Northward. Gardner , T. Slezak , and Hall. B.G. SNP detection and phylogenetic assay of genomes without genome alignment or reference genome. Bioinformatics, 31(17):2877–2878, 2015.
eleven.↵

Jay Ghurye . personal communication, May 2017.
12.↵

J.Southward. Ghurye , 5. Cepeda-Espinoza , and M. Pop . Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med, 89(3):353–362, 2016.
13.↵

M.Grand. Gibson , K.J. Forsberg , and G. Dantas . Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster past ecology. ISME, 9(1):207–216, 2014.
xiv.↵

Li H. and Durbin R. Fast and accurate brusk read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754–sixty, 2009.
15.↵
16.↵

Z. Iqbal et al. De novo associates and genotyping of variants using colored de bruijn graphs. Nature Genetics, 44(2):226–232, 2012.
17.↵

Zamin Iqbal , Mario Caccamo , Isaac Turner , Paul Flicek , and Gil McVean . De novo associates and genotyping of variants using colored de bruijn graphs. Nature genetics, 44(2):226–232, 2012.
18.↵

S. Johnson et al. A better sequence-read simulator programme for metagenomics. BMC Bioinformatics, 15(Suppl 9):S14, 2014.
19.↵

Southward. Koren , T. J Treangen , and M. Pop . Bambus ii: scaffolding metagenomes. Bioinformatics, 27(21):2964–2971, 2011.
20.↵

V. Kuleshov et al. Synthetic long-read sequencing reveals intraspecies variety in the man microbiome. Nature Biotech, 34(1):64–69, 2016.
21.↵

S. Lakin , C. Dean , N. Noyes , A. Dettenwanger , A. Ross , East. Doster , P. Rovira , Z. Abdo , K. Jones , J. Ruiz , Belk K. , P. Morley , and C. Boucher . MEGARes: an antimicrobial resistance database for high throughput sequencing. Nucleic Acids Enquiry, 45(D1):D574–D580, 2017.
22.↵

Southward.R. Landman et al. SHEAR: sample heterogeneity interpretation and assembly past reference. BMC Genomics, xv(1):84, 2014.
23.↵

D. Li et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674, 2015.
24.↵

H. Li et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(sixteen):2078–2079, 2009.
25.↵

Z. Li et al. Comparison of the two major classes of assembly algorithms: overlaplayout-consensus and de-bruijn-graph. Brief Funct Genomics, 11(1):25–37, 2012.
26.↵

Z. Li et al. Comparison of the two major classes of assembly algorithms: overlaplayoutconsensus and de Bruijn graph. Briefings in Functional Genomics, 11(ane):25, 2012.
27.↵

C. Luo et al. ConStrains identifies microbial strains in metagenomic datasets. Nature Biotech, 33(10):1045–1052, 2015.
28.↵

R. Luo et al. SOAPdenovo2: an empirically improved memory-efficient curt-read de novo assembler. GigaScience, i(one):1, 2012.
29.↵

N. Maillet et al. Compareads: comparing huge metagenomic experiments. BMC Bioinformatics, 13(19):1, 2012.
30.↵

N. Maillet et al. COMMET: comparison and combining multiple metagenomic datasets. In In Proc of IEEE BIBM, pages 94–98, 2014.
31.↵

A. McKenna et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20:1297–303, 2010.
32.↵

G.D. Muggli , A. Bowe , Northward.R. Noyes , P. Morley , K. Belk , R. Raymond , T. Gagie , S. Puglisi , and C. Boucher . Succinct colored de bruijn graphs. Bioinformatics, page To announced, 2017.
33.↵

E.W Myers . The fragment assembly string graph. Bioinformatics, 21(Suppl 2):ii79–ii85, 2005.
34.↵

T. Namiki et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome associates from short sequence reads. Nucleic Acids Res, twoscore(20):e155, 2012.
35.↵

S. Nayfach and K.S. Pollard . Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human being-associated bacteria. bioRxiv, page 031757, 2015.
36.↵

S.Grand. Nicholls et al. Advances in the recovery of haplotypes from the metagenome. bioRxiv, page 067215, 2016.
37.↵

J.F. Nijkamp et al. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics, 29(22):2826–2834, 2013.
38.↵

N.R. Noyes et al. Resistome diversity in cattle and the environment decreases during beef product. eLife, 5:e13195, 2016.
39.↵

North.R. Noyes , X. Yang , Fifty.M. Linke , R.J. Magnuson , A. Dettenwanger , South. Cook , R. Zaheer , H. Yang , D. Woerner , I. Geornaras , J. McArt , S.P. Gow , J. Ruiz , K.Fifty. Jones , C.A. Boucher , T. McAllister , P.South. Morley , and K.Eastward. Belk . Characterization of the resistome in manure, soil and wastewater from dairy and beef production systems. Scientific Reports, 6:24645, 2016.
40.↵

World Wellness Organisation. Diet, nutrition and the prevention of chronic diseases. Technical Study 916, WHO Technical Study Series, Geneva, Switzerland, 2003. Report of a joint WHO/FAO expert consultation.
41.↵

M.C.F. Prosperi and M. Salemi . QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics, 28(1):132–133, 2012.
42.↵

A. Rimmer et al. Integrating mapping-, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics, 46(eight):912–918, 2014.
43.↵

G.Chiliad.Z Silva et al. FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ, two:e425, 2014.
44.↵

Jared T Simpson and Richard Durbin . Efficient construction of an associates string graph using the fm-index. Bioinformatics, 26(12):i367–i373, 2010.
45.↵

A.C. Singer , H. Shaw , V. Rhodes , and A. Hart . Review of antimicrobial resistance in the surroundings and its relevance to environmental regulators. Front Microbiol, vii:1728, 2016.
46.↵

Susannah Green Tringe , Christian von Mering , Arthur Kobayashi , Asaf A. Salamov , Kevin Chen , Hwai Westward. Chang , Mircea Podar , Jay Thousand. Short , Eric J. Mathur , John C. Detter , Peer Bork , Philip Hugenholtz , and Edward M. Rubin . Comparative metagenomics of microbial communities. Science, 308(5721):554–557, 2005.
47.↵

V.I. Ulyantsev et al. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic information. Bioinformatics, 32(18):2760–7, 2016.
48.↵

United State Department of Agritculture. Food safety and inspection strategic plan 2017-2021. Technical report, USDA Written report Series, Washington, DC, 2017.
49.↵

R. Uricaru et al. Reference-free detection of isolated SNPs. Nucleic Acids Res, 43(2):e11, 2015.
50.↵

M. Willmann and Southward. Peter . Translational metagenomics and the human resistome: confronting the menace of the new millennium. J Mol Med, 95(ane):41–51, 2017.
51.↵

X. Yang , N.R. Noyes , E. Doster , J.N. Martin , L.Yard. Linke , R.J. Magnuson , H. Yang , I. Geornaras , D.R. Woerner , M.L. Jones , J. Ruiz , C. Boucher , P.S. Morley , and 1000.Due east. Belk . Use of metagenomic shotgun sequencing technology to observe food-borne pathogens within the microbiome of the beef product concatenation. Applied and Ecology Microbiology, 82(eight):2433–2443, 2016.
52.↵

C. Ye , Z.S. Ma , C.H. Cannon , Chiliad. Pop , and D.W. Yu . Exploiting sparseness in de novo genome assembly. BMC Bioinformatics, Suppl half dozen:S1, 2012.
53.↵

O. Zagordi et al. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics, 12(1):one, 2011.
54.↵

M. Zojer et al. Variant profiling of evolving prokaryotic populations. PeerJ, 5:e2997, 2017.

pellerinhaded1983.blogspot.com

Source: https://www.biorxiv.org/content/10.1101/156174v1.full

Resistome Snp Calling via Read Colored De Bruijn Graphs