New hampshire free stuff - craigslist CL new hampshire new hampshire albany, NY boston cape cod catskills cornwall, ON eastern CT glens falls hartford hudson valley long island maine montreal new haven north jersey northwest CT oneonta plattsburgh potsdam-massena quebec rhode island sherbrooke south coast trois-rivieres utica vermont watertown. Download b o c o t ng k t 10 n m ch ng tr nh ph i h p v ph for FREE. All formats available for PC, Mac, eBook Readers and other mobile devices. Download b o c o t ng k t 10 n m ch ng tr nh ph i h p v ph.pdf.
- Nh Ng Ph N M M Download Torrent T T Nh T 2015 For Sale
- Aim Download
- Nh Ng Ph N M M Download Torrent T T Nh T 2015 For Sale
Abstract
Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of “metagenomics”, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards.
Introduction
The advent of next-generation sequencing (NGS) or high-throughput sequencing has revolutionized the field of microbial ecology and brought classical environmental studies to another level. This type of cutting-edge technology has led to the establishment of the field of “metagenomics”, defined as the direct genetic analysis of genomes contained within an environmental sample without the prior need for cultivating clonal cultures. Initially, the term was only used for functional and sequence-based analysis of the collective microbial genomes contained in an environmental sample, but currently it is also widely applied to studies performing polymerase chain reaction (PCR) amplification of certain genes of interest. The former can be referred to as “full shotgun metagenomics”, and the latter as “marker gene amplification metagenomics” (ie, 16S ribosomal RNA gene) or “meta-genetics”.
Such methodologies allow a much faster and elaborative genomic/genetic profile generation of an environmental sample at a very acceptable cost. Full shotgun metagenomics has the capacity to fully sequence the majority of available genomes within an environmental sample (or community). This creates a community biodiversity profile that can be further associated with functional composition analysis of known and unknown organism lineages (ie, genera or taxa). Shotgun metagenomics has evolved to address the questions of who is present in an environmental community, what they are doing (function-wise), and how these microorganisms interact to sustain a balanced ecological niche. It further provides unlimited access to functional gene composition information derived from microbial communities inhabiting practical ecosystems.
Marker gene metagenomics is a fast and gritty way to obtain a community/taxonomic distribution profile or fingerprint using PCR amplification and sequencing of evolutionarily conserved marker genes, such as the 16S rRNA gene. This taxonomic distribution can subsequently be associated with environmental data (metadata) derived from the sampling site under investigation.
Several types of ecosystems have been studied so far using metagenomics, including extreme environments such as areas of volcanism– or other areas of extreme temperature,, alkalinity, acidity,, low oxygen,, and high heavy-metal composition.17, This invaluable resource provides an infinite capacity for bioprospecting and allows the discovery of novel enzymes capable of catalyzing reactions of biotechnological commercialization.
The first metagenomic studies were focused on low- diversity environments, such as an acid mine drainage, human gut microbiome, and water samples from the Sargasso Sea, mainly due to the unavailability of both high-throughput sequencing technologies at that time and relevant software for the scaffolds’ assembly. As more and more researchers entered this new field of study, the need for powerful tools and software became apparent and therefore led to the creation of several such tools.
Sequencing Technologies
Two commonly used NGS technologies utilized to date are the 454 Life Sciences and the Illumina systems, with the ratio of usage shifting in favor of the latter recently. Both technologies have been widely used in metagenomic studies, and hence it is important to briefly describe their advantages and disadvantages with respect to the sequencing of metagenomics samples.
The 454 pyrosequencer was the first next-generation sequencer to achieve commercial introduction in 2004. Its chemistry relies on the immobilization of DNA fragments on DNA-capture beads in a water–oil emulsion and then using PCR to amplify the fixed fragments. The beads are placed on a PicoTiterPlate (a fiber-optic chip). DNA polymerase is also packed in the plate, and pyrosequencing is performed., Its main difference from the classic Sanger sequencing is that pyrosequencing relies on the detection of pyrophosphate release on nucleotide incorporation rather than chain termination with dideoxynucleotides. The release of pyrophosphate is conveyed into light using enzyme reactions, which is then converted into actual sequence information.
In the initial years of high-throughput sequencing, scientists embraced the new technology and hence discovered the existence of the “rare biosphere”. However, in many cases the apparent assignment of a microbial operational taxonomic unit (OTU) was in fact an attribute of sequencing errors, which caused an overinflation of the diversity estimates.27 Noise generated by this 454 pyrosequencing technology affected different aspects of metagenomic data analysis and led to biased results.
PCR errors may lead to replicate sequence artifacts, which can cause overestimation of species abundance and functional gene abundance in 16S rRNA and full shotgun metagenomics, respectively. PCR can also generate noise in the form of single base pair errors (ie, substitutions, deletions) that can cause frame shifts for protein coding genes in shotgun meta-genomics. Moreover, PCR chimeras (sequences generated by undesired end-joining of two or more true sequences) can also affect 16S metagenomics results with respect to species distribution. Sequencing errors can also occur due to the actual chemistry underlining the technology. For example, there is an inherent difficulty in clearly identifying the intensity of 454 pyrosequencing-generated flowgrams. This task becomes even more difficult during the sequencing of homopolymers. The 454 pyrosequencing technology can generate reads up to 1,000 bp in length and ~1,000,000 reads per run. The relatively long read length generated by this technology (in comparison to other sequencing technologies) allows a significantly less error-prone assembly in shotgun metagenomics and permits greater annotation accuracy., The cost of sequencing using 454 pyrosequencing technology is estimated at around US$20 per Mb, but it has a relatively low coverage of 0.7 GB per sequencing run. With respect to pyrosequencing, <20 ng of DNA is sufficient for sequencing single-end libraries, although paired-end sequencing may require larger quantities of DNA.
Although 454 will eventually stop being supported by Life Sciences, still one should take into account that there is a large number of existing unpublished datasets that have been generated via this technology. Therefore, it is important to include it in this review and compare it with the other sequencing services that have become more popular over the last years, namely Illumina.
Illumina dye sequencing by synthesis begins with the attachment of DNA molecules to primers on a slide, followed by amplification of that DNA to produce local colonies. This generation of “DNA clusters” is accompanied by the addition of fluorescently labeled, reversible terminator bases (adenine, cytosine, guanine, and thymine) attached with a blocking group. The four bases then compete for binding sites on the template DNA to be sequenced, and the nonincorporated molecules are washed away. After each synthesis cycle, a laser is used to excite the dyes, and a high-resolution scan of the incorporated base is made. A chemical deblocking step ensures the removal of the 3’terminal blocking group and the dye in a single step. The process is repeated until the full DNA molecule is sequenced. Illumina has a variety of sequencing instruments dedicated to different applications. MiSeq, for example, has an output of 15 GB and 25 million sequencing reads of 300 bp in length; clustered fragments can be sequenced from both ends (paired-end sequencing), which can be merged so that 600 bp reads can be obtained. HiSeq2500 has a much greater output (1,000 GB per run) but offers 125 bp reads. Illumina yields involve a much lower cost (~US$0.50 per Mb), but the run time is longer than that for 454 pyrosequencing. Currently, this feature is being addressed by the MiSeq Illumina machine, which has been developed in order to run smaller jobs at a much faster rate with relatively high throughput. Illumina allows sample preparation sizes of <20 ng DNA (similar to 454 pyrosequencing). The shorter read length produced by Illumina may increase errors during assembly and, subsequently, the annotation inaccuracies during shotgun metagenomics data analysis. In contrast, when analyzing 16S metagenomics data, this technology obviates the need for time-consuming noise removal algorithms required for pyrosequencing and makes analysis less error-prone. The greater coverage/yield generally offered by Illumina allows significant decrease of systematic errors. This advantage and the low cost are the delineating factors that have turned Illumina into the preferred high-throughput sequencing technology for metagenomics studies.
Additional sequencing technologies are available and can potentially be used for metagenomic studies. These include the Applied Biosystems SOLiD 5500 W Series sequencer, which offers higher coverage than 454 pyrosequencing but lower than Illumina (~120 GB per run). It allows fragment or mate-paired sequencing; however, it can only guarantee a low error rate for sequencing reads of maximum 50 bp in length. This reduces the possibility of generating a reliable and usable de novo assembly for shotgun metagenomics; but, on the other hand, this technology performs very well when utilizing a reference genome for mapping or assembly of reads. However, using the Exact Call Chemistry (ECC) module, the SOLiD system offers to boost the accuracy of its ligation-based sequencing.
An emerging sequencing technology that may have high impact on the fields of genomics and metagenomics was recently developed by Pacific Biosciences (PacBio). This technology uses single-molecule real-time (SMRT) sequencing, which is a parallelized single-molecule DNA sequencing by synthesis. SMRT sequencing utilizes the zero-mode waveguide (ZMW), whereby a single DNA polymerase enzyme is fixed to the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to allow the observation of a single nucleotide of DNA (also known as a base) being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off, which diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye. PacBio provides much longer read lengths (~10,000 bp) compared to the aforementioned technologies, thus having obvious advantages when addressing issues of annotation and assembly for shotgun metagenomics. PacBio technology uses a process called strobing to perform paired-end read sequencing. Despite the high read length of PacBio, this technology is limited by high error rates and low coverage (albeit at higher throughput than Sanger sequencing).
In addition to the aforementioned technologies, which are based on optics, technologies such as Ion Torrent’s semiconductor sequencing benchtop sequencer and Ion Proton are now coming into play. These technologies are based on the use of proton emission during polymerization of DNA in order to detect nucleotide incorporation. This system promises read lengths of >200 bp and relatively high throughput, on the order of magnitude achieved by 454 Life Sciences systems. Additionally, it offers higher quality than 454, especially when sequencing homopolymers, but at a similar cost (about US$23 per Mb for the Ion Torrent PGM −314 Chip). Looking into the future, and given that 454 will eventually stop being supported by Life Sciences, it is very likely that former users of the 454 pyrosequencing will switch to Ion Torrent sequencing chemistry, due to the similarities of both (eg, emulsion PCR step) and the significant the advantages of the latter.
An even more cutting-edge technology is currently under development by Oxford Nanopore technologies, which is developing “strand sequencing”, a method of DNA analysis that could potentially sequence completely intact DNA strands/polymers passed through a protein nanopore. This obviates the need for shotgun sequencing and aims to revolutionize the sequencing industry in the future. Oxford Nanopore intends to commercialize this technology with the Company’s GridION™ and MinION™ systems. For meta-genomics, this technology can have obvious advantages, as it will eliminate erroneous sequencing caused by shotgun meta-genomics and exclude the need for the error-prone assembly step during data analysis (for details, see later). However, nanopore sequencing is at the moment noncommercialized (offered only through the MinION™ Access Program) and is still being optimized on case-by-case basis using specific template and sequencing needs.
Another example of an innovative and very promising technology is the Irys Technology (BioNano Genomics), which uses micro and nanostructures and offers new ways of de novo constructing genome maps. The input is DNA labeled at specific sequence motifs that can be used for imaging and identification in IrysChips. These labeling steps result in a uniquely identifiable, sequence-specific pattern of labels to be used for de novo map assembly or for anchoring sequencing contigs.
Shotgun Metagenomics
Assembly of shotgun metagenomics data
Metagenomics studies are commonly applied to investigate the specific genomes (known as well as unknown, both cultured and uncultured) that are present within an environmental community under study. Moreover, when performing full shotgun metagenomics, the complete sequences of protein coding genes (previously characterized or novel) as well as full operons in the sequenced genomes can offer invaluable functional knowledge about the community. For these reasons, an assembly of shorter reads into genomic contigs and orientation of these into scaffolds is often performed to provide a more compact and concise view of the sequenced community under investigation. Early attempts at metagenomic data assemblies utilized tools initially implemented for single genome data assemblies. They, therefore, fell short when forced to assemble reads into contigs for metagenomic samples. However, assembly tools have significantly evolved since then, and the current line of tools have been modified and specifically designed to assemble samples containing multiple genomes, thereby rendering them much more affective for the task in hand.
The process of assembling shorter reads into contigs can take two different routes: 1) reference-based assembly and 2) de novo assembly. The choice of which route to follow depends on the dataset that needs to be analyzed and on the specific needs of each research project. For example, de novo assembly could be, in theory, used even if a reference genome exists, if the computational power allows for it.
Reference-based assembly refers to the use of one or more reference genomes as a “map” in order to create contigs, which can represent genomes or parts of genomes belonging to a specific species or genus. Tools such as Newbler (Roche), MIRA 4, or AMOS, as well as the recent MetaAMOS, are commonly used in metagenomics for performing referenced-based assemblies. These tools are not computationally intensive and perform well when metagenomic samples are derived from extensively studied and researched areas. In such cases, sequences from closely related organism would have already been deposited in online data repositories and databases, allowing them to be used as references for the assembly process. Often, assemblies are visually evaluated using genome browser tools such as Artemis. The observation of large gaps in the query genome(s) of the resulting assembly, when comparing to the reference genome(s), can be seen as an indication that perhaps the assembly is incomplete or that the reference genome(s) used are too distantly related to the community under investigation in order to perform optimally.
De novo assembly refers to the generation of assembled contigs using no prior reference to known genome(s). This task is computationally expensive and relies heavily on sophisticated graph theory algorithms, such as de-Bruijn graphs, which were specifically employed to tackle this job. Tools such as EULER, Velvet, SOAP, and Abyss were amongst the first to perform de novo assembly and are still widely used today. They require computers with large amounts of memory and generally long execution times (depending on the size of the dataset). However, these tools were built with the assumption of assembling a single genome and often underperform when used for metagenome assemblies. Problems arise from 1) variation between similar subspecies, 2) genomic sequence similarity between different species, and 3) difference in abundance for species in a sample also affected by different sequencing depths for individual species. These issues introduce kinks (or branches) in the de Bruijn graph, and have to be addressed in order to improve the assembly.
The next generation of assembly tools, such as MetaVelvet and very recently MetaVelvet-SL, and Meta-IDBA, was developed to address these issues. MetaVelvet and Meta-IDBA employ a combined binning (for details on binning, see below) and assembly approach to create more accurate assemblies from datasets containing a mixture of multiple genomes. They make use of k-mer frequencies to detect kinks in the de-Bruijn graph and then use these k-mer thresholds to decompose the graph into subgraphs. These tools further assemble contigs and scaffolds based on the decomposed subgraphs, and thus perform a more efficient grouping/assembly of contigs, effectively separating those belonging to different species.
The IDBA-UD algorithm was recently developed to additionally address the issue of metagenomic sequencing technologies with uneven sequencing depths. It makes use of multiple depth-relative k-mer thresholds to remove erroneous k-mers in both low-depth and high-depth regions. Comparison of the performances of these tools is often performed using the N50 length score, which is defined as “the length for which the collection of all contigs of that length, or longer, contains at least half of the total of the lengths of the contigs in the assembly”., A recent comparison of the latest line of assembly tools shows that IDBA-UD can reconstruct longer contigs with higher accuracy. However, there is still much room for the improvement of metagenomic assembly algorithms in order for them to conceptually capture the task in hand.
Binning tools for metagenomes
Binning is the process of grouping (binning) reads or contigs into individual genomes and assigning the groups to specific species, subspecies, or genus. Binning methods can be characterized in two different ways depending on the information used to group the sequences in hand: 1) Composition-based binning is based on the observation that individual genomes have a unique distribution of k-mer sequences (also denoted as genomic signatures). By making use of this conserved species-specific nucleotide composition, these methods are capable of grouping sequences into their respective genomes. 2) Similarity- or homology-based binning refers to the process of using alignment algorithms such as BLAST or profile hidden Markov Models (pHMMs) to obtain similarity information about specific sequences/genes from publically available databases (eg, NCBI’s nonredundant database – nr or PFAM). Thereafter, sequences are binned according to their assigned taxonomic information.
Available composition-based binning algorithms are included in tools such as TETRA, S-GSOM,, Phylopythia and its successor PhylopythiaS, TACAO, PCAHIER, ESOM,58, and ClaMS, while examples of purely similarity-based binning software include tools such as CARMA, MetaPhyler, and SOrt-ITEMS. Some tools employ similarity-based binning algorithms in their metagenomics analysis pipelines. Examples of such tools are IMG/MER 4, MG-RAST,, and MEGAN– and will be described in more detail below.
Certain binning tools employ a hybrid approach using both composition and similarity-based information to group sequences. Some examples of such tools are PhymmBL and MetaCluster., More innovative binning approaches include co-abundance gene segregation across a series of metagenomic samples, thus facilitating the assembly of microbial genomes without the need for reference sequences. This new method promises to overcome the usual computational challenges of other binning tools and has been tested for a human gut microbiome.
Binning tools can further be characterized with respect to the type of algorithm they employ such as 1) ab initio unsupervised classifiers and 2) supervised/training-based classifiers. Unsupervised binning refers to the process of using pre-existing bins derived from genomic sequences to classify a given dataset without user supervision. In contrast, supervised binning allows user interference and supervision in the training process per se. More particularly, the user may specify the type of sequences that will be used to train each bin and, furthermore, select sequences from known taxonomic lineages to use while training the classifier. Sophisticated algorithms such as support vector machines (PhylopythiaS), hidden Markov models (PhymmBL, TETRA), as well as self-organizing maps (ESOMs) have been used in binning algorithms. However, tools such as PhylopythiaS and TETRA allow little user intervention, while ClaMS and ESOM provide a more supervised training approach that can be fine-tuned to allow optimal classification for the specific dataset under consideration.
There are certain aspects that one must take into consideration when performing the binning of metagenomic sequences. Composition-based binning using genomic signature has its drawbacks, especially when performed on short reads (ie, 150 bps). Given that all possible tetranucleotide combinations amount to 256, it is unlikely to extract sufficient information to reliably assign a taxonomic rank to a specific bin using short reads. Therefore, it is common practice to perform composition-based binning on assembled datasets. This way, longer contigs can provide the required k-mer distribution information, which will allow effective binning and taxonomic assignment. Observation of a taxonomic marker sequence (ie, 16S rRNA gene) within the bins can further facilitate reliable taxonomic assignment for the respective bin. Similarity-based binning also has its disadvantages. Although capable of binning reads of short length, it fails to do so accurately when the metagenome under consideration consists of numerous closely related species. This may cause assignment of closely related sequences to the same reference genome, perhaps at a higher taxonomic level (ie, order or class), thereby generating bins containing a mixture of genomes. Therefore, optimal binning results are expected to be attained when combining both composition- and similarity-based approaches as adopted by hybrid tools such as PhymmBL and MetaCluster.,
Annotation of metagenomics sequences
Annotation of metagenomes is specifically designed to work with mixtures of genomes and contigs of varying length. Initially, a series of preprocessing steps prepare the reads for annotation. These include 1) Trimming of low-quality reads using platform-specific tools such as the FASTX-Toolkit. Additionally, FastQC can provide summary statistics for FASTQ files. Both have been recently integrated into the Galaxy platform.– SolexaQA and Lucy 2 are also used for FASTQ files. Most of these tools make use of Phred or Q quality scores,, the thresholds of which depend on sequencing technology; 2) Masking of low-complexity reads performed using tools such as DUST; 3) A de-replication step that removes sequences that are more than 95% identical; 4) A screening step performed by some tools (ie, MG-RAST) in which the pipeline provides the option of removing reads that are near-exact matches to the genomes of a handful of model organisms, including fly, mouse, cow, and human. This is done using mapping tools such as Bowtie 2.
The next main stage of the annotation pipeline is the identification of genes within the reads/assembled contig, a process often denoted as “gene calling”. Genes are labeled as coding DNA sequences (CDSs) and noncoding RNA genes, and certain annotation pipelines (eg, IMG/MER) also predict for regulatory elements such as clustered regularly interspaced short palindromic repeats (CRISPRs).
CDSs are identified using a number of tools including MetaGeneMark, Metagene, Prodigal, Orphelia, and FragGeneScan, all of which utilize ab initio gene prediction algorithms. Often, annotation pipelines use an intersection of these tools to obtain a more informative prediction of the protein coding genes. Gene prediction tools utilize codon information (ie, start codon – AUG) to identify potential open reading frames and hence label sequences as coding or non-coding. Most tools can be trained by using the desired training sets. For example, FragGeneScan is trained for prokaryotic genomes only, and is used by IMG/MER and MG RAST as well as EBI Metagenomics. It is believed to be one of the most accurate gene-prediction tools currently available. However, like most of these tools, it is expected to have an average prediction accuracy of ~65%–70%, resulting in multiple genes that are missed altogether.
CRISPR elements are identified by programs such as CRT and PILER-CR. IMG/MER uses a concatenation of results obtained from both these programs, retaining the longest element prediction in case of overlap.
Noncoding RNAs such as tRNAs are predicted using programs like tRNAscan,, ribosomal RNA (rRNA) genes (5s, 16s, and 23s) are predicted using internally developed rRNA models for IMG/MER, and MG-RAST uses similarity to compare three known databases (SILVA, Greengenes, and the Ribosomal Database Project-RDP,) to predict rRNA genes.
The next stage of the annotation pipeline involves functional assignment to the predicted protein coding genes. This is currently achieved by homology-based searches of query sequences against databases containing known functional and/or taxonomic information. Due to the large size of metagenomic datasets, this stage is often very expensive computationally and highly automated. BLAST or other sequence-similarity-based algorithms often run on high-performance computer clusters. Often, multithreading or other parallel programming approaches are used to divide jobs in multiple central/graphic processing units (CPUs/GPUs). This reduces the running time complexity and significantly speeds up querying execution time.
Some widely used data repositories to obtain annotation for metagenomic datasets include functional annotation databases such as KEGG,, SEED, eggNOG, COG/KOG, as well as protein domain databases such as PFAM, and TIGRFAM. Often, annotation pipelines make use of multiple databases or composite protein domain databases such as Interpro (see EBI Metagenomics) in order to obtain a more collective, cumulative biological functional annotation.
IMG/MER utilizes HMMsearch (profile HMMs) to associate genes with PFAM, and genes are further annotated using COGs. Database of position-specific scoring matrix (PSSMs) for COGs are downloaded from NCBI and are used to annotate protein sequences. Moreover, genes are labeled using KEGG-associated KO terms, EC numbers, and assigned phylogeny using similarity searches. With a large set of genomes in its public repositories, IMG/MER can exploit its own resources, using them as reference nonredundant databases from which it obtains additional functional annotation.
MG-RAST utilizes many of the databases described above for annotation mapping as well as the NCBI taxonomy. The primary data product displayed to the user by MG-RAST is in the form of abundance profiles, and taxonomic information is projected against this data.
Both IMG/MER and MG-RAST are widely used data management repositories and comparative genomics environments. They are fully automated pipelines that provide quality control, gene prediction, and functional annotation. Both tools support user download of data products generated, as well as optional sharing and publishing within the respective portals. However, there are important differences between MG-RAST and IMG/MER that are relevant to the way MG-RAST calculates abundance profiles.
MG-RAST predicts all genes in the metagenome, and then identifies the best homologs of those genes in the isolate genomes using a tool called BLAT (BLAST-like alignment tool). BLAT misses similarities below 70% identity, so many strong hits to other genes are missed. After the best hits to genes from an isolated genome are identified, all subsequent analysis is done using the genes of the isolate genomes, not the genes of the metagenome at hand. This creates a lot of limitations due to the fact that the analysis is not performed on the original genes of the metagenome but on the “proxy” genes to the isolated genomes instead. The advantage of this method is its speed; the only computationally intensive step is to find the best hits of the metagenomes against the isolates. Once this is done, all other comparisons are already pre-existing. The other major advantage is that the MG-RAST database does not grow in size, as is the case with the IMG/MER database.
IMG/MER also begins with prediction of all genes from the metagenome, but then runs all the computations on those genes rather than on their proxies. This allows the identification of PFAM hits (which is not supported in MG-RAST) and provides much more detailed functional information compared to COGS, which is the only protein families database used in MG-RAST. The major bottleneck for IMG/MER is the exponential growth of the gene number, which is not an issue for MG-RAST since the metagenome genes are not kept for analysis. It is, however, important to use PFAM for functional analysis because by comparing the number of genes from any metagenome that go into COG or PFAM clusters, the second provides significantly higher coverage and therefore allows a much deeper analysis. Another major advantage of IMG/MER is that, since the tool keeps the original metagenome genes, it also keeps the original contigs, which provides synteny information. Therefore, it is far more suitable if one is interested in identifying novel biosynthetic gene clusters (BGCs) in the metagenomes, a type of analysis that may be less viable using MG-RAST. The prediction of BGCs from metagenomics data is recently gaining a great deal of interest due to their potential in biotechnological applications. The possibility to engineer BGCs for the production of secondary metabolites with improved properties, known for their use in anticancer drugs and antibiotics, offers limitless potential for bioprospecting.
Over its 35-year career, Anthrax has been a pioneering band with its unique style, sound and heavy brand of thrash metal, and, as Metallica's Kirk Hammett put it. Jan 1, 2018 - For All Kings| Anthrax to stream in hi-fi, or to download in True CD Quality on Qobuz.com. Check out For All Kings by Anthrax on Amazon Music. Stream ad-free or purchase CD's and MP3s now on Amazon.com. Jan 1, 2016 - You Gotta Believe 02. Monster at the End 03. For All Kings 04. Breathing Lightning 05. Evil Twin 07. Blood Eagle Wings 08.
The EBI Metagenomics service is a newly developed web-based portal that uses metadata structures and formats that comply with the Genomic Standards Consortium (GSC) guidelines. Moreover, a novel data scheme currently being hosted by the EBI-EMBL is being adopted by the EBI Meta-genomics service. This is known as the European Nucleotide Archive (ENA) data schema and aims to integrate data derived from sequencing technologies under a consensus, mutually accepted standard. EBI Metagenomics offers a dual shotgun and marker gene analysis service. It allows the extraction of rRNA data from shotgun metagenomic data using tools such as rRNASelector for concurrent marker metagenomic analysis. It therefore supports additional 16S rRNA-based analysis tools such as Qiime (see section on Marker Gene Metagenomics) for the efficient taxonomic assignment of these sequences. For functional analysis and annotation of CDS sequences, EBI Metagenomics uses FragGeneScan to obtain protein coding sequences and thereafter utilizes databases such as Interpro, which is a composite, cumulative system comprised of multiple databases of protein families, and allows for protein domain prediction and functional assignment. EBI Metagenomics provides data archiving via ENA and provides unique accession numbers for submitted datasets. Archiving policies require the data to be made public; however, there is a 2-year period (upon submission) during which the data is kept private pending user publication of analysis results.
CAMERA is another online cloud computing service that provides hosted software tools and a high-performance computing infrastructure for the analysis of metagenomic data. One advantage of CAMERA is that it allows greater user intervention and flexibility during the analysis process. However, this means that users must have expertise, knowledge, and hands-on experience in metagenomic date analysis per se, in order to ensure correct execution of the pipeline and accuracy of results. Moreover, in order to perform comparative metagenomics using CAMERA, the datasets in hand must be traversed through the CAMERA pipeline, thus making integration of data from different resources more computationally demanding. MEGAN 5 is yet another tool that performs analysis of metagenomic data and offers a wide range of visualization tools for metagenomic annotation results. It supports multiple visualization schemes including functional or taxonomic dendrograms, tag clouds, bar charts, and Krona taxonomic plots, that allow hierarchical data to be explored in the form of a zoomable pie chart.
Marker Gene Metagenomics
It is widely accepted that sequencing of the 16S rRNA gene reflects eubacterial evolution. Since the introduction of SSU rDNA-based molecular techniques,– the study of microbial diversity in natural environments has advanced significantly. In addition, pyrosequencing, of the 16S rRNA gene has been widely applied in the field of microbial ecology,– and has resulted in a great number of sequences deposited in relevant databases, thus enhancing the value of 16S as the “gold standard” in microbial ecology. While the 16S rRNA gene fragment, containing one or more variable regions, is the preferred target marker gene for bacteria and archaea, this is not the case for fungi and eukaryotes where the preferred marker genes are the internal transcribed spacer (ITS) and 18S rRNA gene, respectively.
Taxonomic analysis for prokaryotes (ie, bacteria and archaea) is regularly performed using 16S data derived from varying sequencing technologies (ie, 454 pyrosequencing as well as Illumina, Solid and Ion Torrent), and, for the purposes of this review, we will list the relevant software to allow analysis for most sequencing technologies. Commonly used tools for 16S data analysis and denoising include QIIME, Mothur, SILVAngs, MEGAN, and AmpliconNoise. Despite the vast availability of algorithms and software for analysis of 16S metagenomics datasets, QIIME seems to be established as the “gold standard”.
It is important to be aware of certain aspects of the terminology required for the efficient analysis of 16S metagenomics data. These include the following: 1) Amplicon – a DNA fragment that is amplified by PCR, eg, one or more 16S rRNA variable regions, or other marker genes. Most researchers will make use of standard PCR primers; 2) OTU – species distinction in microbiology, typically using rRNA and a percentage of similarity threshold for classifying microbes within the same, or different, OTUs; 3) Barcode – a short DNA sequence that is added to each read during amplification and that is specific for a given sample. This allows samples to be mixed (multiplexed) to reduce sequencing cost. During analysis, sequences need to be demultiplexed, ie, separated by sample.
Analysis usually requires a reference database that is searched to find the closest match to an OTU from which a taxonomic lineage is inferred. Some widely utilized databases include Greengenes, (16S), Ribosomal Database Project,,, (16S), Silva, (16S + 18S), and Unite (ITS). These databases are less suitable for certain groups of organisms, such as protists and viruses, which are extremely diverse and for which considerably less sequence information is available compared to bacteria.
Denoising
Denoising is important for 16S metagenomic data analysis, and it is platform-specific; ie, certain platforms (eg, Illumina) require less denoising than others (eg, pyrosequencing). For example, denoising of 454 pyrosequencing data, despite being computationally expensive, is necessary due to intrinsic errors generated from pyrosequencing that can give rise to erroneous OTUs. A procedure called “flowgram clustering” removes problematic reads and increases the accuracy of the taxonomic analysis. Several denoising algorithms have been developed so far,,,– but for the purpose of this review three of them will be analyzed in detail.
Denoising is performed very efficiently by Amplicon-Noise, a tool that uses the following basic denoising steps: 1) Filtering of noisy reads: reads are truncated based on the appearance of low signal intensities; 2) Removing pyrosequencing noise: distance between the flowgrams is defined and true sequences and their frequencies are inferred by an expectation-maximization (EM) algorithm; 3) Removing PCR noise: the same ideas are used for removing PCR errors; 4) Chimera identification and removal: for each sequence, exact pairwise alignments are performed to all sequences with equal or greater abundance, which is the set of possible parents. Although a considerable number of sequences is lost during the denoising process, it results in high-quality sequences132; however, there has been some debate on the level of stringency required to achieve such high quality.
A very popular software for the analysis of microbial communities is QIIME. Initially QIIME was implemented for use of 454 pyrosequencing datasets only, ie, using sff (Standard Flowgram Format) files, but currently QIIME has been modified to accept the fastq file format, thereby making the analysis of Illumina datasets possible. The QIIME developers provide users with extensive online tutorials for several workflows, and, moreover, QIIME is available as an open-source software package mostly implemented using the programming language PYTHON.
Another widely used software for the analysis of microbial communities is Mothur. It was created from the combination of pre-existing software, such as DOTUR, SONS, and Treeclimber, but, due to the community support it has received, currently it incorporates many more algorithms, thus providing the user with a variety of choices.
More recently, a web-based application called SILVAngs was developed, which provides a fully automated analysis pipeline for data derived from rRNA marker gene amplicon sequencing. The analysis workflow is based on 1) Alignment of reads, 2) Quality assessment and filtering of reads, 3) Dereplication, whereby identical sequences are filtered out to avoid overestimation, 4) Clustering and OTU picking using a priori defined thresholds, and 5) Taxonomic assignment of OTUs using the SILVA rDNA database.
The choice of which denoising algorithm to use is largely depends on the user. Once a choice is made, the user should also consider whether to deviate from the default parameters. Parameter adjustment is related to the dataset produced, ie, which specific 16S rRNA region was sequenced and which technology was used to perform the actual sequencing. In addition, it has been suggested that use of different denoising methods can produce significantly different outcomes, which should be taken into careful consideration when comparing studies that have utilized different algorithms for data analysis.
OTU clustering, picking, and taxonomic assignment
After the demultiplexing of the dataset, ie, the assignment of reads to samples using barcode information, the next step is OTU picking. For bacteria/archaea, it is accepted that OTUs of similarity greater than 97% correspond to the same species, but also other dissimilarity cutoffs can be employed, if needed for the downstream analyses. There are numerous OTU picking strategies: 1) De novo is used if amplicons overlap and if a reference sequence collection is not available. It clusters all reads without using a reference and is quite expensive computationally, hence not very suitable for very large datasets. 2) Closed-reference is used if amplicons do not overlap and if a reference sequence collection is available. This approach discards reads that do not hit a reference sequence. 3) Open-reference is used if amplicons overlap and a reference rules='rows'>Shotgun metagenomicsAssemblyEULER41Velvet42SOAP43ABySS44MetaVelvet46MetaVelvet-SL45Meta-IDBA47IDBA-UD48Newbler (Roche)MIRA37Mapsembler171ALLPATHS172,173MetaORFA174,175MetAMOS38BinningTETRA51S-GSOM52PhylopythiaS54,55TACOA56PCAHIER57ESOM58ClaMS60CARMA61WGSQuikr176SPHINX177MetaPhyler62SOrt-ITEMS63PhymmBL70MetaCluster71,72AnnotationFASTX-Toolkit74FastQC67SolexaQA78Lucy 279DUST82Bowtie83MetaGeneMark84LEfSe19TACOA56Metagene85CREST178Prodigal86mOTU-LG179Orphelia87Kraken180FragGeneScan88CRT89NBC181MyTaxa182RITA183PILER-CR90tRNAscan184KEGG99MetaCluster TA71SEED100eggNOG101ProViDE185COG/KOG186PFAM103,104,187TIGRFAM105MetaPhlAn188HighSSR189Blat107Analysis pipelinesIMG/MER64,190MG-RAST65MEGAN 567–69CAMERA112Parallel-META74,191EBI Metagenomics108METAREP192PHACCS193Marker gene metagenomicsStandalone softwareQIIME111,194Mothur121JAguc195M-pick196OTUbase197CopyRighter198AbundantOTU199UniFrac145,200ESPRIT141,201Analysis pipelinesSILVA125FunFrame202PANGEA203FastGroupII204CLOTU205DenoisingAmpliconNoise122DADA28JATAC127UCHIME206Bellerophon207CANGS208,209DatabasesSILVA125Greengenes94Ribosomal Database Project (RDP)210Unite126
There has been some controversy within the metagenomics community regarding the actual need for performing assembly on metagenomes. One contention is that using clustering algorithms such as cd-hit, or uclust is sufficient to group similar reads together and thereafter proceed to annotation of these clusters without prior assembly. This clustering approach may allow for more accurate annotation of highly diverse samples containing rare, uncultured genomes that may otherwise be excluded from the assembly process due to their low coverage. One drawback of not performing an assembly may be that complex regulatory elements such as CRISPRs may not be identified successfully.
Binning and annotation methods are also constantly being modified and altered to specifically address metagenomic analysis pipelines. A significant improvement of these processes will be achieved upon increase of the genomic repository of cultured as well as uncultured genomes within the public database repertoire. Composition-based as well as similarity-based binning methods, especially those making use of supervised machine learning algorithms (ie, PhylopithiaS, trained on reference genomes), will become increasingly accurate due to the availability of more reliable information.
At this stage it is important to mention that, in spite of the best efforts to reconstruct and prepare datasets by 1) quality filtering, 2) performing assemblies, and 3) binning sequences into taxonomically informative groups, annotation pipelines still achieve successful annotation for only ~50% of the sequences under analysis., As mentioned above, the annotation process is highly dependent on the available databases and hence limited by the amount of information that is present within these repositories. Sequences that do not have any similarity with any other sequence existing in a known database are termed “orphan genes”. These genes are believed to be 1) a consequence of sequencing errors and/or reflect the inaccuracy of gene prediction tools, or 2) truly novel genes that have no sequence or function similarity to known genes and may share higher order similarity in the form of protein folds., A lot of work is currently being undertaken in order to shed some light on these unknowns/orphans using various types of information. Some existing tools use pathway information from metagenomic neighbors and also context-depended metabolomic data to assign a functional annotation to unknown genes., Along these lines, the use of metabolomic, metatranscriptomic, and/or metaproteomic data will provide a more elaborate view of the “picture”, addressing all aspect of the dogma of life in the metagenomics era. Moreover, single-cell genomics is now becoming increasingly popular by investigating information from sequencing individual cells. The synergy of single-cell genomics with metagenomics can allow a more accurate separation of metagenomics sequences into individual genomes, guided by the single-cell sequencing data.
A wide array of software is currently available to perform each step of the marker gene metagenomics analysis pipeline. What is missing from the literature is a systematic evaluation of software and algorithms that have been used so far and a standardized means of comparing results derived from different workflows. Variation in results can occur due to inconsistencies in a number of factors, such as DNA extraction,, primer pair and amplification region,– sequencing platform, and the software used. All of the aforementioned sources of variation make it very difficult to compare and obtain trustworthy results. Computational and programming challenges to improve the already available software can be achieved, but only through benchmarks, simulations, and thorough testing. Initiatives such as the GSC could potentially take over the design of the “ Minimum Analysis Requirements of Metagenome Sequences (MARMS)”. This will be made up of standardized methodologies and consensus in the choice of software, analysis steps, threshold values, and parameters. Such an initiative would eliminate, or at least minimize, the biases that can be generated by analyzing data using multiple methodologies.
The availability of data software such as EBI Meta-genomics, IMG/MER, MG-RAST, and SILVAngs will further allow users with limited computational facilities to perform analysis of metagenomic samples. In comparative metagenomic analyses, one can use tools to compare samples from different ecological niches and extract information that is common and/or unique to a specific environment.,, Moreover, the GSC is striving toward the successful integration of analyzed data under a unified and mutually acceptable structure/format that will facilitate the exchange of valuable insights and information in the field of microbial ecology and environmental microbiology.
To sum up, we have created a metagenomics flowchart (Fig. 1) outlining all the aforementioned basic steps of the analysis pipeline. Analysis can take two different routes depending on the type of sequencing data (marker gene or shotgun metagenomics). Every analysis step shown in the flowchart is complemented by a list of some well-established tools used by the metagenomics community.
Flowchart of basic metagenomics steps and tools currently in practice.
Notes: The analysis pipeline can take two different routes depending on the type of sequencing data (marker gene or shotgun metagenomics) available. The flowchart outlines the basic steps in the analysis pipeline starting with preprocessing of the data to the final extraction of results and concurrent storage and management of the data. Some popular tools that have been used extensively by the metagenomics community are shown for every step, as a well as the databases and algorithms in common practice.
Footnotes
ACADEMIC EDITOR: J.T Efird, Editor in Chief
FUNDING: This work was supported by the European Commission FP7 programs INFLA-CARE (EC grant agreement number 223151), “Translational Potential” (EC grant agreement number 285948), and LifeWatchGreece Research Infrastructure (http://www.lifewatchgreece.eu/) [384676–94/GSRT/NSRF(C&E)]. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.
COMPETING INTERESTS: Authors disclose no potential conflicts of interest.
Paper subject to independent expert blind peer review by minimum of two reviewers. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE). Provenance: the authors were invited to submit this paper.
Author Contributions
AO, GAP, II conceived the idea of the manuscript. AO, CP wrote the first draft of the manuscript. All other authors (GAP, II, NP, PP, GK, CA) made critical revisions and approved the final version of the manuscript.
REFERENCES
Nh Ng Ph N M M Download Torrent T T Nh T 2015 For Sale
Aim Download
Abstract
Technological advances have led to the introduction of next-generation sequencing (NGS) platforms in cancer investigation. NGS allows massive parallel sequencing that affords maximal tumor genomic assessment. NGS approaches are different, and concern DNA and RNA analysis. DNA sequencing includes whole-genome, whole-exome, and targeted sequencing, which focuses on a selection of genes of interest for a specific disease. RNA sequencing facilitates the detection of alternative gene-spliced transcripts, posttranscriptional modifications, gene fusion, mutations/single-nucleotide polymorphisms, small and long noncoding RNAs, and changes in gene expression. Most applications are in the cancer research field, but lately NGS technology has been revolutionizing cancer molecular diagnostics, due to the many advantages it offers compared to traditional methods. There is greater knowledge on solid cancer diagnostics, and recent interest has been shown also in the field of hematologic cancer. In this review, we report the latest data on NGS diagnostic/predictive clinical applications in solid and hematologic cancers. Moreover, since the amount of NGS data produced is very large and their interpretation is very complex, we briefly discuss two bioinformatic aspects, variant-calling accuracy and copy-number variation detection, which are gaining a lot of importance in cancer-diagnostic assessment.
Introduction
In recent years, next-generation sequencing (NGS) technologies have played an essential role in the understanding of the altered genetic pathways involved in human cancer. Compared to earlier genome-sequencing methods, numerous advantages characterize NGS. Primarily, this is a high-throughput method, as it allows massive parallel sequencing consisting of simultaneous sequencing of multiple targeted genomic regions in multiple samples in order to detect concomitant mutations in the same run. Another important advantage in routine tumor sequencing is the reduced turnaround time of analysis, which leads to reduced clinical reporting time. Moreover, an analysis in NGS requires very low input of DNA/RNA, in contrast to traditional sequencing methods. A variety of genomic aberrations with high accuracy and sensitivity can be screened simultaneously, such as single/multiple-nucleotide variants, small and large insertions and deletions, copy-number variations (CNVs), and fusion transcripts. The sensitivity of NGS is higher than Sanger sequencing (detection of 2%–10% versus 15%–25% allele frequency, respectively), and allows quantitative evaluation of the mutated allele.
NGS workflow is constituted by different steps, from nucleic acid extraction to variant annotation, as shown in Figure 1. There are currently three main companies offering NGS platforms: Roche, Illumina, and Life Technologies (Thermo Fisher Scientific, Waltham, MA, USA). Each of the available platforms uses different sequencing chemistry and methods for signal detection. Roche 454 platforms employ pyrosequencing, whereby a chemiluminescent signal indicates base incorporation and the intensity of the signal correlates with the number of bases incorporated through homopolymer reads. However, the NGS platforms most commonly used employ sequencing by synthesis, in which the DNA strand to be sequenced is used as a template, a complementary strand is synthesized, and consequently the sequence of the template strand is obtained. Illumina MiSeq and HiSeq sequencers use four distinct fluorescently labeled nucleotides and optical imaging to visualize the growing complementary strand. The error rate estimated for Illumina technology is <0.4%., Instead, Life Technologies uses a nonoptical approach and unlabeled nucleotides. Sequencing by synthesis is performed in microscopic wells interfaced with a semiconductor chip. The DNA is clonally amplified on microscopic beads. After incorporation of nucleotides one at a time, the protons released result in a change in pH, measured by the semiconductor chip. The error rate estimated for Ion Torrent technology is 1.8%–1.9%, mostly in the detection of homopolymer stretches.,
NGS workflow from nucleic acid extraction to variant annotation.
Abbreviation: NGS, next-generation sequencing.
NGS approaches are different, and concern tumoral DNA and RNA analysis. DNA sequencing includes whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted sequencing. WGS allows sequencing of the entire genome, requiring a large DNA sample. To detect clinical mutations accurately, 100- to 200-fold sequencing coverage may be needed, which is both time- and cost-prohibitive. Usually, a 30- to 60-fold sequencing, sufficient to identify structural rearrangements, is employed. WES focuses on the coding regions (exons) of a genome, typically ~2.5% of the human genome, to discover rare or common variants associated with a disorder or phenotype. WES reduces cost and time compared to WGS. The most common methods rely on hybridization by oligonucleotide probes to “capture” targeted DNA fragments, thereby enriching for exonic sequences. Targeted sequencing, focusing on a selection of genes of interest for a specific disease, could be more accurate and accessible in terms of time and cost for clinical applications for more laboratories.
RNA sequencing (RNA-Seq) facilitates the detection of alternative gene-spliced transcripts, posttranscriptional modifications, gene fusion, mutations/single-nucleotide polymorphisms (SNPs), and changes in gene expression. The extracted RNA is first enriched and reverse-transcribed into complementary DNA, which is then processed. Moreover, with the NGS approach, it is possible to investigate epigenetic alterations, such as promoter methylation, microRNAs, and the expression of other small RNAs, even if currently there are no relevant panels available to use in diagnostics. Life Technologies is engaging more in the setup of specific kits for the disease (Ion AmpliSeq Colon and Lung Panel version 2, BRCA1/2 Panel, AML Panel, and RNA Lung Fusion Panel) with respect to the Illumina approach, which is based on the development of a generic cancer-panel kit including information on the genes of several cancers (TruSeq Amplicon and TruSight Cancer).
Although NGS is extensively used for research purposes, its application in clinical practice has not been fully formalized with guidelines, due to the novelty of the approach. Despite this, NGS is beginning to be widely used for diagnostic requests. The Italian Society of Human Genetics has recently released early indications on this topic, summarizing criteria needed for new NGS-based molecular diagnosis.
This review includes the advances and initial clinical applications of NGS in solid and hematologic cancer diagnosis. Moreover, we briefly discuss two bioinformatic aspects that are gaining significant importance in cancer-diagnostic assessment: first, the accuracy and quality of variant calling, which is still an open question in terms of reducing the false-positive rate; and second, CNV detection, which is an essential analysis in the clinical setting.
NGS analysis for solid cancer diagnosis
Detection of critical cancer-gene alterations in solid-tumor samples better defines patient diagnosis and prognosis, and indicates what targeted therapies must be administered to improve the care of selected cancer patients in the personalized-medicine scenario. NGS studies on solid cancer, described here, offer a fundamental overview about how the cancer molecular approach is changing, evidencing advantages compared to traditional diagnostic methods.
Hereditary breast cancer
Hereditary breast cancers (HBCs) account for 5%–10% of all BCs, and in about 30% of cases are caused by BRCA1 and BRCA2 mutations. The BRCA1 and BRCA2 genes codify for tumor-suppressor proteins, essential for DNA repair and genomic stability. The presence of these mutations increases the lifetime risk of developing HBC, and so genetic counseling and a BRCA-gene test is recommended for BC patients with early onset or a significant family history.
Conventional DNA sequencing, such as direct Sanger sequencing, requires long analysis times and high costs, due to BRCA1 and BRCA2 gene lengths of 23 and 27 exons, respectively. Moreover, prescreening methods, such as denaturing high-performance liquid chromatography, have been suggested to speed up the molecular analysis.
Our lab experience and several recent papers have demonstrated how NGS methods are adequate to detect point mutations and indels in BRCA1/BRCA2 genes, revolutionizing this genetic analysis and reducing time and costs.– This approach in fact is suitable in routine diagnostic workflow, since it is faster and more sensitive than denaturing high-performance liquid chromatography/Sanger sequencing methods. Data quality is assured by participation in international quality programs on BRCA1/BRCA2 testing with the NGS method (ie, the European Molecular Genetics Quality Network) that also allow obtainment of specific certification on correct results, sensitivity, specificity, and interpretation of variant calling.
Nowadays, other genes besides BRCA1/BRCA2 have been shown to confer high BC risk. NGS platforms allow the customization of gene panels, in order to give more chance to patients to determine their BC risk.– Tung et al found that the frequency of mutations in non-BRCA1/BRCA2 genes was 4.3% in their 25-gene panel. Lin et al developed a sequencing panel containing 68 genes associated with cancer risk for patients with early onset or familial BC. They discovered alterations in RAD50, TP53, ATM, BRIP1, FANCI, MSH2, MUTYH, and RAD51C, which may be valuable in BC-risk assessment. Lhota et al performed NGS of 581 genes in 325 BC patients (negatively tested in previous BRCA1/BRCA2/PALB2 analyses), identifying 127 truncating variants.
Despite several findings on HBC with NGS, a recent study demonstrated that with regard to the two most common platforms, neither the Illumina MiSeq sequencer with the supplied MiSeq Reporter software nor the Life Technologies Ion Torrent Personal Genome Machine (Ion PGM) with the supplied Torrent Suite software were completely suitable for clinical laboratory sequencing of BRCA1 or BRCA2. The inability of the MiSeq system is that it fails to detect insertions and deletions larger than nine base pairs. Similarly, the inability of the Ion PGM with Torrent Suite software was in detecting a ten-base pair insertion and 64-base pair deletion. However, the authors reported that an alternative alignment and variant-calling software, Quest Sequencing Analysis Pipeline (QSAP), was capable of detecting large deletions and insertions. With the combination of the MiSeq platform and QSAP alignment, they were able to design an assay with 100% sensitivity and specificity for BRCA1-and BRCA2-sequence variations. These results underline the strong impact of specific bioinformatic tools for alignment and variant calling depending on the application of interest, as we describe herein.
Melanoma
BRAF mutations play a key role in 40%–70% of malignant melanomas. According to the COSMIC database, 44% of the melanomas have BRAF mutations and 97.1% of these mutations are localized in codon 600 of the BRAF gene. Mutated BRAF can be inhibited by small-molecule kinase inhibitors, among which are vemurafenib (Roche), approved by the US Food and Drug Administration (FDA) in August 2011 for unresectable or metastatic melanoma, and dabrafenib. For these therapies, it is mandatory to detect BRAF alterations by gold-standard methods, such as Sanger sequencing and real-time polymerase chain reaction (PCR).
Ihle et al evaluated several parameters of different methods for BRAF-mutation analysis. They compared allele-specific PCR performed with the Cobas BRAFV600 test, pyrosequencing using the Therascreen BRAF Pyro kit, high-resolution melting analysis, immunohistochemistry, the NGS approach, and Sanger sequencing with regard to their sensitivity, specificity, costs, amount of work, feasibility, and limitations. They suggested the best method to be a combination of VE1-antibody staining and high-resolution melting for p.V600E-mutation analysis, associating the lowest detection limit with a fast method with 100% sensitivity. However, the authors reported the numerous NGS advantages for melanoma molecular diagnostics, supporting the future substitution of the current methods with an NGS approach.
However, there is a clinical need to analyze other genes, both in terms of finding other target-therapy types and understanding eventual resistance. Currently, validated diagnostic panels are not commercially available, and very few studies have been performed on the development of a custom-designed gene panel., van Engen-van Grunsven et al designed a panel containing hotspot alterations, such as BRAF exon 15, NRAS exons 2 and 3, HRAS exons 2 and 3, AKT1 exon 3, GNAQ exons 4 and 5, GNA11 exons 4 and 5, KIT exons 8, 9, 11, 13, and 14, and PDGFRA exons 12, 14, and 18. Our AmpliSeq custom panel includes eleven crucial full-length genes (BRAF, NRAS, PTEN, MITF, CDK4, MGMT, CTLA4, PIK3CA, MC1R, KIT, and RB1) involved in melanoma carcinogenesis and therapy-response pathways. We tested its clinical applicability on the NGS platform Ion PGM in order to individuate new or already known SNPs and mutations that could be related to different response duration to BRAF inhibitors. Our results showed higher sensitivity and specificity in detecting a wide range of genetic alterations compared to traditional sequencing methods. Moreover we identified alterations in CTLA4, MITF, PIK3CA, KIT, and MC1R related to BRAF-inhibitor response duration. This panel is now in validation, in order to use it routinely in diagnostic prognosis and therapy prediction.
Prostate cancer
Prostate cancer (PC) has become the leading cause of cancer death in many countries among males. The high tumor heterogeneity suggests that numerous genetic events are responsible for indolent and aggressive forms of the disease. Currently, there is no way to differentiate accurately between these two forms before treatment. Most men diagnosed with PC have clinically indolent disease that does not require immediate radical treatment, and overtreatment of these men could lead to worse quality of life. The clinical response to therapy varies widely from patient to patient, since some patients relapse shortly after treatment, whereas others remain disease-free for a long time before relapsing.
Recent advances in NGS technology have improved the understanding of PC biology and clinical variability. In particular, DNA-Seq, RNA-Seq, chromatin immunoprecipitation-Seq, and methyl-Seq experiments have better elucidated the major pathways affecting prostate tumorigenesis, which are the AR-signaling, PI3K–PTEN–Akt, and RTK–Ras–MAPK pathways.
Two studies have evidenced the possibility for huge screening of PC patients in routine diagnosis., Manson-Bahr et al showed that DNA from cancer material dissected from transrectal ultrasound needle-core biopsy specimens can be analyzed. The authors observed a pattern of mutation consistent with those previously observed in PC surgical tissues, including TMPRSS2–ERG fusion and mutations in SPOP, TP53, ATM, and MEN1, while nonsense mutations were observed in the MAP2K5 and the NCOR2 genes. Iacono et al performed the first retrospective NGS study on 60 specimens: 30 high- and 30 intermediate-risk patients. They identified nonsynonymous variations and SNPs with an allelic frequency ≥10% in the TP53, CSFR1, KDR, KIT, PIK3CA, MET, and FGFR2 genes, evidencing their role in the progression and aggression of PC. However, at present the study of multiple genetic alterations in PC is not suggested for routine diagnostic purposes.
Thyroid cancer
Thyroid nodules, very frequent in the general population, are mostly benign, but an accurate identification of those nodules that could be a precursor of a cancer is needed. A common diagnostic approach that allows differential diagnosis between cancerous and benign nodules in most cases is represented by ultrasound-guided fine-needle aspiration (FNA) of the thyroid nodule followed by cytological examination. However, in approximately 25% of nodules, the diagnosis cannot be established by FNA cytology, since the limited diagnostic material available is not sufficient to perform a comprehensive molecular characterization by traditional techniques.
In the last few years, several studies have been performed on the possibility of improving thyroid cancer (TC) diagnosis by an NGS molecular test.– In 2013, they developed the first custom gene panel, the ThyroSeq, which allowed the targeting of 284 mutational hotspots in 12 cancer genes. Sequencing was performed on 228 neoplastic and nonneoplastic thyroid samples, including 105 frozen, 72 formalin-fixed, and 51 FNA samples, representing all major types of TC. Using this approach, point mutations were detected in 30%–83% of specific types of TC and in only 6% of benign thyroid nodules.
In 2014, Nikiforov et al validated the performance of a new gene-mutation panel (ThyroSeq version 2) and a gene-fusion panel (ThyroSeq RNA) in a large series of thyroid nodules with follicular or oncocytic (Hürthle cell) neoplasms/suspicious for a follicular or oncocytic (Hürthle cell) neoplasm, demonstrating that it allowed accurate cancer-risk assessment in these nodules. In 2015, the same authors demonstrated the possibility to stratify patients with benign and malignant thyroid nodules diagnosed as atypia of undetermined significance/follicular lesion by cytology, with high sensitivity and specificity. The last custom panel developed by the authors (ThyroSeq version 2.1) included 14 genes analyzed for point mutations and 42 types of gene fusions occurring in TC.
Recently, Simbolo et al investigated the diagnostic stratification of sporadic medullary TC by the use of the Ion AmpliSeq Hot Spot Cancer Panel version 2 (Life Technologies). Thirteen cases had a somatic RET mutation, and the authors showed that only ten were detected by both Sanger sequencing and NGS, while three were undetected by Sanger, revealing higher NGS sensitivity. In summary, these studies demonstrated that NGS offers the possibility of better classifying thyroid nodules. Moreover, this should improve patient management and allow clinicians to avoid diagnostic surgeries associated with significant costs and potential risks.
Lung cancer
Lung cancer (LC) is the leading cause of cancer-related death in developed countries, and is often diagnosed at an advanced stage. A comprehensive knowledge of predictive biomarkers has enabled the selection of LC patients for the use of tyrosine-kinase inhibitors (TKIs). In clinical practice, EGFR mutations must be evaluated to address patients for TKI treatment appropriately. Most (80%–90%) EGFR mutations are either small exon 19 deletions or the L858R mutation in exon 21, but other TKI-sensitive EGFR mutations can occur in exons 18–21. The mutation T790M in exon 20 needs to be investigated, because it is associated with first-generation TKI resistance but third-generation TKI sensitivity.– Another marker of TKI resistance consists of ALK rearrangement. Indeed, to date, EGFR and ALK are the only actionable genes that have drugs approved by the FDA for LC treatment.
Formalin-fixed paraffin-embedded tissue is considered an optimal specimen for molecular analysis. The gold-standard technique to detect EGFR mutations for several years was Sanger sequencing, but recently other methods have been employed for molecular diagnostics (high resolution melting, restriction fragment-length polymorphism, mutant allele-specific PCR, peptide nucleic acid-mediated PCR, pyrosequencing, immunohistochemistry with specific EGFR antibodies, and the Scorpion Amplification Refractory Mutation System). Instead, to study ALK rearrangements, the gold standard is still immunohistochemistry or fluorescence in situ hybridization.
Several studies have indicated numerous changes due to the introduction of NGS into daily clinical practice for LC molecular diagnosis, reporting high sensitivity for detecting actionable alterations by the use of a gene panel on LC specimens.– In fact, Lim et al recently reported that 58% of patients with wild type by standard testing for EGFR/KRAS/ALK showed alterations identified by NGS, thus giving these patients a therapeutic chance.
However, tissue biopsies are not always available, because 60% of non-small-cell LCs (NSCLCs) are high-stage locally advanced and/or inoperable tumors that have already metastasized to distant sites when they are detected. The diagnosis of LC sometimes depends on metastatic lymph-node specimens obtained by FNA cytology. In these patients, cytology specimens are usually the only material available for histological typing and for molecular analysis. In these cases, the tumor-cell content may be very low, implying the need to use very sensitive methods. Scarpa et al demonstrated for the first time in 2013 the diagnostic relevance of the Ion AmpliSeq Colon and Lung Cancer Panel on lung adenocarcinoma cytological samples. The first version of this panel included 504 mutational hotspot regions in 22 cancer-related genes, and it was able to detect variants up to 1% of allelic frequency, which corresponds to 2% of cancer cells in a sample. An implementation of the Ion AmpliSeq Colon and Lung Cancer Panel was reported in a study in which seven different labs belonging to the OncoNetwork Consortium tested the NGS panel on the same samples. This final version of the panel was constituted of 1,825 selected mutational hotspots in 22 cancer-related genes. Recent studies have confirmed the sufficient and high quality of DNA from cytological LC samples for NGS molecular analysis.,
Neoplastic tissues remain the standard specimen for molecular analysis. However, the potential to obtain noninvasive sampling compared with tissue biopsy is very attractive. Blood collection is less invasive than tissue sampling, and can be used when tissue specimens are limited/not available or for critically ill patients. Moreover, it can allow for sampling at several time points to monitor the genetic evolution of the tumor and also to predict early treatment resistance or nonresponse.
Plasma DNA can also be used by NGS to detect cancer-related gene alterations useful in LC-treatment decisions, because plasma may reflect disease status compared with tumor biopsy.– Moreover, during treatment, plasma analysis could reveal EGFR treatment-resistant mutations, indicating early clinical progression.
In our lab, two NGS panels on Ion Torrent are in daily use in NSCLC patients: the Ion AmpliSeq Colon and Lung Cancer Panel version 2 and the Ion AmpliSeq RNA Fusion Lung Cancer Research Panel. We also participated in Thermo Fisher Scientific’s international validation program for the final version of this fusion panel. Routinely, NGS clinical analysis is performed on NSCLC formalin-fixed paraffin-embedded and cytological samples. A comparative study is ongoing on NGS application for tissue and plasma detection, obtaining encouraging results (manuscript in preparation). Moreover, the use of the Ion AmpliSeq Colon and Lung Cancer Panel is a fundamental step in our clinical analysis to characterize the EGFR deletion type, because specific in vitro diagnostic molecular tests on Rotor-Gene real-time PCR do not provide this information.
Colorectal cancer
EGFR, involved in cancer growth and survival, is targeted by several drugs in colorectal cancer (CRC) therapy. However, only a small subgroup of patients with metastatic CRC can benefit from anti-EGFR therapies (cetuximab or panitumumab), and thus prediction of patient responses is necessary to avoid side effects and to save costs. Ras proteins (HRas, KRas, and NRas) are important downstream effectors that transmit signals from EGFR to the intracellular signaling cascade. KRAS is considered a predictive biomarker for the efficacy of anti-EGFR therapy since KRAS-mutant CRC patients (codons 12 and 13 in exon 2) are resistant to treatment with EGFR inhibitors. However, approximately 40%–50% of patients harboring wild-type KRAS exon 2 do not benefit from these targeted agents, suggesting the potential involvement of other genetic alterations in pathways downstream of EGFR. In fact, a recent study suggested that additional mutations in KRAS and NRAS, as well as downstream mutations in BRAF or PIK3CA, may cause resistance to anti-EGFR treatment. Inter- and intratumoral genetic heterogeneity is another factor in predicting treatment failure and drug resistance in CRC therapies. The recently updated National Comprehensive Cancer Network guideline strongly recommends genotyping of tumor tissue (either primary tumor or metastasis) in all patients with metastatic CRC for RAS (exons 2–4 of KRAS and NRAS), and patients with any known KRAS or NRAS mutation should not be treated with cetuximab or panitumumab. To date, the gold standard for analysis of these genes is real-time PCR or pyrosequencing, methods that are time-consuming with low sensitivity.
In order to investigate CRC specimens with NGS in clinical practice, Tops et al developed a multigene panel, already used for LC investigation. This panel has also been employed by several other groups in CRC research, who have recommended it in clinics when compared to traditional methods.– Another clinical application of NGS to CRC is represented by an interesting recent study in which a cutoff for mutational load can be identified via multigene tumor profiling to discriminate CRC patients in DNA-mismatch repair (MMR)-pathway proficiency and deficiency, since 15%–20% of CRC patients are deficient in one or more genes of MMR. This approach can be used for initial screening of Lynch syndrome. Moreover, the authors demonstrated the feasibility of analyzing MMR deficiency and RAS/BRAF mutations in CRC patients with the same panel, reducing time and costs of analysis.
Lately, several custom gene panels have been developed with Illumina and Life Technologies to investigate many other crucial CRC genes.– A multigene approach is in fact mandatory to obtain simultaneously a larger mutational spectrum, increasing the knowledge of CRC. Probably in the future, additional information emerging from these NGS studies will be useful for anti-EGFR therapy response duration or to develop other target therapies.
NGS and hematologic cancer
Hematological malignancies are grounded in genetic aberrations, in particular large mutations that are at the basis of the different phenotypes in the spectrum of hematologic cancers. NGS technologies have been applied to hematological disorders in a variety of contexts: guiding diagnosis (TCR gene rearrangement to establish T-cell clonality), subclassification (recurrent cytogenetic translocations in acute myeloid leukemia), prognosis (Philadelphia chromosome-positive in acute lymphoblastic leukemia), and minimal residual disease (MRD) testing (BCR–ABL transcripts in chronic myelogenous leukemia), often allowing the identification of novel mutations.,49 The characterization of leukemias, lymphomas, and myelomas is continually evolving, and includes the precise identification of additional common mutations that may be of great prognostic value and clinical importance.
Multiple myeloma
Multiple myeloma (MM) is a malignancy of plasma cells. It is a multistep process, and an asymptomatic stage of monoclonal gammopathy of undetermined significance precedes virtually all cases of MM. This malignancy undergoes a multistep-transformation process. Its genetic landscape changes over time due to additional events, such as somatic mutations and epigenetic and chromosomal copy-number changes, driving its progression from monoclonal gammopathy of undetermined significance to symptomatic MM and ultimately to aggressive extramedullary disease in some patients.
The first important event in plasma-cell transformation is represented by hyperdiploidy, observed in up to 55% of patients. The second is based on IGH translocations in 40%–50% of patients. Moreover, t(11;14) (dysregulation of the CCND1 gene, with its overexpression), t(4;14) (upregulation of FGFR3 and MMSET/WHSC1), and many other chromosomal rearrangements are present in the tumor plasma cells at the time of diagnosis. All these abnormalities have been known for a long time, because they are visible on the conventional karyotype. More recent data based on comparative genomic hybridization or SNP-array technologies have revealed other important chromosomal changes, especially homozygotic deletions.
With the development of NGS, the understanding of MM has been greatly improved in the past 5 years, confirming its wide heterogeneity at the molecular level, but also providing a clearer picture of the disease pathogenesis and progression. The quantitative nature of NGS data allows for higher resolution of the subclonal architecture of cancers. Nevertheless, initial reports of genomic evolution in MM using NGS were conducted on small cohorts, suggesting that MM shows a heterogeneous subclonal structure at diagnosis and only a few recurrent mutated genes of likely pathogenetic significance, including KRAS, NRAS, TP53, BRAF, and FAM46C.,
With NGS, Bolli et al confirmed subclonal KRAS, NRAS, and BRAF mutations in MM observed in about one-third of patients: acquisitions with crucial therapeutic implications in trials of Mek and BRaf inhibitors. Recently, Kortüm et al designed a 47-gene-targeting gene panel containing 39 genes known to be mutated in ≥3% of MM cases and eight genes in pathways therapeutically targeted in MM. Mutation analysis revealed KRAS as the most commonly mutated gene, followed by NRAS, TP53, DIS3, FAM46C, and SP140. They tracked clonal evolution and identified mutation acquisition and/or loss in FAM46C, FAT1, KRAS, NRAS, SPEN, PRDM1, NEB, and TP53, as well as two mutations in XBP1, a gene associated with bortezomib resistance.
Lymphomas
In recent years, the development of NGS has also allowed the acquisition of important molecular information in a variety of lymphoid tumors, including Hodgkin’s lymphoma, diffuse large B-cell lymphoma, Burkitt’s lymphoma, chronic lymphocytic leukemia, follicular lymphoma, mantle-cell lymphoma, hairy-cell leukemia, and splenic marginal zone lymphoma. Although there have been many advances in this field, NGS panels are not yet available for clinical practice. The current modality to diagnose a hematological disease is based on fluorescence in situ hybridization, classic molecular biology, and radiographic studies. The latter in particular is associated with radiation exposure and limited specificity.
The new sequencing technologies, in addition to identifying somatic mutations involved in cancer progression (ie, mutations of BRAF, MYD88, and NOTCH2), have provided scientific evidence that might be useful for clinical treatment, as well as for the diagnosis and progression of these diseases. NGS aims to detect the tumor-specific clonotype and circulating tumor-specific sequence in the peripheral blood of patients with Hodgkin’s lymphoma. Quesada et al used this approach to identify lymphoma-specific immunoglobulin gene rearrangements in primary tumor samples at diagnosis or disease recurrence, as well as in follow-up. Moreover, the sequencing of B-cell lymphoma genomes has identified recurrent mutations, some of which have prognostic impact or serve as drug targets. Mutation of P53 predicts poor response to treatment and shortened overall survival across lymphoma entities, and mutations in NOTCH1 and SF3B1 have been shown to be independent predictors of poor outcome in chronic lymphocytic leukemia.
Minimal residual disease
MRD is defined as the small number of cancer cells that persist in a patient during or after treatment, even though clinical and microscopic examinations confirm complete remission and the patient shows no signs or symptoms of disease. MRD detection and quantification are used for the evaluation of treatment efficiency, patient-risk stratification, and long-term outcome prediction in hematological malignancies.
Currently, flow cytometry is the most commonly used technique for the diagnosis and characterization of hematological malignancies and MRD. Although the method is widely used, a high level of expertise is required to interpret the data precisely when it comes to rare-event detection, such as MRD. The sensitivity for the detection of malignant cells varies according to the type of disorder, the panel of antibodies used, the number of cells analyzed, and the expertise of the laboratory. Furthermore, DNA and RNA tests usually lack the sensitivity required for MRD monitoring.
NGS approaches allow for searching not only for known mutations/translocations but also for all clonal gene mutations and rearrangements present in diagnostic samples to understand the possible evolution of MRD better. In a recent study, consensus primers and high-throughput sequencing were employed to amplify and sequence all rearranged IGH and TCR gene segments.
Ladetto et al described a comparison between real-time quantitative PCR and LymphoSight NGS as methods for MRD detection using clonal IGH rearrangements. The primary results demonstrated that NGS enabled the detection of this molecular marker in a high proportion of cases, including a fraction in which standard PCR-based amplification failed. In addition, NGS showed a sensitivity comparable to that obtained by real-time quantitative PCR, allowing its use for detection of MRD.
Unfortunately, NGS for this purpose is not yet routinely employed in clinical practice. NGS might overcome some disadvantages of PCR-based methods and avoid the need for patient-specific reagents. In addition, the NGS approach enables the analysis of genetic diversity and clonogenic heterogeneity which may contribute to our current understanding of disease biology and relapse kinetics., To date, only one CE (Conformité Européene)-marked in vitro diagnostic panel is commercially available: the LymphoTrack Dx assay (for Illumina MiSeq and Ion PGM) used for the identification of the DNA sequence, clonal prevalence, and V–J family identity for each gene rearrangement, as well as IGH assays, and the extent of IGHV somatic hypermutation.
Variant calling and copy-number variations
NGS provides large-scale data that continue to pose a major challenge. To call variants from NGS data, many aligners and variant callers have been developed and composed into diverse pipelines. A typical pipeline contains an aligner, which maps the sequencing reads to a reference genome, and a variant caller, which identifies variant sites and assigns a genotype to a subject. The performances of different aligners have been extensively studied, and great effort is still needed to correctly identify the best analysis pipeline.,
The Genome Analysis Toolkit (GATK; Broad Institute, Cambridge, MA, USA) is a powerful set of tools for NGS-data analysis. Recently, we focused on the optimization of GATK to call variants from data sets coming from an Ion Torrent targeted custom panel, including eleven genes involved in melanoma. In particular, the variant-filtration step was investigated. To this end, the variant quality-score recalibration (VQSR) step has been recently introduced. VQSR filtering uses annotation metrics (eg, quality by depth, mapping quality, strand bias) from a true variant, annotated in HapMap for instance, to generate an adaptive model. Such a model applied to the other variants allows calculation of the probability that a variant is true or false. Although this is a powerful method, it requires a large call set. Indeed, GATK’s best practices suggest not to apply VQSR in “small-scale experiments, such as targeted gene panels or exome studies with fewer than 30 exomes”. In these cases, hard filtering is the approach indicated by GATK. General rules are available, but appropriate filters have to be specifically set up for each study, considering also that GATK does not provide any technical documentation for Ion Torrent data. Therefore, starting from a comparison of results from GATK and proprietary Torrent Suite variant caller (TVC) analyses on the real data set, our aim was to determine a framework for GATK hard filtering in order to lower false-positive calls (Figure 2). We observed a high discrepancy between TVC and GATK, particularly for indels, suggesting that such type variants are difficult to detect with even the present bioinformatic tools. We then decided to simulate two data sets, each with a different coverage and each carrying alterations found in real data. Indeed, the importance of defining a “gold standard” data set to test variant calling methods is a very hot topic, and, recently “synthetic” matched tumor–normal samples have been created to compare performances of popular variant callers in the detection of “somatic” single nucleotide variants (SNVs). The first important result is that results are strictly correlated with coverage. We found that in a high-coverage data set, calling of SNVs led to a lower number of false positives than in a low-coverage data set. However, focusing on indels, the picture is more complex and the number of false-positive cases is high in both of the two data sets when looking at the variants suggested by GATK in the phase preceding the filtering of “good” variants. To be able to select opportune hard filters, we considered the most important parameters of quality indicated in the raw Variant Call Format file. We built up regression trees to be able to identify the best choice for hard filtering, in order to discriminate true and false calls better. We performed the analyses in SNV and indel subsets, both stratified by genotype, in high- and low-coverage data sets. Regression trees allowed us to set a series of filters for each type of alteration. Recently, Vanni et al used GATK to analyze sequencing data of the targeted AmpliSeq Colon and Lung Cancer Panel (Life Technologies). Methodologically, they filtered out variants with a Phred score of 5–30, marking them as low quality. Our results showed that such an approach might not be enough to have a high-quality GATK call set. In detail, we found that different parameters could be tuned depending on the type of mutations and genotypes suggested. The application of hard filtering was able to reduce the number of false positives. Sometimes, the loss of true variants could be high, in particular for indels, but it has to be noticed that the number of false variants was also high. Therefore, the application of hard filtering can help to drastically reduce such a high number of false positives, and we argue that increasing coverage should improve filtering results in terms of true variants not correctly discarded. We explored flanking regions of each type of alteration, in particular searching for recurrent homopolymeric strings, highlighting that they are partly responsible for false-positive calls. Hard filters were tested on a real independent cohort, which underwent sequencing by the same custom panel. We found almost 100% concordance regarding SNV calling (manuscript in preparation).
Our approach to setting up a pipeline for SNV calling.
Abbreviations: SNV, single-nucleotide variant; GATK, Genome Analysis Toolkit; TVC, Torrent Suite variant caller; VCF, Variant Call Format.
Another NGS application is represented by CNV analysis. CNVs occur frequently during carcinogenesis, and thus the detection of these aberrations is essential for cancer-genome analysis to improve diagnosis and treatment. NGS-based CNV algorithms frequently manage WGS and WES data. A number of somatic CNV-detection programs for NGS data have been developed, each of them based on a different approach. However, with regard to targeted sequencing, the approach used in diagnostic settings, the bioinformatic challenge remains open. In essence, all pipelines for CNV detection in targeted-sequencing data work through the read-depth approach. In detail, they are based on the calculation of coverage of the amplicons and in the detection of outliers, subsequent to the opportune normalization step. Some algorithms require matched tumor–normal samples or a reference DNA, but recently an R package was introduced, Ioncopy, that does not need control samples.– Different biases have to be considered in a read depth-based approach. PCR could lead to coverage distortion, because of nonuniform efficiency in amplification. An important issue deeply studied for CNV identification is guanine–cytosine bias, which affects read coverage. Moreover, another important bias regards the alignment step, because short reads might not be unambiguously mapped to the reference genome. In conclusion, even if a number of methods have been set up, validation is still needed in order to include them in a clinical setting.
Conclusion
Despite several critical points regarding mostly technology implementation and data interpretation, in this review, we have shown numerous benefits of an NGS approach (Figure 3). In fact, recent innovations in sequencing technologies, have allowed the obtainment of a wide spectrum of genomic alterations occurring within tumors.
Benefits obtained from the use of NGS methods in clinical molecular diagnostics.
Note: The introduction of the NGS in the clinical guidelines requires an improvement of some critical points shown in this figure.
Abbreviation: NGS, next-generation sequencing.
At present, the clinical utility and efficacy of comprehensive genomic profiling with the NGS are under evaluation, in order to introduce this technology in clinical guidelines for solid and hematologic cancer management. Initial results demonstrate that NGS might improve patient care, guiding them toward specific screening programs and targeted therapies with more accuracy and specificity than traditional sequencing methods, even if other many studies are needed.
Footnotes
Disclosure
The authors report no conflicts of interest in this work.