A “bioinformatics pipeline” refers to a number of computational tasks, generally applied sequentially (hence the term “pipeline”), which receive at the beginning the output of an MPS sequencing instrument such as an image or FASTQ files, and progressively analyse this data through key steps, ending up with a VCF file, or even further with an annotated spreadsheet (CSV, TSV) or Text file.
While there is no one standard pipeline, most bioinformatics pipelines convert the data through a series of fairly standardised milestones.
A bioinformatics pipeline can be provided by the MPS instrument vendor, using proprietary software, or using open-source software. None of these approaches has been shown to be innately superior to the others, provided they are selected, tuned, validated or verified (as appropriate) and applied correctly.
This phase receives raw electronic information from the MPS instrument, and converts it using the vendor’s proprietary algorithms into genomic signals such as nucleotide positions and ordering (“base calling”). The laboratory usually has relatively little control of this phase as it is under the instrument manufacturer’s control.
Where multiplexing strategies have been applied, de-multiplexing is performed at this analysis stage; de-multiplexing re-identifies the sample from which individual sequence reads were derived.
For amplicon sequencing strategies, primers have to be trimmed from the reads.
The outputs of the primary analysis phase are usually FASTQ files. Quality control (including machine metrics) and acceptance criteria should be applied at this stage.
This phase receives the FASTQ files from the primary analysis, and maps (or aligns) it to the reference sequence and identifies changes from the reference sequence (variant calling).
The secondary analysis pipeline must be tailored to the MPS technical platform used. For example, duplicates arising from PCR strategies are typically marked for capture-/enrichment-based approaches where this strategy helps identify clonally-derived sequences and potential sequence artefacts. In contrast, PCR duplicates are not marked in amplicon-based sequencing strategies. Local realignment can optimise mismatches to increase accuracy and minimise false-positive variant calls.
Variant calling is then performed to identify sequence variations from the reference such as SNVs and small insertions/deletions, copy number alterations and structural changes.
The outputs of the secondary analysis phase are usually BAM and VCF files. There are a large number of commercial, academic and in-house tools in use for the secondary analysis of MPS data.
Further quality control should be applied at this stage.
Tertiary analysis concerns the annotation of the identified sequence variants and may involve a combination of the following strategies:
- Comparison of the identified sequence variants to those reported in the most appropriate of the various polymorphism databases (e.g. dbSNP, dbVar, 1000Genomes, Exome Aggregation Consortium, Exome Variant Server)
- Annotation of the resulting transcript consequences (synonymous, truncating, missense, splice site etc.)
- Application of tools to predict the severity of the alteration, such as in silico pathogenicity prediction tools, splice site prediction tools, Grantham difference, assessment of sequence conservation, comparison to known protein domains
- Comparison to variants documented in clinical variant databases (e.g. ClinVar, HGMD, OMIM, LOVD, DECIPHER) and locus- and disease-specific databases
- Review of functional data relevant to the variant/locus, including gene expression data, in vitro, and in vivo studies
- Research of the variant/gene published in peer-reviewed literature
For large-scale genomic investigations, such as expanded gene panels, whole-exome or whole-genome analysis, tertiary analysis further involves a process of variant filtering and prioritization, by removal of findings of lesser interest. The aim of variant filtering and prioritization is to reduce the number of candidate variants to those most-likely associated with disease. For genome-scale investigations, variant filtering and prioritization is typically performed in a (semi-)automated fashion. The resulting pre-filtered set of candidate variants is then manually reviewed in further detail to allow clinical interpretation and classification of the sequence variants, and to take into account the current limitations of annotation databases; clinical interpretation and reporting of findings are discussed in chapter 5.
The outputs of annotation and filtering phases commonly are annotated VCF or CSV/TSV (spreadsheet) files. Further quality control system standards can be applied at this stage.