to
Close
close

Member Login


User ID  
Password

Bioinformatics

1. Scope

1.1 Scope

Diagnostic applications of genomic testing span a wide range of approaches. These may include copy number analysis using DNA microarrays or resequencing of single genes (in high multiplex), gene panels, whole exomes, whole genomes, tumour profiling, non-invasive prenatal screening/testing, methylation analyses and RNA-Seq.

The scope of this chapter is restricted to consideration of MPS technologies applied to clinical diagnostic DNA analysis. Excluded from scope are analyses of RNA, transcriptomes, epigenetic and methylation analysis and other applications of MPS.

Issues addressed cover the range of MPS testing for genes, panels of genes, exomes and whole genomes. As the size and complexity of the analysis increases, additional procedures and safeguards may need to be included to ensure robustness and reliability of the analysis.

1.2 The Bioinformatics Pipeline

A “bioinformatics pipeline” refers to a number of computational tasks, generally applied sequentially (hence the term “pipeline”), which receive at the beginning the output of an MPS sequencing instrument such as an image or FASTQ files, and progressively analyse this data through key steps, ending up with a VCF file, or even further with an annotated spreadsheet (CSV, TSV) or Text file.

While there is no one standard pipeline, most bioinformatics pipelines convert the data through a series of fairly standardised milestones.

A bioinformatics pipeline can be provided by the MPS instrument vendor, using proprietary software, or using open-source software. None of these approaches has been shown to be innately superior to the others, provided they are selected, tuned, validated or verified (as appropriate) and applied correctly.

Primary analysis:

This phase receives raw electronic information from the MPS instrument, and converts it using the vendor’s proprietary algorithms into genomic signals such as nucleotide positions and ordering (“base calling”). The laboratory usually has relatively little control of this phase as it is under the instrument manufacturer’s control.

Where multiplexing strategies have been applied, de-multiplexing is performed at this analysis stage; de-multiplexing re-identifies the sample from which individual sequence reads were derived.

For amplicon sequencing strategies, primers have to be trimmed from the reads.

The outputs of the primary analysis phase are usually FASTQ files. Quality control (including machine metrics) and acceptance criteria should be applied at this stage.

Secondary analysis:

This phase receives the FASTQ files from the primary analysis, and maps (or aligns) it to the reference sequence and identifies changes from the reference sequence (variant calling).

The secondary analysis pipeline must be tailored to the MPS technical platform used. For example, duplicates arising from PCR strategies are typically marked for capture-/enrichment-based approaches where this strategy helps identify clonally-derived sequences and potential sequence artefacts. In contrast, PCR duplicates are not marked in amplicon-based sequencing strategies. Local realignment can optimise mismatches to increase accuracy and minimise false-positive variant calls.

Variant calling is then performed to identify sequence variations from the reference such as SNVs and small insertions/deletions, copy number alterations and structural changes.

The outputs of the secondary analysis phase are usually BAM and VCF files. There are a large number of commercial, academic and in-house tools in use for the secondary analysis of MPS data.

Further quality control should be applied at this stage.

Tertiary analysis:

Tertiary analysis concerns the annotation of the identified sequence variants and may involve a combination of the following strategies:

  • Comparison of the identified sequence variants to those reported in the most appropriate of the various polymorphism databases (e.g. dbSNP, dbVar, 1000Genomes, Exome Aggregation Consortium, Exome Variant Server)
  • Annotation of the resulting transcript consequences (synonymous, truncating, missense, splice site etc.)
  • Application of tools to predict the severity of the alteration, such as in silico pathogenicity prediction tools, splice site prediction tools, Grantham difference, assessment of sequence conservation, comparison to known protein domains
  • Comparison to variants documented in clinical variant databases (e.g. ClinVar, HGMD, OMIM, LOVD, DECIPHER) and locus- and disease-specific databases
  • Review of functional data relevant to the variant/locus, including gene expression data, in vitro, and in vivo studies
  • Research of the variant/gene published in peer-reviewed literature

For large-scale genomic investigations, such as expanded gene panels, whole-exome or whole-genome analysis, tertiary analysis further involves a process of variant filtering and prioritization, by removal of findings of lesser interest. The aim of variant filtering and prioritization is to reduce the number of candidate variants to those most-likely associated with disease. For genome-scale investigations, variant filtering and prioritization is typically performed in a (semi-)automated fashion. The resulting pre-filtered set of candidate variants is then manually reviewed in further detail to allow clinical interpretation and classification of the sequence variants, and to take into account the current limitations of annotation databases; clinical interpretation and reporting of findings are discussed in chapter 5.

The outputs of annotation and filtering phases commonly are annotated VCF or CSV/TSV (spreadsheet) files. Further quality control system standards can be applied at this stage.

2. Documentation

Comment: Laboratories have a choice of using vendor-supplied pipelines, open-source pipelines, or some combination of both. In general, less documentation is required for vendor-supplied pipelines, but more customisation and fine-tuning is possible for in-house developed or applied software. The requirements described in this section apply regardless of the source of the bioinformatics pipeline.

2.1 The laboratory must document all components of, changes to, and auditing of the informatics pipeline.

The laboratory must document all components of the informatics pipeline, including software packages, custom scripts and algorithms, reference sequences and databases. Any changes, patch releases or updates in processes or version numbers must be documented with the date of implementation such that the precise informatics pipeline and annotation sources used for each test and report is traceable. If information from public websites is used, the date of access should be documented.

2.2 The laboratory must use version control to track software releases and updates to analysis methods.

The laboratory may consider use of dedicated version control software to assist with this requirement for managing software code, such as Concurrent Versions System (CVS), Apache Subversion (SVN), or Git. There are also dedicated software tools for management and control of laboratory method documents and validation records.

2.3 The laboratory must document the quality metrics assessed during a test.

For the informatics pipeline, relevant quality metrics include but are not limited to: the total number of reads passing quality filters, the percentage of reads aligned, the number of single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) called, and the percentage of variants in dbSNP.

2.4 The laboratory must document the results of the pipeline validation.

The validation documentation must detail the performance of the pipeline such as the sensitivity, specificity and accuracy of the pipeline to detect variants and any limitations of the pipeline. The validation document must be readily available to staff involved in MPS based genetic testing.

2.5 The laboratory should document all training and staff qualifications.

Given the rapid advances in bioinformatics, laboratories implementing NGS-based assays need to consider appropriate staff training and ongoing professional development of staff in bioinformatics. Staff involved in the reporting of NGS results must have, as a minimum, an understanding of the bioinformatics analysis steps and resources used for annotation.

2.6 The laboratory must document the process of data handling and storage.

The laboratory needs to define the minimum set of data to store. Typically, this will involve storage of .bam, .vcf files but not image files. Alternatively, the laboratory may store .fastq files to allow re-analysis of the primary data. Interpreted variant call files, such as those after review of the initial calls must also be stored.

2.7 The laboratory must define and document the conditions for data reanalysis.

As our understanding of sequence variation expands and our bioinformatics tool set improves, it may be necessary to re-evaluate the annotation of a variant or to re-analyse the sequence data. The laboratory must specify under which circumstances, if any, such reanalysis is to be performed.

3. Validation

The general principles of validation of laboratory tests (IVDs) (see NPAAC Requirements for the Development and Use of in-house in-vitro Diagnostic Devices - 2014) also apply for MPS assays. These include design, production, technical validation, and monitoring /improvement, and documentation requirements. However, that document does not address aspects specific to genomics and MPS, which is covered in greater depth in resource documents such as Clinical Laboratory Standards Institute. MM09-A2: Nucleic Acid Sequencing Methods in Diagnostic Laboratory Medicine; Approved Guideline - Second Edition (February 2014) and in Gargis et al. (2012).

Risk of errors in bioinformatics pipeline: In an analysis pipeline for identification of sequence variants, one must have high confidence that the resulting variant calls have high sensitivity and specificity. Although true positives (TP) can be distinguished from false positives (FP) easily through external validation, it is almost impossible to systematically distinguish false negatives (FN) from the vast number of true negatives (TN). Different pipelines may vary widely in their degree of concordance of classification of findings (e.g. O’Rawe et al. 2013), with the risk of false negative rate being particularly difficult to address, especially with indels compared to SNVs. The majority of differences between variant calling pipelines appear, however, in ‘problem regions’ of the genomes, such as repeat sequences, regions of sequence homology elsewhere, low complexity regions and regions with errors in the reference assembly; the concordance between calls can often be further improved by applying post-variant calling filters to remove artefactual calls (Li et al Bioinformatics 2014 PMID: 24974202).

Besides variant calling, the use of different variant annotation software programs and transcript annotation files can also make a substantial difference in annotation results that are not commonly appreciated (McCarthy et al. 2014). These troubling reports highlight the needs to ensure bioinformatics pipelines are subjected to rigorous validation and QC, especially for clinical diagnostic applications.

3.1 Design of validation study

3.1.1 The validation study must be designed to provide objective evidence that the bioinformatics pipeline is fit for the intended purpose.

Validation is the process of measuring the performance characteristics of a bioinformatics pipeline, and ensuring that the pipeline meets certain pre-defined minimum performance characteristics before it is deployed.

3.1.2 The validation study must identify and rectify common sources of errors that may challenge the analytical validity of the bioinformatics pipeline.

As part of the validation study, it is important to gain an understanding of common error sources that may compromise the validity of the pipeline, such as:

  • Inherent limitation of individual programs
  • Inadequate optimization of parameters of individual programs
  • Problems with data flow between individual programs
  • Use of incorrect auxiliary files (e.g. wrong human genome reference)
  • Hardware or operating system failure

3.1.3 The validation study must establish the analytical validity of the bioinformatics pipeline in terms of being able to correctly detect sequence variants (secondary analyses) and correctly annotate sequence variants (tertiary analyses).

Analytical validity refers to the ability of a bioinformatics pipeline to correctly call and annotate a variant. Analytical validity must be achieved before clinical validity can be established.

Clinical validity refers to the ability of a test to detect or predict a phenotype of interest. Clinical validity must be established by external knowledge such as results from large-scale population studies or functional studies (Refer to chapter on Reporting).

3.1.4 The laboratory must validate the entire bioinformatics pipeline as a whole, under the given operational environment.

A laboratory may choose to put together its bioinformatics pipeline using any combination of commercial, open-source, or custom software. Regardless of whether an individual component has been validated, the laboratory is still required to validate the entire bioinformatics pipeline under their operational environment (i.e., same hardware specification, same operating system, same parameter setting, and same input load).

3.1.5 The validation study must be designed to avoid bias caused by testing on training data.

It is important to ensure that quality metrics were measured on reference materials that have not been used for tuning (training) the parameters of any of part of the pipeline. The use of training data as testing data may lead to artefactually inflated measurement of various quality metrics.

3.2 Validation process

3.2.1 The laboratory must determine standardised performance metrics of the pipeline.

The use of standardised performance metrics ensure that validation results could be communicated and compared unambiguously. Some commonly used performance metrics are:

  • The frequency of True Positive, True Negative, False Positive, and False Negative results
  • Accuracy
  • Precision
  • Sensitivity
  • Specificity
  • Reportable range
  • Reference range
  • Limit of detection

The usefulness of these metrics depends on testing on a diverse collection of Reference Materials in an environment that realistically simulates the real operational environment. Depending on the performance characteristics of the analytical system, it may be necessary to use replicate analyses or duplicate samples to achieve satisfactory technical reproducibility.

3.2.2 The validation study must define valid ranges for commonly assessed quality metrics.

We generally do not know the correct answer associated with an input FASTQ file, except for the case of reference materials. Nonetheless, based on the results of RM and other previous experience, it is possible to establish some general statistics that we could expect from a valid pipeline. For example, the return of the expected number of variants from WES data set (generating 10,000 – 50,000 variants) can be checked. The transition/transversion ratio (Ti/Tv) can also be determined to fall within a defined range. Deviation from these pre-defined ranges may indicate a necessity for closer examination, but does not automatically imply a validity problem.

3.2.3 Acceptability criteria must be defined to describe clearly the minimum quality metrics required to demonstrate the bioinformatics pipeline is fit for purpose.

One way to demonstrate acceptability and fitness for purpose is to undertake proficiency testing carried out by a NATA accredited (or international equivalent) third party using a different set of Reference Materials.

3.2.4 The laboratory must benchmark the bioinformatics pipeline using reference material, where available. The reference materials chosen must be appropriate for assessing performance of the pipeline for its intended purpose.

Validation of a bioinformatics pipeline generally involves executing it given some input data where the correct status of the variant is known. These input data are called Reference Material (RM). The usefulness of a RM depends on obtaining a large variety of input, from sequence containing only simple SNV to sequences containing complex indels. RM can be generated entirely by in silico simulation, or sequencing real oligonucleotides of known sequences. Note that for the purposes of specific bioinformatics Quality Assurance, this RM may consist of well characterised data sets (e.g. FASTQ files), rather than physical materials such as DNA samples. It is possible to obtain a large variety of RM from in silico simulation. Nonetheless, RM from real sequences should also be employed as they likely better capture characteristics of real data. Examples of bioinformatics reference materials are consensus variant calls distributed by the Genome in a Bottle Consortium for NA12878, e.g. accessible on the Genome Comparison and Analytics Testing website (http://www.bioplanet.com/gcat) and consensus calls distributed under the Illumina Platinum Genomes initiative (http://www.illumina.com/platinumgenomes/). Both of these datasets include a consensus set of calls from multiple pipelines to allow identification of pipeline-specific artefacts.

3.2.5 The laboratory should compare the results from multiple pipelines, where possible, to allow identification of pipeline-specific artefacts.

Multiple pipelines could generate quite different variant calling results from the same input FASTQ file. One strategy to validate a pipeline is to measure the concordance between the results of a given pipeline against several other widely used pipelines. High concordance does not necessarily guarantee correctness, but low concordance indicates problems. Poor concordance commonly overlaps with ‘problem regions’ of the genome, e.g. low complexity regions, as discussed above. Any limitations of the chosen pipeline must be defined as part of the validation study.

3.2.6 The validation study must establish appropriate error handling within the pipeline.

A bioinformatics pipeline could fail due to the corruption of an input file generated by primary analysis or intermediate steps within the pipeline. It could also fail due to excessive load on the server or interrupted network connection. As part of the validation procedure, it is important to assess whether the pipeline can detect corrupted files or interrupted execution, and generate appropriate error messages.

3.2.7 The validation study must establish appropriate hardware and operating system environments to allow successful execution of the pipeline.

The bioinformatics pipeline can be executed in a dedicated computer server, a shared high performance computing (HPC) environment, or the cloud. The successful execution of these programs also depends on the use of appropriate operating system, appropriate auxiliary software program, and supporting reference files (e.g., the human reference genome file, and gene annotation file). Validation should be conducted in a system that closely resembles the actual operational environment. See also issues raised in section 5 of this chapter.

3.2.8 When changes are made to the test system, the laboratory must demonstrate that acceptable performance specifications have been met before using the changed test system for clinical purposes.

3.2.9 The laboratory must define the limitations of the informatics pipeline.

Common limitations of the bioinformatics pipeline include but are not limited to: the maximum size of indels detectable, regions of poor mapping and/or excessive read depth, regions of poor sequence coverage, repeat regions and homopolymer sequence regions that may affect variant calling. There may also be specific limitations of individual specimens that can affect the capability of a given bioinformatics pipeline.

4. Defining Quality Control and Quality Assurance Criteria

Quality control (QC) of sequencing data vs. QC of the bioinformatics pipeline: It is important to distinguish QC for checking the quality of sequencing data, and QC for ensuring the correct execution of the bioinformatics pipeline. Data QC is important for checking whether the sequencing data is of sufficient good quality to ensure variant calling can be performed to the required standard. On the other hand, pipeline QC is concerned about whether the bioinformatics pipeline has been correctly executed according to the predefined quality metrics for a given sequencing data input. Both types of QC are important.

QC of bioinformatics pipeline may include the following metrics:

  • Mapping quality
  • Transition/Transversion ratio
  • Presence of duplicate reads
  • Expected number of variants
  • Expected percentage of known variants (e.g. variants in dbSNP)

4.1 Quality Control and Quality Assurance

4.1.1 The laboratory must monitor quality metrics and acceptability criteria of the informatics pipeline established during pipeline validation.

Quality metrics are to be recorded for each test performed and interpreted in the context of the acceptability criteria that were defined during pipeline validation.

4.1.2 Deviation of achieved quality metrics from defined acceptability criteria must be investigated and mitigated.

Significant deviations may require repeat of the test. For example, a deviation in the percentage of SNPs in dbSNP observed may indicate a problem with variant calling for that sample.

4.1.3 Quality metrics and acceptability criteria must be reviewed regularly to ensure relevance to current test performance.

Revalidation must be performed where ongoing deviations are observed and/or substantial changes to the informatics pipelines have been made. Choice of appropriate quality metrics can be of significant help in troubleshooting the source of the problem in an underperforming test. Trend analysis of bioinformatics quality metrics may also prove to be useful. The appropriateness of the chosen quality metrics to monitor test performance needs to be reviewed regularly, and at least annually.

4.2 Confirmatory processes

4.2.1 The laboratory must define the policy for confirmation of reported variants.

The policy must include a statement as to the circumstances, if any, under which clinically actionable findings are to be confirmed by use of an orthogonal technology. For example, this may involve re-sequencing using Sanger sequencing, or using a second, different MPS technology, or applying an independent or different technique (such as protein, enzyme or functional assay). Confirmation of the results in an independent sample with the same assay may be considered in an effort to minimise stochastic effects. The circumstances may depend on the nature of the test request, the performance characteristics of the assay (in particular the defined accuracy of the test), and the intended use of the reported result.

4.2.3 The laboratory should consider use of multiple independent software tools to establish consensus calls or for confirmation of calls.

Depending on the accuracy of individual software tools, establishing consensus of multiple tools may significantly improve the accuracy of the prediction. The policy for use of multiple software tools and the confirmation of calls should be established during pipeline validation.

4.3 Quality assurance

4.3.1 The laboratory must participate in QAP programs for the analysis and interpretation of DNA sequence variants, where such programs are available.

Example of QAP programs include those organised by the RCPA and the EMQN network. Currently, programs for MPS analysis are in pilot phases.

4.3.2 The laboratory should consider the use of reference materials for ongoing monitoring of test performance.

For example, alignment and variant calling pipelines can be validated and monitored using the Genome in a Bottle, Coriell NA 12878, Illumina Platinum Genomes or similar reference materials.

4.3.3. The laboratory should establish the local process for proficiency testing.

Proficiency testing may involve an external QA program, sample exchange, use of electronic sequence files, reference materials and other approaches.

5. General Informatics Aspects

This section refers to general issues that are applicable in all circumstances and environments. Where a laboratory uses off-site or hosted facilities (including “cloud” facilities), these requirements must be met for all stages of the process, including those not physically co-located or under the direct control of the laboratory.

5.1 Data security and privacy

5.1.1 The laboratory must ensure that data management meets requirements for data integrity and security including avoidance of tampering with primary data files and/or corruption of result files.

MPS data may involve the management of very large data files (in excess of hundreds of Gb) on shared compute resources. Strategies need to be put in place to ensure the integrity of data files is maintained (e.g. use of checksum tools during file transfer, management of data permissions and ‘write’ access rights) and that a secure copy of the primary data files (FASTQ) is maintained elsewhere from ‘working copies’ which allows regeneration of results files (BAM, VCF, annotations), if this should be required.

5.1.2 The laboratory must use structured databases wherever possible.

The use of spreadsheets or text files to store information is discouraged as these typically don’t allow satisfactory traceability or auditing of changes made.

5.1.3. The laboratory must ensure that data management meets the requirements for protecting patient privacy and autonomy.

General requirements for privacy as they relate to the practice of pathology can be found in the NPAAC Standards: Requirements for Medical Pathology Services, and Requirements for Information Communication. Patient autonomy here relates to a patient’s wish of learning or not of incidental findings that may arise in the course of testing and the general scope of testing to which the patient consented. Data management strategies should consider the masking of information that is outside the scope of testing for a given patient sample. This may involve masking of loci other than those targeted for analysis in a given patient. Masking may be performed at any stage during the bioinformatics analysis pipeline, but must be performed prior to providing annotated variant calls for review to a laboratory scientist to ensure the scientist is not exposed to information outside the scope of testing.

5.2 Data storage and backup

5.2.1 The laboratory must establish a procedure for the storage and backup of data with particular reference to the management of raw sequence data, primary, secondary, and tertiary analysis files. The data files to be stored long-term must be identified.

5.2.2 The laboratory must ensure adequate data storage and backup capacity is available.

For MPS data this may require Tb of storage to accommodate primary and secondary analyses files. Network speed to manage data transfer and access also needs to be considered.

6. Resources

Validation (generic):

  • Jennings et al. Recommended Principles and Practices for Validating Clinical Molecular Pathology Tests. Arch Pathol Lab Med—Vol 133, May 2009
  • Mattocks et al. A standardized framework for the validation and verification of clinical molecular genetic tests. EJHG 2010.
  • NPAAC: Requirements for the Development and Use of In-House In Vitro Diagnostic Medical Devices (Third Edition 2014)
  • Validation (secondary analyses):
  • Linderman et al BMC Medical Genomics 2014. 7:20. Analytical validation of whole exome and whole genome sequencing for clinical applications
  • Cornish and Guda. BioMed Research International. A comparison of variant calling pipelines using genome in a bottle as a reference.
  • Heinrich et al. The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process. Nucleic acids research.
  • Meynert et al. Variant detection sensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics 2014.
  • Zook et al. Integrating human sequence data sets provides a resource of benchmark SNSNP and indel genotype calls. Nature Biotechnology. 2014.
  • Meynert et al. Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics. 2013.
  • Chin et al. Assessment of clinical analytical sensitivity and specificity of next-generation sequencing for detection of simple and complex mutations. BMC Genetics 2013.
  • Pirooznia et al. Validation and assessment of variant calling pipelines for next-generation sequencing. Human Genomics. 2014.
  • O’Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).

Validation (tertiary analyses):

  • Walters-Sen et al. Variability in pathogenicity prediction programs: impact on clinical diagnostics. Molecular Genetics and Genomic medicine. 2014.
  • McCarthy, D. J. et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 (2014).
  • Guidelines (tertiary analyses/ annotation):
  • ACGS Practice Guidelines for the Evaluation of Pathogenicity and the Reporting of Sequence Variants in Clinical Molecular Genetics.
  • ACMG Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine.
  • CMGS Practice guidelines for Targeted Next Generation Sequencing Analysis and Interpretation.

Other:

  • College of American Pathologists’ Laboratory Standards for Next-Generation Sequencing Clinical Tests. doi: 10.5858/arpa.2014-0250-CP

Contact form for feedback and comments:

 

*Required

*Required

*Required

 Security code