IT Infrastructure

Genomic technologies introduce complex analytical methods which require substantial bioinformatic and IT infrastructure which are not the usual domain of regulatory and/or accreditation agencies. This chapter will discuss the specific IT infrastructure issues that should be addressed by laboratories considering genomic methods.

Following sequencing, primary analysis (base-calling) usually occurs on the sequencing instrument and is beyond the control of the user.

Secondary analysis (alignment and variant-calling) can occur on-or off-instrument.

Tertiary analysis usually occurs off-instrument. Most of the sequencing manufacturers provide appropriate computing power and storage on-instrument. Where analyses occur off-instrument, it is necessary for the laboratory to consider the following issues:

  • The level of processing power required to perform timely analyses
  • The need to ensure data integrity during transfer across a network
  • Data management and storage

Given the vast potential of genomic methods to generate genome-wide data, laboratories will need to actively consider precisely which data they will store and the retention time of that data. In some cases it may be that institutional IT departments and policies may be able to accommodate data within centralized storage facilities. However, there may be many cases where this is not possible and the problems will need to be addressed locally.

Specific requirements will vary according to the platform and style of analysis (i.e. the requirements for small-scale targeted sequencing will be different to those for whole-exome or whole-genome sequencing). The choice of computing hardware specification (i.e. type and number of CPUs or GPUs, amount of RAM, type and amount of storage platform and operating system) will be governed by the chosen software/analytical pipelines (see Bioinformatics chapter).

Further consideration should be given to equipment which exceeds the minimum specifications in order the reduce processing, and hence turnaround time.

The choice of operating system will also be largely determined by the specific software and analytical programs being used. At a minimum, a 64-bit operating system should be installed (memory allocation can be severely restricted in some/all 32-bit operating systems).

Collapsible 3.1 Wherever possible, data should not be transferred using USB “memory sticks” or external hard drives.

Genomic methods have the capacity to generate very large data files. During analysis data may need to be transferred to different computing hardware (i.e. from sequencer to analytical computer or from sequencer to storage location). A speed of 1 gigabit/second (i.e. Gigabit Ethernet) is suggested as a minimum data transfer speed. This requirement will affect network cables as well as routers/switches. Infrastructure capable of faster transfers will reduce delays introduced by the transfer of large files.

This is a significant issue, especially as files increase in size. Laboratories should implement a system to show that data transferred between different elements of their computing hardware have not been corrupted during the transfer. Consideration should also be given to similar mechanisms for data transferred to external organisations for analysis. Checksums for individual files or compressed files can be generated using a variety of software packages.

3.4.1 Resources

See also resources section in Ethical & Legal Issues.

During data generation and analysis, a series of files of varying sizes are created.

In Sanger sequencing, the stored data includes unedited chromatograms (“raw” data), edited chromatograms, sequence alignments and summarized results/reports. Equivalent components can be identified within NGS pipelines, although the amount of storage required will be significantly larger.

Some genomic data may need to be repeatedly accessed and analysed over a greater period than expected in typical data retention policies (e.g. whole genome or whole exome data). Where possible, the laboratory should determine the feasibility of very long term data retention. The laboratory should develop a formal data management policy which minimizes the possibility of data loss. During analysis, genomic data will be transferred to a number of different computers for analysis and/or storage.

The specific choice of computing hardware for storage purposes will vary between laboratories. The specifications of storage devices will be substantially different from the specifications of processing devices (see above). The important characteristics of storage devices will be quantity, speed and redundancy. It is suggested that “solid state” devices are inappropriate for long-term data storage as their life-span has not been empirically determined.

Cloud storage has the potential for reducing the loss of data due to hardware failure, and is readily scalable, but issues of bandwidth for access and confidentiality of identifiable data remain major concerns.

4.2.1 Resources

4.2.2 Other Resources

This section lists some of the documents which address quality issues in genomic sequencing generally. More specific references are provided in subsequent sections of this document.


Contact form for feedback and comments:

 

*Required

*Required

*Required

Enter security code:
 Security code

Copyright © 2019 RCPA. All rights reserved.

Top