to
Close
close

Member Login


User ID  
Password

IT Infrastructure

1. Introduction

Genomic technologies introduce complex analytical methods which require substantial bioinformatic and IT infrastructure which are not the usual domain of regulatory and/or accreditation agencies. This chapter will discuss the specific IT infrastructure issues that should be addressed by laboratories considering genomic methods.

1.1 IT process overview

Following sequencing, primary analysis (base-calling) usually occurs on the sequencing instrument and is beyond the control of the user.

Secondary analysis (alignment and variant-calling) can occur on-or off-instrument.

Tertiary analysis usually occurs off-instrument. Most of the sequencing manufacturers provide appropriate computing power and storage on-instrument. Where analyses occur off-instrument, it is necessary for the laboratory to consider the following issues:

  • The level of processing power required to perform timely analyses
  • The need to ensure data integrity during transfer across a network
  • Data management and storage

Given the vast potential of genomic methods to generate genome-wide data, laboratories will need to actively consider precisely which data they will store and the retention time of that data. In some cases it may be that institutional IT departments and policies may be able to accommodate data within centralized storage facilities. However, there may be many cases where this is not possible and the problems will need to be addressed locally.

2. Data processing infrastructure and capacity

2.1 The computing hardware and other IT infrastructure should be fit for purpose.

Specific requirements will vary according to the platform and style of analysis (i.e. the requirements for small-scale targeted sequencing will be different to those for whole-exome or whole-genome sequencing). The choice of computing hardware specification (i.e. type and number of CPUs or GPUs, amount of RAM, type and amount of storage platform and operating system) will be governed by the chosen software/analytical pipelines (see Bioinformatics chapter).

2.2 Computing hardware should at least meet the minimum specifications of the software.

Further consideration should be given to equipment which exceeds the minimum specifications in order the reduce processing, and hence turnaround time.

2.3 The laboratory should show that the choice of hardware and software can be maintained appropriately, including installation, updates and troubleshooting.

The choice of operating system will also be largely determined by the specific software and analytical programs being used. At a minimum, a 64-bit operating system should be installed (memory allocation can be severely restricted in some/all 32-bit operating systems).

2.4 The chosen computing hardware should be shown capable of performing the required analyses and/or capable of running the chosen software using training/control datasets (i.e. datasets with characteristics consistent with clinical samples to be analysed.

Datasets may be supplied by software providers, or may be obtained externally.

3. Data Transfer

3.1 Wherever possible, data should not be transferred using USB “memory sticks” or external hard drives.

3.2 Consideration should be given to the use of high-speed network connections between the various components of the computing hardware.

Genomic methods have the capacity to generate very large data files. During analysis data may need to be transferred to different computing hardware (i.e. from sequencer to analytical computer or from sequencer to storage location). A speed of 1 gigabit/second (i.e. Gigabit Ethernet) is suggested as a minimum data transfer speed. This requirement will affect network cables as well as routers/switches. Infrastructure capable of faster transfers will reduce delays introduced by the transfer of large files.

3.3 Confidentiality of data should be maintained during data transfer.

3.4 Appropriate steps should be taken to ensure that data corruption does not occur during transfer.

This is a significant issue, especially as files increase in size. Laboratories should implement a system to show that data transferred between different elements of their computing hardware have not been corrupted during the transfer. Consideration should also be given to similar mechanisms for data transferred to external organisations for analysis. Checksums for individual files or compressed files can be generated using a variety of software packages.

3.4.1 Resources

See also resources section in Ethical & Legal Issues.

4. Data management and storage

4.1 The laboratory should determine and justify which data are to be stored.

During data generation and analysis, a series of files of varying sizes are created.

In Sanger sequencing, the stored data includes unedited chromatograms (“raw” data), edited chromatograms, sequence alignments and summarized results/reports. Equivalent components can be identified within NGS pipelines, although the amount of storage required will be significantly larger.

Some genomic data may need to be repeatedly accessed and analysed over a greater period than expected in typical data retention policies (e.g. whole genome or whole exome data). Where possible, the laboratory should determine the feasibility of very long term data retention. The laboratory should develop a formal data management policy which minimizes the possibility of data loss. During analysis, genomic data will be transferred to a number of different computers for analysis and/or storage.

4.2 The laboratory should ensure that data are stored in a manner that prevents loss in the event of hardware failure (i.e. data should have redundant backup).

The specific choice of computing hardware for storage purposes will vary between laboratories. The specifications of storage devices will be substantially different from the specifications of processing devices (see above). The important characteristics of storage devices will be quantity, speed and redundancy. It is suggested that “solid state” devices are inappropriate for long-term data storage as their life-span has not been empirically determined.

Cloud storage has the potential for reducing the loss of data due to hardware failure, and is readily scalable, but issues of bandwidth for access and confidentiality of identifiable data remain major concerns.

4.2.1 Resources

4.2.2 Other Resources

This section lists some of the documents which address quality issues in genomic sequencing generally. More specific references are provided in subsequent sections of this document.

Contact form for feedback and comments:

 

*Required

*Required

*Required

 Security code