Storing, Protecting, and Providing Access to Petabytes of Genomic Data

Genomics England has ambitious aims. The organization was established in 2013 by the UK’s Department of Health & Social Care to sequence the genomes of 100,000 people, generating new insights that can help improve treatments—while also accelerating the development of the UK genomics industry. In 2018, the project was significantly expanded: the new goal was to sequence up to five million genomes over five years.

Unfortunately, the existing network-attached storage (NAS) solution used for storing genomic data was not up for the task. The NAS, which held 21 PB of data, had reached its node-scaling limit. “We needed something that’s much more scalable than existing NAS solutions—an infrastructure that could grow to hundreds of petabytes,” says David Ardley, head of technical delivery at Genomics England. A new solution also had to facilitate simple, flexible access to data by more than 3,000 researchers around the world.

Using Quantum ActiveScale Object Storage
Genomics England called on Nephos Technologies, an independent UK-based data services organization, to design and implement a new storage solution. Together, teams from Nephos and Genomics England deployed a multi-faceted solution that incorporates a WekaIO high-performance file system, Mellanox high-speed networking, and ActiveScale object storage.

The solution creates a two-tier architecture that combines flash storage plus ActiveScale object storage system, which serves as a long-term data lake repository. The two storage tiers—each of which can be scaled independently—present as a single hybrid storage environment. As a result, researchers have the flexibility to query data in a highly randomized fashion.

Taking on New Challenges During the COVID-19 Pandemic
Within a few years of deploying the new storage environment, Genomics England needed to expand again. The emergence of the COVID-19 pandemic in early 2020 presented new, urgent challenges for the global medical-scientific community, and Genomics England was in a prime position to help better understand who is susceptible to the virus. The organization committed to sequencing the genomes of up to 20,000 intensive care patients with COVID-19 plus up to 15,000 people with the virus who are experiencing only mild symptoms.

Around the same time that Genomics England was ramping up participation in COVID-19 research, the ActiveScale solution platform was acquired by Quantum. A Quantum team facilitated a smooth transition for Genomics England, which expanded the object-storage environment from 40 PB to more than 100 PB.

Scaling was seamless with RAID (Redundant Array of Independent Disks). “What we love about the ActiveScale system is that its inherent architecture is underpinned by its RAID replacement technology, the intelligent, dynamic placement of erasure-coded data,” says Ardley. That dynamic placement eliminates the need for system rebalancing, which can compromise performance and availability.

Protecting Vital Genomic Data
ActiveScale object storage protects data and provides the data resiliency that Genomics England needs for its critical work. The organization takes advantage of the geo-distributed capability of ActiveScale. With ActiveScale, the organization distributes data across three data centers, for full data protection against a major disaster such as site loss.

Gaining Scalability While Controlling Costs and Complexity
With ActiveScale, Genomics England no longer faces the capacity limits of its previous NAS solution. The organization has expanded its object storage to support more genomic analysis and taken on the additional COVID-19 work without a major storage overhaul.

This scalable storage environment also helps reduce costs. According to Nephos, the Genomics England team decreased storage costs by 75 percent per genome compared with the previous environment. The organization is expected to reduce costs by 96 percent by 2023.

Just as important, the Genomics England team has experienced these benefits without adding complexity. The new integrated storage environment makes it simple for researchers from around the world to store and access the genomic data they need for their work.