The Scalable Data Infrastructure for Science Initiative

The challenges of research data management, data validation and analysis, and data governance—combined with the technical complexity of building and managing high-volume data infrastructure—make delivering robust, scalable, and high-performance data solutions a daunting task. Add the powerful resources that create mountains of scientific data from computer simulations, experiments, and observations at various facilities, and that daunting task becomes a grand challenge.

The Scalable Data Infrastructure for Science (SDIS) Initiative in the Computing and Computational Sciences Directorate at the Department of Energy’s Oak Ridge National Laboratory aims to tackle this grand challenge by defining and building a scalable data framework that addresses every stage of the data lifecycle to support scientists’ workflows and, by extension, accelerate scientific breakthroughs.

Modern science is driven by large-scale instruments and scientific user facilities that produce increasing amounts of highly heterogeneous data, which is then analyzed by teams of specialists from different domains. The increasing variety and volume of data, along with large teams created by the multi-facility collaborative paradigm between distinct domain specialists, creates several challenges that must be addressed to efficiently and effectively plan and conduct scientific research. Most facilities are independently grappling with such challenges in data acquisition, pre-processing, collection, management, analysis, and publication that must be addressed to create a data-rich scientific ecosystem.

SDIS has three focus areas that support researchers by building scalable data infrastructure and providing data management guidance across the entire data lifecycle:

  • Scientist’s sandbox: With an ecosystem of computational resources, data needs to be moved around the planet with low latency for immediate consumption. The sandbox framework supports scientists’ creativity across a data ecosystem, from inception to experiment to results, while allowing facilities to deploy and maintain their own environments.
  • Library: A key part of science is knowledge dissemination. The SDIS library supports the publication and archival aspects of the data lifecycle and is part of the infrastructure that is vital to the FAIR data principles (findable, accessible, interoperable, and reusable). This library framework supports the publication of scientific data, and its main feature is an ecosystem of libraries that are interoperable and searchable across facilities.
  • Governance: Data governance is the process of managing the availability, usability, integrity, and security of data based on data standards and policies. Through the SDIS Data Assets Council, a governance framework will be defined to provide guidance and best practices in data management, interoperability, data sharing, privacy, and other data-related topics.