Introduction

Data has evolved from a good to have to something that organizations heavily depend on for developing customer experience and obtaining revenue. Before we get into explaining Data Reliability Engineering (DRE), let us first explain what is Site Reliability Engineering (SRE). SRE is a collection of practices and principles that incorporates all aspects of software engineering and then applies them to operations and infrastructure problems. The main objective of SRE is to create highly reliable and scalable software systems.

DRE is a sub-field of SRE. Just as SRE deals with the reliability of the software systems for an organization,  DRE deals with the systems of the data infrastructure of an organization.

Why is Data Reliability Engineering Important?

Data Reliability Engineering essentially means refining and improving the quality of data, having data moving on time, and ensuring that analytics, Artificial Intelligence (AI) products are given appropriate data inputs. This work is done by data scientists, data engineers, and analytics engineers who historically did not have mature processes and data infra tools at their disposal. Modern software engineering and DevOps teams already enjoy the luxury of these tools and processes. Thus, data reliability work today generally involves more of kicking off late-night backfills, spot-checking data, and hand-rolling a bit of SQL-into-Grafana monitoring than repeatable processes such as incident management and supervision.

To sum up, data reliability is an organization’s ability to offer high data availability and health through the whole data life cycle.

Database Reliability is now specialization in its own right within the context of SRE. A Database Reliability Engineer then becomes someone who takes care of (the reliability of) data infrastructure which comprises databases, data pipelines, deployments, CI/CD, HA, access, and privileges (in some cases), storage and archival, and data warehouses. In short, DRE works on and takes care of all the pieces of the infrastructure that a Data Engineer requires to do his or her job.

Under the name DRE, some data teams are starting to change that by borrowing from SRE and DevOps.

Why Is This Happening Now?

Data quality has been receiving more attention in the last two years. This is a result of a few trends coming together at the same time.

  • Data is being utilized in ever-higher impact applications: Product recommendations, support chatbots, financial planning, inventory management, and much more. These data-driven applications offer huge gains in efficiency, but they can result in a loss to businesses if there is a data outage. As organizations push for higher and higher ROI use cases, there is more dependency on the data. This increases the requirement for reliability and quality.
  • Humans are less in the loop: ML models that need to retrain on regular intervals, streaming data, self-service dashboards, and other applications decrease the dependency on humans in the loop. This indicates that the pipelines have to be more efficient by default because there is no analyst or data scientist available to spot-check the data anymore.
  • There are not enough data engineers available. Hiring data engineers is expensive and difficult. This puts immense pressure on teams to be resource-effective, avoid reactive firefighting, and ensure automation of problem detection and resolution.

The Future of Data Reliability Engineering 

DRE is a relatively new concept and is a part of Data Operations (DataOps). DataOps refers to the broader set of all operational issues that data platform owners face. Many companies are helping to create and define the tools and practices that will make DRE as effective as SRE and DataOps.