Data Mesh Architecture: Understanding the Four Key Components

by Spencer Nguyen • October 12, 2022

Organizations worldwide put their best foot forward to create a centralized database where information is gathered, stored, and managed. Their data engineers transform difficult-to-decipher datasets into data pipelines that can be used by data scientists, analysts, and consumers. However, the new data mesh concept championed by Zhamak Dehghani, the director of technology for IT consultancy firm ThoughtWorks, allows domain teams to conduct cross-domain data analysis independently. This approach is focused on the decentralization of data. Before delving into its architecture, let us first understand what data mesh is and how it functions.

What Is Data Mesh?

The data mesh is a recent response to extensive monolithic data platforms that involve giant silos of data lakes and warehouses. It is also the result of the "great divide of data," as there is a distinct difference between operational and analytical data.

The data mesh attempts to address the bottlenecks of centralized data architecture. It implies that developers are taking inspiration from the domain-driven designs popularly used in software engineering and using its concepts to improve data architecture.

It is important to note that data mesh focuses on adopting cloud platform technologies for data management and scalability.

It's worth noting that data mesh promotes the adoption of cloud-native and cloud platform technologies to scale and achieve data management goals.

Experts in the field of data science believe this concept can be compared to microservices. The microservices architectures focus on bringing together lightweight services to improve the functionality of an application. Data mesh is doing something similar to data warehousing by adopting distributed architecture along with decentralized governance.

Principles of Data Mesh Architecture

The data mesh architecture does not refer to just another type of data architecture but is a new concept that helps design modern data architectures. This approach carries the data-centric concepts along with organization constructs. To understand how data mesh works, let’s look at the four critical components of data mesh architecture.

Domain Oriented Decentralized Ownership of Data

A domain is an area of control or a sphere of knowledge of a business. However, it is a challenge to come to an understanding regarding what is the domain within an organization. This challenge can be addressed by using the terminology called source systems. Source systems are the starting point in a separate data warehouse and can be referred to as the data sources on which the entire data warehouse is based.

Zhamak refers to these source systems as "systems of reality." It is because the datasets from those systems have not been changed in any way, and they give a clear picture of what happened in a business. From the data perspective, a domain can be formed by coupling the source systems that form an interconnected domain dataset. These domain datasets come together to form a source domain data set.

After gaining a clear understanding of what defines a domain, data ownership comes into play. The domains are responsible for data management, which involves ingestion, cleansing, and aggregating data to produce data assets for business intelligence apps.

It is crucial to note that the data meshes provide data ownership among the data product owner and the data engineer, the two primary roles in a domain. The data product owner creates a blueprint of data products. This role is geared toward customer satisfaction, implying that users are content with the received data. The data product owner needs to understand the user persona to stay abreast of their needs.

The role of a data engineer focuses on building the data product within the domain. This role combines software engineering expertise and the data engineering team’s skills, especially a competency in ETL solutions such as AWS Glue, Datastage, etc.

The goal is to apply the concepts of domain-driven design used in software architectures to data architecture to create analytical data products. Consequently, it is a decentralized approach in which many parties collaborate to provide excellent data.

Data as a Product

From a strategic perspective, data-as-a-product gives the ownership of an organization’s data to separate teams rather than a central data team. The data product is central to the data mesh architecture and is developed using business intelligence to provide an answer to all the essential questions.

Accordingly, a data product is used by data customers and is essential to make a business data-driven. A data customer is any individual in the organization who requires data to accomplish a task or function. For instance, analysts might need specific data for predictive modeling, so they are data customers.

What exactly is a data product? Data product can refer to any of the following:

Data files, views, and streams

Metadata comprising columns, rows, and definitions that are easier to consume

Access patterns, also called query patterns, outline the provisions for data access so the system and the users can use data to satisfy business needs.

The data infrastructure and code that comprise the meta-part of the data

To summarize, the data as a product is an amalgamation of data, metadata, infrastructure, and code. Zhamak describes the data products as having the following qualities:

Discoverable, meaning it should have a data catalog, registry, and metadata so you can identify its owners and the source.

Addressable to ensure that users can access it. It requires the development of standard global conventions.

Trustworthy and truthful to provide provenance and lineage to data customers. Moreover, it requires data owners to be responsible for data quality by strict adherence to predefined objectives.

Data schemas with self-describing semantics and syntax are needed to help the data customers self-serve the data products.

Interoperable and governed by global standards to facilitate collaboration between domains when required. For this purpose, discoverability, governance, formatting, and metadata should be standardized.

Secure and governed by global access control so users can access data securely.

Self-Serve Data Infrastructure as a Platform

Every enterprise has logically autonomous domains. These domains serve various purposes, from supporting business processes and functions to developing data products. Consequently, there is a need for an underlying infrastructure for data products. Different enterprise domains can easily use this self-serve data infrastructure to create data products.

The domains should therefore be free from server errors, operating system and networking issues, and other complexities. A central IT organization is tasked with providing a self-service data platform that enables the creation of superior data products that generate business value.

This platform is domain agnostic, so it can be customized according to each domain. It enables the domain’s engineers to leverage the platform to create end-to-end solutions without interference and vendor constraints. It allows design efficiency and the creation of intelligible data products.

The self-serve data infrastructure platform is great for analytics. Although there is no shortage of analytics platforms, many of them are not scalable in terms of accessing and sharing data. This proposed platform provides analytical data in a decentralized manner to a variety of data customers, including data analysts, engineers, scientists, and others who can create and manage their data products.

Moreover, self-service analytics empowers the employees in the organization to make data-driven decisions.

Federated Computational Governance

As the data platforms provide more extensive access to self-service analytics and make crucial data available to employees without a background in data science or engineering, it is essential to have safeguards in place. Moreover, conventional data governance may not be able to produce enough value through data.

Data mesh advocates a paradigm shift in the governance of data. The idea is to make governance more federated rather than centralized. It implies that the responsibility of maintaining data quality and security lies with the business domains. On the other hand, the role of the central governing body is limited to defining the framework or guidelines for data quality and security.

So, this data model requires the collaboration of domain teams and centralized data governance teams to fulfill the data requirements of the enterprise. It also emphasizes a shared data infrastructure layer that domains can use to build their own data pipelines that comply with security standards and guidelines. Federated computational governance ensures that each domain doesn’t have to build its infrastructure from the ground up.

Why Are APIs Integral to Data Mesh?

Application programming interfaces (APIs) allow an organization to access its data mesh. They provide a consistent method for specifying the data structure, security constructs, and query methods. So, APIs enable a data consumer to search and consume data in an organization’s data mesh.

Moreover, all the cross-functional teams are responsible for the entire data lifecycle in a data mesh architecture. It implies that each domain comprising microservices has its own file storage system and an OLAP database. Microservices expose everything through HTTP REST APIs, a data domain that employs efficient interfaces to expose data that other teams can easily access.

DreamFactory’s data mesh enables you to create sophisticated and transparent APIs. Start your free 14-day trial here.

TL;DR - GET AN AI SUMMARY

Spencer Nguyen

As a seasoned content moderator with a keen eye for detail and a passion for upholding the highest standards of quality and integrity in all of their work, Spencer Nguyen brings a professional yet empathetic approach to every task.