Tips for Weaving and Implementing a Successful Data Mesh

0
203

This past decade, organisations in every industry have been working diligently to consolidate all of their available data into a single common location. In the past, this common location has been a data warehouse, but more recently, it has been a data lake. Though they perform a similar function, the key difference between these data storage mechanisms is that data lakes do not require information to be structured and organised, while traditional databases, such as data warehouses and data marts, require it. 

Lately, however, businesses have begun to question whether the costs of building and managing data lakes justify their value. In addition, the centralised approach to data infrastructure has also presented several unintended consequences. The foremost of these is the knowledge discrepancy that arises when centralised data teams and individual business teams are unable to properly interpret the information at their disposal. This is further compounded by the rigid infrastructure of the system. Centralised data architectures are not designed to accommodate the needs of different departments across a business, which leads to information gaps. Finally, centralising data is a time-intensive process that presents data consumers with a significant gap in time-to-value.  

Organisations have begun to address these issues through a “data mesh” configuration – a new, decentralised architectural approach to data infrastructure. Deloitte has described this concept as a democratised approach to data management that sees each business domain operationalise its own data with the backing of a central self-service data infrastructure. This takes the form of a bundled set of data pipeline engines, storage systems, and computing capabilities.

Instead of the traditional data lake approach of compiling all incoming information into a single repository, a data mesh filters it into individualised data products. In an organisation with a data mesh, an individual domain such as finance will only provide usable data that is relevant to a given task. This ensures that the different departments within an organisation maintain ownership over their individual data products, enabling them to easily and reliably access the compiled deep domain knowledge relevant to each data task. 

A data mesh system enables this through a central self-serve data platform. In order to ensure interoperability, a set of over-arching standards are also established. Each data domain is structured to deliver data in a way that prioritises ease of consumption and adheres to an organisation’s global guidelines. While the ownership of individual segments is decentralised, structural aspects such as provisioning and governance remain partly centralised. In this way, the data mesh approach offers a solution to the problems accompanying fully centralised infrastructures. To find the correct balance between independent data domains and an overarching central platform, many organisations have turned to existing technologies such as data virtualization.

Enabling Replication-Free Data Access 

Data virtualization is rapidly emerging as one of the key components in the implementation of a data mesh. The ability of data virtualization to provide data access without first replicating it to a centralised repository differentiates it from earlier batch-oriented data integration approaches. This approach provides businesses with a decentralised data integration strategy that links their various data silos. By acting as a data access portal, the data virtualization layer grants data consumers rapid access to any requested information, absent the normal access formalities that accompany such requests. 

The provision of a single storage point for metadata also enables organisations to implement automatic role-based security and data governance protocols from a single point of control. A request for sensitive data would only be automatically approved for employees with the appropriate credentials. In this way, a data virtualization layer provides the necessary self-serve data platform functionality required in a data mesh architecture.

Organisations can further build on the data virtualization layer by incorporating a variety of semantic layers. Structured on a departmental basis and functioning as semi-autonomous data domains, these semantic layers can be modified or removed without any impact on the underlying data. As a result, organisations can establish standard data definitions compatible across domains and achieve semantic interoperability amongst data products. 

Creating Data Products  

With organisations increasingly reliant on data mesh for the development of data products, data virtualization is being leveraged to create virtual models that are accessible to every stakeholder without having to interact with the underlying complexity of multiple data sources or the need to write additional code. This includes methods such as SQL, REST, OData, GraphQL, and MDX. 

With data virtualization, data products can also support features such as data lineage tracking, self-documentation, change impact analysis, identity management, and single sign-on (SS0). Through centrally stored metadata, data virtualization is able to utilise these features to provide domain-specific data product catalogues across an organisation.

Establishing Data Domain Autonomy

Data virtualization’s ability to provide organisations with semantic models without affecting the underlying data provides an ideal foundation for the autonomy of data domains. This structure enables data domain stakeholders to more accurately access the data sources required for their products, and remain flexible enough to change them as needed. The ability of data domains to scale independently enables business verticals with their own data marts and settings to use a data mesh configuration and repurpose the information with minimal effort. 

Despite the enhanced capabilities offered by data virtualization, it should not be viewed as a substitute for traditional repositories such as data warehouses and lakes. Instead, data virtualization treats these repositories as any other source. When applied to a data mesh configuration, they are incorporated as nodes. As such, data domains closely linked to existing data repositories can continue to rely on them for specific data products, including those that require machine learning. In these instances, data products would continue to be accessed through the virtual layer and be governed by the same protocols that oversee the rest of the data mesh.

Data mesh is a promising new architecture that avoids many of the pitfalls of highly centralised data infrastructures. But organisations need the right technology to leverage data mesh effectively and in a straightforward manner, without necessitating legacy hardware replacement.  

About the Author:

Alberto Pan is chief technology officer at Denodo, a leader in data management. He is also an associate professor at the University of A Coruña. Alberto has authored more than 25 scientific papers in areas such as data virtualization, data integration, and web automation.