Data lineage: why and how to implement it in your company?

Artist's vision of data lineage

An accurate data lineage is necessary to get the most out of your data; but what does data lineage consist of? It provides information about the movement and transformation of data throughout the enterprise information system. Establishing the data lineage provides an overview of how processes transform data.

Implementing data lineage is also essential to track sensitive data and verify its accuracy. Consequently, all sectors are concerned: banking, insurance, finance, health, energy, luxury goods, industry, etc. They all have an interest in optimizing their data flows to get the most out of their data analysis and gain an advantage over their competitors.

What is data lineage?

Data lineage – sometimes referred to as data flow, data traceability or data lineage – describes theentire data lif ecycle throughout the dataprocessing chain within a company: from data origin to use, archiving and deletion.

The data lineage is often represented by a graph allowing the visualization of the path of the data from one treatment to the other.

Data lineage tools allow you to visualize the extraction, loading and transformation processes of data, also called ETL (Extract Transform Load), or more recently ELT (Extract Load Transform) since the advent of big data.

For quality results, companies need to visualize the journey of data and how it moves; reaches a particular destination or is consumed. If data lineage is not tracked, it can have a negative impact on downstream analytics and applications.

Data lineage is crucial for decision making and relies on accurate information: it allows users – both business and technical – to ensure the accuracy of data.

Due to the digital boom, we are facing an explosion in the volume and variety of data (big data); fortunately, companies can keep up by automating their data lineage to improve their processing and data flows. But what is the link between data flow and data lineage?

Data lineage provides an overview of how your data flows through the information system; while data flow is the movement of data from point A to point B. A data flow is described by a source, a destination and possible transformation processes occurring between the source and the destination. By bringing together all the data flows in the information system, we can better understand how information flows, how it is processed and what problems may arise.

What are the components of data flows?

Data flow components
Data flow components

The data lineage is composed of the following components:

  • Data source: a set of source data sets; a data source can be fed by other data sources.
  • Processes: activities carried out on data; automatic or manual processing.
  • Data storage: physical storage of a data source; example: database, data lake, data hub, data warehouse, data mart…
  • Data flow: how data moves, in which direction, who sends it, who receives it.

All companies that want to master data management must understand data flows:

The importance of data flows

Data flows are a crucial tool in data management, enabling companies to understand the interconnections between data elements in the information system. Data flow diagrams (DFD) are sometimes used to visually represent the processes that capture, manipulate, store and distribute data within a company.

By understanding how information flows through their systems, organizations increase efficiency and reduce operational risk, while unlocking new business opportunities (see the benefits of data governance).

Knowing the data flows helps answer questions such as:

  • What is the source of this data?
  • When were they recorded?
  • Which departments use this data?
  • What systems use this data? What systems are impacted if I change or delete this data? Very useful for migrations!

Companies that want to be data centric – efficient and scalable in their use of data – need to be able to answer these questions quickly, and this is precisely what data lineage makes possible.

What is data provenance?

Data lineage and data provenance are related, but different:

  • The data lineage answers the question:where does this data come from?
  • Data provenance answers the questions: why and how was this data collected? Is it created or copied? How was it created? What is the history of its values ?

Data provenance

In short, data provenance provides the historical documentation of the data, its origin and method of creation. It allows businesses to ensure the veracity of the information carried by a set of data.

Within a data catalog, the data lineage appears between the data sources (under the name of data process); while the data provenance is at the level of the metadata of the data sources.

What is the link between data lineage and data classification?

As the term implies, data classification is the process of putting data into categories based on its characteristics. Data classification is an integral part of data security and compliance policies and procedures, which are governed by data governance. Data classification is all the more necessary when dealing with huge amounts of information.

In addition, it provides a solid foundation for data security and compliance methods by providing insight intowhere sensitive or regulated information is stored (data storage).

In addition, data classification increases user productivity through rapid search, eliminates redundant information and reduces storage and maintenance costs.

Data lineage greatly facilitates data classification. Indeed, starting from an already classified data, you can deduce that any data upstream and downstream of this data has the same classification.

For example, let’s imagine that the “date of birth” field in a table corresponds to the date of birth of your customers and that it is classified as “personal data” in order to be managed by the GDPR: in this case, data lineage allows you to classify as “personal data” all the fields that originate from and are destined to this field. As you can see, data lineage and classification are very powerful tools for carrying out compliance or data security projects.

Data classification process
Data classification process

Benefits of data lineage for IT

Let’s now look at the use cases of data lineage for data management teams and IT departments:

Identify the causes of errors

Data managers can rely on data lineage to track data and discover the cause of even the most subtle errors. In the age of Big Data, processing chains have become so complex that it has become almost impossible to debug them without data lineage.

Perform impact analysis to prevent failures

Manipulations or modifications can negatively impact the data being processed. In these cases, prevention is better than cure: the preliminary study of the data lineage allows to better prepare the evolution and maintenance operations and to protect the applications in production from breakdowns whose consequences can lead to important financial losses.

Prepare for data migrations

Data migrations allow the transfer of data from one storage location to another. Data lineage makes the migration process easier and less risky by providing all the details of the data life cycle.

Note: other tools are also useful such as data profiling.

Reduce data debt

Data lineage improves the ability of teams to discern which data sets are obsolete, non-existent or applicable. By reducing the amount of information unnecessarily stored, organizations can accelerate the implementation of their data projects. Therefore, data lineage is a critical component of data debt reduction.

Data Vault Modeling

Since the mid-2010s, big data architectures – a central data hub fed by numerous peripheral data sources – and increased compliance requirements have led to a renewed interest in so-called Data Vault modeling.

This standard includes a system of link keys – a link corresponding to a data flow in this standard – allowing the path of data to be followed through time and the various transformations, i.e. data lineage. This makes Data Vault modeling particularly useful for compliance audits and data quality management.

Data Vault 2.0 certified professionnals (by the Data Vault Alliance) are also highly sought after on the job market, and Data Éclosion has several of them among its workforce.

Benefits of data lineage for the business

Data lineage and business intelligence

From a business perspective, data is often seen as key performance indicators (KPIs), displayed in synthetic visual reports.

Usually designed by business intelligence developers (see also our article on BI consultants), these reports are read and analyzed by business people and decision-makers who have no technical skills to understand how they are built andwhere the data comes from . How much can we trust computer scientists? Even if their honesty is not questioned, how can we be sure that there is no misunderstanding about the origin of the indicators displayed on the screen?

Data lineage sheds light on the origin and calculation method of the indicators. A good data lineage should be visual and accessible to non-technical users.

Data classification process
Data classification process

Improve data quality

Knowing the origin, provenance and processing of data clearly contributes to improving data quality. Indeed, it is then much easier to understand the causes of non-quality and to remedy them.

For the business, data quality is extremely important to have reliable information to make the best strategic decisions. Conversely, poor quality data leads to decision errors, lost customers and reduced sales.

Six Dimensions of Data Quality
Six Dimensions of Data Quality

Data governance and data lineage

As we explain in our article, data governance allows you to manage data as an asset, through effective data policies including: security, compliance, quality, documentation, etc.

The data catalog is the main tool for data governance and one of its most important features is of course data lineage.

Implementing data lineage

What are the data lineage techniques?

Here are some techniques to establish a data lineage:

Manual data lineage

The first solution that comes to mind is to describe the data lineage by hand, by studying the current information system and filling in Excel sheets or a wiki space such as SharePoint or Confluence.

This solution is often carried out by consultants doing a mapping of the information system or consultants carrying out a compliance (GDPR for example).

Tip for data stewards who perform data lineage by hand

If we document the data lineage in a wiki-type documentation, we can provide a “data source” section at each source. For each dataset, the simplest way is then to fill in only the list of sources rather than the sources and destinations. Indeed, it will be up to the data steward of each destination to declare which are its sources. In other words, if we want to know where a particular dataset is copied to (what are the destinations of the data), it will suffice to search for all the datasets where the dataset in question appears among its data sources

Data sourceImported data sets
ERPAccounts receivable, invoices
Online survey applicationSurveys
PIMProduct repository
Customer contact form of the websiteQualified leads
Example: data sources of the CRM
Disadvantages of manual data lineage

Although it requires simple tools, the manual approach is not recommended. Indeed, when there are tens of thousands of fields and flows, the manual documentation work is immense, especially since the systems are constantly evolving. It is almost impossible to obtain an exhaustive inventory, not to mention that it is a tedious and error-prone task.

Manual data lineage is often done on a one-time basis, but it is a job that will need to be done repeatedly if you want to remain compliant as systems and regulations change.

This is why it is recommended to use automatic tools.

ETL data lineage

Since ETLs move and transform data in the information system, a first idea would be to use them to build the data lineage.

Some tools, such as Talend Open Studio, do have a graphical interface to represent their data flows, but the people who access them are primarily technical. Moreover, other tools, such as open source big data tools, do not have a visual representation of data lineage at all. Finally, an information system does not usually use only one technology but several. This is why using ETLs is not a viable approach for a global data lineage accessible to all – IT and business.

Data lineage of the data catalog

Data catalogs are intuitive and bridge the gap between your local databases and your data centers. The data catalogs allow you to carefully organize technical and business data with the help of classification and glossary.

Most importantly, they are able to automatically establish data lineage in a heterogeneous data architecture. Truly technology agnostic, they are the essential tool for data governance and that is why we recommend them to all our customers who want effective data governance.

Example of data lineage in the Alation data catalog
Example of data lineage in the Alation data catalog. Source: Alation.com, 2023.
Example of data lineage in the DataGalaxy data catalog
Example of data lineage in the DataGalaxy data catalog. Source: Datagalaxy.com, 2023.

How to set up a data lineage program?

Note that we are talking about a program and not a data lineage project. Indeed, if it is possible – but not recommended – to establish a data lineage for a particular need, it is especially recommended to establish and maintain the data lineage over the long term, within a data catalog.

Data lineage is therefore part of a data governance program. However, to justify the cost of such a program, it is sometimes necessary to choose a pilot project that will be the opportunity to demonstrate the added value of data governance for all other projects of the company. At Data Éclosion, we are specialists on the subject: do not hesitate to contact us to discuss it.

Recap

Data lineage is necessary to understand the flow of data, manage risk, and improve an organization’s data impact analysis. Making data-driven decisions is critical for businesses. Moreover, understanding the data path is essential to improve data quality. With this information in mind, companies can be confident that the data they are working with is accurate and reliable. It also helps them identify potential problems or inconsistencies in the data and develop better strategies. Data governance teams need to understand the importance of data lineage and how it relates to cybersecurity and privacy. Data lineage helps organizations protect their data and comply with changing regulations. This ensures the longevity of their data infrastructures and strengthens their security controls. By following the data lineage, the data is processed in the best possible way: it’s up to you!