Data Catalog

Data, like all assets, must be inventoried in what we call a data catalog. A true bible for anyone working with data, it is the essential tool for sustainable data governance within the organization. First of all, we present its essential functionalities, then its secondary functionalities; finally we expose the different ways to implement it with the solutions available on the market.

Essential features

Managing data without a data catalog is like managing a library without a complete book catalog. First, let’s discover the essential features of a catalog.

Glossary

The glossary contains concepts and definitions of business terms that are frequently used in the daily activities of an organization. He establishes the consensus necessary to avoid misinterpretation of the meaning of the data.

It has a more or less complex structure:

  • All the terms in a flat, dictionary-like format.
  • Terms organized in a tree structure or taxonomy. Example: all types of prices are children of the generic term price.
  • Terms organized according to a concept graph or ontology. Example: human resources can have several statuses such as internal or external and they can have skills such as actuary or software developer

The terms are then linked to the metadata.

Metadata

Metadata is data that is used to describe other data.

A data is an elementary description of a reality such as the first name of a person, the price of an item, the temperature of a place, etc. It is usually stored in digital form since the advent of computers.

The meaning of a data is called information. An elementary data does not give much information without its context. For example, the data 38 alone is a number, while linked to other data, it gives more precise information: 38 euros is the price of a kilogram of dark chocolate at 75% on January 22, 2022 in our stores in France.

An attribute is generally called attribute a property, a field or a column… an elementary data.

Attributes can be grouped to form a entity an object or a line

Finally, a collection of entities constitutes a dataset (dataset), which can be linked in different ways within a data source The dataset can be modeled in several ways: relational database, objects, documents, graphs, etc.

Metadata describes data at all levels of its organization: attribute, dataset, data source… and its life cycle: original file, transaction database, data warehouse, analysis cube, archive, etc.

At this point, it is possible to proceed with the classification of the data in order to know which policies to apply to them.

Data classification

First of all, data can be classified according to the business domain to which it belongs, such as finance, sales, human resources, etc.

Secondly, some data needs to be processed according to specific data policies.

For example:

  • Personal data is subject to a privacy policy such as the RGPD in Europe and the CCPA in California…
  • Sensitive data is subject to specific security policies that will, for example, severely restrict access or even prohibit storage in a non-sovereign cloud. Examples of this are classified data or health data.

Data catalogs must allow for the assignment of one or more classes to different data assets. This is usually done using a tagging system.

As with the business terms in the glossary, the list of tags can be simple or follow a hierarchy or more complex relationships. This is why some catalogs have merged the concepts of terms and tags.

Data lineage

The data lineage is a complete mapping of the path and stages of data transformation within the information system. You can track the traceability or the complete life cycle of a piece of data from the moment it enters the system to the moment it is archived or deleted.

The objective of data lineage is to answer the following questions:

  • What is the origin of the data? And do they come from a reliable source?
  • Have they been transformed or even altered?
  • Are they transiting to insecure applications, i.e. violating data policies?
  • How are they used and by which applications?
  • How often are they used and how popular are they?

You can read our article on data lineage to know everything about it.

Data discovery and search engine

One of the most important features is obviously the navigation and search engine. This makes it easy to understand all of the company’s data assets.

The navigation is often a tree structure of data assets by domain or data source.

The search engine allows you to instantly find any type of item based on other items. These can be attributes, datasets, people (data stewards, data owners, etc.), tags, terms, etc.

Search engines are increasingly using artificial intelligence techniques that bring up the most relevant data based on the history and learning habits of the company’s employees.

Catalog editors talk about data discovery.

Documentation of data visualization and other uses

The purpose of the data being the decision making, it generally ends up in the data visualization. They are also used for applications used by the company’s operations.

Catalogs link these uses and data sources to answer questions like:

  • Where did the data in this report come from?
  • Are there redundant or inconsistent reports?
  • Does this application use the source of truth of this master data?
  • What data is most used for the supply chain?

Automatic synchronization between catalog and data

A good data catalog must be able to connect to most enterprise systems: applications, databases, warehouses and ETLs to import and maintain metadata and data lineage.

Catalog publishers often offer a wide range of connectors or scanners that are compatible with the most common databases and software in the enterprise. When the connector does not exist, the company using the catalog must be able to import the metadata and lineage in several ways:

  • By involving the data stewards manually:
    • Via the user interface.
    • Via mass imports via Excel or CSV files.
  • By extending the connectors via an API through ad hoc development.

Collaboration

The catalog is also a collaborative space where it is possible to leave comments, annotate requirements, report problems and chat with other users to share information.

It also incorporates the roles and responsibilities of data governance. Indeed, the following roles are commonly found:

  • Administrators: manage permissions, configuration and catalog structure.
  • Data owners: business managers responsible for the data in the area of the company that concerns them: human resources, sales, supply chain, etc. They approve the documentation produced by the data stewards.
  • Data stewards: import metadata and lineage via connections; edit glossary terms and perform curation, documentation, and classification work, while interacting with other stakeholders.
  • Data users: all other users who consult the catalog in read-only mode.

Additional features

In addition to documenting data assets, some catalogs add functionality and for good reason:

  • The catalog complements the functionalities of other products of its editor: ETL, data quality, data analysis, BI
  • The catalog is specialized in a particular area such as compliance or security.

Workflows

Some catalogs go further than the collaborative and conversational aspect. They also integrate a workflow or ticket system to organize the work on the data: from the expression of needs by the data owners to the realization of extractions or processing by the data engineers, including the resolution of quality problems. The ticket system can be internal or based on an existing system such as Jira.

Data Profiling

Data profiling is the process of profiling by gathering statistics and information.

Unlike data analysis, which has a business purpose, profiling allows for a technical evaluation to gain insight into its characteristics: fill rate, minimum, maximum, percentage of unique values, distribution, etc.

This functionality does not generally require a complete reading of the data but a sampling, which is performed either by the databases on request of the catalog or by a module of the catalog located closer to the data sources. Therefore, it limits the impact on performance and security.

Data quality management

Beyond data profiling, some catalogs go much further and allow for complete data quality management: measurements, alerts and data cleansing campaigns.

Obviously, this requires full access to the data and carries several risks:

  • Data leakage and quality compromise, especially if the catalog is running in a foreign cloud…
  • Performance costs due to frequent scanning of the entire data set.
  • Risk of data corruption or loss if the catalog malfunctions.

This is why some publishers have chosen not to manage quality and to leave it to publishers who specialize in it.

Various

  • Data mining – requires a full reading
  • Documentary repository in the form of a wiki
  • Artificial intelligence to suggest more business friendly attribute names.
  • Representation of the relationships between the tables of a relational database.
  • Management of access to data sources from the catalog.
  • External data catalogs: repositories, open data, etc.

Data catalog market

Updated in 2022.

First of all, it is possible to create a catalog by hand in a collaborative wiki software as long as you formalize at least the plan and the templates of the pages.

This approach has the merit of starting to acculturate employees and to frame the company’s needs before investing in a specific and more costly solution. However, the choice of a catalog tool is necessary sooner or later because the wiki method is not sustainable in the long term. Indeed, the number and the too frequent evolution of the databases make the work of maintenance of the catalog too important and thus too expensive to be carried out manually.

Growth

According to Mordor Intelligence, the data catalog market is valued at $524 million worldwide in 2020. By 2026, it is projected to be $1,788 million, a 340% growth from 2020.

It is therefore a booming market based on a rapidly evolving technology.

Competition

The data catalog market already contains dozens of players. It is divided into three categories:

  • Pure players that only offer a data catalog and are specialized in data governance.
  • Data specialists for whom the catalog completes a product line. We will find ETL editors, data visualization specialists or data platforms on premises or in the cloud…
  • Generalist software companies – usually tech giants – have so many data needs that they have developed their own solutions for their internal use and their customers.
Logos of the main data catalog vendors in 2022
Some players in the data catalog market in 2022

Conclusion

This is a first approach to data catalogs and their functionalities. It is the main tool for data governance in the age of big data and artificial intelligence.

At Data Éclosion, we have evaluated dozens of solutions and we have expertise in the most important ones. Choosing and implementing a catalog successfully requires a good knowledge of the company’s needs and the catalog market. This is why it is strongly advised to be accompanied by professionals.