Data, like all assets, must be inventoried in what we call a data catalog. A true bible for anyone working with data, it is the essential tool for sustainable data governance within the organization. First of all, we present its essential functionalities, then its secondary functionalities; finally we expose the different ways to implement it with the solutions available on the market.

Essential features

Managing data without a data catalog is like managing a library without a complete book catalog. First, let’s discover the essential features of a catalog.

Glossary

The glossary contains concepts and definitions of business terms that are frequently used in the daily activities of an organization. He establishes the consensus necessary to avoid misinterpretation of the meaning of the data.

It has a more or less complex structure:

The terms are then linked to the metadata.

Metadata

Metadata is data that is used to describe other data.

A data is an elementary description of a reality such as the first name of a person, the price of an item, the temperature of a place, etc. It is usually stored in digital form since the advent of computers.

The meaning of a data is called information. An elementary data does not give much information without its context. For example, the data 38 alone is a number, while linked to other data, it gives more precise information: 38 euros is the price of a kilogram of dark chocolate at 75% on January 22, 2022 in our stores in France.

An attribute is generally called attribute a property, a field or a column… an elementary data.

Attributes can be grouped to form a entity an object or a line

Finally, a collection of entities constitutes a dataset (dataset), which can be linked in different ways within a data source The dataset can be modeled in several ways: relational database, objects, documents, graphs, etc.

Metadata describes data at all levels of its organization: attribute, dataset, data source… and its life cycle: original file, transaction database, data warehouse, analysis cube, archive, etc.

At this point, it is possible to proceed with the classification of the data in order to know which policies to apply to them.

Data classification

First of all, data can be classified according to the business domain to which it belongs, such as finance, sales, human resources, etc.

Secondly, some data needs to be processed according to specific data policies.

For example:

Data catalogs must allow for the assignment of one or more classes to different data assets. This is usually done using a tagging system.

As with the business terms in the glossary, the list of tags can be simple or follow a hierarchy or more complex relationships. This is why some catalogs have merged the concepts of terms and tags.

Data lineage

The data lineage is a complete mapping of the path and stages of data transformation within the information system. You can track the traceability or the complete life cycle of a piece of data from the moment it enters the system to the moment it is archived or deleted.

The objective of data lineage is to answer the following questions:

You can read our article on data lineage to know everything about it.

Data discovery and search engine

One of the most important features is obviously the navigation and search engine. This makes it easy to understand all of the company’s data assets.

The navigation is often a tree structure of data assets by domain or data source.

The search engine allows you to instantly find any type of item based on other items. These can be attributes, datasets, people (data stewards, data owners, etc.), tags, terms, etc.

Search engines are increasingly using artificial intelligence techniques that bring up the most relevant data based on the history and learning habits of the company’s employees.

Catalog editors talk about data discovery.

Documentation of data visualization and other uses

The purpose of the data being the decision making, it generally ends up in the data visualization. They are also used for applications used by the company’s operations.

Catalogs link these uses and data sources to answer questions like:

Automatic synchronization between catalog and data

A good data catalog must be able to connect to most enterprise systems: applications, databases, warehouses and ETLs to import and maintain metadata and data lineage.

Catalog publishers often offer a wide range of connectors or scanners that are compatible with the most common databases and software in the enterprise. When the connector does not exist, the company using the catalog must be able to import the metadata and lineage in several ways:

Collaboration

The catalog is also a collaborative space where it is possible to leave comments, annotate requirements, report problems and chat with other users to share information.

It also incorporates the roles and responsibilities of data governance. Indeed, the following roles are commonly found:

Additional features

In addition to documenting data assets, some catalogs add functionality and for good reason:

Workflows

Some catalogs go further than the collaborative and conversational aspect. They also integrate a workflow or ticket system to organize the work on the data: from the expression of needs by the data owners to the realization of extractions or processing by the data engineers, including the resolution of quality problems. The ticket system can be internal or based on an existing system such as Jira.

Data Profiling

Data profiling is the process of profiling by gathering statistics and information.

Unlike data analysis, which has a business purpose, profiling allows for a technical evaluation to gain insight into its characteristics: fill rate, minimum, maximum, percentage of unique values, distribution, etc.

This functionality does not generally require a complete reading of the data but a sampling, which is performed either by the databases on request of the catalog or by a module of the catalog located closer to the data sources. Therefore, it limits the impact on performance and security.

Data quality management

Beyond data profiling, some catalogs go much further and allow for complete data quality management: measurements, alerts and data cleansing campaigns.

Obviously, this requires full access to the data and carries several risks:

This is why some publishers have chosen not to manage quality and to leave it to publishers who specialize in it.

Various

Data catalog market

Updated in 2022.

First of all, it is possible to create a catalog by hand in a collaborative wiki software as long as you formalize at least the plan and the templates of the pages.

This approach has the merit of starting to acculturate employees and to frame the company’s needs before investing in a specific and more costly solution. However, the choice of a catalog tool is necessary sooner or later because the wiki method is not sustainable in the long term. Indeed, the number and the too frequent evolution of the databases make the work of maintenance of the catalog too important and thus too expensive to be carried out manually.

Growth

According to Mordor Intelligence, the data catalog market is valued at $524 million worldwide in 2020. By 2026, it is projected to be $1,788 million, a 340% growth from 2020.

It is therefore a booming market based on a rapidly evolving technology.

Competition

The data catalog market already contains dozens of players. It is divided into three categories:

Logos of the main data catalog vendors in 2022
Some players in the data catalog market in 2022

Conclusion

This is a first approach to data catalogs and their functionalities. It is the main tool for data governance in the age of big data and artificial intelligence.

At Data Éclosion, we have evaluated dozens of solutions and we have expertise in the most important ones. Choosing and implementing a catalog successfully requires a good knowledge of the company’s needs and the catalog market. This is why it is strongly advised to be accompanied by professionals.