As you know, the better you analyze your data, the more value it brings you. However, it is important to evaluate the content and quality of your data beforehand, especially since the volume and variety of your data is constantly increasing (big data). In short, there has never been a greater need to monitor and clean up data.
Data profiling is a powerful technique to combat inaccurate, missing or unusable data. It helps to improve data quality and gain a competitive advantage in the market.
In this article, we explore the definition, techniques and technologies associated with data profiling, as well as how it works in practice and how it helps companies solve their data problems.
What is data profiling?
Data profiling is the process of evaluating data sources, analyzing their structure, content and relationships in order to identify their potential uses and problems.
In other words, data profiling is the act of examining and analyzing data to learn more about its structure, the information it includes, the relationships between different data sets and potential applications.
For example, data analytics teams perform data profiling to better understand the state and value of their data before embarking on developments. In short, data profiling is a kind of data inventory.
Why do we need data profiling?
Data profiling can help you find, understand and manage your data. If you are not already using it in your company, we recommend that you implement it without delay and here is why:
First of all, data profiling allows you to verify that your data corresponds to the description made of it (ideally in your data catalog).
It can then help you better understand your data by revealing connections between different databases, source applications or tables.
In addition, data profiling helps you find nuggets of information buried in your own data and ensure that your data complies with industry standards for your specific business requirements.
For example, Country columns may have two-letter codes, as well as full country names that are sometimes misspelled. Data profiling will reveal this discrepancy and provide information for the development of a standardization method that could result in uniform two-letter codes for all enterprise applications. In this example, it can therefore help a master data management (MDM) program.
Types of data profiling
Data profiling can be of three main types:
- Structure discovery
- Content discovery
- Relationship discovery
Structure discovery
Structure analysis, commonly known as structure discovery, confirms the accuracy and correct formatting of the data you already have. You can achieve this by using techniques such as pattern matching.
For example, pattern matching can help you locate valid sets of formats contained in a phone number dataset. This allows you to determine whether a field is textual or numeric, as well as other format-specific details.
Verification can be extended to data models. For example: are there redundant columns in the same table?
Content discovery
The content discovery process consists of examining all rows of the data tables to verify the veracity of the data. This can help you identify null, inaccurate or ambiguous values.
Any use of a dataset should start with data profiling to highlight inconsistent or ambiguous elements.
For example, if you want to conduct a holiday card mailing campaign to your customers, you should first verify that your customers’ addresses are consistent to minimize the rate of undelivered mail.
This type of analysis is made possible by elementary statistical methods (minimum, maximum, average, median, modes, standard deviations…) or advanced statistical methods such as machine learning, popularly called artificial intelligence.
Relationship discovery
Relationship discovery involves understanding the relationships between data sets. It’s about identifying the most important links between data and focusing on areas where data overlap. When used with data lineage, this technique aligns data that shares the same meaning across the enterprise .
Uses of data profiling
Data profiling is mainly used for the following purposes:
Data quality
As we started to say above, data profiling is widely used to quickly assess the quality of data and prepare it for the company’s standard. These operations will be carried out by unifying the data models and / or campaigns of corrections or data cleaning.
Data cleansing is an essential step in the data preparation process, as it helps with de-duplication and error correction while ensuring data accuracy and relevance. However, it is only useful for poor quality data, and that is where data profiling comes in: by detecting problems that would otherwise go completely unnoticed.
For this reason, quality and data profiling technologies scrutinize massive amounts of data to identify inaccurate fields, null values, and other statistical anomalies that may affect data processing.
This use case is of particular interest to data stewards, who are responsible for data sets within the framework of data governance.
Data migration
Data migration is the process of transferring a large amount of data from a source system (to be decommissioned) to a target (new) system. For example: migration from an old customer management system (CRM) to a new one (we want to keep all customer data).
To maintain consistency between the old and new systems, the data must first be checked for differences and resolved before the transfer can begin.
Data profiling techniques are ideal for reducing the risk of errors, duplicates and inaccurate data before the migration process begins. Failure to manage this type of risk can lead to migration failure with significant financial repercussions for the company.
This use case is of particular interest to company architects or data architects, who are sometimes in charge of this type of project.
Data integration
By combining data from multiple sources, data integration provides a complete view of the company’s data. When source data is ingested and fed into a data warehouse, data hub or data mart, profiling it prior to integration is a safeguard against the most common errors.
This use case is of interest to data engineers or data custodians.
Cybersecurity and fraud detection
Finally, thanks to the overview it provides, data profiling can highlight anomalies that follow a cyber attack or fraud.
For example, among the data of all the accounts of a social network, a statistically abnormal repetition can suggest false accounts used to spread fake news on a large scale…
Strategies for implementing data profiling tools
There are several approaches to implementing data profiling in an organization:
Data profiling tools in ETLs
First of all, let’s remember that an ETL (Extract Transform Load) is a tool that automates the transfer of data from one data source (e.g. database) to another (e.g. data warehouse). This is also known as data integration. As seen above, it is recommended to perform data quality assessments before proceeding. This is why most ETL solutions on the market offer data profiling modules, like Talend Open Studio.
Talend Open Studio is one of the best known open source data integration and profiling tools. It performs simple data integration tasks in batches or in streaming. This tool has a number of features, such as data management and cleansing, text field analysis, real-time data integration from any source, and more. Time series management is one of the tool’s distinctive value propositions. In addition, it provides a simple user interface that displays a number of graphs and tables that show the profiling results for each data element.
Data profiling tools in data management suites
Data management and business intelligence platforms – whether on-premises or in the cloud – typically have data profiling capabilities. We can mention Microsoft Power BI or Qlik for BI; or Ataccama or Denodo for data platforms. If you use other platforms, refer to their documentation to see what they offer in terms of data quality and profiling.
Data profiling tools in data governance tools
In addition to their data catalog, the richest data governance suites – including Collibra and Alation – also include a data profiling module. Users see side-by-side metadata and the overview of the quality of the data they [les métadonnées] describe.
On the other hand, this approach raises security concerns: while it is desirable that as many employees as possible have access to the metadata of the data catalog (see our article on data literacy), what about the risk of exposing information about the data itself? These modules can be configured not to display everything… at the cost of a much more important data stewardship work.
Stand-alone data profiling tools
When a company wants a flexible and scalable information system, it prefers to limit the number of functionalities per tool. For example, if you choose to base your data profiling on Talend Open Studio, it will be more difficult to replace it with another ETL, because you will also have to replace the data profiling that goes with it.
This is why there are various tools dedicated to data profiling, from the simplest to the most powerful:
OpenRefine
OpenRefine is an open source application to clean and format data. It was previously known as Google Refine and Freebase Gridworks. The data profiling tool released in 2010 has been improved by the active OpenRefine community in order to keep it up to date with changing user needs.
Open Source Data Quality and Profiling
Open Source Data Quality and Profiling is an integrated, high-performance data management platform with capabilities for data preparation, metadata identification, anomaly discovery and other data management tasks.
Apache Griffin
For big data, there is an open source solution dedicated to data quality and data profiling: Apache Griffin. It supports batch and streaming modes. In addition, it provides a unified process for measuring the quality of your data from different angles, helping you create trusted data assets.
Python ecosystem and data science
As for data scientists, they have no lack of options thanks to numerous libraries available in Python (and Jupyter). The most famous are: pandas-profiling, Lux, Sweetviz, AutoViz, dataprep, skimpy…
Data profiling and data governance
This article gives you an overview of what data profiling is and why it is useful. It also discusses the different approaches to implement it.
Today, the majority of companies use many data applications on a daily basis. As a result, the data is scattered in various databases (called silos) which makes it difficult to exploit in a 360 degree view. To remedy this problem, companies tend to concentrate all their data in a central location called a data hub, following a so-called data centric approach. Nevertheless, it is necessary to control the quality of the data so that the result is up to expectations: data profiling is the first tool to use.
Finally, since it allows us to better value data and manage it as an asset, data profiling is also one of the tools of data governance, alongside the data catalog. At Data Éclosion, we have the expertise to help you with data profiling and much more… So don’t hesitate to contact us.