What is a Data Catalog? How to Choose One

Data unification and collaboration are the keystones of any organization boasting a sound data-driven strategy. A data catalog helps you maintain and manage an inventory of data assets while letting you explore and understand the data sources you have at your disposal.

At the same time, it allows you to extract the maximum value out of your information assets. Generally, it is necessary for a user to know the connection string or path pertaining to a data source in order to access it. A data catalog addresses this problem and assists the user in discovering these data sources.

Data catalog is very useful for analysts, as it doesn’t just help them find the data, but also trust the data needed to perform self-service analytics. Data catalogs accumulate the metadata on data-sets that describe standard database objects, such as tables, queries, and schema stored in a data warehouse.

Further, it can be enriched with annotations created in analytic applications and shared through the catalog. Just like a search engine, a data catalog also crawls databases to provide a single point of contact for enterprise data, eventually resulting in an archive of all the data that can be accessed from a single source.

The process of evaluating and selecting a data catalog from a wide array of solutions available in the market can be a daunting task. There are many data catalogs that come close to being a data catalog, but fail to address the actual problems. You will also come across solutions that are meant to be integrated with a real data catalog.

Therefore, it becomes important to have a successful data catalog. Let’s have a look at what makes one.

Automated Population of the Catalog

It would save data stewards a lot of time if they don't go about connecting data sources manually. In that regard, automated data catalog population is either achieved by scanning APIs that gather metadata from tables and stored procedures, or by analyzing data values and algorithms to tag data.

Crowd-sourced Tag Curation

As great as it is to have automatically populated tags in data catalogs, they must also allow crowd-sourcing, human contributions of ratings, annotations, documentation, as well as human appraisal of tags. AI-powered data catalogs should use machine learning to understand the human content conditions so as to make automated tagging more precise.

Powerful Search

It is advisable to use a data catalog that is powered by a proven search engine; one that allows users to search the catalog for effective data-driven decision making. Moreover, the search must be multi-dimensional, which means a user must be able to define different parameters in their search query. Some of these parameters may be - name, size, owner, time, and format.

Scalability at Enterprise Level

Another important feature of data catalogs is its ability to scale services across the whole enterprise. If a data catalog is developed on some big data technology that relies on cloud infrastructure, chances are that it’s scalable enough.
It must be able to manage a wide array of data sources and scale to match your evolving data landscape. From relational, semi-structured, and unstructured data to the data that resides anywhere: on-premise, hybrid, and cloud.

Lineage for Root Cause Analysis

Lineage assists you in establishing a link between the dashboard and the data it shows, thus further allowing a user to understand the relationship that exists between different types of data sources. This is particularly useful when your dashboard is showing inconsistency in data, where a data steward can use lineage to identify the root of the problem and come up with an appropriate solution.

Business Glossary

There needs to be a standard terminology to unify people who are working on the data. If they don’t share a common understanding of the terms that link to the data, only confusion and complexities will arise. This makes the glossary actionable as a consequence. For example, if you search for PII in a data catalog, and find the relevant data source containing them, it is extremely useful in a GDPR context.

In conclusion, a data catalog should be an integral part of your data strategy. Taking control of data can be a challenge, and it is important to make a collaborative effort and set up a single place for trusted data. So, stop polluting the data lake, and include a data catalog in your analytic toolkit.

Last updated:7/24/2019 10:58:52 PM
John Reiley

John Reiley

John Reiley is a Texas -based senior business analyst. He has been helping small business owners plan their strategies for success since 2005. He is a big gadget freak who loves to share his views on the latest technologies and applications.


Leave Comment