Large Language Models for Data Catalog Enrichment: A Survey With Operational Evidence From Enterprise Deployments

Kuladeep Sandra

doi:10.37082/IJIRMPS.v12.i3.233063

Large Language Models for Data Catalog Enrichment: A Survey With Operational Evidence From Enterprise Deployments

Authors: Kuladeep Sandra

DOI: https://doi.org/10.37082/IJIRMPS.v12.i3.233063

Short DOI: https://doi.org/hbxvf9

Country: United States

Full-text Research PDF File: View | Download

Abstract: Enterprise data catalogs have failed to achieve adoption despite billion-dollar investments because high-touch human curation does not scale with data volume. In deployments across banking and insurance, the first two catalog implementations stalled: a 2018 Azure Purview rollout reached only 350 registered tables and 12 active users, while a 2020 Collibra deployment grew to 1,200 tables but left 28% without registered owners 18 months after launch. A third implementation succeeded by integrating the catalog with the data access workflow, reaching 3,000 tables in 3 months. This paper surveys how Large Language Models (LLMs) address the residual curation gap. We report on a production pilot enriching 10,000 tables with GPT-4: 88% of generated descriptions were rated good or excellent, owner suggestion accuracy reached 72% exact match, and sensitivity classification achieved 85% agreement with human stewards. Steward review time fell from 8 minutes to 2 minutes per table. Ownership coverage rose from 28% to 89%; description completeness rose from 19% to 84%; active users grew from 8 to 127. We present a reference enrichment architecture, discuss failure modes including hallucination and inappropriate owner inference, and identify open research challenges in quality measurement, generalization, and privacy.

Keywords:

Paper Id: 233063

Published On: 2024-06-14

Published In: Volume 12, Issue 3, May-June 2024

All research papers published in this journal/on this website are openly accessible and licensed under Creative Commons Attribution-ShareAlike 4.0 International License; accordingly, any user can read, download, copy, distribute, print, search, or link to the full texts of the authors/researchers submitted and published articles, crawl them for indexing, pass them as data to any software, or use them for any other lawful purpose. The journal is fulfilling the DOAJ's definition of open access.

About IJIRMPS Indexing & Archiving Publication Ethics Peer Review & Plagiarism	Website/Journal Policies Usage Policy Content Policies Privacy Policy	Contact Us +91-9687-828-838 editor@ijirmps.org

International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences
E-ISSN: 2349-7300 • Impact Factor - 9.907

A Widely Indexed Open Access Peer Reviewed Online Scholarly International Journal

Large Language Models for Data Catalog Enrichment: A Survey With Operational Evidence From Enterprise Deployments

Share this

International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences E-ISSN: 2349-7300 • Impact Factor - 9.907

A Widely Indexed Open Access Peer Reviewed Online Scholarly International Journal

Large Language Models for Data Catalog Enrichment: A Survey With Operational Evidence From Enterprise Deployments

Share this

International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences
E-ISSN: 2349-7300 • Impact Factor - 9.907