Large Language Models for Data Catalog Enrichment: A Survey With Operational Evidence From Enterprise Deployments
Authors: Kuladeep Sandra
DOI: https://doi.org/10.37082/IJIRMPS.v12.i3.233063
Short DOI: https://doi.org/hbxvf9
Country: United States
Full-text Research PDF File:
View |
Download
Abstract: Enterprise data catalogs have failed to achieve adoption despite billion-dollar investments because high-touch human curation does not scale with data volume. In deployments across banking and insurance, the first two catalog implementations stalled: a 2018 Azure Purview rollout reached only 350 registered tables and 12 active users, while a 2020 Collibra deployment grew to 1,200 tables but left 28% without registered owners 18 months after launch. A third implementation succeeded by integrating the catalog with the data access workflow, reaching 3,000 tables in 3 months. This paper surveys how Large Language Models (LLMs) address the residual curation gap. We report on a production pilot enriching 10,000 tables with GPT-4: 88% of generated descriptions were rated good or excellent, owner suggestion accuracy reached 72% exact match, and sensitivity classification achieved 85% agreement with human stewards. Steward review time fell from 8 minutes to 2 minutes per table. Ownership coverage rose from 28% to 89%; description completeness rose from 19% to 84%; active users grew from 8 to 127. We present a reference enrichment architecture, discuss failure modes including hallucination and inappropriate owner inference, and identify open research challenges in quality measurement, generalization, and privacy.
Keywords:
Paper Id: 233063
Published On: 2024-06-14
Published In: Volume 12, Issue 3, May-June 2024
All research papers published in this journal/on this website are openly accessible and licensed under