Metadata management is becoming a big issue (…again, finally, for the “nth” time over the years), thanks to attention to initiatives such as data governance, data quality, master data management, and others. This time it finally feels more serious. That’s a good thing. One conceptual issue that vendors and architects are pushing related to metadata management (whether home-grown or part of a packaged solution) is “data lineage.” What does that mean?
Let’s imagine that a user is confused about a value in a report…perhaps it is a numeric result labeled as “Quarterly Profit Variance”.
What do they do today to gain further awareness of this amount and trust it for making a decision? In many large enterprises, they call someone at the “help desk”. This leads to additional phone calls and emails, and then one or more analysts or developers sifting thru various systems reviewing source code and tracking down “subject matter experts.” One large bank I visited recently said that this can take _days_! …and that’s assuming they ever successfully find the answer. In too many other cases the executive cannot wait that long and makes a decision without knowing the background, with all the risks that entails.
Carefully managed metadata that supports data lineage can help.
Using the example above, quick access to a corporate glossary of terminology will enable an executive to look up the business definition for “Quarterly Profit Variance.” That may help them understand the business semantics, but may not be enough…. They or their support team may need to drill deeper. “Where did the value come from?” “How is it calculated?”
Data lineage can answer these questions, tracing the data path (it’s “lineage”) upstream from the report. Many sources and expressions may have contributed to the final value. The lineage path may run through cubes and database views, ETL processes that load a warehouse or datamart, intermediate staging tables, shells and FTP scripts, and even legacy systems on the mainframe. This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.
Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure. Ultimately this leads to better decision making. Happy reporting!
Update (2017-03-30) to this old but still very valid discussion….. please check out this excellent new blog entry by one of my IBM colleagues, Distinguished Engineer and thought leader Mandy Chessell, regarding Data Lineage … https://poimnotes.blog/2017/03/19/understanding-the-origin-of-data/