Metadata management is becoming a big issue (…again, finally, for the “nth” time over the years), thanks to attention to initiatives such as data governance, data quality, master data management, and others. This time it finally feels more serious. That’s a good thing. One conceptual issue that vendors and architects are pushing related to metadata management (whether home-grown or part of a packaged solution) is “data lineage.” What does that mean?
Let’s imagine that a user is confused about a value in a report…perhaps it is a numeric result labeled as “Quarterly Profit Variance”.
What do they do today to gain further awareness of this amount and trust it for making a decision? In many large enterprises, they call someone at the “help desk”. This leads to additional phone calls and emails, and then one or more analysts or developers sifting thru various systems reviewing source code and tracking down “subject matter experts.” One large bank I visited recently said that this can take _days_! …and that’s assuming they ever successfully find the answer. In too many other cases the executive cannot wait that long and makes a decision without knowing the background, with all the risks that entails.
Carefully managed metadata that supports data lineage can help.
Using the example above, quick access to a corporate glossary of terminology will enable an executive to look up the business definition for “Quarterly Profit Variance.” That may help them understand the business semantics, but may not be enough…. They or their support team may need to drill deeper. “Where did the value come from?” “How is it calculated?”
Data lineage can answer these questions, tracing the data path (it’s “lineage”) upstream from the report. Many sources and expressions may have contributed to the final value. The lineage path may run through cubes and database views, ETL processes that load a warehouse or datamart, intermediate staging tables, shells and FTP scripts, and even legacy systems on the mainframe. This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.
Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure. Ultimately this leads to better decision making. Happy reporting!
Ernie Ostic
Update (2017-03-30) to this old but still very valid discussion….. please check out this excellent new blog entry by one of my IBM colleagues, Distinguished Engineer and thought leader Mandy Chessell, regarding Data Lineage … https://poimnotes.blog/2017/03/19/understanding-the-origin-of-data/
December 15, 2009 at 10:37 pm
One caveat I can see with respect to data lineage is there may be intellectual property involved in a calculation somewhere upstream from the report. In that case, the lineage might intentionally be less complete and useful than it otherwise would be.
December 16, 2009 at 7:55 am
Hi Bill…
Thanks for the thought. Yes — this can happen. The key thing for the metadata administrator [or “the one wearing the hat who is responsible for the lineage research” 🙂 ] is to be able to at least document that the intellectual property exists, whether it is a black box, company sensitive calcuation, an embedded “purchased” solution, or an external web service (etc.). There needs to be a way in the lineage tooling to represent this artifact (and represent it graphically as noted), and if nothing more, have a URL to it, a phone number, a “steward” responsible for it, or something so that the details can be obtained if/when absolutely necessary.
Ernie
December 16, 2009 at 10:13 am
I have always wondered how lineage deals with some of the more complex, yet very real, data management approaches. E.g. if data is stored in an object, subject, predicate form or other similar forms where the “meta-data” i.e. what class of items are being dealt with, is held as data. In this case dozens of individual items may be managed in the same set of columns. I presume this makes it hard to specify rules, such as: when dealing with “agent contracts” (predicate) xyz rules are applied to the data, while when dealing with “direct customers”, abc rules are applied to the data.
In the simple case, metadata lineage makes perfect sense and should certainly be encouraged, I am just wondering how in practice some of the more complex nuances of data are supported by lineage?
December 16, 2009 at 11:14 am
Thanks Darren. Good point. Some people’s “data” is really “metadata,” and the degree to which you have to “drill down” to find the actual details can be extensive. I’ve been finding that good metadata management and data lineage can often take some creative “artwork” and brainstorming….and a key part of that brainstorming is determing “who” the lineage is for, and when do you draw the line between “lineage” and management of metadata and simply “going to the tool or product or applicationand opening it up”. Two different banks offered use cases that are helpful here. In one case they already had a legacy home grown metadata management application keeps low level “rule” detail in a relational table…. but they didn’t have an up-to-date distribution system (old 3270 green screen stuff), or a way to inter-relate the legacy system with new objects blossoming all around them, from ERwin models to newly aquired database objects and transformation tools. So a hybrid is being put together…..some of their tangible objects are being represented directly, and where needed, a URL dumps out “metadata” from their generic table with appropriate filtering and stylesheet display………. the other use case more clearly outlines what “degree” of metadata is required to to be useful. Their concern is 1000’s of mainframe data sets that are managed by many 100’s of COBOL programs. They no longer have the intellectual knowledge of what happens to individual “fields” in the copy books of those programs, nor anyone who would be able to, in a brief glance, even comprehend what happens at the field level. But simply knowing “which” COBOL program moves “which” files to and fro (and which files are source by which other files and systems) would save them days of combing thru JCL. Consequently, the “black box” that they represent in lineage doesn’t need to be too detailed. If someone needs to know exactly what the MOVE statement looks like in the COBOL code, they can go directly to their source management system and look at the code itself (assuming it still exists — I’ve met sites that don’t have the source anymore either!)
Ernie
January 23, 2012 at 10:01 am
Very nice explanation
September 6, 2012 at 7:03 am
Reblogged this on IT & WEB.
June 10, 2013 at 5:09 pm
Love it. What is meta data? “Wherever you are” – ‘look up’ – that’s meta data to you!
June 10, 2013 at 5:14 pm
Thank you Barbara! …the person who really opened my eyes to the value of meta data! 🙂
May 12, 2016 at 9:34 am
If you are interested in sql data lineage try http://sqldep.com.
February 15, 2017 at 10:08 am
Awesome explanation. Thanks for sharing. I kept searching and read more articles but clearified here.
July 5, 2018 at 3:35 am
Hi Ernie….Apologies couldnt find the right topic to post this query.
Please can you share some insights/case studies/Demos available for creation of Information Governance Dashboards. IBM Knowledge Centre has vast information and doesnt show any demos/lab exercises as such.
Please share in case you have any reference links for any of above.
July 27, 2019 at 11:46 am
[…] https://dsrealtime.wordpress.com/2009/12/15/what-exactly-is-data-lineage/ […]