What is Data Lineage? (a reprise)

(from Meriam-Webster) re-prise: a recurrence, renewal, or resumption of an action

Hi Everyone.

Ten years ago I posted an entry called “What exactly is Data Lineage?”   Ten years!


Since that time, the concept of lineage has evolved and grown and taken on more meanings.  Lineage is now a major topic in every conversation that surrounds data, regulatory compliance, governance, metadata management, decision support, artificial intelligence, data intelligence, data quality, machine learning, and much, much more.  Let’s quickly review what ten years has done to affect the definition of “lineage”…

  • Ten years ago, we barely started uttering the words “information governance” or “data governance”.  Today, “governance” is just one small part of the lineage equation.
  • Ten years ago, Hadoop and Data Lakes were in their infancy, and we were just starting to grasp the explosion of data we are swimming in today.
  • Ten years ago, we were exploring the display of lineage on our laptops, and “maybe” on our Blackberries. Today we expect graphical rendering on any device
  • Ten years ago, many questioned whether we would need lineage for COBOL and other legacy systems alongside lineage for modern ETL tooling and coding methods. Now we demand lineage for everything!
  • Ten years ago, we didn’t think there would be any chance for metadata and lineage standardization. Today there are initiatives underway for common models and metadata sharing protocols.
  • Ten years ago, we weren’t thinking about lineage for ground-to-cloud, Spark, or lineage to illustrate decisions made by data citizens building machine learning or predictive analytic models. Today we are spawning new methods in open source and data science that demand lineage engineering.
  • Ten years ago…(I am sure you can think of many more…)…

Ten years ago!  Whew.

Today there are a multitude of web sites where you can dive into the topic of lineage.  Depending on your background, or interest, you can find resources pointing to everything from the calculation of lineage and its representation within a mathematical graph to the use of lineage for predicting bottlenecks and potential security breaches, and everything in-between.   You will find many definitions of lineage and its nuances.

Here are the major areas and definitions of lineage that are trending at my customers.

Data Lineage.   The basic definition of data lineage has remained constant, albeit with a lot of sub-divisions and “extended” descriptions.  Data Lineage is a representation of the “flow” of data, with the ability to trace what “happened” (or will happen) to that data, going back to “where it came from” or illustrating “where it goes to”.     The “extensions” of that definition generally branch out in terms of the level of granularity and the kind of lineage that is being tracked, traced, or followed.  I won’t try to list them all here — it would be redundant.  I encourage you to look at your own requirements, and then what YOUR users need.

Regarding level of granularity, how deep does your lineage need to go?   How many different ways does it need to be rendered?  Do you need low level technical lineage that drills into individual expressions and the actual “if…then…else” syntax that exists in your source code?   Or are users overwhelmed by that much information and need a higher level “Business Lineage” or “Conceptual Lineage” showing the general flow of data through your information lifecycle or the logical handling of your critical assets?  Do you need both?  Can you achieve either of these levels of granularity automatically?  Are parsers/scanners/bridges available?  Do you even have access to the source of your integration programs if an automated solution exists?   As you look at lineage solutions or build your own, first understand the granularity you want and need, based on your consumers and their use cases.

Regarding the types of lineage, what are you trying to achieve?  As with the topic of granularity, determine what “kinds” of lineage YOUR users require.   Here are just a few of the types of lineage that are practiced and/or are being discussed.

  • Design Lineage  What the code, the process or ‘thing’ you are exploring for lineage is “supposed” to do.
  • Operational or Run-time lineage. What the code, process or ‘thing’ you are exploring for lineage actually “did” (last night, last week, last version, last <fill in the blank>).  This discussion usually gets deep into capturing actual run-time parameter values.
  • Process Flow Lineage (flow of control, as opposed to flow of data). Which processes call other processes?  How are your systems initially invoked or “kicked off”?   This will then also have its own design vs runtime considerations.
  • …and finally, a type that is being driven harder and faster by the growing concerns for the handling of personal data. This is “Value-based Lineage” or “Data Provenance”.  Years ago, this was largely in the domain of customer relationship management and points to the ability to trace how a “specific, individual” record flows or flowed thru the system.  This is of course critical now for GDPR, CCPA and similar efforts to “really know” where particular personal information lies and where it is going.

Why all these new definitions and branching disciplines?  During these ten years, the domain of lineage has not stopped growing.   We are coding fast and furiously with new tools and new environments (without doing lineage “up front” where it would be less expensive and simpler to implement), and we are also continually realizing new use cases and solutions where data lineage can provide value and insight.  Lineage is not “just” for impact analysis, and it is no longer “just” for improved decision making and data quality.  It’s value for regulatory compliance, actionable data management, performance analysis, data protection, and more, are just starting to be realized.

Besides new “things” to scan and parse, what is next?   Expect to see more progress with “open metadata” and standardization.   Apache Atlas and now ODPi Egeria  (https://egeria.odpi.org/) are leading to multi-vendor development of a common model for sharing — not only general metadata, but also lineage information.  This offers the promise of untangling complex efforts to ingest, reconcile, and normalize lineage details from diverse and otherwise incongruent repositories.

The next challenge will be learning how to better exploit our increasing insight into lineage.  What are we doing with the insight?  How are we taking advantage of what lineage delivers?  What “should” we be doing with it?

I am looking forward to the next steps in this journey!


*** update ***

Hi everyone.  Several weeks ago I left IBM for a new opportunity with MANTA Software ( www.getmanta.com ).  I am looking forward to continuing to drive innovations in data lineage and common understanding of data to meet all the challenges above and more!  Thank you to everyone for your support and encouragement!    -ernie


3 Responses to “What is Data Lineage? (a reprise)”

  1. Shanks Says:

    Hey Ernie,

    Great post ! I am intrigued as to how Lineage can help in a performance analysis perspective.

    • dsrealtime Says:

      Great question! Thanks.

      The best use cases I’ve discussed with customers about performance analysis and lineage have usually been around “what happened?” (that is causing something to run more slowly) or “why is this code path faster than this one?” or “what can we do?” to improve performance. The first two are assisted by lineage comparisons — “what has changed in this SQL vs last month?”. The addition of a particular join, or change in a WHERE clause could be enough to dramatically impact behavior. Or consider two different ETL programs. Why is one much faster than another? Lineage can help uncover differences in transformation logic that are otherwise difficult to see. More proactively, lineage reporting can help suggest columns that should have indexes on them, by quickly illustrating JOIN and WHERE clause dependencies. Doing this via lineage reporting can help more quickly narrow down “which” procedures to look at first, or flag expressions in code that is not as well known to the team.

      • Shankara Narayanan S Says:

        I am a few months late to reply to this, very sorry about that.

        Co-incidentally, around the time I read this is when I moved to a project where Data Lineage was readily available (Through IGC) and I totally see what you mean. The difference in night and day.

        My previous experience was in a poorly managed project where the person who has been there for 10 years is more or less the Lineage source of truth. Especially in transformation rules, if something does not make sense, they were the bottleneck for us to develop ETL code. Sadly, the information we got from a human wasn’t always right too. I was talking to an old colleague from there and I just remembered your note here. Sending this to them.

        thanks so much.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: