Defining Lineage Flows (Part 1)

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Uploading New Assets!

Original post in this series:
Open IGC is here!

If you have been following along, we finished designing our first simple “bundle” and uploaded some new instances of those objects. Our use case for that first bundle was a not-so-standard source and target; a set of “message queues”. The objects we created for that use case “might” contain their own lineage, but it is more likely that they will simply participate as end-points in other lineage definitions. Once created, they can be referenced by Extension Mappings or by the new Open IGC.

Now let’s get into the use case for a true “data mover” by defining a set of objects that actually move and transform data — programs, scripts, stored procedures, independent ETL tools, java, etc. Open IGC provides constructs that allow us to define, and therefore graphically illustrate the processes and sub-processes that we use for transformation. Further, it allows us to describe internal and external flows, to establish a “zoom” point where we can “dive in” for more detail (“Expand” for those of you who use DataStage lineage today), and to specify both Design and Operational level lineage. There are other goodies too, so let’s get right to it.

For this exercise I have defined a new bundle. I call it “PixelStage”. This is a fictitious ETL tool from the future that moves and transforms light beams. 😉 I took the liberty of using this example for the object types and their properties to force me to think “outside the box” and frankly, to keep things light (no pun intended) and interesting. Ultimately I morphed back to a fairly normal “column lineage” and data oriented paradigm, but this approach helped with the early learning curve over six months ago. You already know how to construct a bundle, so I will cover the highlights of what makes this bundle just a “little bit” different from our Messaging one.

First we define our bundle ID, and then lay out a hierarchy of object types. The “Processes” we are defining belong to a “Workspace”, and then inside each “Process” we will be defining a set of “Tasks”. By analogy to DataStage, this is like Project, Job, and Stage. Many programming disciplines have similar structures (albeit deeper or shallower) that you can describe in this fashion. Beneath “Tasks” will be “Columns”, the lowest unit of data flow for our lineage definition. Here is the screen shot of our new family of objects in this bundle:

(click on any image in this post to enlarge it in its own window; use your “back” key to return to the post)…


Looking more closely at this bundle, here are some very interesting and important properties:



This defines the “summary” level that you want to initially appear by default in your lineage reports.   It is designed primarily to support “data mover” types of bundles that additionally have their own “internal” lineage at a more granular level.  One object in each independent hierarchical path can have this property. Here we are defining expandableInLineage on each “Process”. This means that the user, upon seeing lineage initially displayed at the “Process” level, can drill deeper by clicking on the “Expand” link and see lineage that is “inside” that Process…at its underlying “Task” level:



While this diagram looks a little bit like an expanded DataStage Job, you can quickly see that some of the icons are unique (I stole the others from DataStage for this example because I didn’t have time to play in MS Paint!). Each icon inside the Process is identifying a different “Task” in this bundle, and each with its own internal lineage showing flows from one Task to another. The user can then hover over and further examine a Task, and then request lineage on one of its columns:


So you can see how Open IGC lets us define and then explore, very fine-grained lineage patterns.

Another interesting property, especially when defining lineage for data movement tools that have their own graphical development paradigm is canHaveImage=”true”. This is a nice feature that allows an IGC metadata author to edit the object and include a static screen shot for better identification and governance purposes.

The subprocesses for any transformation tool or process that you describe will often have different purposes; different functions that they apply. In our use case they are all still called “Tasks”, as they each belong to an overall “Process”, but each having their own unique properties. Open IGC allows us to reflect this relationship in the bundle, simplifying our definitions by supporting the inheritance of common properties. Here we see the overall definition of a class called “Task”, with some Header properties that will be common to all Tasks I define:


As I define additional task types and their custom properties, I refer back to the overall “Task” definition using the “superClass” attribute:


Class “Converter” above inherits Header properties from object “Task”, but further defines its own (inWaveLength and outWaveLength) and we see this again in the Reader subclass that has properties to keep track of security credentials:


While we are here it is worth noting that there are often objects in a bundle that you might not want to have ANY definition for lineage. Objects that you still want to govern, and provide icons for, but not allow the user to ask for “Data Lineage”. In this example, I want to illustrate “Variables” used by a Process. We may want to represent any number of them, and have them appear in lists, with their own icons, and available for independent reporting — but not be something that directly participates in a “flow”. Note the attribute called dataAccessRole=”None”. This indicates that the object cannot be directly defined for lineage, and the icon that a user clicks to request lineage will not appear for this object in a hover window or on its detail page.


The variables still appear in a Process detail page, but don’t illustrate lineage themselves:


Note that assets with dataAccessRole=”None” will not have “Reads from…” and “Writes to…” properties available in IGC queries, and consider that high level “container” type assets (ie:  “Workspace”, “Project”) are great candidates for this attribute.

Whew. This post is getting long. Next time we will see how we get all of these objects connected to one another and to other assets in the enterprise.

Next post in this series:
Defining Lineage Flows (Part 2)



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: