This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.
Previous post in this series:
Defining Lineage Flows (Part 1)
Original post in this series:
Open IGC is here!
Now it is time to start connecting your assets, processes and objects together to complete your illustration of lineage. In our previous discussion we reviewed a bundle that describes a data movement and transformation process — complete with inner functionality and flows that are consistent with common ETL and programming patterns. Of course, once you have defined this new bundle, you need to upload instances for it (instances of the objects that represent your actual programs, processes, and sub-processes). We described this effort in an earlier post [https://dsrealtime.wordpress.com/2015/08/20/open-igc-uploading-new-assets/ ]. Once our instances are loaded, we will want to describe for the Open IGC exactly how those inner flows are tied to each other, and how they link to other enterprise assets that define our sources and targets. Ultimately, the chaining of sources, targets and processes, along with all the other lineage definitions already captured or known to the Information Governance Catalog (DataStage, QualityStage, Extension Mappings, lineage via SQL views, business intelligence tools, FastTrack, etc.) will give us a complete end-to-end view of the lineage for the entire data integration lifecycle.
Lineage via Open IGC is defined by the ingestion of a “flow” document. Like the import of new assets, this is an xml document, defined in a single REST call to Open IGC. This document first lists the inventory of assets that will be used in lineage definitions, and then defines the exact source and target specifications (what is connected to what) that represent the flow of data. Let’s first look at the list of assets. Here we see a snippet of the flow doc that provides an inventory of our “bundle” assets. Each asset node identifies one “instance” of an asset that will be used in a lineage flow, or else identifies the parent of an asset within its hierarchy so that it can be properly located by Open IGC: (click directly on the image for a larger view and then use your “back” button to return)
The asset ID=”w1″ (red box) above is an arbitrary value, hand coded here but usually generated programmatically. As with the uploading of new asset instances, this value is pertinent only for this xml document in this invocation of a REST call to Open IGC. It is not a persistent value connected to this resource. The purple boxes identify the critical parts of the hierarchy leading to the $PixelStage-Column that will be formally used as a source or target.
Further down in our asset inventory we see the identification for the Database Columns of a Database Table. Like our newly loaded bundle assets, Database Columns belong to a hierarchy, each level of which is properly identified. Notice that I don’t need to provide any detailed properties here and in the example above…just the identity information (name in this case) and the containment relation for its parent.
Once again, asset ID = “db1” is arbitrary and unique only for this xml. The purple boxes identify the hierarchy that leads to Database Column “mycol1”. This should be familiar to you when you review the hierarchy of any Implementation Model with the Information Governance Catalog browser interface. We are simply identifying each part of the “tree”.
Similarly, here is the identification for the Data File Fields of a Data File. I don’t have to go down to this level, but it is a best practice to define lineage at the lowest possible point in the hierarchy, which is generally columns and fields. At the very least, aim to define lineage at the table level — lineage results will be more clear for your end users.
This identification of objects needs to be done for anything that you want to include in your lineage path. Data Files and Database Table assets are often the most common, but any object that is available for lineage in IGC is a potential candidate. Business Intelligence assets, members of an Extended Data Source collection, or parts of other bundles, such as the Messaging objects we reviewed in an earlier post of this series. How do you figure out that hierarchy, and learn the object class names? Well….admittedly, that can be tricky, although once you deal with the most common ones for awhile, you will become familiar with the names and their relationships. It is important that you become familiar with all the tooling that is available at the igc-rest-explorer page that we have reviewed in earlier posts. The “Types” section and the “Assets” section are invaluable for reviewing the class names for primary objects and their properties…and Open IGC will be sure to remind you with useful errors about not finding an asset if you spell a class incorrectly or guess wrong on the hierarchy.
After we have identified our “inventory of assets” we are ready to connect them. Here we bring our attention to the “flowUnit” nodes of the xml document. Each “flowUnit” is associated with an asset (usually a higher level asset, such as a whole Job or Process) and has a collection of individual “flows” that are the detailed unit for a point-to-point source/target specification. Let’s look at a representative sample and identify some of its meaningful parts:
The first important attribute in the flowUnit xml element is assetID=”p1″. This is the main asset that is associated with this flow unit. This refers to the in-document assetID that is associated with each node up above in our asset inventory (in the initial screen shot above in this post, assetID=”p3″ describes a Process called lookupCustomer and would be the typical asset for flowUnit details). The value “p1” identifies a whole process object in our bundle hierarchy. An entire “Job”. This also might be a single “instance” of an object that references a formal “execution” or “run” of this Job, if such an object is defined in your bundle. In this scenario, the next interesting attribute, flowType=”DESIGN” provides a “descriptor” for the kind of lineage I am defining. This value will appear for the user when they use their mouse and “hover” over a particular line/arrow in a lineage diagram. “DESIGN” represents the “intended” lineage for this process — and perhaps might be a way to show the processes own default values as coded by a developer. “DESIGN” might not be needed for your use case — many times you might only need “OPERATIONAL” for the flowType, when the lineage you are defining reflects an actual run-time history of the process and the data that flowed through it.
Now look carefully at the “flow” element above. Very simple. It has two critical xml attributes. One for source IDs…and one for targets. These point to other assetIDs from your the asset inventory you defined above. This is the ultimate key to defining your lineage. Lay out your point-to-point lineage connections here. IGC will aggregate and summarize your low level lineage specifications to display a larger lineage rendering.
Once the lineage specifications have been outlined, the xml is uploaded via Open IGC using the POST that is available in the Flows section of the igc-rest-explorer.
As with other call samples on the igc-rest-explorer “learn and test” page, there is a property where you can paste your xml payload, and then an example of the formal URL and expected response that you will use within the formal Open IGC interface you are developing.
If all goes well, your flow xml will be uploaded successfully and you can view lineage for your defined process! Lineage can be invoked in many ways — when initially testing lineage for new process assets I try to start lineage from the overall “process” asset itself. This will generally show me all of the lineage connections that were defined in that xml submission. Then you can move on and validate other lineage connections, starting on various assets that are significant for your use case.
…and drill in deeper with “Expand” if you have enabled that capability in your bundle!
…and as noted earlier, you can optionally hover over an individual lineage “arrow” and see the flowType for that particular data flow connection.
With this post, you now have reviewed all the basic ingredients you need to (1) design and register a new “bundle” of custom assets, (2) load various assets that you want to govern and make available for lineage, and (3) define and render the lineage that illustrates the flow of data through your systems.
In the next post we will start looking at advanced topics for fine tuning your lineage displays, updating bundles, etc.
Next post in this series:
Open IGC Advanced Topics: Virtual Assets