Sample Bundles

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Updating Your Bundle

Original post in this series:
Open IGC is here!

Here are some sample bundles for you! These bundles correspond to the use cases that I have been describing within this blog series (see “Original post” above). Each .zip file contains a directory structure that is formatted as I described in one of my early posts on bundle design (Open IGC: Defining a new bundle!). These bundles are for demonstration and learning purposes only. There are no warranties or certified methodologies implied.

Each bundle is complete with the asset_type_descriptor, along with several instance publishing upload files and one or more flow model uploads (if applicable to the use case). I have tried to include examples of various techniques, some of which I have already reviewed in these posts, or intend to in the near future. The values for various string properties are fictitious, and in some cases, just repeated and copied in the interest of more quickly building the example. This is especially true with the asset_ids (attribute ID= in the publishing and flow upload xmls), whose values are fairly random. These xml documents were crafted by hand — a good way to start testing — but ultimately, most of you will probably generate these unique identifiers programmatically. The prior posts in this series are enough to help you take these examples, register their bundles and upload their assets and lineage specifications. Then you can play with the instances within IGC, add new ones, update property values via the user interface or with new xml’s, and get further inspired to build your own!

Let me know if you have any problems accessing these zip file, or if you have any further questions about their use. — and let me know if you would like to also share your own creative bundles!


Note: This site doesn’t allow me to upload .zip files, so the files at these url’s have been renamed with “.ppt” as an additional suffix. Just rename them after download. They are normal .zip files.

Messaging Use Case

Abstract “Access Control” Use Case

Transformation Tool Use Case

Updating Your Bundle

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Open IGC Advanced Topics: Virtual Assets

Original post in this series:
Open IGC is here!

So….by now you have a custom bundle, loaded with instances of new objects for governance. You can see them in the Query tool of IGC, and view their icons when browsing all information assets. Hopefully you have tried to assign them to Business Terms and give them Stewards, and maybe even given them additional meaning by including them in a lineage flow.

Equally important, I hope that you have shown them to your colleagues and other members of your governance team, and also many of the general business and technical users in your enterprise! How did they like it? Have you illustrated a concept critical to your business that they can follow? Do they understand this concept better than they did before? Did they have any questions? It is very important that you expose your new bundle and its purpose to your entire use community. Their feedback will be critical as you fine tune the solution and make it a regular part of your governance activities.

If you have done all of these things, then it is likely that you need to make adjustments and changes. Maybe the labels for your new objects aren’t descriptive enough. Maybe you made a typo. Perhaps you need some alternative structures, or want to tweak the behavior of an object when it is used within a lineage report. How do you apply changes to a bundle? You could just delete the whole thing and start over, registering the bundle again, but that might not be necessary. What if you have already loaded a lot of assets, and lineage flows — it would be frustrating to have to run those calls again or manually re-load all of that metadata. There are some changes you can safely make to a bundle without receiving any errors:

— Add a new class
— Add a new property to a class
— Change label names
— Change default locale label
— Change label properties files
— Change dataAccessRole, expandableInLineage
— Add or change icons
— Add another literal to an existing enumeration (for an object declared as having an enumerated list).

What you cannot change (requires deletion and re-registering of your bundle):

— rename a class or attribute
— change a datatype
— remove a class or attribute
— change containment (change parentage)
— change inheritance (superclass)

Make and then save your changes to your original the asset_type_descriptor.xml, and then zip up the asset_descriptor, your icon subdirectory and the language subdirectory. Apply the updates using the PUT call that is available at the igc-rest-explorer page:

(click on the image for a larger picture in its own window)


If your changes are in the “allowed list” above, and there aren’t any other errors, your update will be applied successfully and you can immediately see the impact on your existing object instances. If your changes are not in the allowed list, you will need to entirely delete your bundle, apply the changed bundle, and then re-load the instances.

Happy “bundling!”


DataStage on YARN! …running in Hadoop!

Hi all…

Just a quick note. Yesterday we announced Information Server 11.5. It has some new features for governance, such as support for XML and also for detailed data classifications…..and it also has the ability for DataStage Jobs to run in Hadoop, controlled by YARN!

One of my colleagues with deep experience with Hadoop has written a very nice post on this exciting new capability…

Be the first to start using this feature to take additional advantage of your enterprise’s investment in Hadoop!


Open IGC Advanced Topics: Virtual Assets

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Defining Lineage Flows (Part 2)

Original post in this series:
Open IGC is here!

In the previous post, we reviewed how you define a formal lineage “flow” — first by defining the “inventory” of assets that will be established as sources and targets in your exact flow specification, and then the flowUnit xml node that explicitly states what will be the “source” and “target” for each point to point connection.

Assets, as we reviewed, might be Data Files, Database Columns, objects from other Bundles — anything that is able to participate in a lineage report within the Information Governance Catalog. We looked at the how you define the hierarchy, identifying (for example), the Host, Database, Schema, table and specific column name for a Database Column that has been formally imported into the repository (at an earlier time, via Metadata Asset Manager or other mechanism).

(as a reminder, here is the hierarchy that identifies a Database Column to be used in a flowUnit)


(and here is a flow unit that includes that column)


But what happens if you haven’t yet imported that Database Table and its columns? What if this is a temporary table, with a dynamic, time-stamp generated name, and you don’t care about ever formally importing it into the repository for governance purposes? What if you simply made a typo in your code, picked up the wrong name from somewhere in your program, or were given mis-information by the tool whose lineage you are recording? Open IGC supports the idea of a “Virtual Asset”. This allows you to define the objects that will be seen in a lineage report as a source or a target, but without any concern about whether they actually exist in the repository. These assets appear in the lineage diagram, but will be slightly greyed out, to indicate their status as a “Virtual Asset”.

In the first screen shot above, Database Column mycol1 doesn’t really exist. I have never imported it. It is used for illustration purposes, but could also easily be a column in a temporary table that only exists for a given run of the application. Note that it still appears in the lineage report, but with a slightly greyed out appearance. All the details from the definition above will appear in the report…the “red” box in the screen shot below (the top source icon on the left) identifies the “Virtual Asset”:


This Virtual Asset is viewed here in lineage, and you can even click on it directly and go to its detail page. However, it is considered “non governable”. This means that you can’t assign Terms or Stewards to it, or use it in Collection, or anything else related to governance. It is a tool to assist you in enabling lineage, providing additional insight where needed regarding the flow of your data. If it is truly an important asset, then it makes sense to formally import it and give it a full definition in the flow xml.

If an asset is found (using its name based object identity) in the repository, then it appears in a clear font and fully colored icon, without being greyed out. The green box (the bottom icon on the left in the lineage picture) identifies a “real” asset. This is an asset that truly exists in the repository that was imported earlier by formal means, and is fully governable (searchable, can be assigned Terms and Stewards, etc.).

Virtual Assets can be created for native IGC objects or for assets that you have created with your bundles. They are a powerful mechanism for illustrating lineage quickly and simply, without worrying about whether metadata has been formally imported or defined elsewhere. Later on, if metadata is imported and matches your flow XML, the Virtual Asset will become “real” in each lineage report.

Virtual Assets allow you to illustrate objects in lineage that don’t require governance, but need to be shown so that users fully understand the big picture for your overall data flows. They enable you to more quickly get your lineage solutions up and running for all IGC users.


Defining Lineage Flows (Part 2)

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Defining Lineage Flows (Part 1)

Original post in this series:
Open IGC is here!

Now it is time to start connecting your assets, processes and objects together to complete your illustration of lineage.  In our previous discussion we reviewed a bundle that describes a data movement and transformation process — complete with inner functionality and flows that are consistent with common ETL and programming patterns.   Of course, once you have defined this new bundle, you need to upload instances for it (instances of the objects that represent your actual programs, processes, and sub-processes).  We described this effort in an earlier post [ ].   Once our instances are loaded, we will want to describe for the Open IGC exactly how those inner flows are tied to each other, and how they link to other enterprise assets that define our sources and targets.    Ultimately, the chaining of sources, targets and processes, along with all the other lineage definitions already captured or known to the Information Governance Catalog (DataStage, QualityStage, Extension Mappings, lineage via SQL views, business intelligence tools, FastTrack, etc.) will give us a complete end-to-end view of the lineage for the entire data integration lifecycle.

Lineage via Open IGC is defined by the ingestion of a “flow” document.   Like the import of new assets, this is an xml document, defined in a single REST call to Open IGC.   This document first lists the inventory of assets that will be used in lineage definitions, and then defines the exact source and target specifications (what is connected to what) that represent the flow of data.   Let’s first look at the list of assets.   Here we see a snippet of the flow doc that provides an inventory of our “bundle” assets.    Each asset node identifies one “instance” of an asset that will be used in a lineage flow, or else identifies the parent of an asset within its hierarchy so that it can be properly located by Open IGC: (click directly on the image for a larger view and then use your “back” button to return)


The asset ID=”w1″ (red box) above is an arbitrary value, hand coded here but usually generated programmatically.   As with the uploading of new asset instances, this value is pertinent only for this xml document in this invocation of a REST call to Open IGC.  It is not a persistent value connected to this resource.   The purple boxes identify the critical parts of the hierarchy leading to the $PixelStage-Column that will be formally used as a source or target.

Further down in our asset inventory we see the identification for the Database Columns of a Database Table.  Like our newly loaded bundle assets, Database Columns belong to a hierarchy, each level of which is properly identified.  Notice that I don’t need to provide any detailed properties here and in the example above…just the identity information (name in this case) and the containment relation for its parent.


Once again, asset ID = “db1” is arbitrary and unique only for this xml.  The purple boxes identify the hierarchy that leads to Database Column “mycol1”.   This should be familiar to you when you review the hierarchy of any Implementation Model with the Information Governance Catalog browser interface.   We are simply identifying each part of the “tree”.

Similarly, here is the identification for the Data File Fields of a Data File.  I don’t have to go down to this level, but it is a best practice to define lineage at the lowest possible point in the hierarchy, which is generally columns and fields.    At the very least, aim to define lineage at the table level — lineage results will be more clear for your end users.


This identification of objects needs to be done for anything that you want to include in your lineage path.  Data Files and Database Table assets are often the most common, but any object that is available for lineage in IGC is a potential candidate.  Business Intelligence assets, members of an Extended Data Source collection, or parts of other bundles, such as the Messaging objects we reviewed in an earlier post of this series.    How do you figure out that hierarchy, and learn the object class names?  Well….admittedly, that can be tricky, although once you deal with the most common ones for awhile, you will become familiar with the names and their relationships.   It is important that you become familiar with all the tooling that is available at the igc-rest-explorer page that we have reviewed in earlier posts.   The “Types” section and the “Assets” section are invaluable for reviewing the class names for primary objects and their properties…and Open IGC will be sure to remind you with useful errors about not finding an asset if you spell a class incorrectly or guess wrong on the hierarchy.

After we have identified our “inventory of assets” we are ready to connect them.  Here we bring our attention to the “flowUnit” nodes of the xml document.  Each “flowUnit” is associated with an asset (usually a higher level asset, such as a whole Job or Process) and has a collection of individual “flows” that are the detailed unit for a point-to-point source/target specification.    Let’s look at a representative sample and identify some of its meaningful parts:


The first important attribute in the flowUnit xml element is assetID=”p1″.  This is the main asset that is associated with this flow unit. This refers to the in-document assetID that is associated with each node up above in our asset inventory (in the initial screen shot above in this post, assetID=”p3″ describes a Process called lookupCustomer and would be the typical asset for flowUnit details). The value “p1” identifies a whole process object in our bundle hierarchy.   An entire “Job”.  This also might be a single “instance” of an object that references a formal “execution” or “run” of this Job, if such an object is defined in your bundle.   In this scenario, the next interesting attribute, flowType=”DESIGN” provides a “descriptor” for the kind of lineage I am defining.   This value will appear for the user when they use their mouse and “hover” over a particular line/arrow in a lineage diagram.   “DESIGN” represents the “intended” lineage for this process — and perhaps might be a way to show the processes own default values as coded by a developer.    “DESIGN” might not be needed for your use case — many times you might only need “OPERATIONAL” for the flowType, when the lineage you are defining reflects an actual run-time history of the process and the data that flowed through it.

Now look carefully at the “flow” element above. Very simple. It has two critical xml attributes. One for source IDs…and one for targets. These point to other assetIDs from your the asset inventory you defined above. This is the ultimate key to defining your lineage. Lay out your point-to-point lineage connections here. IGC will aggregate and summarize your low level lineage specifications to display a larger lineage rendering.

Once the lineage specifications have been outlined, the xml is uploaded via Open IGC using the POST that is available in the Flows section of the igc-rest-explorer.


As with other call samples on the igc-rest-explorer “learn and test” page, there is a property where you can paste your xml payload, and then an example of the formal URL and expected response that you will use within the formal Open IGC interface you are developing.

If all goes well, your flow xml will be uploaded successfully and you can view lineage for your defined process! Lineage can be invoked in many ways — when initially testing lineage for new process assets I try to start lineage from the overall “process” asset itself. This will generally show me all of the lineage connections that were defined in that xml submission. Then you can move on and validate other lineage connections, starting on various assets that are significant for your use case.


…and drill in deeper with “Expand” if you have enabled that capability in your bundle!


…and as noted earlier, you can optionally hover over an individual lineage “arrow” and see the flowType for that particular data flow connection.

With this post, you now have reviewed all the basic ingredients you need to (1) design and register a new “bundle” of custom assets, (2) load various assets that you want to govern and make available for lineage, and (3) define and render the lineage that illustrates the flow of data through your systems.

In the next post we will start looking at advanced topics for fine tuning your lineage displays, updating bundles, etc.

Next post in this series:
Open IGC Advanced Topics: Virtual Assets


Defining Lineage Flows (Part 1)

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Uploading New Assets!

Original post in this series:
Open IGC is here!

If you have been following along, we finished designing our first simple “bundle” and uploaded some new instances of those objects. Our use case for that first bundle was a not-so-standard source and target; a set of “message queues”. The objects we created for that use case “might” contain their own lineage, but it is more likely that they will simply participate as end-points in other lineage definitions. Once created, they can be referenced by Extension Mappings or by the new Open IGC.

Now let’s get into the use case for a true “data mover” by defining a set of objects that actually move and transform data — programs, scripts, stored procedures, independent ETL tools, java, etc. Open IGC provides constructs that allow us to define, and therefore graphically illustrate the processes and sub-processes that we use for transformation. Further, it allows us to describe internal and external flows, to establish a “zoom” point where we can “dive in” for more detail (“Expand” for those of you who use DataStage lineage today), and to specify both Design and Operational level lineage. There are other goodies too, so let’s get right to it.

For this exercise I have defined a new bundle. I call it “PixelStage”. This is a fictitious ETL tool from the future that moves and transforms light beams.😉 I took the liberty of using this example for the object types and their properties to force me to think “outside the box” and frankly, to keep things light (no pun intended) and interesting. Ultimately I morphed back to a fairly normal “column lineage” and data oriented paradigm, but this approach helped with the early learning curve over six months ago. You already know how to construct a bundle, so I will cover the highlights of what makes this bundle just a “little bit” different from our Messaging one.

First we define our bundle ID, and then lay out a hierarchy of object types. The “Processes” we are defining belong to a “Workspace”, and then inside each “Process” we will be defining a set of “Tasks”. By analogy to DataStage, this is like Project, Job, and Stage. Many programming disciplines have similar structures (albeit deeper or shallower) that you can describe in this fashion. Beneath “Tasks” will be “Columns”, the lowest unit of data flow for our lineage definition. Here is the screen shot of our new family of objects in this bundle:

(click on any image in this post to enlarge it in its own window; use your “back” key to return to the post)…


Looking more closely at this bundle, here are some very interesting and important properties:



This defines the “summary” level that you want to initially appear by default in your lineage reports.   It is designed primarily to support “data mover” types of bundles that additionally have their own “internal” lineage at a more granular level.  One object in each independent hierarchical path can have this property. Here we are defining expandableInLineage on each “Process”. This means that the user, upon seeing lineage initially displayed at the “Process” level, can drill deeper by clicking on the “Expand” link and see lineage that is “inside” that Process…at its underlying “Task” level:



While this diagram looks a little bit like an expanded DataStage Job, you can quickly see that some of the icons are unique (I stole the others from DataStage for this example because I didn’t have time to play in MS Paint!). Each icon inside the Process is identifying a different “Task” in this bundle, and each with its own internal lineage showing flows from one Task to another. The user can then hover over and further examine a Task, and then request lineage on one of its columns:


So you can see how Open IGC lets us define and then explore, very fine-grained lineage patterns.

Another interesting property, especially when defining lineage for data movement tools that have their own graphical development paradigm is canHaveImage=”true”. This is a nice feature that allows an IGC metadata author to edit the object and include a static screen shot for better identification and governance purposes.

The subprocesses for any transformation tool or process that you describe will often have different purposes; different functions that they apply. In our use case they are all still called “Tasks”, as they each belong to an overall “Process”, but each having their own unique properties. Open IGC allows us to reflect this relationship in the bundle, simplifying our definitions by supporting the inheritance of common properties. Here we see the overall definition of a class called “Task”, with some Header properties that will be common to all Tasks I define:


As I define additional task types and their custom properties, I refer back to the overall “Task” definition using the “superClass” attribute:


Class “Converter” above inherits Header properties from object “Task”, but further defines its own (inWaveLength and outWaveLength) and we see this again in the Reader subclass that has properties to keep track of security credentials:


While we are here it is worth noting that there are often objects in a bundle that you might not want to have ANY definition for lineage. Objects that you still want to govern, and provide icons for, but not allow the user to ask for “Data Lineage”. In this example, I want to illustrate “Variables” used by a Process. We may want to represent any number of them, and have them appear in lists, with their own icons, and available for independent reporting — but not be something that directly participates in a “flow”. Note the attribute called dataAccessRole=”None”. This indicates that the object cannot be directly defined for lineage, and the icon that a user clicks to request lineage will not appear for this object in a hover window or on its detail page.


The variables still appear in a Process detail page, but don’t illustrate lineage themselves:


Note that assets with dataAccessRole=”None” will not have “Reads from…” and “Writes to…” properties available in IGC queries, and consider that high level “container” type assets (ie:  “Workspace”, “Project”) are great candidates for this attribute.

Whew. This post is getting long. Next time we will see how we get all of these objects connected to one another and to other assets in the enterprise.

Next post in this series:
Defining Lineage Flows (Part 2)


Open IGC: Uploading New Assets!

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Open IGC: Defining a new bundle!

Original post in this series:
Open IGC is here!

At this point you have your bundle defined, you can see your objects and their icons in the “browse assets” page, and the detailed properties of your new objects are visible within the Query Tool. Congratulations! Now you are ready to start loading new assets, or new “instances” of the objects that you have modeled with your bundle design.

New assets are added using XML and another REST call, this time a special POST for the upload of new assets. The documentation (see URL in the prior blog entry) includes example xml documents and the xsd, but let’s look more closely at one here.

Our Messaging bundle describes a simple hierarchy that has Queue Managers at the highest level, and then Queues. Queue Managers also have “Listeners”. These are the major objects in the bundle. Initially I am just defining new Queue Managers and their Queues. To keep my sanity while initially learning and playing with the API, I am creating a single xml document for each Queue Manager. This is not a requirement, but keeping the documents small, and focused on one higher level object in the hierarchy will help you understand the structure of the xml and speed up your learning curve. Each whole “xml document” or “xml string” is what you will be passing in a single http POST when performing the actual upload.

Here is a list of an initial set of these xml documents.


To stay organized, I keep them in a folder structure, per bundle type, that has a subdirectory for bundle details (see prior blog entry), a subdirectory for publishing new assets, and a subdirectory for publishing “flows” (for lineage…a future post). Ultimately, many of these xml documents will be built “on the fly” in your programs that craft the interface between Information Governance Catalog and whatever you are modeling with your bundle. However, the simplest way to learn the Open IGC is by using static xml documents. Depending on your use case, some of you may only have a few objects to govern, and might always use this file based approach.


Let’s take a closer look at one of the publishing xml documents:


The elements and attributes above are well documented in the examples, so I won’t go into excruciating detail, but want to point out several items.

1. Your custom properties (light green box). Note how their names each begin with a dollar sign. This uniquely identifies them as “yours”. Every object gets name, short_description, and long_description. Think of these as “free” in your bundle. You didn’t need to define them in the bundle — they are just “there”. As such, they don’t require the dollar sign prefix.

2. The value of the repr attribute in the object header, and the string used for the “value” attribute for “name” immediately below it (purple box) must be identical! This is for internal reasons. It is a requirement of the API. You will get a nice error if they are not identical, so am pointing it out here to save you the trouble.

3. The ID value (red boxes) is a unique identifier for the asset within this xml document. It is just an internal reference that is used throughout this particular xml document (it doesn’t have any overall system significance). It is critical for establishing the hierarchy of your objects and will be even more important when you learn about the “flow” xml for lineage.

4. The “reference” element (blue box) is what helps establish the hierarchy, identifying the parent asset (if applicable). Note the use of “ID”.

Another very important part of the publishing xml is the “importAction” element at the bottom of your xml document. This is an important property that controls the behavior of the API when managing a complex hierarchy. This can be a difficult concept to understand, but I will do my best to explain it here.


The element importAction has two attributes, partialAssetIDs and completeAssetIDs. These attributes contain a set of comma delimited IDs from up above in the xml document. They describe whether a particular asset, in this xml document, is being uploaded with ALL of its children, or only “some”. If the parent ID is listed in “completeAssetIDs”, then the parent and its collection of child objects is considered complete; any pre-existing child instances “not mentioned” in this new xml document will be blown away. Mentioned child instances, if pre-existing, will have current properties edited (if desired) and retain all governance references (stewardship, term assignments, etc.). If you want to preserve the pre-existing children for a particular parent, place that parent ID in “partialAssetIDs”.

Once you have built your xml document, and have checked it for well-formed-ness (at the very least, make sure you can open it in your browser as a well-formed and recognized xml document), you are ready to upload it to IGC. Go to the igc-rest-explorer page for the Open IGC API and find the bundle “POST” invocation for publishing assets:


Open your xml document in a regular editor and copy/paste the entire xml string into the available property (red box in the screen shot above) and then click “Try it out!”

If there are any errors, you will receive them here directly, and if all is “ok”, you will receive a clean 200 response code, and your assets will have been loaded.


At this point, you can immediately return to the Information Governance Catalog and view your new assets!


Browse them by returning to the main “Information Assets…Browse All” pull down where you found the icons for your bundle, and then look around….see if your child assets are also loaded, and how they are displayed “within” the parent! Try doing a Query. Edit one of your new assets and make adjustments to one of the properties!

Your assets are now being governed…they can be assigned Terms and Labels, belong to Collections, become the responsibility of a Steward — just about everything that you can do within Information Governance Catalog is now available for your new objects! In the next post we will look at how you can apply your own custom flow definitions for data lineage that includes your new object instances.

Next post in this series:
Defining Lineage Flows (Part 1)



Get every new post delivered to your Inbox.

Join 147 other followers