Defining Lineage Flows (Part 2)

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Defining Lineage Flows (Part 1)

Original post in this series:
Open IGC is here!

Now it is time to start connecting your assets, processes and objects together to complete your illustration of lineage.  In our previous discussion we reviewed a bundle that describes a data movement and transformation process — complete with inner functionality and flows that are consistent with common ETL and programming patterns.   Of course, once you have defined this new bundle, you need to upload instances for it (instances of the objects that represent your actual programs, processes, and sub-processes).  We described this effort in an earlier post [https://dsrealtime.wordpress.com/2015/08/20/open-igc-uploading-new-assets/ ].   Once our instances are loaded, we will want to describe for the Open IGC exactly how those inner flows are tied to each other, and how they link to other enterprise assets that define our sources and targets.    Ultimately, the chaining of sources, targets and processes, along with all the other lineage definitions already captured or known to the Information Governance Catalog (DataStage, QualityStage, Extension Mappings, lineage via SQL views, business intelligence tools, FastTrack, etc.) will give us a complete end-to-end view of the lineage for the entire data integration lifecycle.

Lineage via Open IGC is defined by the ingestion of a “flow” document.   Like the import of new assets, this is an xml document, defined in a single REST call to Open IGC.   This document first lists the inventory of assets that will be used in lineage definitions, and then defines the exact source and target specifications (what is connected to what) that represent the flow of data.   Let’s first look at the list of assets.   Here we see a snippet of the flow doc that provides an inventory of our “bundle” assets.    Each asset node identifies one “instance” of an asset that will be used in a lineage flow, or else identifies the parent of an asset within its hierarchy so that it can be properly located by Open IGC: (click directly on the image for a larger view and then use your “back” button to return)

assetListForFlows

The asset ID=”w1″ (red box) above is an arbitrary value, hand coded here but usually generated programmatically.   As with the uploading of new asset instances, this value is pertinent only for this xml document in this invocation of a REST call to Open IGC.  It is not a persistent value connected to this resource.   The purple boxes identify the critical parts of the hierarchy leading to the $PixelStage-Column that will be formally used as a source or target.

Further down in our asset inventory we see the identification for the Database Columns of a Database Table.  Like our newly loaded bundle assets, Database Columns belong to a hierarchy, each level of which is properly identified.  Notice that I don’t need to provide any detailed properties here and in the example above…just the identity information (name in this case) and the containment relation for its parent.

assetListDatabaseForFlows

Once again, asset ID = “db1” is arbitrary and unique only for this xml.  The purple boxes identify the hierarchy that leads to Database Column “mycol1”.   This should be familiar to you when you review the hierarchy of any Implementation Model with the Information Governance Catalog browser interface.   We are simply identifying each part of the “tree”.

Similarly, here is the identification for the Data File Fields of a Data File.  I don’t have to go down to this level, but it is a best practice to define lineage at the lowest possible point in the hierarchy, which is generally columns and fields.    At the very least, aim to define lineage at the table level — lineage results will be more clear for your end users.

assetListDataFileforFlows

This identification of objects needs to be done for anything that you want to include in your lineage path.  Data Files and Database Table assets are often the most common, but any object that is available for lineage in IGC is a potential candidate.  Business Intelligence assets, members of an Extended Data Source collection, or parts of other bundles, such as the Messaging objects we reviewed in an earlier post of this series.    How do you figure out that hierarchy, and learn the object class names?  Well….admittedly, that can be tricky, although once you deal with the most common ones for awhile, you will become familiar with the names and their relationships.   It is important that you become familiar with all the tooling that is available at the igc-rest-explorer page that we have reviewed in earlier posts.   The “Types” section and the “Assets” section are invaluable for reviewing the class names for primary objects and their properties…and Open IGC will be sure to remind you with useful errors about not finding an asset if you spell a class incorrectly or guess wrong on the hierarchy.

After we have identified our “inventory of assets” we are ready to connect them.  Here we bring our attention to the “flowUnit” nodes of the xml document.  Each “flowUnit” is associated with an asset (usually a higher level asset, such as a whole Job or Process) and has a collection of individual “flows” that are the detailed unit for a point-to-point source/target specification.    Let’s look at a representative sample and identify some of its meaningful parts:

flowUnit

The first important attribute in the flowUnit xml element is assetID=”p1″.  This is the main asset that is associated with this flow unit. This refers to the in-document assetID that is associated with each node up above in our asset inventory (in the initial screen shot above in this post, assetID=”p3″ describes a Process called lookupCustomer and would be the typical asset for flowUnit details). The value “p1” identifies a whole process object in our bundle hierarchy.   An entire “Job”.  This also might be a single “instance” of an object that references a formal “execution” or “run” of this Job, if such an object is defined in your bundle.   In this scenario, the next interesting attribute, flowType=”DESIGN” provides a “descriptor” for the kind of lineage I am defining.   This value will appear for the user when they use their mouse and “hover” over a particular line/arrow in a lineage diagram.   “DESIGN” represents the “intended” lineage for this process — and perhaps might be a way to show the processes own default values as coded by a developer.    “DESIGN” might not be needed for your use case — many times you might only need “OPERATIONAL” for the flowType, when the lineage you are defining reflects an actual run-time history of the process and the data that flowed through it.

Now look carefully at the “flow” element above. Very simple. It has two critical xml attributes. One for source IDs…and one for targets. These point to other assetIDs from your the asset inventory you defined above. This is the ultimate key to defining your lineage. Lay out your point-to-point lineage connections here. IGC will aggregate and summarize your low level lineage specifications to display a larger lineage rendering.

Once the lineage specifications have been outlined, the xml is uploaded via Open IGC using the POST that is available in the Flows section of the igc-rest-explorer.

flowsCall

As with other call samples on the igc-rest-explorer “learn and test” page, there is a property where you can paste your xml payload, and then an example of the formal URL and expected response that you will use within the formal Open IGC interface you are developing.

If all goes well, your flow xml will be uploaded successfully and you can view lineage for your defined process! Lineage can be invoked in many ways — when initially testing lineage for new process assets I try to start lineage from the overall “process” asset itself. This will generally show me all of the lineage connections that were defined in that xml submission. Then you can move on and validate other lineage connections, starting on various assets that are significant for your use case.

expandLink

…and drill in deeper with “Expand” if you have enabled that capability in your bundle!

expandedProcess

…and as noted earlier, you can optionally hover over an individual lineage “arrow” and see the flowType for that particular data flow connection.

With this post, you now have reviewed all the basic ingredients you need to (1) design and register a new “bundle” of custom assets, (2) load various assets that you want to govern and make available for lineage, and (3) define and render the lineage that illustrates the flow of data through your systems.

In the next post we will start looking at advanced topics for fine tuning your lineage displays, updating bundles, etc.

–ernie

Defining Lineage Flows (Part 1)

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Uploading New Assets!

Original post in this series:
Open IGC is here!

If you have been following along, we finished designing our first simple “bundle” and uploaded some new instances of those objects. Our use case for that first bundle was a not-so-standard source and target; a set of “message queues”. The objects we created for that use case “might” contain their own lineage, but it is more likely that they will simply participate as end-points in other lineage definitions. Once created, they can be referenced by Extension Mappings or by the new Open IGC.

Now let’s get into the use case for a true “data mover” by defining a set of objects that actually move and transform data — programs, scripts, stored procedures, independent ETL tools, java, etc. Open IGC provides constructs that allow us to define, and therefore graphically illustrate the processes and sub-processes that we use for transformation. Further, it allows us to describe internal and external flows, to establish a “zoom” point where we can “dive in” for more detail (“Expand” for those of you who use DataStage lineage today), and to specify both Design and Operational level lineage. There are other goodies too, so let’s get right to it.

For this exercise I have defined a new bundle. I call it “PixelStage”. This is a fictitious ETL tool from the future that moves and transforms light beams. ;) I took the liberty of using this example for the object types and their properties to force me to think “outside the box” and frankly, to keep things light (no pun intended) and interesting. Ultimately I morphed back to a fairly normal “column lineage” and data oriented paradigm, but this approach helped with the early learning curve over six months ago. You already know how to construct a bundle, so I will cover the highlights of what makes this bundle just a “little bit” different from our Messaging one.

First we define our bundle ID, and then lay out a hierarchy of object types. The “Processes” we are defining belong to a “Workspace”, and then inside each “Process” we will be defining a set of “Tasks”. By analogy to DataStage, this is like Project, Job, and Stage. Many programming disciplines have a similar structures (albeit deeper or shallower) that you can describe in this fashion. Beneath “Tasks” will be “Columns”, the lowest unit of data flow for our lineage definition. Here is the screen shot of our new family of objects in this bundle:

(click on any image in this post to enlarge it in its own window; use your “back” key to return to the post)…

pixelStageFamily

Looking more closely at this bundle, here are some very interesting and important properties:

expandableInLineage=”true”

expandable

This defines the “summary” level that you want to initially appear by default in your lineage reports. One object in each independent hierarchical path can have this property. Here we are defining expandableInLineage on each “Process”. This means that the user, upon seeing lineage initially displayed at the “Process” level, can drill deeper by clicking on the “Expand” link and see lineage that is “inside” that Process…at its underlying “Task” level:

expandLink

expandedProcess

While this diagram looks a little bit like an expanded DataStage Job, you can quickly see that some of the icons are unique (I stole the others from DataStage for this example because I didn’t have time to play in MS Paint!). Each icon inside the Process is identifying a different “Task” in this bundle, and each with its own internal lineage showing flows from one Task to another. The user can then hover over and further examine a Task, and then request lineage on one of its columns:

columnLineage

So you can see how Open IGC lets us define and then explore, very fine-grained lineage patterns.

Another interesting property, especially when defining lineage for data movement tools that have their own graphical development paradigm is canHaveImage=”true”. This is a nice feature that allows an IGC metadata author to edit the object and include a static screen shot for better identification and governance purposes.

The subprocesses for any transformation tool or process that you describe will often have different purposes; different functions that they apply. In our use case they are all still called “Tasks”, as they each belong to an overall “Process”, but each having their own unique properties. Open IGC allows us to reflect this relationship in the bundle, simplifying our definitions by supporting the inheritance of common properties. Here we see the overall definition of a class called “Task”, with some Header properties that will be common to all Tasks I define:

mainTask

As I define additional task types and their custom properties, I refer back to the overall “Task” definition using the “superClass” attribute:

superClass01

Class “Converter” above inherits Header properties from object “Task”, but further defines its own (inWaveLength and outWaveLength) and we see this again in the Reader subclass that has properties to keep track of security credentials:

superClass02

While we are here it is worth noting that there are often objects in a bundle that you might not want to have ANY definition for lineage. Objects that you still want to govern, and provide icons for, but not allow the user to ask for “Data Lineage”. In this example, I want to illustrate “Variables” used by a Process. We may want to represent any number of them, and have them appear in lists, with their own icons, and available for independent reporting — but not be something that directly participates in a “flow”. Note the attribute called dataAccessRole=”none”. This indicates that the object cannot be directly defined for lineage, and the icon that a user clicks to request lineage will not appear for this object in a hover window or on its detail page.

dataAccessRole

The variables still appear in a Process detail page, but don’t illustrate lineage themselves:

dataAccessRole

variable

Whew. This post is getting long. Next time we will see how we get all of these objects connected to one another and to other assets in the enterprise.

Next post in this series:
Defining Lineage Flows (Part 2)

–ernie

Open IGC: Uploading New Assets!

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Open IGC: Defining a new bundle!

Original post in this series:
Open IGC is here!

At this point you have your bundle defined, you can see your objects and their icons in the “browse assets” page, and the detailed properties of your new objects are visible within the Query Tool. Congratulations! Now you are ready to start loading new assets, or new “instances” of the objects that you have modeled with your bundle design.

New assets are added using XML and another REST call, this time a special POST for the upload of new assets. The documentation (see URL in the prior blog entry) includes example xml documents and the xsd, but let’s look more closely at one here.

Our Messaging bundle describes a simple hierarchy that has Queue Managers at the highest level, and then Queues. Queue Managers also have “Listeners”. These are the major objects in the bundle. Initially I am just defining new Queue Managers and their Queues. To keep my sanity while initially learning and playing with the API, I am creating a single xml document for each Queue Manager. This is not a requirement, but keeping the documents small, and focused on one higher level object in the hierarchy will help you understand the structure of the xml and speed up your learning curve. Each whole “xml document” or “xml string” is what you will be passing in a single http POST when performing the actual upload.

Here is a list of an initial set of these xml documents.

xmlDocsForPublishing

To stay organized, I keep them in a folder structure, per bundle type, that has a subdirectory for bundle details (see prior blog entry), a subdirectory for publishing new assets, and a subdirectory for publishing “flows” (for lineage…a future post). Ultimately, many of these xml documents will be built “on the fly” in your programs that craft the interface between Information Governance Catalog and whatever you are modeling with your bundle. However, the simplest way to learn the Open IGC is by using static xml documents. Depending on your use case, some of you may only have a few objects to govern, and might always use this file based approach.

highLevelBundleFolderStructure

Let’s take a closer look at one of the publishing xml documents:

publishXML

The elements and attributes above are well documented in the examples, so I won’t go into excruciating detail, but want to point out several items.

1. Your custom properties (light green box). Note how their names each begin with a dollar sign. This uniquely identifies them as “yours”. Every object gets name, short_description, and long_description. Think of these as “free” in your bundle. You didn’t need to define them in the bundle — they are just “there”. As such, they don’t require the dollar sign prefix.

2. The value of the repr attribute in the object header, and the string used for the “value” attribute for “name” immediately below it (purple box) must be identical! This is for internal reasons. It is a requirement of the API. You will get a nice error if they are not identical, so am pointing it out here to save you the trouble.

3. The ID value (red boxes) is a unique identifier for the asset within this xml document. It is just an internal reference that is used throughout this particular xml document (it doesn’t have any overall system significance). It is critical for establishing the hierarchy of your objects and will be even more important when you learn about the “flow” xml for lineage.

4. The “reference” element (blue box) is what helps establish the hierarchy, identifying the parent asset (if applicable). Note the use of “ID”.

Another very important part of the publishing xml is the “importAction” element at the bottom of your xml document. This is an important property that controls the behavior of the API when managing a complex hierarchy. This can be a difficult concept to understand, but I will do my best to explain it here.

partialAssetID

The element importAction has two attributes, partialAssetIDs and completeAssetIDs. These attributes contain a set of comma delimited IDs from up above in the xml document. They describe whether a particular asset, in this xml document, is being uploaded with ALL of its children, or only “some”. If the parent ID is listed in “completeAssetIDs”, then the parent and its collection of child objects is considered complete; any pre-existing child instances “not mentioned” in this new xml document will be blown away. Mentioned child instances, if pre-existing, will have current properties edited (if desired) and retain all governance references (stewardship, term assignments, etc.). If you want to preserve the pre-existing children for a particular parent, place that parent ID in “partialAssetIDs”.

Once you have built your xml document, and have checked it for well-formed-ness (at the very least, make sure you can open it in your browser as a well-formed and recognized xml document), you are ready to upload it to IGC. Go to the igc-rest-explorer page for the Open IGC API and find the bundle “POST” invocation for publishing assets:

publishAssets

Open your xml document in a regular editor and copy/paste the entire xml string into the available property (red box in the screen shot above) and then click “Try it out!”

If there are any errors, you will receive them here directly, and if all is “ok”, you will receive a clean 200 response code, and your assets will have been loaded.

successReturnCode

At this point, you can immediately return to the Information Governance Catalog and view your new assets!

NewAssets

Browse them by returning to the main “Information Assets…Browse All” pull down where you found the icons for your bundle, and then look around….see if your child assets are also loaded, and how they are displayed “within” the parent! Try doing a Query. Edit one of your new assets and make adjustments to one of the properties!

Your assets are now being governed…they can be assigned Terms and Labels, belong to Collections, become the responsibility of a Steward — everything that you can do within Information Governance Catalog is now available for your new objects! In the next post we will look at how you can apply your own custom flow definitions for data lineage that includes your new object instances.

Next post in this series:
Defining Lineage Flows (Part 1)

–ernie

Open IGC: Defining a new bundle!

This post discuses another topic regarding Open IGC — the new “extensibility” API that allows you to define your own objects inside the Information Server repository, and then govern them with Stewardship, Business Terms, or detailed lineage reporting.

First post in this series:
https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/

Previous post in this series:
https://dsrealtime.wordpress.com/2015/08/07/open-igc-a-simple-messaging-system-use-case/

So…….you have decided that you want to extend the Information Server repository, and have decided that you want to create your own custom objects with their own icons and their own internal relationships. Now what?

Your first goal is to model what you want or need to represent. What objects do you want to govern? What kinds of lineage do you want to display? See the prior posts in this series for some ideas of what this might mean. Also, be sure to look at the documentation, and play with the real examples for extensibility that are included. [formal documentation for the Open IGC is here: http://www-01.ibm.com/support/docview.wss?uid=swg21699130 ].

Work it out first on paper, or on a whiteboard. What objects do you want a user to be able to click on, and request lineage? What levels in your database schemas do you want to show as connected objects in a lineage graph? If you are illustrating a process, one that has sub-processes and even additional sub-sub-processes, at what level do you want to provide a drill down or “expand” capability to the user for additional detail?

These specifications for your new object types are outlined in a “bundle”. The bundle represents each of the new object types and their icons that you will be defining. The bundle describes the relationships between the objects (parent/child or other “containment” definition) and also captures all of the individual properties (and their data types) for those objects. It establishes formal property names for their use in your code and in the user interface.

The bundle is defined using XML. A well documented xml schema is provided with the Open IGC, and is fairly easy to follow, even if you don’t spend much time with XML. Here is a snippet of the bundle I used for the prior post, to define my “Messaging” environment:

bundleSnippet

Specifically note the class element at the top, named “queue”, and its various properties such as default_persistence towards the bottom. It’s parent is “queue_manager” (defined earlier in the xml). Note also the “header section”. These properties will appear towards the top of the detail page when a user is reviewing this object, and also will be shown in the unique “hover” view that is available throughout all of IGC. Properties can be defined as simple strings, integers, float, etc. and also with enumerated types, as illustrated here. When using an enumerated type, the pre-defined list of values is automatically provided in a drop-down selection for any Steward who might edit this object. The values are also validated when objects are entered via API into the system.

The bundle xml, known as the “asset descriptor” is arranged alongside two special folders for language conventions (not yet fully supported) and custom matching icons:

bundleFolderStructure

Your matching icons are placed into an “icon” folder, following a naming convention for their class and size. As documented, the supported icon sizes are 32×32 (big) and 16×16 (small). These different icons will then appear in various places in IGC, depending on the context and what the user is doing.

iconList

Ultimately, the asset_descriptor.xml and the two folders are zipped together into a single archive:

bundleZip

A good practice to follow is to name the .zip file by the name of your bundle.

This is the file that is sent to Information Server to formally “register” your new objects (go to https://:/ibm/iis/igc-rest-explorer ). This can be done programmatically, of course, but the igc-rest-explorer page makes this very convenient, especially when you are first getting started, or if you haven’t done much with REST apis and their invocation as an HTTP based web service. In a later post, I will discuss various ways of making these calls in an automated fashion. Here is a screen shot of how this looks:

register

Click on “bundle” when you first get to the igc-rest-explorer page, and then POST for registering a new bundle. A convenient “browse” button allows you to select your bundle zip file and then just click “Try it Out!”. It is very simple to get started! Error checking is very thorough — if you mess up your bundle, the IGC registration will let you know. Here I have made a very simple error, trying to re-register the same bundle:

registrationError

…it also picks up other subtle errors that you might make when defining your new objects.

When the registration works, you will get a clean confirmation, and can then immediately go and see the results of your creative thinking and design efforts! I like to immediately check the “browse all assets” list, to see what new icons and bundle “section” I have:

MessagingIcons

I also like to immediately select one of my objects in the IGC Query tool, and check to see that my special Open IGC custom properties are showing up as I expect:

igcQueryWithProperties

If you need to make updates to your bundle, such as add new object types or properties or make mild changes to the labels or names shown in the user interface, or add/change icons, there is REST call (also available at the igc-rest-explorer page) to “Update a previously registered asset bundle”. You cannot make radical structural changes, or alter datatypes or the formal names of registered objects, but simple changes and additions are permitted. If you make changes to your icons, or add new ones, be sure to clear your browser cache to ensure that they are visible the next time you return to and refresh the browse page.

That’s it! Now I am ready to start adding real instances of my new objects to the repository and start governing!!!

Next post in this series:
Uploading New Assets!

Ernie

Open IGC. A Simple “Messaging System” Use Case

In the previous post in this series about Open IGC (https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/), I described several use cases to get you thinking about how you might apply this technology to your own solutions. I have since encountered several other great use cases that I will discuss in future posts — but for now, let’s dive into one of them that has already been discussed: Messaging Systems.

A Messaging System or environment is a unique case of Source and/or Target. It’s not quite a “file”, although it can “contain a file”…nor is it the same as a Table. Queues have “data” but they also can store other things, and have lots of other qualifiers, such as persistence, message types, and read methods. There is an implied hierarchy in a messaging system, but it isn’t the same as a subdirectory with files or a schema with its collection of tables.

Governance covers many things, and queues and queuing systems certainly qualify as objects worth governing, depending on your specific needs. Queues and their accompanying objects may require Stewardship, Application and Term definitions, and can carry operational information, such as Current Queue Depths, or historical status’. Queues certainly can and should participate in lineage and impact analysis reporting, as they are often the “beginning” or the “termination” of a lineage “flow”.

All of these unique qualities justify the application of “Open IGC” to my Messaging System. I also should consider “volume” and “available skill sets”, but for now let’s assume that I have a significant number of messaging artifacts to justify the work effort, and the skills in xml and REST to get it done.

What will it look like for my users? What can I do with it once it is defined with Open IGC? (click on any of the images to see them “up close”)

Let’s see what the finished result looks like in the Information Governance Catalog (IGC). Once we register a new set of Object types (we call this “registering a new bundle”), the objects appear within each regular and expected context of IGC. I can browse the new Objects:

MessagingIcons

I can assign them to a Business Term or other relationships:

Assignment

…use them in a Query:

query

…and have them participate in Data Lineage Reporting:

lineage

The bottom line is that I can use them as I would most any other object that is part of Information Server, including Stewardship and integration with Rules and Policies. The fact that I am able to give these objects their own structure, their own properties, and their own icons, makes their use for governance more inviting to the user community and more understandable by everyone. This helps encourage adoption and participation in the governance framework.

Once the new bundle of object types is registered, I can populate the repository with actual instances. The brief lineage picture above gives you an idea of how objects of this messaging bundle participate in lineage analysis, but we can also review their details. Here is the detail page for one of the Queue Managers, showing just a few of the properties that have been modeled with this bundle, and populated for our environment:

QueueMgr

The Open IGC also provides a paradigm for including “Operational Metadata” in a one:many relationship that makes it convenient to include run time statistics or other details of your processes that may be important for your governance scenarios. Here you see how queue statistics might be captured and stored for later review:

runstats

This is a simple implementation. I am not representing a complex process, with inner subtasks [we’ll get there in a later post], yet have created a new set of objects that more clearly illustrate an important concept for the enterprise. Governance adoption can be simpler, and will bring aboard a new audience whose needs have been met with custom objects, icons, and relationships. Data lineage is supported with known tooling, using Extension Mappings that are already in use by other parts of the governance team.

Next post we’ll take a look at what is required to define new bundles like this and to load up new instances of metadata into the Information Server repository!

Next post in this series:

Open IGC: Defining a new bundle

–ernie

Validating your REST based Service calls from DataStage

About a year ago the Hierarchical Stage (used to be called the “XML” Stage) added the capability of invoking REST based Web Services. REST based Web Services are increasing in popularity, and are a perfect fit for this Stage, because most REST based services use payloads in XML or JSON for their requests and responses.

REST based Web Services have a couple of challenges, however, because they do not use SOAP, and consequently, they rarely have a schema that defines their input and output structures. There is no “WSDL” like their is for a classic SOAP based service. On the other hand, they are far less complex to work with. The payloads are clean and obvious, and lack the baggage that comes with many SOAP based systems. We won’t debate that here…both kinds of Web Services are with us these days, and we need to know how to handle all of them from our DataStage/QualityStage environments.

Here are some high level suggestions and steps I have for working with REST and the Hierarchical Stage:

1. Be sure that you are comfortable with the Hierarchical Stage and its ability to parse or create JSON and XML documents. Don’t even think about using the REST step until you are comfortable parsing and reading the XML or JSON that you anticipate receiving from your selected service.

2. Start with a REST service “GET” call that you are able to run directly in your browser. Start with one that has NO security attached. Run it in your browser and save the output payload that is returned.

3. Put that output in a .json or .xml file, and write a Job that reads it (using the appropriate XML and/or JSON parser Steps in the Assembly) Make sure the Job works perfectly and obtains all the properties, elements, attributes, etc. that you are expecting. If the returned response has multiple instances within it, be sure you are getting the proper number of rows. Set that Job aside.

4. Write another Job that uses the REST Step and just tries to return the payload, intact, and save it to a file. I have included a .dsx for performing this validation. Make sure that Job runs successfully producing the output that you expect, and that matches the output from using the call in your browser.

5. NOW you can work on putting them together. You can learn how to pass the payload from one step to another, and include your json or xml parsing steps in the same Assembly as the REST call, or you could just pass the response downstream to be picked up by another instance of the Hierarchical Stage. Doing it in the same Assembly might be more performant, but you may have other reasons that you want to pass this payload further along in the Job before parsing.

One key technique when using REST with DataStage is the ability to “build” the URL that you will be using for your invocations. You probably aren’t going to be considering DataStage/QualityStage for your REST processes if you only need to make one single call. You probably want to repeat the call, using different input parameters each time, or a different input payload. One nice thing about REST is that you can pass arguments within the URL, if the REST API you are targeting was written that way by its designers.

In the Job that I have provided, you will see that the URL is set within the upstream Derivation. It is very primitive here — just hard coded. It won’t work in your environment, as this is a very specific call to the Information Governance Catalog API, with an asset identifier unique to one of my systems. But it illustrates how you might build YOUR url for the working REST call that you are familiar with from testing inside of your browser or other testing tool. Notice in the assembly that I create my own “argument” within the REST step which is then “mapped” at the Mappings section to one of my input columns (the one with the Derivation). The Job is otherwise very primitive — without Job Parameters and such, but simply an example to help you get started with REST.

Ernie

…another good reference is this developerWorks article by one of my colleagues:

https://www.ibm.com/developerworks/data/library/techarticle/dm-1407governrest/

BasicRESTvalidation.dsx

Open IGC is here!

Hi Everyone….

Been awhile since I’ve posted anything — been too busy researching and supporting many new things that have been added in the past year — for data lineage, for advanced governance (stewardship and workflow), and now “Open IGC”.  This is the ability to create nearly “any” type of new object within the Information Governance Catalog and then connect it to other objects with a whole new lineage paradigm.    If you are a user of Extensions (Extension Mapping Documents and Extended Data Sources), think of Open IGC as the “next evolution” for extending the Information Server repository.   If you are a user of DataStage, think of what it would be like to create your own nested objects and hierarchies, with their own icons, and their own level of “Expand” (like zoom) capability for drilling into further detail.

This new capability is available at Fix Central for 11.3 with Roll-up 16 (RU 16) and all of its pre-requisites (FP 2 among other things).

So exactly what is this “Open IGC”?

Open IGC (you may also hear or see “Open IGC for Lineage” or “Open IGC API”), is providing us with the ability to entirely define our “own” object types.   This means having them exist with their own names, their own icons, and their own set of dedicated properties.     They can have their own containment relationships and define just about “anything” you want. They are available via the detailed “browse” option, and appear in the query tool. They can be assigned to Terms and vice versa, and participate in Collections and be included in Extension Mappings        …and then…once you have defined them, you can describe you own lineage among these objects, also via the same API, and define what you perceive as “Operational” vs “Design” based lineage (lineage without needing to use Extensions, and supporting “drill down” capabilities as we see with DataStage lineage).

Here are some use cases:

a) Represent a data integration/transformation process…or “home grown” ETL.    This is the classic use case.  Define what you call a “process” (like a DataStage Job)….and its component parts…the subparts like columns and transformations, and properties that are critical.   Outline the internal and external flows between such processes and their connections to other existing objects (tables, etc.) in the repository.

b)  Represent some objects that are “like” Extended Data Sources, but you want more definition…..such as (for example) all the parts of an MQ Series or other messaging system configuration…objects for the Servers, the Queue Managers, and individual Queues.  Give them their own icons, and their own “containment” depths and relationships.   Yes — you could use Extensions for this, but at some point it becomes desirable to have your own custom properties, your own object names for the user interface, and your own creative icons!

c)  Overload the catalog and represent some logical “concept” that lends itself to IGCs graphical layout features, but isn’t really in the direct domain of Information Integration.   One site I know of wants to show something with “ownership”…but illustrate it graphically.  They are interested in having “responsibility roles” illustrated as objects…whose “lineage” is really just relationships to the objects that they control.  Quite a stretch, and would need some significant justification vs using tooling more appropriate for this use case, but very do-able via this API.

It’s all done based on XML and REST, and does not require that you re-install or otherwise re-configure the repository.  You design and register a “bundle” with your new assets and their properties, and then use other REST invocations to “POST” new instances of the objects you are representing.

Quite cool…….and more to come…..I will be documenting my experiences with the API and the various use cases that I encounter.

What use cases do YOU have in mind?    :)

Next post in this series: Open IGC: a Simple Messaging Use Case

Ernie

Follow

Get every new post delivered to your Inbox.

Join 119 other followers