Validating your REST based Service calls from DataStage

About a year ago the Hierarchical Stage (used to be called the “XML” Stage) added the capability of invoking REST based Web Services. REST based Web Services are increasing in popularity, and are a perfect fit for this Stage, because most REST based services use payloads in XML or JSON for their requests and responses.

REST based Web Services have a couple of challenges, however, because they do not use SOAP, and consequently, they rarely have a schema that defines their input and output structures. There is no “WSDL” like their is for a classic SOAP based service. On the other hand, they are far less complex to work with. The payloads are clean and obvious, and lack the baggage that comes with many SOAP based systems. We won’t debate that here…both kinds of Web Services are with us these days, and we need to know how to handle all of them from our DataStage/QualityStage environments.

Here are some high level suggestions and steps I have for working with REST and the Hierarchical Stage:

1. Be sure that you are comfortable with the Hierarchical Stage and its ability to parse or create JSON and XML documents. Don’t even think about using the REST step until you are comfortable parsing and reading the XML or JSON that you anticipate receiving from your selected service.

2. Start with a REST service “GET” call that you are able to run directly in your browser. Start with one that has NO security attached. Run it in your browser and save the output payload that is returned.

3. Put that output in a .json or .xml file, and write a Job that reads it (using the appropriate XML and/or JSON parser Steps in the Assembly) Make sure the Job works perfectly and obtains all the properties, elements, attributes, etc. that you are expecting. If the returned response has multiple instances within it, be sure you are getting the proper number of rows. Set that Job aside.

4. Write another Job that uses the REST Step and just tries to return the payload, intact, and save it to a file. I have included a .dsx for performing this validation. Make sure that Job runs successfully producing the output that you expect, and that matches the output from using the call in your browser.

5. NOW you can work on putting them together. You can learn how to pass the payload from one step to another, and include your json or xml parsing steps in the same Assembly as the REST call, or you could just pass the response downstream to be picked up by another instance of the Hierarchical Stage. Doing it in the same Assembly might be more performant, but you may have other reasons that you want to pass this payload further along in the Job before parsing.

One key technique when using REST with DataStage is the ability to “build” the URL that you will be using for your invocations. You probably aren’t going to be considering DataStage/QualityStage for your REST processes if you only need to make one single call. You probably want to repeat the call, using different input parameters each time, or a different input payload. One nice thing about REST is that you can pass arguments within the URL, if the REST API you are targeting was written that way by its designers.

In the Job that I have provided, you will see that the URL is set within the upstream Derivation. It is very primitive here — just hard coded. It won’t work in your environment, as this is a very specific call to the Information Governance Catalog API, with an asset identifier unique to one of my systems. But it illustrates how you might build YOUR url for the working REST call that you are familiar with from testing inside of your browser or other testing tool. Notice in the assembly that I create my own “argument” within the REST step which is then “mapped” at the Mappings section to one of my input columns (the one with the Derivation). The Job is otherwise very primitive — without Job Parameters and such, but simply an example to help you get started with REST.


…another good reference is this developerWorks article by one of my colleagues:


Open IGC is here!

Hi Everyone….

Been awhile since I’ve posted anything — been too busy researching and supporting many new things that have been added in the past year — for data lineage, for advanced governance (stewardship and workflow), and now “Open IGC”.  This is the ability to create nearly “any” type of new object within the Information Governance Catalog and then connect it to other objects with a whole new lineage paradigm.    If you are a user of Extensions (Extension Mapping Documents and Extended Data Sources), think of Open IGC as the “next evolution” for extending the Information Server repository.   If you are a user of DataStage, think of what it would be like to create your own nested objects and hierarchies, with their own icons, and their own level of “Expand” (like zoom) capability for drilling into further detail.

This new capability is available at Fix Central for 11.3 with Roll-up 16 (RU 16) and all of its pre-requisites (FP 2 among other things).

So exactly what is this “Open IGC”?

Open IGC (you may also hear or see “Open IGC for Lineage” or “Open IGC API”), is providing us with the ability to entirely define our “own” object types.   This means having them exist with their own names, their own icons, and their own set of dedicated properties.     They can have their own containment relationships and define just about “anything” you want. They are available via the detailed “browse” option, and appear in the query tool. They can be assigned to Terms and vice versa, and participate in Collections and be included in Extension Mappings        …and then…once you have defined them, you can describe you own lineage among these objects, also via the same API, and define what you perceive as “Operational” vs “Design” based lineage (lineage without needing to use Extensions, and supporting “drill down” capabilities as we see with DataStage lineage).

Here are some use cases:

a) Represent a data integration/transformation process…or “home grown” ETL.    This is the classic use case.  Define what you call a “process” (like a DataStage Job)….and its component parts…the subparts like columns and transformations, and properties that are critical.   Outline the internal and external flows between such processes and their connections to other existing objects (tables, etc.) in the repository.

b)  Represent some objects that are “like” Extended Data Sources, but you want more definition…..such as (for example) all the parts of an MQ Series or other messaging system configuration…objects for the Servers, the Queue Managers, and individual Queues.  Give them their own icons, and their own “containment” depths and relationships.   Yes — you could use Extensions for this, but at some point it becomes desirable to have your own custom properties, your own object names for the user interface, and your own creative icons!

c)  Overload the catalog and represent some logical “concept” that lends itself to IGCs graphical layout features, but isn’t really in the direct domain of Information Integration.   One site I know of wants to show something with “ownership”…but illustrate it graphically.  They are interested in having “responsibility roles” illustrated as objects…whose “lineage” is really just relationships to the objects that they control.  Quite a stretch, and would need some significant justification vs using tooling more appropriate for this use case, but very do-able via this API.

It’s all done based on XML and REST, and does not require that you re-install or otherwise re-configure the repository.  You design and register a “bundle” with your new assets and their properties, and then use other REST invocations to “POST” new instances of the objects you are representing.

Quite cool…….and more to come…..I will be documenting my experiences with the API and the various use cases that I encounter.

What use cases do YOU have in mind?    :)


New Recording on DataStage/QualityStage Lineage!

Hi Everyone…

Our engineering team just posted a very nice, short recording on how easily you can view Data Lineage for your existing DataStage and QualityStage Jobs after simply importing them into 11.3 !

The Jobs do NOT have to be “running” in 11.3. They can continue to run in their current production environment while you take advantage of all the new metadata features in 11.3. You import the Jobs and can also import the Operational Metadata from your earlier releases.


DataStage and Minecraft (just for fun…)!

Hi Everyone!

Do your kids play Minecraft? Do you? Here is a “just for fun” recording of DataStage in a Minecraft world….. if you use DataStage and you and/or your family members play Minecraft, we hope you’ll enjoy this little adventure into the “world of Transformation”….. ; )


Manually Creating Shared Tables and Data Files

There are lots of ways to import Shared Tables from other tools and other kinds of metadata — but sometimes it’s helpful to just “do it yourself.” You may havehome-grown spreadsheets or documents that contain your column metadata, or simply don’t have immediate access to the COBOL FDs, XSDs, ERwin (or other popular modeling tools) models, or other standard forms of metadata that contain critical table and column information.

See this link for more information on Shared Tables and Data Files… What are Shared Table Definitions?

[Note: 9.1 Update — in 9.1 this utility is now formally part of istool. It is called istool workbench generate ]

A new utility (courtesy of our creative engineers) has been built that supports the manual creation of Shared Tables and Data Files within the repository. This provides new flexibility for dynamically, and programmatically creating these structures for data lineage and impact analysis purposes. The utilty takes advantage of the istool archive format and syntax. Attached to this posting are three files…the documentation, a .jar file to be copied to your istool directory structure, and a sample import .csv with a variety of Shared Tables and Data Files. It will work with istool in both the Server and Client side environments. Please see the documentation for further details…


1) After download, remove the .jpg suffix for the .jar and .csv files !!!! (the suffix is there just to satisfy this blog’s restrictions)
2) I noticed that the .jar file had its dots (.) changed to dashes (-). Leave the underscore, but rename the file so that each dash becomes a dot. The actual jar file you copy into your subdirectory should be:
3) USE/READ THE DOCUMENT. It illustrates the detailed syntax that you need.
4) I noticed on linux that a ‘/’ was appended to the .isx file. I renamed the file before doing the “import” step. It worked fine. It seems to work fine “as is” on Windows.

jar file [remove “.jpg” after download]
Sample .csv file with detailed table metadata [remove “.jpg” after download]

(what are?) Shared Table Definitions

Information Server 8.x introduced a concept called a “Shared Table Definition.” This is a “container” for Table and Column metadata. It holds the Column names and other critical metadata (sql data type, length, etc.) for your relational tables. We call it a “Shared Table” because it is automatically shared among all of the tooling within Information Server and Foundation Tools.

A Shared Table is immediately visable (and usable) in Information Analyzer, FastTrack, Metadata Workbench and Business Glossary. It is also available for DataStage and QualityStage via the Metadata Sharing option within the DataStage Designer. Shared Tables are “like” DataStage Table Definitions, but with a broader, more enterprise scope. DataStage Table Definitions are still supported as always, but are unique to the DataStage environment. Inside of DataStage, a Shared Table appears in the Repository listing with a little white “book” icon above it. I always call it the “dove” icon because it looks to me like a bird. :)

Shared Tables often reflect a “real”, physical table. They have a multipart name that includes Host, Database, Schema and Table and finally Column names. Shared Tables can be created in a variety of ways:

— Importing tabular metadata via “Connector” inside of FastTrack, Information Analyzer, or the DataStage Connector Wizard (under import…table defs).
— Importing tabular metadata via the Import/Export Manager using Bridges and/or Brokers
— via the “Create Shared Table from Table Definition” option from within DataStage
— via a new utility that allows the creation of Shared Tables and Data Files (also known as Shared Files) from .csv [saving that subject for another blog entry]
— via export of physical models from InfoSphere Data Architect (which could have imported the model(s) from other design tools)
— via Optim export from InfoSphere Discovery

Once they are created, you can see them in the “tree” views of FastTrack, IA, and Workbench, and also delete them in most of those Information Server clients…..each client is looking at the exact same copy of the metadata — so there is absolute consistency between the views, regardless of which client you are using or which role you play in the organization.

These Shared Tables drive the integration activity, or the review of such activity, throughout Information Server, and they also serve as an important “anchor” for concepts that help enable strong information governance. Shared Tables retain associated profiling information (domain population details, data quality assessment history, etc.) from InfoSphere Discovery and Information Analyzer. Shared Tables can be assigned a Data Steward — a subject matter expert who is most familiar with a particular table and its value to the organization. Shared Tables and their Columns can be assigned to a Business Term — or to multiple Business Terms….to futher define them and qualify their purpose and meaning to the enterprise. Perhaps most important, Shared Tables are the key sources and targets for Data Lineage reporting and analysis. Shared Tables are “connected” for data lineage purposes via DataStage Jobs or Extension Mappings that represent processes external to Information Server and Foundation Tools.

Related to a Shared Table is a “Data File”. A Data File is also a container for Fields (or Columns), but is associated with a sequential or flat file instead of an RDBMS. A Data File has similar characteristics, and participates in data lineage and metadata enhancement (Stewards, Terms, etc.) like Shared Tables, but has a different icon when viewed within the Metadata Workbench. The naming convention for a Data File is Host, DataFile name, and Data File Structure….and then Fields. Today, Data Files are uniquely used by DataStage, and created from there using the “Create Shared Table from Table Definition” as noted above. The difference is in the “Locator” tab of the original DataStage Table Definition. If it says “sequential”, then a Data File will be created instead of a Shared Table. However, a new utility just became available that allows Data Files to be created at the command line. I will be writing another entry in the blog to describe this new utility that provides new flexibility for creating these useful objects.


Data Lineage vs Business Lineage

Hi all…

Thought it would be a good idea to go into some detail about Data Lineage and Business Lineage.

Data Lineage is a more “technical” detailed lineage from sources to targets that includes ETL Jobs, FTP processes and detailed column level flow activity. It should trace everything from source to target, and be flexible enough to encompass legacy sources, all RDBMS, abstract sources like and targets like message queues and realtime transactions, and every sort of transformation and movement process that touches the data along the way.

Business Lineage is a summary of that lineage — showing primarily only the sources, targets and reporting assets. I think of those assets as sort of “floating” to the top for display in Business Lineage (the same lineage is created under the covers). Business Lineage is often more useful to business members of your teams who aren’t interested in all of the gory detail at the lower level, but want to see a few critical business rules along with the ultimate sources for their reports and portal displays.

With Information Server and Foundation Tools, Lineage between sources and targets (Shared Tables) is established in a variety of ways:

a) via DataStage Jobs that illustrate the transformation of data from source(s) to target(s).

b) via parsing of SQL in rdbms views, or as an extension of the above when custom SQL is used in a Job.

c) via Extension Mappings, an 8.1.1 feature of Metadata Workbench (June, 2009) that supports the illustration and lineage for any source and target.

DataStage Jobs are just “there”. As a developer builds and edits a Job, lineage just “happens”. Either as a matter of course as you follow the links thru a complex Job, or via the algorithms that Metadata Workbench uses to parse through Jobs and find their relationships (ie…JobA writes to a target that becomes the source for JobB). Views are picked up during import from an RDBMS catalog. Metadata for a view is contained in the SQL SELECT that defines it. This SQL is parsed by Metadata Workbench to determine the actual base tables used. Extension Mappings allow you to be creative and define anything you want. This is important for illustrating any type of source-to-target relationship, and can represent anything from a cobol program to a shell script, or nothing at all — it simply may be important to illustrate a from/to relationship between two or more objects in your enterprise.

Business Lineage was an 8.1.2 capability (December 2009) that expands on the lineage concept by displaying only the sources and targets and reporting resources. The DataStage Jobs or Extension Mappings are “just under the surface”, but their detail, which could be extensive, is suppressed by the Business Lineage display. This makes consumption of those screens for lineage review simpler for certain categories of users.



Get every new post delivered to your Inbox.

Join 112 other followers