Validating your REST based Service calls from DataStage

About a year ago the Hierarchical Stage (used to be called the “XML” Stage) added the capability of invoking REST based Web Services. REST based Web Services are increasing in popularity, and are a perfect fit for this Stage, because most REST based services use payloads in XML or JSON for their requests and responses.

REST based Web Services have a couple of challenges, however, because they do not use SOAP, and consequently, they rarely have a schema that defines their input and output structures. There is no “WSDL” like their is for a classic SOAP based service. On the other hand, they are far less complex to work with. The payloads are clean and obvious, and lack the baggage that comes with many SOAP based systems. We won’t debate that here…both kinds of Web Services are with us these days, and we need to know how to handle all of them from our DataStage/QualityStage environments.

Here are some high level suggestions and steps I have for working with REST and the Hierarchical Stage:

1. Be sure that you are comfortable with the Hierarchical Stage and its ability to parse or create JSON and XML documents. Don’t even think about using the REST step until you are comfortable parsing and reading the XML or JSON that you anticipate receiving from your selected service.

2. Start with a REST service “GET” call that you are able to run directly in your browser. Start with one that has NO security attached. Run it in your browser and save the output payload that is returned.

3. Put that output in a .json or .xml file, and write a Job that reads it (using the appropriate XML and/or JSON parser Steps in the Assembly) Make sure the Job works perfectly and obtains all the properties, elements, attributes, etc. that you are expecting. If the returned response has multiple instances within it, be sure you are getting the proper number of rows. Set that Job aside.

4. Write another Job that uses the REST Step and just tries to return the payload, intact, and save it to a file. I have included a .dsx for performing this validation. Make sure that Job runs successfully producing the output that you expect, and that matches the output from using the call in your browser.

5. NOW you can work on putting them together. You can learn how to pass the payload from one step to another, and include your json or xml parsing steps in the same Assembly as the REST call, or you could just pass the response downstream to be picked up by another instance of the Hierarchical Stage. Doing it in the same Assembly might be more performant, but you may have other reasons that you want to pass this payload further along in the Job before parsing.

One key technique when using REST with DataStage is the ability to “build” the URL that you will be using for your invocations. You probably aren’t going to be considering DataStage/QualityStage for your REST processes if you only need to make one single call. You probably want to repeat the call, using different input parameters each time, or a different input payload. One nice thing about REST is that you can pass arguments within the URL, if the REST API you are targeting was written that way by its designers.

In the Job that I have provided, you will see that the URL is set within the upstream Derivation. It is very primitive here — just hard coded. It won’t work in your environment, as this is a very specific call to the Information Governance Catalog API, with an asset identifier unique to one of my systems. But it illustrates how you might build YOUR url for the working REST call that you are familiar with from testing inside of your browser or other testing tool. Notice in the assembly that I create my own “argument” within the REST step which is then “mapped” at the Mappings section to one of my input columns (the one with the Derivation). The Job is otherwise very primitive — without Job Parameters and such, but simply an example to help you get started with REST.

Ernie

…another good reference is this developerWorks article by one of my colleagues:

https://www.ibm.com/developerworks/data/library/techarticle/dm-1407governrest/

BasicRESTvalidation.dsx

Open IGC is here!

Hi Everyone….

Been awhile since I’ve posted anything — been too busy researching and supporting many new things that have been added in the past year — for data lineage, for advanced governance (stewardship and workflow), and now “Open IGC”.  This is the ability to create nearly “any” type of new object within the Information Governance Catalog and then connect it to other objects with a whole new lineage paradigm.    If you are a user of Extensions (Extension Mapping Documents and Extended Data Sources), think of Open IGC as the “next evolution” for extending the Information Server repository.   If you are a user of DataStage, think of what it would be like to create your own nested objects and hierarchies, with their own icons, and their own level of “Expand” (like zoom) capability for drilling into further detail.

This new capability is available at Fix Central for 11.3 with Roll-up 16 (RU 16) and all of its pre-requisites (FP 2 among other things).

So exactly what is this “Open IGC”?

Open IGC (you may also hear or see “Open IGC for Lineage” or “Open IGC API”), is providing us with the ability to entirely define our “own” object types.   This means having them exist with their own names, their own icons, and their own set of dedicated properties.     They can have their own containment relationships and define just about “anything” you want. They are available via the detailed “browse” option, and appear in the query tool. They can be assigned to Terms and vice versa, and participate in Collections and be included in Extension Mappings        …and then…once you have defined them, you can describe you own lineage among these objects, also via the same API, and define what you perceive as “Operational” vs “Design” based lineage (lineage without needing to use Extensions, and supporting “drill down” capabilities as we see with DataStage lineage).

Here are some use cases:

a) Represent a data integration/transformation process…or “home grown” ETL.    This is the classic use case.  Define what you call a “process” (like a DataStage Job)….and its component parts…the subparts like columns and transformations, and properties that are critical.   Outline the internal and external flows between such processes and their connections to other existing objects (tables, etc.) in the repository.

b)  Represent some objects that are “like” Extended Data Sources, but you want more definition…..such as (for example) all the parts of an MQ Series or other messaging system configuration…objects for the Servers, the Queue Managers, and individual Queues.  Give them their own icons, and their own “containment” depths and relationships.   Yes — you could use Extensions for this, but at some point it becomes desirable to have your own custom properties, your own object names for the user interface, and your own creative icons!

c)  Overload the catalog and represent some logical “concept” that lends itself to IGCs graphical layout features, but isn’t really in the direct domain of Information Integration.   One site I know of wants to show something with “ownership”…but illustrate it graphically.  They are interested in having “responsibility roles” illustrated as objects…whose “lineage” is really just relationships to the objects that they control.  Quite a stretch, and would need some significant justification vs using tooling more appropriate for this use case, but very do-able via this API.

It’s all done based on XML and REST, and does not require that you re-install or otherwise re-configure the repository.  You design and register a “bundle” with your new assets and their properties, and then use other REST invocations to “POST” new instances of the objects you are representing.

Quite cool…….and more to come…..I will be documenting my experiences with the API and the various use cases that I encounter.

What use cases do YOU have in mind?    :)

Ernie

New Recording on DataStage/QualityStage Lineage!

Hi Everyone…

Our engineering team just posted a very nice, short recording on how easily you can view Data Lineage for your existing DataStage and QualityStage Jobs after simply importing them into 11.3 !

https://www.youtube.com/watch?v=zBHdC0lxLDc

The Jobs do NOT have to be “running” in 11.3. They can continue to run in their current production environment while you take advantage of all the new metadata features in 11.3. You import the Jobs and can also import the Operational Metadata from your earlier releases.

Ernie

DataStage and Minecraft (just for fun…)!

Hi Everyone!

Do your kids play Minecraft? Do you? Here is a “just for fun” recording of DataStage in a Minecraft world….. if you use DataStage and you and/or your family members play Minecraft, we hope you’ll enjoy this little adventure into the “world of Transformation”….. ; )

http://youtu.be/YFbLxbPuScA

Ernie

Methodology for Building Extensions

Hi Everyone…

In the last post I talked about “why” Metadata Extensions are useful and important (Building Metadata Extensions….”Why?”). Today I want to review the basic steps in a high level “methodology” that you can apply when making decisions about extensions and their construction that will help you meet the objectives of your governance initiatives.

Step 1. Decide what you want to “see” in lineage and/or accomplish from a governance perspective. Do you have custom Excel Spreadsheets that you would like to illustrate as little “pie charts” in a lineage diagram, as the ultimate targets? Do you have mainframe legacy “green screens” that business users would like to see the names of as “icons” in a business lineage report? Are there home grown ETL processes that you need to identify, at least by “name” when they move data between a flat file and your operational data store? Lineage is helping boost confidence to the users, whether they are report and ETL developers, or DBAs tracking actual processes, or reporting users who need some level of validation of a data source. Which objects are “missing” from today’s lineage picture? Which ones would add clarity to the users “big picture”. Each of the use cases above represent scenarios where lineage from the “known” sources (such as DataStage) wasn’t enough. There are no industry “bridges” for custom SQL, personalized spreadsheets, or home grown javascript. And in the green screen case, the natural lineage that illustrated the fields from a COBOL FD left business users confused and in the dark.

The “…accomplish from a governance perspective” in the first sentence above takes this idea further. The value of your solution is not just lineage — it will be valuable to assign Stewards or “owners” to those custom reports, or expiration dates to the green screens. Perhaps those resources are also influenced by formal Information Governance Rules or Terms in the business glossary. The need to manage those resources, beyond their function in lineage, is also something to measure.

Step 2. How will you model it inside of Information Server? Once you know which objects and “things” you want to manage or include in lineage, what objects should you use inside of Information Server? The answer to this is a bit trickier. It requires that you have some knowledge of Information Server and its metadata artifacts, how they are displayed, which ones exist in a parent-child hierarchy (if that is desirable), which ones are dependent upon others, what does their icon look like in data lineage reports, etc. There aren’t any “wrong” answers here, although some methods will have advantages over others. There are many kinds of relationships within Information Server’s metadata, and nearly anything can be illustrated. Generally speaking, if the “thing” you are representing is closest in concept to a “table” or a “file”, then use those formal objects (Database Tables and Data Files). If it is conceptual, consider a formal logical modeling object. If it looks and tastes like a report, then a BI object (pie chart) might be preferred. If it is something entirely odd or abstract (the green screen above, or perhaps a proprietary message queue), then consider an Extended Data Source. I’ll go into more details on each of these things in later posts, but for now, from a methodology perspective, consider this your planning step. It often requires some experimentation to determine how best to illustrate your desired “thing”.

Step 3. How much detail do you need? This question is a bit more difficult to answer, but consider the time-to-value needed for your governance solution, and what your ultimate objectives are. If you have a home grown ETL process, do you need to illustrate every single column mapping expression, and syntax? Or do you just need to be able to find “that” piece of code within a haystack of hundreds of other processes. Both are desirable, but of course, there is a cost attached to capturing explicit detail. More detail requires more mappings, and potentially more parsing (see further steps below). A case in point is a site that is looking at the lineage desired for 10’s of thousands of legacy cobol programs. They have the details in a spreadsheet that will provide significant lineage…..module name, source dataset and target dataset. Would they benefit by having individual MOVE statements illustrated in lineage and searchable in their governance archive? Perhaps, but if they can locate the exact module in a chain in several minutes — something that often takes hours or even days today — the detail of that code can easily be examined by pulling the source code from available libraries. Loading the spreadsheet into Information Server is child’s play — parsing the details of the COBOL code, while interesting and potentially useful, is a far larger endeavor. On a lesser note, “how much detail you need” is also answered by reviewing Information Server technology and determining things like “Will someone with a Basic BG User role be able to see this ‘thing’? “…..which leads to “Do I want every user to see this ‘thing’?”. Also important is whether the metadata detail you are considering is surfaced directly in the detail of lineage, or if you have to drill down in order to view it. How important is that? It depends on your users, their experience with Information Server, how much training they will be getting, etc.

Step 4. Where can you get the metadata that you need? Is it available from another tool or process via extract? in xml? in .csv? Something else? Do you need to use a java or C++ API to get it? Do you have those skills? Will you obtain the information (descriptions, purposes) by interviewing the end users who built their own spreadsheets? Is it written in comments in Excel? Some of the metadata may be floating in the heads of your enterprise users and other employees. Structured interviews may be the best way to capture that metadata and expertise for the future. Other times it is in a popular tool that provides push-button exports, or that has an open-enough model to go directly after its metadata via SQL. ASCII based exports/extracts have proven to be one of the simplest methods. Most often, governance teams are technical but do not often have resources with lower level API skills. Character based exports, whether xml, or .csv or something else, are often readable by many ETL tools, popular character based languages like PERL or similar, or even manipulated by hand with an editor like NotePad. I use DataStage because it’s there, and I am comfortable with it — but the key is that you need to easily garner the metadata you decided you need in the previous steps.

Step 5. Start small! This could easily be one of the earlier steps — the message here is “don’t try to capture everything at once”. Start with a selected set of metadata, perhaps related to one report, or one project. Experiment with each of the steps here with that smaller subset — giving you the flexibility to change the approach, get the metadata from somewhere else, model it differently or change your level of detail as you trial the solution with a selected set of users. Consider the artifacts that will have the most impact, especially for your sponsors. This will immediately focus your attention on a smaller set of artifacts that need to be illustrated for lineage and governance, and allow you to more quickly show a return on the governance investment that you are making.

Step 6. Build it! [and they will come :) ] Start doing your parsing and construct Extensions per your earlier design. Extension Mapping Documents are simple .csv files…no need for java or .net or other type of API calls. Adding objects and connecting them for lineage is easy. Extended Data Sources, Data Files, Terms, BI objects — each are created using simple .csv files, and/or in the case of Terms, xml. I suggest that you do your initial prototypes entirely by hand. Learn how Extensions and other such objects are structured, imported, and stored. As noted earlier, I will go into each of these in more detail in future posts, but all of them are well documented and easily accessible via the Information Server user interfaces. Once you have crafted a few, test the objects for lineage. Assign Terms to them. Experiment with their organization and management. Assign Stewards, play with adding Notes. Work with Labels and Collections to experience the full breadth of governance features that Information Server offers. Then don’t wait — get this small number of objects into the hands of those users — all kinds of users. Have a “test group” that includes selected executives, business reporting users and decision makers in addition to your technical teams. Get their feedback and adjust the hand crafted Extensions as necessary. Then you can move on and investigate how you’d create those in automated fashion while also loading them via command line instead of via the user interfaces.

Keep track of your time while doing these things so that you can measure the effectiveness of the solution vis-a-vis the effort that is required. For some of your extensions, you may decide that you only need a limited number of objects, and that they almost never change — and no future automation will be necessary. For others, you may decide that it is worth the time to invest in your own enterprise’s development of a more robust parser or extract-and-create-extension-mechanism that can be implemented as metadata stores change over time. This also makes it simpler to determine when it makes sense to invest with an IBM partner solution for existing “metadata converters” that already load the repository. These are trusted partners who work closely with all of us at IBM to build solutions that have largely answered the methodology questions above in their work at other locations. IBM Lab Services also can help you build such interfaces. When appropriate and market forces prevail, IBM evaluates such interfaces for regular inclusion in our offerings.

Ultimately, this methodology provides you with a road map towards enhancing your governance solution and meeting your short and longer term objectives for better decision making and streamlined operations via Information Governance.

-Ernie

Business Glossary and Cognos — Integrated together…

Hi Everyone…

Just wanted to share a video I completed today that illustrates the integration of Cognos reporting with InfoSphere Business Glossary…. showing a user inside of Cognos, using the right-click integration that Cognos provides to do a context-based search into Business Glossary to display a term, and then navigate further through the metadata to find details about a value and concept in the report.

This is very much like the Business Glossary Anywhere, except that it is a capability built directly into the Cognos Report Studio and Cognos Report Viewer. Enjoy!

Ernie

Follow

Get every new post delivered to your Inbox.

Join 112 other followers