DataStage and DB2 pureXML

Just saw a great new article on DataStage and DB2 pureXML on developerWorks, written by some of my IBM colleagues who drill deep into some very cool aspects of XML and it’s integration with DB2 and DataStage. I’ve dabbled a bit myself into pureXML, but only enough to scratch the surface. This article has some good pointers and is just what I need to push me to spend some more time there…….

Enjoy!

http://www.ibm.com/developerworks/data/library/techarticle/dm-0908datastagepurexml1/index.html?S_TACT=105AGX11&S_CMP=FP

Ernie

Data Lineage and Web Services

The requirement to call Web Services from ETL (Extraction-Transformation-Load) is becoming increasingly common. Web Services are being included in ETL streams as a source or target, or for re-use of a critical calculation or function. Often they are invoked on a row-by-row basis for Transformation purposes. Web Services are defined by their WSDL (Web Services Description Language) documents, but most times that is just a “signature” that identifies the input and output parameters. What happens “inside” the Web Service? What actions are being performed on those input parameters? How useful would it be to know exactly what is being done to your data when it is sent to a remote “black box?” Services repositories, if utilized at your site, provide a rich source of metadata, and may include descriptions of the function, who owns it, when it was last updated, and provide automation for the bindings. However, many sites have not standardized on one particular repository, and more importantly, even if such information is available, it is unlikely that the repository participates in data lineage.

Data lineage is the ability to view a complete source-to-target flow, thru every process, every function, and every intermediate step. It illustrates what happens to a column, how it is used in expressions and when its name changes or when it is combined with another column to create something new. Data lineage helps decipher where a value came from, and how it was derived. Do your lineage reports have visibility into the “black boxes” that you are calling remotely? Would you like to “see” where the data flows when it goes “outside” the system? The IBM InfoSphere Metadata Workbench has a new capability called the Data Lineage Extender. This new feature lets me describe the details of any external process and include it in all of my metadata exploration and analysis.
In this blog I try to focus on real-time issues, and so will start with the benefits of defining a remote Web Service; however, the Data Lineage Extender and the “Extension Mappings” that it offers are not limited to Web Services. Extensions, as they are also called, can be used to define external scripts, Stored Procedures, Java classes, COBOL programs, third party transformation tools — anything that you believe is important for completing your lineage picture.

Extensions appear as special new objects in the Metadata Workbench, can be queried and viewed like any other object in the repository, and as noted earlier, participate fully in data lineage reports and graphs. I had the chance to start working with this new capability a few months ago in our lab, and believe that it is going to help fulfill the lineage requirement of many data integration applications that have “missing links” or “gaps” whether or not they revolve around DataStage and QualityStage.

To establish an Extension, first identify the external process you want to include in your lineage. Then decide how much detail is desired. The Extensions support specifications at many levels, including Table, Column, DataStage Stage, etc. and support one to one, one to many, and many to one relationships/mappings to other Extensions / native DataStage objects. (for example, First Name and Last name from Object A map to Name in Object B) Where possible, define the mappings of your external object at the column level.

Let’s look at an example. Here is a snippet of a Job that reaches out to a remote “Lookup” Web Service. The WSDL was brought into the Job, but there is no visibility to the Service except for the input and output parameters*:

pic1flow

A data lineage report without Extensions will simply show that the Web Services Transformer is invoked, but with little other detail. To “complete the lineage picture,” we will want to create an Extension that defines what happens within the Lookup. This is done inside the Metadata Workbench. The screen shot below shows the details of a new Extension Mapping. It contains the columns involved in the lookup, commentary about the type of Lookup that (below) yields the “C0_CorpExpFactor” value, and also provides the ability, via balloon help, to see the ultimate source of my Lookup Column:

pic2ExtMapping

Now that the Lookup Extension is defined, it will appear in a Data Lineage report. This next screen shot illustrates a data lineage report at the column level. On the left we see the lineage for the column CORPEXPFACTOR, and where it came from, and on the right we see the context for each column, whether part of a DataStage Link and Job, or within an Extension Mapping (LookupWebService).

pic3textlineage

Clicking on the “LookupWebService” icon or link on the right will take you to the details of the definition (the previous screenshot of the Extension Mapping Document) where you can review the methods used for the lookup and any commentary from the author. This first lineage view is called the “textual” view of Data Lineage. When requesting a lineage report I can alternate between this textual view and the “graphical” view:

pic4graphRpt

Over the next few weeks I’ll be documenting my experiences with this feature, outlining the various use cases, and sharing the techniques and methods used to define Extension Mappings.

Ernie

* …of course, philosophically, one may wish to be entirely abstracted from a remote Web Service. “Who cares what tables it accesses, or what procedures it uses, provided I can trust it and rely on its execution” That’s one of the benefits of Service Oriented Architecture, right? This is a debate with many valid positions, however, most sites I speak to want to ensure extensive re-use of their Services, and simpler use/maintenance, but still maintain visibility into data quality and source/target details. Looking at WSDL alone doesn’t tell you very much.: If your directory strategy provides more than just a WSDL catalog, great. The Extensions described here may help complement what you are already doing, or “fill in the gaps” where there are holes.

ETL vs EAI for Real Time …revisiting an old subject…

Here’s a topic I hear less and less about these days. There are many reasons for this, the most likely being that the lines continue to blur between these technologies, especially now that various independent vendors in each space have been absorbed or morphed into other things. However, it raises its head every once in awhile, and even if classic “ETL” (extraction-transformation-load) and “EAI” (enterprise application integration) are under the same tool umbrella, it’s likely that choices will still need to be made as to the use of underlying features and capabilities.

Thought I’d share some thoughts that were discussed on a recent conference call with a site who is working thru the “ETL vs EAI for Real Time” decision tree. These are some of the major issues to consider (no particular priority attached to their order):

Protocol/medium. MQ Series? JMS? Tibco? Socket? Named Pipe? Sonic? MSMQ? other? Many of the EAI tools that still exist have more of these protocols “in the box.” ETL tools — not always. How hard or easy is it to extend your choice of tool.

The “shape” of the source. Is it relational in nature (rows and columns)? or hierarchical (tree like structure)? XML? COBOL with multiple record types, OCCURS, variable lengths? …or fixed Format? Both ETL and EAI typically handle these, consuming metadata from XML Schema, COBOL, etc. Exceptions would be things like SWIFT or EDI and EDIFACT, where EAI-style tools such as IBM WebSphere Transformation Extender typically have legacy history and metadata import capabilities. Shape of the source is important, but becomes even more significant in light of the next item.

The “shape” of the target. Is it relational (rows and columns)? Tables in an rdbms? Flat file? …or is it also XML with a complex multi-path hierarchy or COBOL with OCCURS depending on, variable length and mulitiple record types, or SWIFT, EDI, EDIFACT, HIPPA, etc.? ETL tools are best at performing relational work. Most can all do hierarchical as well — but often at the price of complexity. If the source and the target are hierarchical, how many transformations of this type are you doing? Just a few, or 98% of your project?

Data Volume.ETL tools excel here. Parallelism and grid work is a given. Being able to manage huge volumes of data, multiple terabytes, in batch, are where they are regularly exercised. EAI tools typically do not fare well in huge volume scenarios, or at least not without crafting parallelism and other configurations that are out-of-the-box with tools like DataStage.

Units of Work.Transactional paradigms. Delete from one rdbms table and write to another. Commit. If either resource fails, rollback everything. Delete from one queue and write to another, or to a relational table. Commit. If either resource fails, rollback. ETL tools can do some of this and do it well, but are often limited in the choices of queues and databases supported. In EAI tools, the semantics are usually simpler and more at the “core” of any EAI tool, richer in resource support (which databases, queuing systems, etc.) and often provide more flexibility and control to react to failures.

Skills.Beauty is in the eyes of the beholder…and ease of use/learning curve is no different. Still, ETL tools, while far from being designed for ‘end-users,’ have typically appealed to “data professionals” with technical skills but not necessarily hard core programming skills. EAI tooling is usually camped in the more technical domain. There is a lot of grey area and overlap here, but the history is clear — EAI tools have a rich development history of providing ensured delivery for binary EDI formats running 24 by 7 — and ETL tools have a rich history of providing consumable transformations for large volumes of data to support business intelligence initiatives. Enough said.

Long ago I recall a great write-up from an analyst who concluded that most companies require both technologies because the goals of each tool are so different. Years later, many companies have both technologies in their toolbox — so it may not be a purchasing decision as much as an implementation one. There are no wrong answers here — just answers that might lead to a bumpier road down the line.

Good luck as you review your variables.

Ernie

Facebook puts a new spin on “real-time”

Well — I finally took the plunge and registered on Facebook. It is something I have avoided for awhile, even as Facebook has made crept into the business world. It’s certainly easy to do, and after one or two “friends” have invited you or accepted you to their “wall party,” it’s amazing to see the messages and notes fly! I’m still an infant there, not even qualified to say that my experience rates me as a “novice,” but the time spent there thus far raises several interesting thoughts:

First, I find it overwhelming to think of all the places a random thought will travel as it moves among friends walls/places/pages….and whether that’s a good thing or not. Online etiquette has grown-up over the past few years, and self-censoring is something we all do these days, even in our email. Still….Facebook may be “too” realtime in that sense. I suppose the real issue here is that I can’t imagine anyone caring at any particular moment that I’m currently updating my blog or preparing for a project out in the yard. I’m probably not a good candidate for a Twitter account! ; ) I suspect that this point of view may be altered as I spend more time on Facebook, but that’s my initial impression.

Second is how cool this technology would be for a business team, and in this discipline, and Information Server data integration team. With the right filtering, wouldn’t it be informative to have automated events and notes, driven by each user, showing up on a “Project Wall?” Such technology may exist already, or perhaps even Facebook supports APIs that would make it possible. “jsmith has just compiled DataStage Job LoadFact.” or “droberts has entered a Note about column Revenue in table Q461T711: There is an anomaly in some of the date fields….”. And from the runtime engine (again with filtering), it might be useful to know that “QualityStage Job mergeNames completed at 1:16:42 with no errors.” . We can certainly do a lot of this with instant messaging, but the broadcast capability here is inviting, as is the “no one has to think about it” aspect of event based messages, if possible, being floated throughout the team.

Guess I’m going to have to explore Facebook further…

Ernie

Anyone else “fried” by Nulls?

Just finished another lengthy debugging session, helping resolve a problem with null handling. Nulls can be a struggle at times in DataStage, but that is hardly the issue. It’s painful for every part of our industry. Do a search on “problems with nulls” — you’ll get hits with subjects on just about every database, product offering, language and package that exists. Five minutes reading and you’ll find lengthy diatribes on both sides of the fence (liking or disliking) regarding nulls and null handling. I remember in the early 80’s how “nulls” would simplify many types of processing, improve the quality of COUNTs and averages, and make lives simpler and applications more manageable. I wonder in hindsight how many billions of dollars in man-hours have been spent (wasted!) trying to correct the problems introduced by nulls in the past 25 years, the incompatibility of null handling among tools and processes, and the errors introduced by the disparity of skill sets and expertise that exists across the spectrum on this topic. I at least hope that I’ve been able to clear the smog surrounding nulls in my little world here on this blog (I have an entry on NULLs and their impact on XML)….

Ernie

Bowling has become “too” real-time

Took the family bowling yesterday. What happened to what used to be a relaxing afternoon in the lanes, trying to knock down a set of pins 60 feet away with a heavy round ball? Technology has been changing this sport for a long time, so this isn’t new to anyone, but it struck me yesterday how the casual bowling experience is entirely different than it was twenty plus years ago. I’m sure the owners of the alleys are happy, as they are able to cram more bowlers into an afternoon than ever before, and the accountants for those alleys are happy, as they can track more effective use of each alley and indeed every ball that is thrown down the lanes, but is the casual player still happy?

Does anyone else miss coloring in the little boxes? …teaching the little ones how to add up the score after a spare or strike? …taking a practice turn at the start to get a “feel” for the lanes? The score is done instantly, and the pin positions after the first ball immediately identified (now, there’s a great “real time” feature). Years ago the area behind the lane had lots of room for everyone, and we used to fight over who could sit at the little lighted table with the overhead projector so the score could be placed overhead. There was advertising on the edges of the score sheet, and lots of room for doodles. It was bad luck to complete the score when someone had a run of strikes, and there was an “extra” throw for fun if you had two strikes and (say) a seven on the third throw of the tenth frame.

Now it’s all done for you. The cashier entered our names at the counter, we prepaid for all the games we would play (and the next family was “waiting” to enter the lane when we were almost thru), you couldn’t compare one game vs another, and the whole thing was rushed. We finished two games in 1/2 the time that it took years ago and the afternoon was not as satisfying as it could have been. Something was missing. Perhaps it was just the location. Hard to say.

Technology is great, and as a data integration specialist, I’m intrigued about the data warehouse some of the companies must have, breaking down revenue by the number of balls thrown. But this is one place where “real time” has diminished the overall experience.

…another way to load Terms into InfoSphere Business Glossary

copy-of-createbgimportscsv
copyofcreatebusinesstermsandattributesxmldsx3

Here are a few other Jobs for loading new Terms and Categories into Business Glossary. Like the earlier post on Business Glossary, these DataStage Jobs read a potential source of Terms (just alter the source stage as needed) and then create a target csv file that is in the correct format for loading into Business Glossary using the new 8.1 csv import/export features available at the Information Server Web Console… Glossary tab. The Jobs are fairly well annotated and should be self explanatory. I haven’t yet set them up for Custom Attributes, nor have they been widely tested —– but they are already being implemented at a variety of locations. Please let me know if you find them useful.

Ernie

(the one with “XML” in the name is the same as the prior blog entry. Each is named .doc, but is actually a .dsx file).

RSS for your Data Transformations!

Huh? What? I thought RSS was for news feeds from CNN or NBC!? The best way to get the latest on the election (like I haven’t had enough) or the latest storm! Instant updates to my browser or other cool electronic gadget.

Well, RSS (Really Simple Syndication in most circles) is coming to your business solutions, if it hasn’t already. You may already be accustomed to that cute little “radio beam” icon, usually orange, that appears on this blog, most web sites, and many other places that you frequent on the internet. You may have an RSS “reader” or “aggregator” in your browser, dedicated on your desktop, or enabled for your Blackberry. Something that updates your favorite team’s box scores, the latest news from IBM or your favorite vendor, or as noted above, the latest gossip in the race for US President. How about sending out intelligent data oriented updates driven by DataStage or a customized SQL statement?

Moving into the mainstream, RSS is not only for updated news clips. Let your executives aim their RSS reader at your customer update feed, and they’ll know whenever a new transaction is generated or when a customer logs a complaint. Behind the scenes, you can have your favorite ETL tool, tickled by request from an RSS reader, pull the latest updated profile for a given customer. The “publish date” that is part of the RSS standard will be used by the sales rep’s RSS reader to highlight new material, making it bold, alternate color, or whatever else their chosen reader is set up to perform. You monitor and supply the data, the user chooses the RSS tooling . That’s freedom that makes everyone happy.

Information Server release 8.1 delivers RSS in the new Information Services Director (the next release of wISD/ISD). I’ve had the pleasure of playing with this new capability for the past few weeks, and it opens up a lot of possibilities [it's also been a fun way to start learning about this new SOA medium :) ]. Changed Data Capture tools, such as InfoSphere CDC (former DataMirror) can recognize log events, trigging DataStage loads to a target system, and then “always ready” RSS Services delve into the data to find updated gems for selected customer segments or territories. The RSS capability is a new binding added to existing SOAP and EJB capabilities, and is complemented also by the new 8.1 bindings for REST and JMS. I’ll dig deeper into these and other real-time related Info Server features in some upcoming posts.

Web 2.0 is coming. Don’t let it pass you by!

Ernie