RSS for your Data Transformations!

Huh? What? I thought RSS was for news feeds from CNN or NBC!? The best way to get the latest on the election (like I haven’t had enough) or the latest storm! Instant updates to my browser or other cool electronic gadget.

Well, RSS (Really Simple Syndication in most circles) is coming to your business solutions, if it hasn’t already. You may already be accustomed to that cute little “radio beam” icon, usually orange, that appears on this blog, most web sites, and many other places that you frequent on the internet. You may have an RSS “reader” or “aggregator” in your browser, dedicated on your desktop, or enabled for your Blackberry. Something that updates your favorite team’s box scores, the latest news from IBM or your favorite vendor, or as noted above, the latest gossip in the race for US President. How about sending out intelligent data oriented updates driven by DataStage or a customized SQL statement?

Moving into the mainstream, RSS is not only for updated news clips. Let your executives aim their RSS reader at your customer update feed, and they’ll know whenever a new transaction is generated or when a customer logs a complaint. Behind the scenes, you can have your favorite ETL tool, tickled by request from an RSS reader, pull the latest updated profile for a given customer. The “publish date” that is part of the RSS standard will be used by the sales rep’s RSS reader to highlight new material, making it bold, alternate color, or whatever else their chosen reader is set up to perform. You monitor and supply the data, the user chooses the RSS tooling . That’s freedom that makes everyone happy.

Information Server release 8.1 delivers RSS in the new Information Services Director (the next release of wISD/ISD). I’ve had the pleasure of playing with this new capability for the past few weeks, and it opens up a lot of possibilities [it's also been a fun way to start learning about this new SOA medium :) ]. Changed Data Capture tools, such as InfoSphere CDC (former DataMirror) can recognize log events, trigging DataStage loads to a target system, and then “always ready” RSS Services delve into the data to find updated gems for selected customer segments or territories. The RSS capability is a new binding added to existing SOAP and EJB capabilities, and is complemented also by the new 8.1 bindings for REST and JMS. I’ll dig deeper into these and other real-time related Info Server features in some upcoming posts.

Web 2.0 is coming. Don’t let it pass you by!

Ernie

Using Shortcuts with Information Services Director

If you are publishing Services with the Information Services Director (ISD, a.k.a. wISD), you may find it useful to take advantage of the shortcut feature that is available in the graphical interface. The navigation inside of the Console can be a bit confusing when you first use it, and if you aren’t a frequent user, these shortcuts will help when you return to manage the Web Services and other SOA artifacts you are generating with your DataStage and QualityStage Jobs and SQL Stored Procedures and Queries.

Create shortcuts and navigation aids. I recommend that you setup shortcuts to simplify things and help you remember the different workspaces that you will be using as you gain experience with ISD. These are my preferences, which of course you can change later, but I find that they make the experience with ISD, especially when you are first learning it, much easier.

a. Go to the Dashboard view (#2 icon…first to the right of the circle). Look around, click among the options on the left and read what appears on the right, but don’t click on anything at the right. After you’ve spent some time looking at the dashboard, click on the small “arrow” icon at the upper right and create a shortcut. I call mine “theDashboard.”

b. Now find “First steps” on the left, click on “Setup Information Services Connections” and then click on the link called “Go to the Information Services Connections workspace” on the right. Don’t do anything here yet – just click on the small arrow (shortcut) icon on the upper right and create another shortcut. I call this one “myProviders”.

c. Pull down Edit…Preferences…Key Commands. Find “toggle shortcut palette.” Choose Edit (button on lower right)…now choose “Add” and you will see a pull down (default is “none”). Choose a function key for this toggle switch. I use F2. Say ok and the window should close.

d. Now press F2. Your shortcut bar should open. Personally, I like to have this be its own window, as the opening bar is sometimes confusing. If you want to do this, drag the shortcut tab “off” it’s docked location and you will end up with an independent window. Click on your previously created “theDashboard” shortcut.

e. Back at the Dashboard, open Design…click on Create an Application and then click on “Go to the Applications Workspace” on the right. When you get there create another shortcut called myApplications. Press F2 and return again to the Dashboard.

f. Now open Run time activities…click on Manage Deployed Applications…and then click on “Go to the Deployed Information Services workspace” at the right. Once again, create a shortcut. I call mine myDeployedApps.

g. Lastly, return again to My Home (use the circle icon). I like to have a shortcut for this page also, called myHome.

h. Press F2. You should have five shortcuts:
i. myApplications
ii. myDeployedApps
iii. myProviders
iv. theDashboard
v. myHome

Now toggle your shortcut window again with F2, play with your shortcuts, and bounce around ISD and explore, just getting familiar with the available workspaces, options and capabilities!

Ernie

DataStage and TIBCO….anyone done it?

Yes indeed!

This post was something I starting writing last December, and at that time only had vague memories of sites connecting DataStage with TIBCO queues. And those memories were of C++ code used in a custom plugin to communicate with TIBCO Rendevous. I wanted to see if anyone out there had tried it in one fashion or another with TIBCO and JMS (Java Message Service) Queues. Fast forward 9 months and I can tell you that it’s been done several times — by at least one of my colleages in IBM Lab Services (more than once), and most recently by myself for a DataStage evaluation.

I’m happy to report that it was a perfectly normal and standard JMS implementation, and JavaPack handled it well, yielding control to the Java Class that connected to the TIBCO JMS queue. Having long since proven that I could connect DataStage to BEA, JBoss, and Sun JMS, once I had the appropriate TIBCO libraries (jars) and connection properties, the class compiled and the Job just “worked.” Amazing how rewarding it is to get some “new” connectivity working. Makes you feel like a kid in a candy store.

Anyway, the process is very straightforward, assuming you are comfortable with Java and JMS (see my other Java entry for example class that works with DataStage). The “initialization()” method does all the start-up connectivity work, and the core of the JMS reading and writing is placed inside of the “process()” method. When the “process()” method is invoked by the DataStage engine, the class determines if the invocation is for an output link or an input link, and the JMS code communicates with TIBCO JMS and either writes to, or reads from, a selected Queue. Data to be written to a target queue is first read from the DataStage Input Link, and payloads read by the java class are “written” to the DataStage Output Link….

If you need to do this with DataStage, find a Java developer who knows JMS, and knows the particular provider…each provider has unique properties for things like “InitialContextFactory” and “ProviderURL” and values for the jndi names of the queues and their Connection Factories (lots of interesting Java stuff — too much to post here). JMS queues can have pure strings in “Text Messages” or have other types of messages with different formatting. There are headers to deal with here also. Not a place to jump into lightly if you are only immersed in DataStage. Instead, find a Java expert, and then work together with him/her to find the synergy you can get with JavaPack integrated inside the DS Engine.

Ernie

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. ;)

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »

Dealing with “primitive” XML

Yesterday I encounted an interesting XML scenario.   One I’ve seen before, but not in awhile.  It began as many do, with questions regarding the import of XML metadata.   Our importer was having problems reading the structure, and the issue trickled onto my desk.  Upon further inspection of the XML document instance, it became clear why we couldn’t read it, but later it was the structure of the document (and the issues encountered trying to generate it) that encouraged me to make this entry today.

The situation we enountered is one that I see on occasion, especially for “older” (usually 5 – 6 years) XML interfaces that were designed when XML was still the “new kid on the block” as far as technology is concerned.  XML has been around longer than six years, but it was around that time that XSDs were starting to blossom (vs DTDs) and XML was turning from its SGML roots and primarily text oriented usage (remember all the “Shakespeare” examples that used to be popular?) to being structured for datatypes and data oriented use cases (I’ve seen testimony by early XML experts who believe that the XML specification was “hijacked” by the database vendors of the world ;) ).   Whether you agree with that or not, be prepared to deal with XML structures like the following:

<RECORD><KEY>any record key value</KEY>
<NAME>FIRST_NAME</NAME><VALUE>ernie</VALUE>
<NAME>ADDRESS</NAME><VALUE>123 Main Street</VALUE>
<NAME>PHONE</NAME><VALUE>111 111 1111</VALUE>
<NAME>EMAIL</NAME><VALUE>ernie@xyz.com</VALUE>
<NAME>ACCOUNT_NUMBER</NAME><VALUE>22222</VALUE>

First of all, it can’t be read by the importer, because it is not a valid document.   So….try to open any document that you have in Internet Explorer, or other tool that handles valid XML.   If IE can’t read it, neither will DataStage.   In this case, it’s not a valid XML document.  It has no header (<?xml version=”1.0″ ?> or similar), and there is no closing tag for <RECORD>.  Further, it turns out that this one one physical record on a flat file, followed on each line by another identical structure.    So this “xml document” is actually a “collection” of xml “chunks”.    With a little bit of extra refinement, it could be a set of repeating nodes in a larger document, or perhaps at least complete documents, one per physical record (as it might be if you had complete documents in a message queue). 

The second “offense” in this document is the “repeating group that is not really a repeating group.”   Here we see the “primitive” design (read “poor” design).   There may be some debate here, but this is really the “easy” way out when trying to model your data structures with XML.    Part of the beauty of XML is the fact that the metadata is represented by the tags.   The element “NAME” above is not the representative meta data for the value “123 Main Street” — it’s simply a tag that contains the “name” of the column that represents “123 Main Street,” which is ADDRESS.   Primitive and lame.  I’m sure there are good reasons out there, but it’s still disappointing to see.    XML gives us the ability to have self describing metadata — far better it would be to have an element of <ADDRESS>123 Main</ADDRESS>, whose “value” is “123 Main”…..simply the value of the text within the element.  Why carry an extra element?    Why have all that additional data?   Do I need an element actually called “value”?   Why?  XML already handles that within its tagging mechanism!  

That being said, how do I go about creating this?   Tooling like DataStage for building XML are designed to bring in many rows from a source and aggregate them into the desired XML structure.  Repeating groups, in a well designed document, would be “rows” of PERSON, each with unique element names.    In this case, the xml repeating group is not a repeating group.    If we have 100’s of columns, it might be justifiable to first pivot the data, creating “n” rows for “m” columns, and then generate the XML noted above.   However, if we’re talking about a minimal number of columns, then it’s probably best to deal with “primitive” xml, with the least complex and most “primitive” solution…….  just hard code the XML string in a Transformation…….  in two minutes you will have been able to generate the right string just by concatenating the tags in question into a long string.  You don’t need any headers, you don’t need dynamic validation, no namespaces or complex nested repeating groups.  Just physical generation of a poorly designed XML “chunk”.    In DataStage, something like “<RECORD><KEY>”:linkname.KEY:”</KEY><NAME>PERSON</NAME><VALUE>”:linkname.PERSON
:”</VALUE><NAME>ADDRESS</NAME><VALUE>”:linkname.ADDRESS: …..and so on to the end of your record.

Not elegant….nor is the structure you are trying to match….   -ernie

 

Are your Lookups Real-Time Ready?

This is just a reminder for all of you using Lookups of any kind in a real-time ETL process. When the focus is on performance in a “batch” ETL flow, we often choose options for lookups that ask the ETL engine to pre-load reference tables into memory. Static lookups of this type might be fine for a real-time scenario, but may also be disasterous if you are expecting to get positive results on a lookup for a reference table that constantly changes.

Check your lookups and choose the right option for your needs.   I was recently reminded of this fact on a current project.  Something to always remember to consider.

In DataStage Server edition, lookups are dynamic by nature, and you need a hash table to force lookups into memory. In Enterprise Edition, however, you specifically need what is referred to as a “sparse” lookup. This is a property that you select inside the Stage.

-e

Calling a Web Service that supports Arrays for input and output

Web Service implementations, especially row-oriented ones, can be “chatty”. If you have a special web service that you’d like to invoke as part of your data integration job, it could be very expensive and time consuming to call it individually for every single row that passes by. If you have 1000 rows, you aren’t only running those 1000 rows thru some remote Web Service…you are also communicating a request and response via SOAP, 1000 times “on the wire.” See if the author of the Service is able to accept and return an “array”. Then you can minimize the traffic, calling the Service only once in this example of 1000 rows, and then processing the single response after it is returned. It could still be problematic, as now your SOAP Envelope to grow to large proportions, but odds are it will be better than individual calls.

To do this in DataStage requires that you be a bit creative. You will need to construct the SOAP Body for the input request prior to invoking the Web Service, and then parse out the details of the SOAP Body that is returned.

I’ve included a .dsx in this post, along with the WSDL that was used to create it. This is a web service I published locally on my own system, so it’s not available to be called externally and tested on your systems, but there are detailed annotations in the Job that outline the necessary steps for this technique. The Web Service it calls still conforms nicely to a row orientation (single relational set of columns in, and single set of relational columns out)…the only difference is that I’m sending a whole array of such rows to the service at one time.

One thing to consider as you build one of these — walk before you run.  Address the input side, and just send the output to a flat file instead of XMLInput as I’ve done here in this completed Job.   Once you have reviewed the flat file and can read it successfully with XMLInput, then you can put everything together.

Enjoy.

-e

callarraywebservicedsx3

arrayexamplewsdl2

I need “Google House” (how I found some valuable .dsx’s).

I just recently re-installed Google Desktop. What a life saver. A prior installation was causing me some problems with email, so a month ago I uninstalled it, and only this week found the time to download another edition and have it index my machine….. and “lo!” I found some .dsx files (DataStage Exports) that I’d forgotten I created a few years back. In particular, I’ve been meaning to write up some instructions on how to use arrays in Web Services, among other complicated Web Service invocations, and wanted to avoid re-inventing the examples from scratch. I found ‘em, and will see if I can clean them up, test them in version 8, and explain them here.

Bless all of you who can keep their hard drive in a perfectly sorted arrangement of easy to locate subdirectories and well documented files! If there was a “Google House,” maybe I’d be able to find some tools I misplaced after the last renovation project….. ;)

-e

Data Quality and Transformation as a Service

In your ETL jobs, do you ever perform “Lookups” to validate account numbers, confirm product codes, or return discount rates?  (I’m sure the answer is “of course you do”)  Are you certifying address lines as part of your batch ETL processes?    (for many of you, another “of course”)  How complex is the logic you are using?  How much work have you done to establish such lookups, certify addresses, and test the functionality?   …and maintain it?   What data sources, security issues and other techniques have you invested in?

Have you thought about how valuable those data integration activities might be to the rest of your organization?

Turn them into a “Service.” One that is easily re-usable, not only for the batch work you are doing today via ETL, but for your real-time applications — your java development teams building a portal, or .NET groups setting up front ends for remote devices.   Or perhaps for other applications that are performing internal communications using your company’s choice of enterprise service bus.

This is where the Information Services Director comes into play for information Server — providing the ability to publish DataStage, QualityStage, SQL queries and stored procedures as Services. Take the “guts” of your data integration activity (Lookups, Transforms, etc.) and “publish” them as a Service.   A “Service” that is supported by industry standards such as Web Services (SOAP over HTTP) and other protocols, along with the necessary artifacts that include a built-in directory and automatically generated WSDL that illustrates the metadata for your creation.

Services that focus on data integration are often most successful when they are built, documented, and maintained by the teams who truly understand the data. In many organizations, this is the same team that has been living with the data migration, transformation, and data warehousing applications.  They have invested time and energy researching the models, the legacy systems, and the oddities of the data under-the-covers.  They’ve built extensive transformations using ETL tooling. Why not exploit those skills, expertise, and business rule investments for benefits beyond the decision support systems?

I’m not talking about new stuff, bleeding edge creations, or upcoming technology.   DataStage, via “RTI,” has been doing this for 4+ years.  Web Services are increasingly mainstream.   Data integration is still at the core, and ETL tools have proven their mettle at simplifying the management of data access technology.  It’s a shame to see it only used for batch.   

Ernie

ps… if you are interesting in reading further about this subject, check out this article on Information as a Service: Data Cleansing Pattern. I was honored to be asked to play a small role in developing this article, written by some of my esteemed IBM teammates. -e

Information as a Service: Data Cleansing Pattern

MQSeries…Ensuring Message Delivery from Queue to Target

Using MQSeries in DataStage as a source or target is very easy…..but ensuring delivery from queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue to database without dropping any messages…

The best way to do this is with an XA transaction, using a formal transaction coordinator, such as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works with MQ to perform transactions across resources….deleting a message from the source queue, INSERTing a row to the target, and then committing the entire operation. This requires the most recent release of DataStage, and the right environment, releases, and configuration of MQSeries and a database that it supports for doing such XA activity….

So what happens if you don’t have the right release of any of these things, or are using an rdbms that is not supported for XA with MQ?

You can come “real close” and accomplish what you’ll need in most scenarios with the attached Job and text file with DDL. This .dsx defines a technique where messages are read from a source queue, then written to a target rdbms……if the INSERT works, messages are immediately removed from the source queue……but if it fails, removal is not performed, and the messages remain in the source queue. Careful thought, testing and review of recovery strategy is necessary, but this technique may be useful in a lot of situations.

Once again, I haven’t mastered the uploading on this site, and had to rename MQCommitTestv8.dsx and TestDDL.txt with pdf suffixes.   Just rename them and you will be fine.  They are both just simple ascii files.

Ernie

targetddl1 mqcommittestv8