Are your Lookups Real-Time Ready?

This is just a reminder for all of you using Lookups of any kind in a real-time ETL process. When the focus is on performance in a “batch” ETL flow, we often choose options for lookups that ask the ETL engine to pre-load reference tables into memory. Static lookups of this type might be fine for a real-time scenario, but may also be disasterous if you are expecting to get positive results on a lookup for a reference table that constantly changes.

Check your lookups and choose the right option for your needs.   I was recently reminded of this fact on a current project.  Something to always remember to consider.

In DataStage Server edition, lookups are dynamic by nature, and you need a hash table to force lookups into memory. In Enterprise Edition, however, you specifically need what is referred to as a “sparse” lookup. This is a property that you select inside the Stage.

-e

Calling a Web Service that supports Arrays for input and output

Web Service implementations, especially row-oriented ones, can be “chatty”. If you have a special web service that you’d like to invoke as part of your data integration job, it could be very expensive and time consuming to call it individually for every single row that passes by. If you have 1000 rows, you aren’t only running those 1000 rows thru some remote Web Service…you are also communicating a request and response via SOAP, 1000 times “on the wire.” See if the author of the Service is able to accept and return an “array”. Then you can minimize the traffic, calling the Service only once in this example of 1000 rows, and then processing the single response after it is returned. It could still be problematic, as now your SOAP Envelope to grow to large proportions, but odds are it will be better than individual calls.

To do this in DataStage requires that you be a bit creative. You will need to construct the SOAP Body for the input request prior to invoking the Web Service, and then parse out the details of the SOAP Body that is returned.

I’ve included a .dsx in this post, along with the WSDL that was used to create it. This is a web service I published locally on my own system, so it’s not available to be called externally and tested on your systems, but there are detailed annotations in the Job that outline the necessary steps for this technique. The Web Service it calls still conforms nicely to a row orientation (single relational set of columns in, and single set of relational columns out)…the only difference is that I’m sending a whole array of such rows to the service at one time.

One thing to consider as you build one of these — walk before you run.  Address the input side, and just send the output to a flat file instead of XMLInput as I’ve done here in this completed Job.   Once you have reviewed the flat file and can read it successfully with XMLInput, then you can put everything together.

Enjoy.

-e

callarraywebservicedsx3

arrayexamplewsdl2

I need “Google House” (how I found some valuable .dsx’s).

I just recently re-installed Google Desktop. What a life saver. A prior installation was causing me some problems with email, so a month ago I uninstalled it, and only this week found the time to download another edition and have it index my machine….. and “lo!” I found some .dsx files (DataStage Exports) that I’d forgotten I created a few years back. In particular, I’ve been meaning to write up some instructions on how to use arrays in Web Services, among other complicated Web Service invocations, and wanted to avoid re-inventing the examples from scratch. I found ‘em, and will see if I can clean them up, test them in version 8, and explain them here.

Bless all of you who can keep their hard drive in a perfectly sorted arrangement of easy to locate subdirectories and well documented files! If there was a “Google House,” maybe I’d be able to find some tools I misplaced after the last renovation project….. ;)

-e

Data Quality and Transformation as a Service

In your ETL jobs, do you ever perform “Lookups” to validate account numbers, confirm product codes, or return discount rates?  (I’m sure the answer is “of course you do”)  Are you certifying address lines as part of your batch ETL processes?    (for many of you, another “of course”)  How complex is the logic you are using?  How much work have you done to establish such lookups, certify addresses, and test the functionality?   …and maintain it?   What data sources, security issues and other techniques have you invested in?

Have you thought about how valuable those data integration activities might be to the rest of your organization?

Turn them into a “Service.” One that is easily re-usable, not only for the batch work you are doing today via ETL, but for your real-time applications — your java development teams building a portal, or .NET groups setting up front ends for remote devices.   Or perhaps for other applications that are performing internal communications using your company’s choice of enterprise service bus.

This is where the Information Services Director comes into play for information Server — providing the ability to publish DataStage, QualityStage, SQL queries and stored procedures as Services. Take the “guts” of your data integration activity (Lookups, Transforms, etc.) and “publish” them as a Service.   A “Service” that is supported by industry standards such as Web Services (SOAP over HTTP) and other protocols, along with the necessary artifacts that include a built-in directory and automatically generated WSDL that illustrates the metadata for your creation.

Services that focus on data integration are often most successful when they are built, documented, and maintained by the teams who truly understand the data. In many organizations, this is the same team that has been living with the data migration, transformation, and data warehousing applications.  They have invested time and energy researching the models, the legacy systems, and the oddities of the data under-the-covers.  They’ve built extensive transformations using ETL tooling. Why not exploit those skills, expertise, and business rule investments for benefits beyond the decision support systems?

I’m not talking about new stuff, bleeding edge creations, or upcoming technology.   DataStage, via “RTI,” has been doing this for 4+ years.  Web Services are increasingly mainstream.   Data integration is still at the core, and ETL tools have proven their mettle at simplifying the management of data access technology.  It’s a shame to see it only used for batch.   

Ernie

ps… if you are interesting in reading further about this subject, check out this article on Information as a Service: Data Cleansing Pattern. I was honored to be asked to play a small role in developing this article, written by some of my esteemed IBM teammates. -e

Information as a Service: Data Cleansing Pattern

MQSeries…Ensuring Message Delivery from Queue to Target

Using MQSeries in DataStage as a source or target is very easy…..but ensuring delivery from queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue to database without dropping any messages…

The best way to do this is with an XA transaction, using a formal transaction coordinator, such as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works with MQ to perform transactions across resources….deleting a message from the source queue, INSERTing a row to the target, and then committing the entire operation. This requires the most recent release of DataStage, and the right environment, releases, and configuration of MQSeries and a database that it supports for doing such XA activity….

So what happens if you don’t have the right release of any of these things, or are using an rdbms that is not supported for XA with MQ?

You can come “real close” and accomplish what you’ll need in most scenarios with the attached Job and text file with DDL. This .dsx defines a technique where messages are read from a source queue, then written to a target rdbms……if the INSERT works, messages are immediately removed from the source queue……but if it fails, removal is not performed, and the messages remain in the source queue. Careful thought, testing and review of recovery strategy is necessary, but this technique may be useful in a lot of situations.

Once again, I haven’t mastered the uploading on this site, and had to rename MQCommitTestv8.dsx and TestDDL.txt with pdf suffixes.   Just rename them and you will be fine.  They are both just simple ascii files.

Ernie

targetddl1 mqcommittestv8