Incorporating Java classes into your DataStage Jobs

Java comes up a lot when we talk about “real time.”   Not that Java in particular has any special dibbs on the term, but frequently when a site is interested in things like Service Oriented Architecture (SOA), Web Services, messaging, and XML, they are often also interested in Java, J2EE, Application Servers and other things related to Sun’s language standard. 

Integrating Java with your ETL processing becomes the next logical discussion, whether “real time” even applies.   There may be some functionality, some existing algorithms worth re-using, some remote java-oriented or java managed system or message queue that contains valuable source data (or would be a valuable target), that you’d like to integrate into a data integration flow.   DataStage can easily be extended to include your Java functionality or take advantage of your Java experience.

There are two Stages that used to be referred to as JavaPack that are included with DataStage:  JavaClient and JavaTransformer.   Both allow you to integrate the functionality of a java class into the flow of a DataStage Job.   JavaClient is used for a sources or targets (only an output link or only an input link), and the JavaTransformer is used for row-by-row processing where you have something you’d like to invoke for each row that passes through.

DataStage provides a simple API for including java classes into your Jobs.  This API allows your class to directly interact with the DataStage engine at run-time — to obtain meta data about the columns and links that exist in the current executing job, and to read and write rows from and to those links when called upon to do so.   You define several special methods in your class, such as Process(), that the engine calls whenever it needs a row, or is giving your class control because it’s ready to give you a row.  Within that method you have various calls to make, such as readRow [from an input link] and writeRow [to an output link].    You can control what comes in and goes out, and also process rejections based on logic in your class.  Other than that, your class can do whatever it wants……read messages from JMS queues, invoke remote EJBs….whatever.  

The JavaPack is very well documented, with examples and descriptions of all the API calls.    However, I’ve included an additional example here for anyone who is interested, including java class, source, .dsx and usage notes.    Have fun!

-ernie

btw…I haven’t exactly figured out yet how to best get the names of the files below represented here on this blog, but if you save them from here, each file except the Readme begins with “ExamineRows” and should be ExamineRows.dsx (for the export), ExamineRows.java (for the Source) and ExamineRows.class for the actual compiled class.   I haven’t had a chance to re-try it after downloading from here, so worst case, you’ll need to recompile the class yourself in your environment.  Otherwise, it should run in v8 “as is”.  See the file at the Readme link for details on the expected classpath in the Job, etc., and read the annotations in the Job itself after you import it.  -e

Examine Rows Class, Examine Rows Java Source, Examine Rows Readme, Examine Rows DataStage Export

Simulating end-of-file for Real Time ETL Transformations

In my initial entry for this blog (https://dsrealtime.wordpress.com/2007/11/23/what-is-real-time-etl-anyway/), I wrote about some of the issues facing “always on” transformations that continously read data from a real-time source (MQSeries, sockets, named-pipes, etc.).   Tools that provide a graphical metaphor for transformation design, and find their origin in high volume batch functionality (classic ETL tools fit this description), are often challenged by the need for a signal that terminates processing.

If you are just doing one-to-one loads of messages to an rdbms, this issue might not matter.  But if you are concerned about individual units of work, have multiple rows “inside” a given message (like an XML document with repeating elements), or are processing individual service requests for a myriad of SOA clients, then something needs to be done to recognize these logical groupings of rows.  Transactional tools that fire up tiny, entirely compiled modules (WebSphere TX being one example) were designed for this, but classic ETL tools, often with interpreted connecivity and performance features that require some ramp-up time, need to stay “always on” for maximum efficiency.   Blocking functions, those that have to “wait” on all rows for completion, are particularly sensitive.  These include Aggregations, Sorts, Pattern Matching, XML document creation, and others.

DataStage and QualityStage manage this by supporting a concept known as end-of-wave.  Driven automatically by the receipt of a SOAP envelope, or on developer control by the reading of “n” messages or other factors, end-of-wave is a “signal” that is sent thru the Job, following all rows in a group, along every possible path.  The end-of-wave signal tells makes all the downstream Stages “think” that processing is complete.   The Stages who block by design (Aggregator, QualityStage matching, XMLOutput, etc.), are notified to go about their business of clean-up or processing “as though” they’ve drained an entire source and hit end-of-file.  However, in the real-time pattern, they haven’t.   End-of-wave is merely the signal that separates two requests from entirely independent users, or the related contents of one MQSeries message from another.   The Job, as noted before, is “always on.”  It simply continues running and immediately receives data for the next “wave.”  This behavior is inherent in the Information Services Director, as it manages traffic from incoming SOA clients via SOAP or other bindings, and is directly available in Stages like the MQSeries Connector.

The following diagram pictorially represents the end-of-wave concept.  Moving from left to right through the Job, end-of-wave follows each of the co row groupings.

   end-of-wave

This is one way of handling the need for high performance parallel transformations in a real-time scenario where volume is defined as lots of concurrent, yet independent sets of rows……while the same transformations and tooling is to be re-used for massive volumes (read: 100’s of gigabytes or multiple terabytes) in batch.  There are other approaches I’m sure, but be certain that the tool you are using has a way to deal with it.

Tips for using Web Services Pack with DataStage Part I: Getting Started

Thought I’d start sharing the notes I’ve collected over the years of working with the Web Services Pack.  This is the ability for DataStage to be a SOAP Client; for a DataStage Job to reach out and invoke a Web Service located “out there in the ether.”  This is a reference to the benefits of Web Services and the fact that you don’t have to know where the Service is located, what machine its on, who wrote it, or what language it’s written in — doesn’t mean it’s not a good idea, especially if your business is relying on it, but I think you get the point.   Give me a WSDL document, and that should be enough to invoke some type of remote function [this is what you might call the opposite of WISD, or RTI, which allow you to publish a DataStage Job or other Information Server asset “as” a Web Service].

Before you get started with it, ask yourself how comfortable you are with Web Services technology.   I speaking here about SOAP over HTTP in particular, the protocol supported by WS Pack, but it would be wise to get yourself a good intro to all of SOA.    There are good books out there, but also 1000’s of great resources here on the web.   Just search in your favorite browser for something like “Web Services Introduction XML,” and poke around.  You’ll find something for everyone.

Then you need the software.  If you are a release 8 user, it’s just “there.”   If 7.5, you have to speak with your account team about downloading it, but it won’t cost you anything.   The installation is simple, although be reminded that there is a client side and server side component.

Once installed, you’ll have two new Stages, the WSClient Stage and the WSTransformer Stage.  WSClient is for Web Services that will be a source or target to your Job, and the WSTransformer is for those Services that you expect to invoke on a row-by-row basis.

I have to put in a plug for the documentation.  The WSPack documentation is very thorough.  It has some very easy to understand graphics that do their own nice job of introducing Web Services in general, as well how WSDL, SOAP, SOAP Envelopes and SOAP Bodies come into play.  What I’d like to add in these entries are some other points to consider as you prepare to include Web Services in your jobs to help you be more successful.

A stand alone Web Services testing tool is a good idea too, unless you are already comfortable using Web Services from a common IDE for Java, C#, .NET, etc.

Finally, find yourself a good Web Service to start with, just to learn the mechanics.  There are some great ones out at www.xmethods.net .  You can test them there to see how the function, and get a solid idea of what their input and output requirements are.   There’s a nice mix of publicly available services there; some are free, others are part of a business.  The free ones are sponsored by folks who are making known their expertise.  Some very solid consulting firms and companies with Web Services expertise are represented there.  I’ve seen many of the services hosted at that site continue to function and be supported for more than five years.  For your first foray into Web Services pack, pick one that has very few input and output arguments, accepts and returns one row (a perfect candidate for WSTransformer) and doesn’t require you to pay anything or come up with an access code.

This entry is getting long.  Next time I’ll walk you thru access to the Service and then discuss how to prepare for a more complex one that you might find inside your enterprise.