Launching Jobs via Web Service

Kicking off Transformation Jobs or functions via Command-line is an important feature, but it is equally important that we are able to launch such processing via Web Service or other service-oriented invocation.

Command line utilities are generally text based.  They require that you learn the vendor’s proprietary syntax.   They may require passwords and userids.   You may have to worry about clear-text or be sure there are ways to hide credential information.  You probably need to know about shell scripting to take maximum advantage of the Command language.  You may also need to know things like Job or Map names, need to understand the parameter structure, or the administrative naming conventions for the Transformation tooling (Project Names, Folders, etc.).   You probably need to know, as a developer, the hostname where the Transformation is running, or at least the engine you are connecting to for launching the process.  

Command line utitilies are powerful…. there are times, however, when a simpler invocation method is needed.  Like when the calling application doesn’t have easy access to scripts or the operating system command line.   Or when the skills of the developer(s) establishing the invocation do not include scripting or details of the Tranformation tool’s command line syntax.   Or when the ultimate location of the tranformation function being launched is unknown or moves frequently.    Service invocations using industry standards help abstract all of that….Servers can be anywhere, and underlying awareness of the Tranformation tooling is hidden from the invoking client.   Datatypes for Job Parameters are less painful to handle, and authentication and/or encryption can be handled at the transport layer.

A financial site I’ve been working with uses Web Services to kick off DataStage ETL Jobs from within a portal application.   The authors of the portal application are Visual Basic .NET experts.   They don’t know the first thing about DataStage, are under deadlines like the rest of us, and needed the ability to start an ETL Job to refresh a datamart as the result of a user pushing a button or choosing an option.   Using the RTI module for DataStage (release 7.x), the .NET development team merely consumes the automatically published WSDL definition to include this functionality.   Within minutes, they have the functionality included in their application.  No scripts to maintain, no complex awareness of DataStage.  For all the .NET developers know, the refresh process might be home-grown C++ code.    That’s the value of using the SOA industry standards.  The .NET developers don’t want to know and they don’t have to know that ultimately, it’s a DataStage Job doing the work.    Meanwhile, the skilled DataStage ETL team, who really understands the data, is maintaining the Jobs, the rules, the access to the source and target data structures, and publishing the functionality (as a Web Service, including the WSDL “contract”) for use by the rest of the enterprise. 

Command line control (in DataStage we generally call it the Job Control API or Job Control Command language, a.k.a. dsjob) in transformation tools is
great – but increasingly we need simpler methods that exploit the new standards.

“Almost” real-time Jobs

It must be Q4, and close to the holidays.   Questions come in like heavy rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? 🙂 )

At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents.  This is a common scenario for reading XML input.   Imagine a subdirectory that receives xml documents  (with the same matching schema) throughout the night, via FTP from remote locations.  It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.

There are a myriad of solutions for handling this, usually with some type of polling mechanism.  It’s not as pure and simple, however, as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter.   Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++.   The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate.   There may be some custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any.   Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting].    They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.”    Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products.  Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.

Simulating end-of-file for Real Time ETL Transformations

In my initial entry for this blog (https://dsrealtime.wordpress.com/2007/11/23/what-is-real-time-etl-anyway/), I wrote about some of the issues facing “always on” transformations that continously read data from a real-time source (MQSeries, sockets, named-pipes, etc.).   Tools that provide a graphical metaphor for transformation design, and find their origin in high volume batch functionality (classic ETL tools fit this description), are often challenged by the need for a signal that terminates processing.

If you are just doing one-to-one loads of messages to an rdbms, this issue might not matter.  But if you are concerned about individual units of work, have multiple rows “inside” a given message (like an XML document with repeating elements), or are processing individual service requests for a myriad of SOA clients, then something needs to be done to recognize these logical groupings of rows.  Transactional tools that fire up tiny, entirely compiled modules (WebSphere TX being one example) were designed for this, but classic ETL tools, often with interpreted connecivity and performance features that require some ramp-up time, need to stay “always on” for maximum efficiency.   Blocking functions, those that have to “wait” on all rows for completion, are particularly sensitive.  These include Aggregations, Sorts, Pattern Matching, XML document creation, and others.

DataStage and QualityStage manage this by supporting a concept known as end-of-wave.  Driven automatically by the receipt of a SOAP envelope, or on developer control by the reading of “n” messages or other factors, end-of-wave is a “signal” that is sent thru the Job, following all rows in a group, along every possible path.  The end-of-wave signal tells makes all the downstream Stages “think” that processing is complete.   The Stages who block by design (Aggregator, QualityStage matching, XMLOutput, etc.), are notified to go about their business of clean-up or processing “as though” they’ve drained an entire source and hit end-of-file.  However, in the real-time pattern, they haven’t.   End-of-wave is merely the signal that separates two requests from entirely independent users, or the related contents of one MQSeries message from another.   The Job, as noted before, is “always on.”  It simply continues running and immediately receives data for the next “wave.”  This behavior is inherent in the Information Services Director, as it manages traffic from incoming SOA clients via SOAP or other bindings, and is directly available in Stages like the MQSeries Connector.

The following diagram pictorially represents the end-of-wave concept.  Moving from left to right through the Job, end-of-wave follows each of the co row groupings.

   end-of-wave

This is one way of handling the need for high performance parallel transformations in a real-time scenario where volume is defined as lots of concurrent, yet independent sets of rows……while the same transformations and tooling is to be re-used for massive volumes (read: 100’s of gigabytes or multiple terabytes) in batch.  There are other approaches I’m sure, but be certain that the tool you are using has a way to deal with it.

Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

Update:This technique was also recently documented in the XML Stage Redbook, section 7.1.2, page 152. The redbook can be found at the following link:

XML Redbook

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.