Launching Jobs via Web Service

Kicking off Transformation Jobs or functions via Command-line is an important feature, but it is equally important that we are able to launch such processing via Web Service or other service-oriented invocation.

Command line utilities are generally text based.  They require that you learn the vendor’s proprietary syntax.   They may require passwords and userids.   You may have to worry about clear-text or be sure there are ways to hide credential information.  You probably need to know about shell scripting to take maximum advantage of the Command language.  You may also need to know things like Job or Map names, need to understand the parameter structure, or the administrative naming conventions for the Transformation tooling (Project Names, Folders, etc.).   You probably need to know, as a developer, the hostname where the Transformation is running, or at least the engine you are connecting to for launching the process.  

Command line utitilies are powerful…. there are times, however, when a simpler invocation method is needed.  Like when the calling application doesn’t have easy access to scripts or the operating system command line.   Or when the skills of the developer(s) establishing the invocation do not include scripting or details of the Tranformation tool’s command line syntax.   Or when the ultimate location of the tranformation function being launched is unknown or moves frequently.    Service invocations using industry standards help abstract all of that….Servers can be anywhere, and underlying awareness of the Tranformation tooling is hidden from the invoking client.   Datatypes for Job Parameters are less painful to handle, and authentication and/or encryption can be handled at the transport layer.

A financial site I’ve been working with uses Web Services to kick off DataStage ETL Jobs from within a portal application.   The authors of the portal application are Visual Basic .NET experts.   They don’t know the first thing about DataStage, are under deadlines like the rest of us, and needed the ability to start an ETL Job to refresh a datamart as the result of a user pushing a button or choosing an option.   Using the RTI module for DataStage (release 7.x), the .NET development team merely consumes the automatically published WSDL definition to include this functionality.   Within minutes, they have the functionality included in their application.  No scripts to maintain, no complex awareness of DataStage.  For all the .NET developers know, the refresh process might be home-grown C++ code.    That’s the value of using the SOA industry standards.  The .NET developers don’t want to know and they don’t have to know that ultimately, it’s a DataStage Job doing the work.    Meanwhile, the skilled DataStage ETL team, who really understands the data, is maintaining the Jobs, the rules, the access to the source and target data structures, and publishing the functionality (as a Web Service, including the WSDL “contract”) for use by the rest of the enterprise. 

Command line control (in DataStage we generally call it the Job Control API or Job Control Command language, a.k.a. dsjob) in transformation tools is
great – but increasingly we need simpler methods that exploit the new standards.

You need a command line tool for “event” based real-time

…by the way, in relation to my previous post, especially when triggering ETL for “kinda” real-time functionality from scheduling environments, make sure the tool you are using or choosing has a command-line mechanism for starting jobs and passing values.   All of the major tools have one – a quick search gave me examples in DataStage, Informatica and Business Objects Data Integrator, and Ab-I.   Any data integration tool worth its salt should have a mechanism.  Learn the command line method that is pertinent to your favorite.  I thought this answer on an Ab-I user collaboration site ( http://datawarehouse.ittoolbox.com/groups/technical-functional/abinitio-l/need-help-on-autosys-jobs-and-wrappers-1397033 ) was great.  I took it a bit out of context, because it was somewhat specific to Ab-I, but the rest makes sense for any of these offerings:

…People then make the mistake of putting these things in the start and end script of the graph – but the best place for them is in a single wrapper that invokes all graphs – effectlvely adapting Ab Initio invokation to Autosys rather than adapting Autosys to Ab Initio. In the end, Ab Initio is along for the ride (as it should be) and Autosys is the controlling harness.

In DataStage, we might suggest that the “first” call from a tool like Autosys would invoke a Graphical Job Sequence, and then endeavor to have as much Job-to-Job-to-Job control and overall process logic manged by the Sequence itself… but it’s still the same concept for complex “event” functions, especially the ones that approximate “always on.” 

 DataStage does all of this with the dsjob command.  You’ll find it document ed in Version 8 in Chapter 23 of the Server Job Developer Guide (i46desjd.pdf). 

 …but there is an exception….what if you want the “event” to be something really creative, like a user pushing a button on a portal built via .NET and running in a remote location?    What if the developers for that button don’t know the first thing about unix or shells?   What if the configuration is so dynamic that the transformation server endpoint changes frequently?  Being able to do this via Web Services or other SOA Service invocation becomes equally imporant.  The subject for my next entry…

“Almost” real-time Jobs

It must be Q4, and close to the holidays.   Questions come in like heavy rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? :) )

At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents.  This is a common scenario for reading XML input.   Imagine a subdirectory that receives xml documents  (with the same matching schema) throughout the night, via FTP from remote locations.  It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.

There are a myriad of solutions for handling this, usually with some type of polling mechanism.  It’s not as pure and simple, however, as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter.   Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++.   The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate.   There may be some custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any.   Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting].    They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.”    Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products.  Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.

Simulating end-of-file for Real Time ETL Transformations

In my initial entry for this blog (http://dsrealtime.wordpress.com/2007/11/23/what-is-real-time-etl-anyway/), I wrote about some of the issues facing “always on” transformations that continously read data from a real-time source (MQSeries, sockets, named-pipes, etc.).   Tools that provide a graphical metaphor for transformation design, and find their origin in high volume batch functionality (classic ETL tools fit this description), are often challenged by the need for a signal that terminates processing.

If you are just doing one-to-one loads of messages to an rdbms, this issue might not matter.  But if you are concerned about individual units of work, have multiple rows “inside” a given message (like an XML document with repeating elements), or are processing individual service requests for a myriad of SOA clients, then something needs to be done to recognize these logical groupings of rows.  Transactional tools that fire up tiny, entirely compiled modules (WebSphere TX being one example) were designed for this, but classic ETL tools, often with interpreted connecivity and performance features that require some ramp-up time, need to stay “always on” for maximum efficiency.   Blocking functions, those that have to “wait” on all rows for completion, are particularly sensitive.  These include Aggregations, Sorts, Pattern Matching, XML document creation, and others.

DataStage and QualityStage manage this by supporting a concept known as end-of-wave.  Driven automatically by the receipt of a SOAP envelope, or on developer control by the reading of “n” messages or other factors, end-of-wave is a “signal” that is sent thru the Job, following all rows in a group, along every possible path.  The end-of-wave signal tells makes all the downstream Stages “think” that processing is complete.   The Stages who block by design (Aggregator, QualityStage matching, XMLOutput, etc.), are notified to go about their business of clean-up or processing “as though” they’ve drained an entire source and hit end-of-file.  However, in the real-time pattern, they haven’t.   End-of-wave is merely the signal that separates two requests from entirely independent users, or the related contents of one MQSeries message from another.   The Job, as noted before, is “always on.”  It simply continues running and immediately receives data for the next ”wave.”  This behavior is inherent in the Information Services Director, as it manages traffic from incoming SOA clients via SOAP or other bindings, and is directly available in Stages like the MQSeries Connector.

The following diagram pictorially represents the end-of-wave concept.  Moving from left to right through the Job, end-of-wave follows each of the co row groupings.

   end-of-wave

This is one way of handling the need for high performance parallel transformations in a real-time scenario where volume is defined as lots of concurrent, yet independent sets of rows……while the same transformations and tooling is to be re-used for massive volumes (read: 100′s of gigabytes or multiple terabytes) in batch.  There are other approaches I’m sure, but be certain that the tool you are using has a way to deal with it.

Real-time Upgrades

I just returned from the west coast on “one more” red-eye, so that I could ostensibly get home and back to work and have a fully productive day.   I never really accomplish that goal, but I keep trying, like all of us, to squeeze out a few extra hours wherever I can find them.  Today’s trip was in coach, but I still managed to get a few hours of sleep, even with two fellow passengers snoring loudly nearby.   Not this time, but quite often I get the “automatic” upgrade as one of Continental’s frequent fliers, which always enhances the otherwise tiring nightime journey.

If you’ve never read anything about CO’s real-time business intelligence system, and the benefits they’ve gained as an enterprise, google it or check out this paper at http://www.fuqua.duke.edu/centers/ccrm/datasets/continental/Continental_Airlines_Case_Study.pdf

They’ve received various awards over the years for their ability to identify their best customers and recognize trends in all aspects of their business.    As a customer, it’s great to be on the receiving end of their real-time BI system.  They notify me when I get an upgrade, and wonder where I’ve been, usually with an inviting mileage offer, if I haven’t flown in awhile.   And I’ve been proactively re-routed in advance when schedules get messed up beyond anyone’s control, as they sometimes do.  Maybe they’ll invent a system that makes real-time seat changes to group all the snorers at the back of the plane.  :)

Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.

Tips for WSPack Part III: Best practices and Guidelines

Now that we’ve gone thru the basics of invoking a Web Service from within DataStage we can cover some of the more interesting stuff.   Here are some guidelines and best practices for using each of the capabilities within Web Services Pack, whether it’s a Web Service Routine (Server) or WSClient or WSTransformer Stages.

WSRoutines

 

These are Server side Routines (create via Import Web Services Function Definitions after you’ve imported the WSDL).   Nice because they can be included directly in a Transformer.  Of course, that also means that they are not as visual.  

  • Better for Boolean WS invocation (getting back a true/false) when used in the Constraints of a Transformer.
  • Better for when you are calling many Web Services conditionally, or in conjunction with each other, for many columns within a link.
  • Best when used with RPC style Web Services
  • Best for single output argument, one or more input arguments
  • Great for quick testing/prototypes, particularly because of the “Test” function within the Routine editor.

NOT so easily used when:

  • Proxy is needed
  • Multi-argument out is needed
  • Complex XML or array output is needed
  • Web Service has a void input message

(these can be done in Routines, but not without difficulty.   These really only should be handled in the WSTransformer, where there is a lot more flexibility and support for things like security and proxy)

 

 

  WS Transformer Stage    

  • Best for single invocations (one call needed per row for a particular service)

  • Complements the graphical design metaphor

  • Best when proxy or security options are needed

  • Consistent for both RPC or complex output scenarios

  • More easily hack-able for extracting SOAP bodies/envelopes, etc.

  • Try to always Load link meta data via the Message Load button and NOT via the Column grid load button, especially when you are first using the Stage.

  • BEST PRACTICE:  When first building a job with WSTranformer (or with WSClient, for that matter, connect ONLY empty links to your WS Stage [ie….don’t ‘carry’ meta data from some other plugin over to the WS Stage] and LOAD the meta data only from your imported WSDL!!!!!!).   One easy trick is to always surround a new WSTransformer with standard transformers…you can always remove them later once you have things working and you are aware of the behavior of the job and the Service.

WS Client Stage 

  • Designed for single call (target or source) functionality
  • As a source, is easily driven by Job Parameters
  • Easiest for void input message signatures (when used as source)
  • Easiest for void output message signatures (when used as a target)

Tips for Using Web Services Pack with DataStage: Part II – Invoking your first Service

Here’s begins a step by step set of “cheat notes” for implementing (ultimately “invoking” or calling) your first Web Service. See prior Web Services Pack entry for some thoughts on getting started in general and finding a sample service.

Assuming you have the WSDL for a nice, basic Service, and have DataStage Designer up and running, and (if on release 7.x) have WSPack successfully installed (it’s already “in” v8), then you are ready to import the WSDL based meta data. From Import…Table Definitions…Web Services Definitions, start the WSDL Meta Data Importer, when it’s up, locate your WSDL. It could be out at a URL somewhere, or use the button to find an already downloaded WSDL. When it’s opened successfully, you’ll its XML on the left, and a tree of its “Operations” on the upper right. For now, think of each “Operation” as like a ‘routine’ or ‘method’ or ‘subroutine.’ You may have one or many listed in this tree if you expand it. Select one of the Operations and then click the “Import” button on the lower right. If all goes well, you’ll see import messages for the input and output of your service.

Now take a moment and review the table definitions that were imported. In your TableDefinitions/WebServices category, find your Operation, and then in there, a set of TableDefs. Open them up. Take a look at their column names and associated meta data. These are the placeholders for the critical meta data of the service you wish to invoke. Some contain SOAP oriented information….but the key ones are named <yourOperation>_IN and <yourOperation>_OUT. They will illustrate the actual input and output arguments of your selected Operation. In most scenarios you wll never need to review these, and it’s just a curiousity for your review. But for some complex services, it will useful to review these Tables and know where they exist.

Now you are ready to build a job. The formal documentation is fairly clear on this point, but as a tip, if you’ve selected a simple service as I’ve advised, be mindful of the following:

  • Place a Web Services Transformation Stage on your canvas, and immediately preceed it and follow it by regular Transformer Stages. This is a best practice when startingthat will become clear as you get into advanced techniques.
  • When you first open the Stage, the critical button you’ll first see allows you to select your Web Service Operation. This brings up a GUI with some “gear” icons. Select your Web Service, and then click on the “Advanced” button. Note the various Web Service meta data attributes that appear. Port Address is the location of the Service, and is often a good candidate for a DataStage Job Parameter. The others are mostly informational only.
  • When “loading” metadata to your import and output links, choose the usual input and output tabs respectively, and then the “message” tab that is visible. When you “load” the message, it knows which Table Definition to bring in, automatically pulls in the namespace details, and will also automatically populate the Columns grid. The most common errors in using WSTransformer for the first time are when users try to load the Column grid first, before loading the message detail and letting the Stage populate the link grid.

If you chose a simple service (some good ones on the link in my first WSPack entry concerning Temperatures), you will be good to go at this point. Supply your input data and send your output to a Peek or Seqeuential file or whatever you like for testing and see if it works!

Follow

Get every new post delivered to your Inbox.