Guidelines for publishing services with DataStage and QualityStage…..

There are a variety of issues to consider when publishing a DataStage or QualityStage Job as a Web Service. One of these is end-of-wave, and I have an entry on my blog for that (I’ll find the link later, but I think it’s in the table of contents)… Another is ensuring that ISD is “the” only driving link for the transformation…

Ensuring that WISD Input is the driving stream for the Job flow

For “always on” Jobs, the flow of rows through the job must be driven exclusively by the primary input Stage that is responsible for the “always on” condition. NO path thru an “always on” Job should have its own independent origin. This condition is most notably encountered in Enterprise Edition Jobs that use Stages that support (by design) two or more equal inputs (Reference Match, Join, Merge, Funnel are typical illustrations of these Stage types).

This is not always an easy construct to recognize, especially when converting very lengthy and complex Jobs into WISD. Let’s look at some examples.

In this first example we see a Job using the Funnel Stage. Two links, or two “flows” from “somewhere” upstream in the Job, arrive at the Funnel. If we consider the rule of having WISD “drive” the flow, then this Job design would be invalid. It is imperative that both links have their origin at the WISD Stage. It is not possible to have an ISD Job where one path (like the bottom path below using link02) starts at WISD, and the other path (link02 on top) beginning at a Sequential file or rdbms:

While it could be debated as to what the behavior of the engine when encountering this construct should be in an “always on” Job, it’s easy to appreciate the ambiguity. In a simplistic batch job, rows from each path to the Funnel are intermixed. They can be sorted together, pushed thru “as they arrive” or be grouped (finish one path and then get the other). What happens in a real time scenario when one single row or set of rows comes from a remote client followed by end-of-wave? Should the entire set of relational rows be sent down link link02, followed by the current WISD row on link link01? Should only one Sequential or rdbms row be sorted into the row or rows from the WISD input (saving the “next” link02 row for the “next” WISD request?) …and then what happens for the subsequent WISD request? Should end-of-wave result in a complete re-running of all the stages upstream from the funnel that are further upstream from link link02? What if the ultimate source of this path contains 12 million rows? Do we want to wait for that amount of processing for each request that arrives to the “always on” Job?

This is a potentially confusing area, not only for the DataStage/QualityStage Developer, but for the engine as well. Best that it be avoided. Incorrect and/or invalid responses will be the result, if not complete failure or hanging of requests. A common symptom is that the first request works and the second request returns nothing, or the same payload as the first.

Does this mean you cannot use a Funnel with ISD? Of course not! There are many reasons for using a Funnel, the most common being the need to have independent paths
of logic for data in a single request. The screen shot below illustrates a Job using the Funnel Stage, where all paths to the Funnel are correctly “driven” by the request that flows from the WISDInput Stage:

As noted in an earlier section, the same circumstances can occur when using the Join Stage. Data on the “independent” path will not be refreshed, and results will be invalid, inconsistent, or yield errors. The alternative is to use a Lookup Stage instead. However, as with the Funnel, the Join is allowed in an “always on” Job, provided all incoming paths are driven by the WISDInput Stage.

Incorrect use of Join Stage in a WISD Job. Use a Lookup instead:

Acceptable use of Join Stage in an “always on” WISD Job:

(A recently discovered variant of the Job pattern immediately above is also problematic when used under WISD. This is a Job flow where the “Join” above is replaced by a Lookup, and link “l3” is being used for the “reference” side of the Lookup. Such a Job needs to be designed using a Join instead of expecting one of the WISD feeds to provide the “lookup table” values).

You can expect this behavior to be exhibited by Jobs that use any of the following Stages: Difference, Compare, Merge, Join and QualityStage Reference Match.

Special Considerations for QualityStage

QualityStage Jobs developed for use under WISD need to be particularly conscious of the concept described in the previous section, especially when feeding data into a Reference Match. An often desired pattern is to feed the incoming service based data as the primary link for the Reference Match, and then bring in reference data from a fixed data source. This violates the rules described above. Instead of having a fixed data source attached to the reference link, perform a Lookup based upon your high level blocking factors in the incoming WISD request. This involves using a Copy Stage that splits the incoming row, sending a row to the primary input as before while sending the other row to a Lookup where multiple reference candidates can be dynamically retrieved. Be sure to make this a sparse lookup if you expect that the source data could change while the WISD Job is “on” (enabled), and check it carefully to be sure you’ve set it to return multiple rows.

Happy “servicing…” 🙂

Ernie

Advertisements

4 Responses to “Guidelines for publishing services with DataStage and QualityStage…..”

  1. Ravindra Harve Says:

    Great write up Ernie. We will be starting build our enhancements to datastage using ISD. This article will definitely help us in a successful implementation.

  2. satish Says:

    hi ernie, who to accept array of input data from web service into ISD job and run on demand. Right now, I am using ISD output stage and job parameters equivalent to input data.. but who to accept array to input in job parameters?

    If I use ISD input stage and xml input stage to accept the array of data.. the job becomes “always running”…

    • satish Says:

      correcting my grammar above ……:)
      In short, how to accept array of input data from web service into ISD job and run on demand. Right now, I am using ISD output stage and added job parameters equivalent to input data.. but how to accept array of input data from job parameters?

      If I use ISD input stage and xml input stage to accept the array of data.. the job becomes “always running”…

      • dsrealtime Says:

        If you are only running this service once-in-awhile, then the batch paradigm (not having ISD input) is fine…..but of course, if the input is only via Job Parameter, you need a way to pass up multiple “rows”. You’ll have to package them somehow. I like XML, but you could just as easily do it with some sort of delimiter. Pack them into a single string as a set of values with pipes or commas delimiting them, and then use various pivot techniques, or send up as a “chunk” of xml and hand it (ultimately) to xmlOutput……

        Ernie


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: