It must be Q4, and close to the holidays. Questions come in like heavy rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? 🙂 )
At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents. This is a common scenario for reading XML input. Imagine a subdirectory that receives xml documents (with the same matching schema) throughout the night, via FTP from remote locations. It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.
There are a myriad of solutions for handling this, usually with some type of polling mechanism. It’s not as pure and simple, however, as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter. Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++. The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate. There may be some custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any. Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting]. They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.” Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products. Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.