MQSeries…Ensuring Message Delivery from Queue to Target

Using MQSeries in DataStage as a source or target is very easy…..but ensuring delivery from queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue to database without dropping any messages…

The best way to do this is with an XA transaction, using a formal transaction coordinator, such as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works with MQ to perform transactions across resources….deleting a message from the source queue, INSERTing a row to the target, and then committing the entire operation. This requires the most recent release of DataStage, and the right environment, releases, and configuration of MQSeries and a database that it supports for doing such XA activity….

So what happens if you don’t have the right release of any of these things, or are using an rdbms that is not supported for XA with MQ?

You can come “real close” and accomplish what you’ll need in most scenarios with the attached Job and text file with DDL. This .dsx defines a technique where messages are read from a source queue, then written to a target rdbms……if the INSERT works, messages are immediately removed from the source queue……but if it fails, removal is not performed, and the messages remain in the source queue. Careful thought, testing and review of recovery strategy is necessary, but this technique may be useful in a lot of situations.

Once again, I haven’t mastered the uploading on this site, and had to rename MQCommitTestv8.dsx and TestDDL.txt with pdf suffixes.   Just rename them and you will be fine.  They are both just simple ascii files.

Ernie

targetddl1 mqcommittestv8

“Almost” real-time Jobs

It must be Q4, and close to the holidays.   Questions come in like heavy rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? 🙂 )

At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents.  This is a common scenario for reading XML input.   Imagine a subdirectory that receives xml documents  (with the same matching schema) throughout the night, via FTP from remote locations.  It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.

There are a myriad of solutions for handling this, usually with some type of polling mechanism.  It’s not as pure and simple, however, as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter.   Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++.   The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate.   There may be some custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any.   Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting].    They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.”    Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products.  Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.

Simulating end-of-file for Real Time ETL Transformations

In my initial entry for this blog (https://dsrealtime.wordpress.com/2007/11/23/what-is-real-time-etl-anyway/), I wrote about some of the issues facing “always on” transformations that continously read data from a real-time source (MQSeries, sockets, named-pipes, etc.).   Tools that provide a graphical metaphor for transformation design, and find their origin in high volume batch functionality (classic ETL tools fit this description), are often challenged by the need for a signal that terminates processing.

If you are just doing one-to-one loads of messages to an rdbms, this issue might not matter.  But if you are concerned about individual units of work, have multiple rows “inside” a given message (like an XML document with repeating elements), or are processing individual service requests for a myriad of SOA clients, then something needs to be done to recognize these logical groupings of rows.  Transactional tools that fire up tiny, entirely compiled modules (WebSphere TX being one example) were designed for this, but classic ETL tools, often with interpreted connecivity and performance features that require some ramp-up time, need to stay “always on” for maximum efficiency.   Blocking functions, those that have to “wait” on all rows for completion, are particularly sensitive.  These include Aggregations, Sorts, Pattern Matching, XML document creation, and others.

DataStage and QualityStage manage this by supporting a concept known as end-of-wave.  Driven automatically by the receipt of a SOAP envelope, or on developer control by the reading of “n” messages or other factors, end-of-wave is a “signal” that is sent thru the Job, following all rows in a group, along every possible path.  The end-of-wave signal tells makes all the downstream Stages “think” that processing is complete.   The Stages who block by design (Aggregator, QualityStage matching, XMLOutput, etc.), are notified to go about their business of clean-up or processing “as though” they’ve drained an entire source and hit end-of-file.  However, in the real-time pattern, they haven’t.   End-of-wave is merely the signal that separates two requests from entirely independent users, or the related contents of one MQSeries message from another.   The Job, as noted before, is “always on.”  It simply continues running and immediately receives data for the next “wave.”  This behavior is inherent in the Information Services Director, as it manages traffic from incoming SOA clients via SOAP or other bindings, and is directly available in Stages like the MQSeries Connector.

The following diagram pictorially represents the end-of-wave concept.  Moving from left to right through the Job, end-of-wave follows each of the co row groupings.

   end-of-wave

This is one way of handling the need for high performance parallel transformations in a real-time scenario where volume is defined as lots of concurrent, yet independent sets of rows……while the same transformations and tooling is to be re-used for massive volumes (read: 100’s of gigabytes or multiple terabytes) in batch.  There are other approaches I’m sure, but be certain that the tool you are using has a way to deal with it.

Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

Update:This technique was also recently documented in the XML Stage Redbook, section 7.1.2, page 152. The redbook can be found at the following link:

XML Redbook

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.

Why use ETL for Real Time? …for Metadata support!

ETL tools were designed for back-room, nightly batch processing, right? Yes…maybe….I suppose. If you look at their history, with most ETL tooling born in the decision support and data warehousing world, the biggest challenges were for point-in-time refreshes and loading of vast amounts of information. However, requirements have evolved, missions have changed, and ETL is no longer used only for decision support. Indeed, a certain percentage of sites never have used ETL for data warehousing, even if that is admittedly still a large segment of the application for such tools and technologies. Today, ETL is a great choice for real-time, and it’s safe to say that the tools are now being designed for top notch real-time functionality. I’d like to just stop using the term “ETL” (or ELT, ETML and some of the other acronyms that have been floating around for years)! It’s not your father’s ETL anymore……..[but terms stick, so for now we’ll go with it unless any of you have better suggestions for us and our friends at the analysts 🙂 ].

If not ETL for Real Time, what else? A lot has already been written on ETL (Extract Transform Load) vs EAI (Enterprise Application Integration), with ETL generally being credited with better high volume abilities, and EAI better at complex, multi-construct (occurs, record types) sources and targets, and other pros and cons for either. As I learn more about how to manage this site I’ll create a page with my favorite links on this subject. In many of these comparisons, real-time often defaults to the EAI category.

However, one area that is often overlooked in this comparison are what you might call two “soft” issues — the user community, your teammates who will actually be doing the development, and the requirements for meta data management. While there are exceptions, ETL tools “tend” to be used by what I like to refer to as “data professionals.” These are folks who may have formal programming backgrounds, but gravitated to their role in the enterprise because they understand the business and they know the data. With their initial focus on business intelligence, ETL tools (I know, beauty is in the eye of the beholder) are often more inviting to this type of user. Not an “end-user” by any means, but also not the user who is typically comfortable with C header files, java types and code snippets. ETL vendors have competed for years on the usability issue. Their success with DBAs and more technical end users is a testament to their appeal.

The other “soft” issue worth noting as ETL moves into “real time” is the support for meta data. No longer is meta data something that people merely pay lip service to. Data lineage and impact analysis — the abilities to link a column name to a real-time Service, its rdbms target, its ERwin model AND its business intelligence report are unique to ETL tools. Most EAI type tools, until recently, could hardly spell metadata, let alone provide impact analysis and data lineage reporting from soup to nuts. This is changing, but deep metadata reporting has been a key component in the data warehousing space (and thus receiving massive investment from ETL vendors) for ten years or more.

Data Governance, regulatory compliance, and metadata management are on everyone’s minds. We can’t pay lip service to metadata and data lineage for any kind of data integration. SOA and real-time data integration need the deep metadata support provided by ETL tooling, as much as business intelligence applications do.

Increasingly, ETL tools, and the platforms they operate in are being chosen for real time data integration because of their support for meta data, and the preference of “data professionals” for these tools over their “closer-to-the-code” IDE tool cousins for programming development.

Ernie

What is Real Time ETL anyway?

What is Real Time ETL? What does it mean? This question keeps coming up in discussions with customers and prospects, for enterprises large and small, and with tool jockeys and home grown coders. It surfaces in debates about EAI vs ETL (subject for another blog), Changed Data Capture, transactional vs batch processing, and more. I won’t debate the definitions of real-time, right-time, real-time data warehousing, active data warehousing, just-in-time or near-real-time — a lot of really smart people have already been there. I just want to look at what people are actually doing, and calling, Real Time ETL.

Trying to formally define real time isn’t easy — there are so many points of view, and critical differences based on industry segment. Those of us in the commercial “data world” spend lots of time discussing the finer points of “real time”….however, I stopped trying to come up with a single definition after reading pure academic and engineering definitions of “real-time computing” that talked about robotic arms in an assembly line reacting in microsecond “real time” to things like minute temperature changes!

I’d like to reflect here instead on the technical aspects of common patterns that those of us in the data integration space run into regarding Real-Time ETL, and mention some of the gotchas that often go overlooked. I see four basic “patterns” that, depending on your point of view and problem you are trying to solve, qualify as Real Time ETL:

  • Frequently executed ETL processes (ie. every 5 minutes, one minute, or every 10 seconds). Really a “batch” pattern, but run in small windows with tiny (by comparison to large batch loads) quantities of data.
  • Messaging or other “continually live” medium as a Source.
  • Messaging or other “continually live medium as a Target.
  • Request/Response with a continually live medium on either end (Source and Target).

The second one above interests me right now, as I’ve had numerous questions on this subject in the past few days. I want to speak here about the technical definition for jobs, maps, procedures (or whatever you call your ETL processes) that need to “read” data from a commonly accepted “real time” technology. Real time sources may be popular messaging engines, such as MQSeries, TIBCO Rendevous, or MSMQ, or java based standards such as JMS, or more custom based solutions such as sockets or even named pipes. Most ETL tools can access these, or provide extensions that make it possible to utilize some of the lesser known APIs.

This is the most commonly requested pattern. When someone says “I need Real-Time ETL,” it generally turns out that they want to “read” from such a source. Reasons for needing it vary. Some sites desire immediate updates to decision support systems or portals, while others are merely “dipping” into an available source that is passing through for other purposes. An already built MQ Series infrastructure, shipping messages between applications, are often the perfect source of data for ETL, whether the objective is immediate updates or not. It’s just “there” and available…and simpler to get than trying to wrestle with security folks for access to source legacy systems. Of course there are hundreds of variants, whether the target is decision support oriented (data warehouse or datamart), or ERP (such as SAP). Either way I’m talking about a persistent target.

Regardless of the reasons, such ETL processes have to deal with issues like the following:

  • Always On. Typically an initialization issue. ETL tools do a lot of preparation when they start…they validate connections, formally “PREPARE” their SQL, load data into memory, establish parallel processes, etc. Twenty seconds of initialization may be acceptable in a 45 minute batch job that processes ½ gigabyte. In a real time scenario, that’s unacceptable. You can’t afford to perform all of that initialization for every message or packet….it needs to be done once, then leave the process “always on” and waiting for new data. I like to think of it “floating” while it waits. Of course, this invites other problems…
  • End-of-file processing for “blocking” functionality. If you have an “always on” job, what do you do if someone wants to use an aggregation or sum() function? How does the process know when it’s finished and can flush rows thru such an operation? This is particularly critical when we move on to Web Services in the request/response pattern, but equally important when reading messages that contain multiple rows, such as when the message payload is a complex XML document.
  • Live vs buffered or in-memory lookups. A common technique for performance in large volume batch processes is to bring values into memory. Same issues for performance in “always on” jobs, but consider that “always on” means needing a strategy to refresh that in-memory copy. Or else ensure that a constant connection to the original source is feasible and performs well….and that the DBA who owns the real time source won’t kill your long running database connection in an “always on” scenario.

These aren’t the only issues, and there are numerous ways of dealing with them. Make sure the tool or techniques you choose give you ways to deal with these problems. Next time I’ll share my notes on these issues and the other real-time patterns in more detail.

Posted in etl, mq, RealTime. Tags: , , . 7 Comments »