What is Real Time ETL anyway?

What is Real Time ETL? What does it mean? This question keeps coming up in discussions with customers and prospects, for enterprises large and small, and with tool jockeys and home grown coders. It surfaces in debates about EAI vs ETL (subject for another blog), Changed Data Capture, transactional vs batch processing, and more. I won’t debate the definitions of real-time, right-time, real-time data warehousing, active data warehousing, just-in-time or near-real-time — a lot of really smart people have already been there. I just want to look at what people are actually doing, and calling, Real Time ETL.

Trying to formally define real time isn’t easy — there are so many points of view, and critical differences based on industry segment. Those of us in the commercial “data world” spend lots of time discussing the finer points of “real time”….however, I stopped trying to come up with a single definition after reading pure academic and engineering definitions of “real-time computing” that talked about robotic arms in an assembly line reacting in microsecond “real time” to things like minute temperature changes!

I’d like to reflect here instead on the technical aspects of common patterns that those of us in the data integration space run into regarding Real-Time ETL, and mention some of the gotchas that often go overlooked. I see four basic “patterns” that, depending on your point of view and problem you are trying to solve, qualify as Real Time ETL:

  • Frequently executed ETL processes (ie. every 5 minutes, one minute, or every 10 seconds). Really a “batch” pattern, but run in small windows with tiny (by comparison to large batch loads) quantities of data.
  • Messaging or other “continually live” medium as a Source.
  • Messaging or other “continually live medium as a Target.
  • Request/Response with a continually live medium on either end (Source and Target).

The second one above interests me right now, as I’ve had numerous questions on this subject in the past few days. I want to speak here about the technical definition for jobs, maps, procedures (or whatever you call your ETL processes) that need to “read” data from a commonly accepted “real time” technology. Real time sources may be popular messaging engines, such as MQSeries, TIBCO Rendevous, or MSMQ, or java based standards such as JMS, or more custom based solutions such as sockets or even named pipes. Most ETL tools can access these, or provide extensions that make it possible to utilize some of the lesser known APIs.

This is the most commonly requested pattern. When someone says “I need Real-Time ETL,” it generally turns out that they want to “read” from such a source. Reasons for needing it vary. Some sites desire immediate updates to decision support systems or portals, while others are merely “dipping” into an available source that is passing through for other purposes. An already built MQ Series infrastructure, shipping messages between applications, are often the perfect source of data for ETL, whether the objective is immediate updates or not. It’s just “there” and available…and simpler to get than trying to wrestle with security folks for access to source legacy systems. Of course there are hundreds of variants, whether the target is decision support oriented (data warehouse or datamart), or ERP (such as SAP). Either way I’m talking about a persistent target.

Regardless of the reasons, such ETL processes have to deal with issues like the following:

  • Always On. Typically an initialization issue. ETL tools do a lot of preparation when they start…they validate connections, formally “PREPARE” their SQL, load data into memory, establish parallel processes, etc. Twenty seconds of initialization may be acceptable in a 45 minute batch job that processes ½ gigabyte. In a real time scenario, that’s unacceptable. You can’t afford to perform all of that initialization for every message or packet….it needs to be done once, then leave the process “always on” and waiting for new data. I like to think of it “floating” while it waits. Of course, this invites other problems…
  • End-of-file processing for “blocking” functionality. If you have an “always on” job, what do you do if someone wants to use an aggregation or sum() function? How does the process know when it’s finished and can flush rows thru such an operation? This is particularly critical when we move on to Web Services in the request/response pattern, but equally important when reading messages that contain multiple rows, such as when the message payload is a complex XML document.
  • Live vs buffered or in-memory lookups. A common technique for performance in large volume batch processes is to bring values into memory. Same issues for performance in “always on” jobs, but consider that “always on” means needing a strategy to refresh that in-memory copy. Or else ensure that a constant connection to the original source is feasible and performs well….and that the DBA who owns the real time source won’t kill your long running database connection in an “always on” scenario.

These aren’t the only issues, and there are numerous ways of dealing with them. Make sure the tool or techniques you choose give you ways to deal with these problems. Next time I’ll share my notes on these issues and the other real-time patterns in more detail.

Posted in etl, mq, RealTime. Tags: , , . 7 Comments »

7 Responses to “What is Real Time ETL anyway?”

  1. Darlene Says:

    Time-ly and interesting! Thanks, Ernie – keep the information coming.

  2. Michael Curry Says:

    Good summary. Can you comment on techniques to retain snapshot integrity when you do real-time ETL? Also, how do you deal with indices and aggregates?

  3. dsrealtime Says:

    Thank you Michael.

    Interesting question. I’d be remiss if I didn’t ask exactly what you mean by “retain snapshot integrity,” but I’ll take a crack at it, assuming for now that you are considering it from the replication point of view.

    Retaining snapshot integrity, if we think of it from a hot backup and “replication” perspective, is not usually the domain of ETL, mostly because significant transformation is not often the requirement of snapshots…and if the transformations are light or non-existent, there are other tools for simple capture and apply that are often more applicable to the task.

    However (assuming replication and preservation of the snapshot of rows or tranactions for replication-like functionality meets your definition above), if we need to address this with ETL, I would first consider not using ‘always on’ functionality as defined in the original posting. If a “snapshot” is defined for a realtime source, it often means that the contents of the queue are “sliced” into it on a scheduled period. Whether this period is 5 minutes or 5 days, the entire “contents” of that queue are critical as a single unit of work. An ETL process sheduled along the same basis can treat the whole “contents” of the queue (or some filtered value that relates the messages together) as a single “set,” applying all or none of the rows in case of a failure. The ETL process in this case is effectively treating the queue source as a flat file.

    …and of course, I haven’t addressed yet the “two phase” commit issues that arise when reading a persistent queue capable of assured delivery and using an ETL tool to apply such rows/records/messages to an rdbms. Hmm. That’s probably worthy of its own posting. 🙂

    Indexes. Classic ETL loading procedures invite age-old techniques like removing indexes before bulk loads, then rebuilding them later for improved load times. In the 24 by 7 “real time” pattern, the indexes are usually left “on.” There may not be any reasonable time to stop the loading and then re-create them (as there would be during a scheduled batch window), or the ETL procedure itself may need certain indecies for lookup purposes. Other applications running concurrently may also need them, and perhaps most important, the real-time “trickle feed” ETL pattern is often chosen to avoid the need for a one-time huge batch load…. rows are coming in all day long instead of piling up, and the performance hit is not as great.

    I’ll save aggregation for an expanded posting and discussion of ‘always on’ and ‘blocking’ functionality as noted above.


  4. Minhajuddin Says:

    Thanks for sharing this information Ernie!!
    Looking forward for more such posts.
    And thanks to Vincent for directing me to this blog!

  5. 2010 in review « Real-Time Data Integration Says:

    […] What is Real Time ETL anyway? November 20074 comments 4 […]

  6. gammydodger Says:

    A recent post on GigaOm discusses Big Data in Real Time: http://gigaom.com/cloud/big-data-in-real-time-is-no-fantasy/ – focusing on the new Hadoop type tools available for fast high volume processing.

    First Retail has been working on technologies for real-time classification at high volume – which means that data from any source can be mapped to an existing schema as it is fed in. So streamed data from databases, social media or any structured or unstructured source can be fed into a database or warehouse for further analysis.

    More information here: http://www.firstretail.com/blog/Time_for_Semantic_ETL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: