ETL vs EAI for Real Time …revisiting an old subject…

Here’s a topic I hear less and less about these days. There are many reasons for this, the most likely being that the lines continue to blur between these technologies, especially now that various independent vendors in each space have been absorbed or morphed into other things. However, it raises its head every once in awhile, and even if classic “ETL” (extraction-transformation-load) and “EAI” (enterprise application integration) are under the same tool umbrella, it’s likely that choices will still need to be made as to the use of underlying features and capabilities.

Thought I’d share some thoughts that were discussed on a recent conference call with a site who is working thru the “ETL vs EAI for Real Time” decision tree. These are some of the major issues to consider (no particular priority attached to their order):

Protocol/medium. MQ Series? JMS? Tibco? Socket? Named Pipe? Sonic? MSMQ? other? Many of the EAI tools that still exist have more of these protocols “in the box.” ETL tools — not always. How hard or easy is it to extend your choice of tool.

The “shape” of the source. Is it relational in nature (rows and columns)? or hierarchical (tree like structure)? XML? COBOL with multiple record types, OCCURS, variable lengths? …or fixed Format? Both ETL and EAI typically handle these, consuming metadata from XML Schema, COBOL, etc. Exceptions would be things like SWIFT or EDI and EDIFACT, where EAI-style tools such as IBM WebSphere Transformation Extender typically have legacy history and metadata import capabilities. Shape of the source is important, but becomes even more significant in light of the next item.

The “shape” of the target. Is it relational (rows and columns)? Tables in an rdbms? Flat file? …or is it also XML with a complex multi-path hierarchy or COBOL with OCCURS depending on, variable length and mulitiple record types, or SWIFT, EDI, EDIFACT, HIPPA, etc.? ETL tools are best at performing relational work. Most can all do hierarchical as well — but often at the price of complexity. If the source and the target are hierarchical, how many transformations of this type are you doing? Just a few, or 98% of your project?

Data Volume.ETL tools excel here. Parallelism and grid work is a given. Being able to manage huge volumes of data, multiple terabytes, in batch, are where they are regularly exercised. EAI tools typically do not fare well in huge volume scenarios, or at least not without crafting parallelism and other configurations that are out-of-the-box with tools like DataStage.

Units of Work.Transactional paradigms. Delete from one rdbms table and write to another. Commit. If either resource fails, rollback everything. Delete from one queue and write to another, or to a relational table. Commit. If either resource fails, rollback. ETL tools can do some of this and do it well, but are often limited in the choices of queues and databases supported. In EAI tools, the semantics are usually simpler and more at the “core” of any EAI tool, richer in resource support (which databases, queuing systems, etc.) and often provide more flexibility and control to react to failures.

Skills.Beauty is in the eyes of the beholder…and ease of use/learning curve is no different. Still, ETL tools, while far from being designed for ‘end-users,’ have typically appealed to “data professionals” with technical skills but not necessarily hard core programming skills. EAI tooling is usually camped in the more technical domain. There is a lot of grey area and overlap here, but the history is clear — EAI tools have a rich development history of providing ensured delivery for binary EDI formats running 24 by 7 — and ETL tools have a rich history of providing consumable transformations for large volumes of data to support business intelligence initiatives. Enough said.

Long ago I recall a great write-up from an analyst who concluded that most companies require both technologies because the goals of each tool are so different. Years later, many companies have both technologies in their toolbox — so it may not be a purchasing decision as much as an implementation one. There are no wrong answers here — just answers that might lead to a bumpier road down the line.

Good luck as you review your variables.

Ernie