learning more about blogging

Whew.  Needed to put an entry in here for myself as a reminder and to keep track of the overwhelming set of concepts and issues that I’m going thru to figure out how best to manage blogging so that I can be productive and yet still enjoy leaving informative bits and pieces here on the web.   In just a two weeks time, I’ve learned a wealth of information and also piled a lot more things on my doorstep.  Thanks to the bloggers and non-bloggers alike who have pointed me in various directions.

Where to put your blog?  Why did you pick WordPress?   Well, it was free, for starters, and seemed to have some good features, after doing a few reviews. 

Are you going to host it yourself?  I’m trying to find time to spend on the blog — host my own web site?  Not happening.  Hats off to all of you who do.

How many blogs?   Personal one, technical, a little bit of both?   Internal to IBM or external?  A lot of thoughts here, and I’m still formulating ideas.   I’m leaning now towards having several.  Maybe no one else will read ‘em, but I need a place to rant about the NHL!

Technorati Tags.   That’s a whole new one to me.   Blogging begets blogging…and a need to find other bloggers…and to index your own blogs.  Still learning about this… see www.technorati.com .

Blog Clients.  Who knew?   What if you want to blog “offline?”  There are tools made for the purpose!  About to try BlogDesk.   And here I thought Notepad would be effective for cut/paste.  Ha!

Categories, RSS, etiquitte, security, blogrolls.  It’ll take awhile.  Can’t spend so much time learning about blogging if it takes away from continually learning more about realtime with Information Server!  :)

Tips for using Web Services Pack with DataStage Part I: Getting Started

Thought I’d start sharing the notes I’ve collected over the years of working with the Web Services Pack.  This is the ability for DataStage to be a SOAP Client; for a DataStage Job to reach out and invoke a Web Service located “out there in the ether.”  This is a reference to the benefits of Web Services and the fact that you don’t have to know where the Service is located, what machine its on, who wrote it, or what language it’s written in — doesn’t mean it’s not a good idea, especially if your business is relying on it, but I think you get the point.   Give me a WSDL document, and that should be enough to invoke some type of remote function [this is what you might call the opposite of WISD, or RTI, which allow you to publish a DataStage Job or other Information Server asset "as" a Web Service].

Before you get started with it, ask yourself how comfortable you are with Web Services technology.   I speaking here about SOAP over HTTP in particular, the protocol supported by WS Pack, but it would be wise to get yourself a good intro to all of SOA.    There are good books out there, but also 1000′s of great resources here on the web.   Just search in your favorite browser for something like “Web Services Introduction XML,” and poke around.  You’ll find something for everyone.

Then you need the software.  If you are a release 8 user, it’s just “there.”   If 7.5, you have to speak with your account team about downloading it, but it won’t cost you anything.   The installation is simple, although be reminded that there is a client side and server side component.

Once installed, you’ll have two new Stages, the WSClient Stage and the WSTransformer Stage.  WSClient is for Web Services that will be a source or target to your Job, and the WSTransformer is for those Services that you expect to invoke on a row-by-row basis.

I have to put in a plug for the documentation.  The WSPack documentation is very thorough.  It has some very easy to understand graphics that do their own nice job of introducing Web Services in general, as well how WSDL, SOAP, SOAP Envelopes and SOAP Bodies come into play.  What I’d like to add in these entries are some other points to consider as you prepare to include Web Services in your jobs to help you be more successful.

A stand alone Web Services testing tool is a good idea too, unless you are already comfortable using Web Services from a common IDE for Java, C#, .NET, etc.

Finally, find yourself a good Web Service to start with, just to learn the mechanics.  There are some great ones out at www.xmethods.net .  You can test them there to see how the function, and get a solid idea of what their input and output requirements are.   There’s a nice mix of publicly available services there; some are free, others are part of a business.  The free ones are sponsored by folks who are making known their expertise.  Some very solid consulting firms and companies with Web Services expertise are represented there.  I’ve seen many of the services hosted at that site continue to function and be supported for more than five years.  For your first foray into Web Services pack, pick one that has very few input and output arguments, accepts and returns one row (a perfect candidate for WSTransformer) and doesn’t require you to pay anything or come up with an access code.

This entry is getting long.  Next time I’ll walk you thru access to the Service and then discuss how to prepare for a more complex one that you might find inside your enterprise.

Why use ETL for Real Time? …for Metadata support!

ETL tools were designed for back-room, nightly batch processing, right? Yes…maybe….I suppose. If you look at their history, with most ETL tooling born in the decision support and data warehousing world, the biggest challenges were for point-in-time refreshes and loading of vast amounts of information. However, requirements have evolved, missions have changed, and ETL is no longer used only for decision support. Indeed, a certain percentage of sites never have used ETL for data warehousing, even if that is admittedly still a large segment of the application for such tools and technologies. Today, ETL is a great choice for real-time, and it’s safe to say that the tools are now being designed for top notch real-time functionality. I’d like to just stop using the term “ETL” (or ELT, ETML and some of the other acronyms that have been floating around for years)! It’s not your father’s ETL anymore……..[but terms stick, so for now we'll go with it unless any of you have better suggestions for us and our friends at the analysts :) ].

If not ETL for Real Time, what else? A lot has already been written on ETL (Extract Transform Load) vs EAI (Enterprise Application Integration), with ETL generally being credited with better high volume abilities, and EAI better at complex, multi-construct (occurs, record types) sources and targets, and other pros and cons for either. As I learn more about how to manage this site I’ll create a page with my favorite links on this subject. In many of these comparisons, real-time often defaults to the EAI category.

However, one area that is often overlooked in this comparison are what you might call two “soft” issues — the user community, your teammates who will actually be doing the development, and the requirements for meta data management. While there are exceptions, ETL tools “tend” to be used by what I like to refer to as “data professionals.” These are folks who may have formal programming backgrounds, but gravitated to their role in the enterprise because they understand the business and they know the data. With their initial focus on business intelligence, ETL tools (I know, beauty is in the eye of the beholder) are often more inviting to this type of user. Not an “end-user” by any means, but also not the user who is typically comfortable with C header files, java types and code snippets. ETL vendors have competed for years on the usability issue. Their success with DBAs and more technical end users is a testament to their appeal.

The other “soft” issue worth noting as ETL moves into “real time” is the support for meta data. No longer is meta data something that people merely pay lip service to. Data lineage and impact analysis — the abilities to link a column name to a real-time Service, its rdbms target, its ERwin model AND its business intelligence report are unique to ETL tools. Most EAI type tools, until recently, could hardly spell metadata, let alone provide impact analysis and data lineage reporting from soup to nuts. This is changing, but deep metadata reporting has been a key component in the data warehousing space (and thus receiving massive investment from ETL vendors) for ten years or more.

Data Governance, regulatory compliance, and metadata management are on everyone’s minds. We can’t pay lip service to metadata and data lineage for any kind of data integration. SOA and real-time data integration need the deep metadata support provided by ETL tooling, as much as business intelligence applications do.

Increasingly, ETL tools, and the platforms they operate in are being chosen for real time data integration because of their support for meta data, and the preference of “data professionals” for these tools over their “closer-to-the-code” IDE tool cousins for programming development.

Ernie

What is Real Time ETL anyway?

What is Real Time ETL? What does it mean? This question keeps coming up in discussions with customers and prospects, for enterprises large and small, and with tool jockeys and home grown coders. It surfaces in debates about EAI vs ETL (subject for another blog), Changed Data Capture, transactional vs batch processing, and more. I won’t debate the definitions of real-time, right-time, real-time data warehousing, active data warehousing, just-in-time or near-real-time — a lot of really smart people have already been there. I just want to look at what people are actually doing, and calling, Real Time ETL.

Trying to formally define real time isn’t easy — there are so many points of view, and critical differences based on industry segment. Those of us in the commercial “data world” spend lots of time discussing the finer points of “real time”….however, I stopped trying to come up with a single definition after reading pure academic and engineering definitions of “real-time computing” that talked about robotic arms in an assembly line reacting in microsecond “real time” to things like minute temperature changes!

I’d like to reflect here instead on the technical aspects of common patterns that those of us in the data integration space run into regarding Real-Time ETL, and mention some of the gotchas that often go overlooked. I see four basic “patterns” that, depending on your point of view and problem you are trying to solve, qualify as Real Time ETL:

  • Frequently executed ETL processes (ie. every 5 minutes, one minute, or every 10 seconds). Really a “batch” pattern, but run in small windows with tiny (by comparison to large batch loads) quantities of data.
  • Messaging or other “continually live” medium as a Source.
  • Messaging or other “continually live medium as a Target.
  • Request/Response with a continually live medium on either end (Source and Target).

The second one above interests me right now, as I’ve had numerous questions on this subject in the past few days. I want to speak here about the technical definition for jobs, maps, procedures (or whatever you call your ETL processes) that need to “read” data from a commonly accepted “real time” technology. Real time sources may be popular messaging engines, such as MQSeries, TIBCO Rendevous, or MSMQ, or java based standards such as JMS, or more custom based solutions such as sockets or even named pipes. Most ETL tools can access these, or provide extensions that make it possible to utilize some of the lesser known APIs.

This is the most commonly requested pattern. When someone says “I need Real-Time ETL,” it generally turns out that they want to “read” from such a source. Reasons for needing it vary. Some sites desire immediate updates to decision support systems or portals, while others are merely “dipping” into an available source that is passing through for other purposes. An already built MQ Series infrastructure, shipping messages between applications, are often the perfect source of data for ETL, whether the objective is immediate updates or not. It’s just “there” and available…and simpler to get than trying to wrestle with security folks for access to source legacy systems. Of course there are hundreds of variants, whether the target is decision support oriented (data warehouse or datamart), or ERP (such as SAP). Either way I’m talking about a persistent target.

Regardless of the reasons, such ETL processes have to deal with issues like the following:

  • Always On. Typically an initialization issue. ETL tools do a lot of preparation when they start…they validate connections, formally “PREPARE” their SQL, load data into memory, establish parallel processes, etc. Twenty seconds of initialization may be acceptable in a 45 minute batch job that processes ½ gigabyte. In a real time scenario, that’s unacceptable. You can’t afford to perform all of that initialization for every message or packet….it needs to be done once, then leave the process “always on” and waiting for new data. I like to think of it “floating” while it waits. Of course, this invites other problems…
  • End-of-file processing for “blocking” functionality. If you have an “always on” job, what do you do if someone wants to use an aggregation or sum() function? How does the process know when it’s finished and can flush rows thru such an operation? This is particularly critical when we move on to Web Services in the request/response pattern, but equally important when reading messages that contain multiple rows, such as when the message payload is a complex XML document.
  • Live vs buffered or in-memory lookups. A common technique for performance in large volume batch processes is to bring values into memory. Same issues for performance in “always on” jobs, but consider that “always on” means needing a strategy to refresh that in-memory copy. Or else ensure that a constant connection to the original source is feasible and performs well….and that the DBA who owns the real time source won’t kill your long running database connection in an “always on” scenario.

These aren’t the only issues, and there are numerous ways of dealing with them. Make sure the tool or techniques you choose give you ways to deal with these problems. Next time I’ll share my notes on these issues and the other real-time patterns in more detail.

Posted in etl, mq, RealTime. Tags: , , . 6 Comments »

First Blog Entry!

Ok.   I’m here.   Finally decided to try this “blogging” thing.  Thanks to some encouragement from a few close friends, I’ll start sharing things with the world.  Seems that this might be a good place to leave opinions about the data integration business, where I’ve spent my entire career, and also make observations on my other passions, like NHL hockey and family, for anyone who cares to listen.   First up — I need to spend some time figuring out how this site works! …and then I’ll think of some content that someone (?) might consider interesting to read.  ;)   

Follow

Get every new post delivered to your Inbox.