Data Quality and Transformation as a Service

In your ETL jobs, do you ever perform “Lookups” to validate account numbers, confirm product codes, or return discount rates?  (I’m sure the answer is “of course you do”)  Are you certifying address lines as part of your batch ETL processes?    (for many of you, another “of course”)  How complex is the logic you are using?  How much work have you done to establish such lookups, certify addresses, and test the functionality?   …and maintain it?   What data sources, security issues and other techniques have you invested in?

Have you thought about how valuable those data integration activities might be to the rest of your organization?

Turn them into a “Service.” One that is easily re-usable, not only for the batch work you are doing today via ETL, but for your real-time applications — your java development teams building a portal, or .NET groups setting up front ends for remote devices.   Or perhaps for other applications that are performing internal communications using your company’s choice of enterprise service bus.

This is where the Information Services Director comes into play for information Server — providing the ability to publish DataStage, QualityStage, SQL queries and stored procedures as Services. Take the “guts” of your data integration activity (Lookups, Transforms, etc.) and “publish” them as a Service.   A “Service” that is supported by industry standards such as Web Services (SOAP over HTTP) and other protocols, along with the necessary artifacts that include a built-in directory and automatically generated WSDL that illustrates the metadata for your creation.

Services that focus on data integration are often most successful when they are built, documented, and maintained by the teams who truly understand the data. In many organizations, this is the same team that has been living with the data migration, transformation, and data warehousing applications.  They have invested time and energy researching the models, the legacy systems, and the oddities of the data under-the-covers.  They’ve built extensive transformations using ETL tooling. Why not exploit those skills, expertise, and business rule investments for benefits beyond the decision support systems?

I’m not talking about new stuff, bleeding edge creations, or upcoming technology.   DataStage, via “RTI,” has been doing this for 4+ years.  Web Services are increasingly mainstream.   Data integration is still at the core, and ETL tools have proven their mettle at simplifying the management of data access technology.  It’s a shame to see it only used for batch.   


ps… if you are interesting in reading further about this subject, check out this article on Information as a Service: Data Cleansing Pattern. I was honored to be asked to play a small role in developing this article, written by some of my esteemed IBM teammates. -e

Information as a Service: Data Cleansing Pattern

How to Invoke Complex Web Services

Thought it was about time for another set of entries on Web Services. A few months ago I wrote about how we can invoke or “call” a Web Service from within DataStage, with a brief overview of the process. Now it’s time to address more difficult scenarios. I’ve had some comments from some of you who have run into issues — hopefully these details will help.

What exactly is a “complex” web service? Well, the word “complex” is relative, but I guess for our purposes here, it means one that you couldn’t easily get to work the first time out-of-the-box! 😉 Seriously though, Web Services come in a lot of flavors. The industry standards are in place, but not always rigorously followed. Web Services may have been designed and written before the standards were solid, or there may simply be constructs within and issues around a particular web service and its pattern that may it difficult to implement in any client tool. In fact, understanding the pattern of the service may be the most important research you can do before trying to utilize a web service in your applications. How often is it called? What does it do? What does it deliver? Does it send back one row or many rows? Does it expect you to “give it” one row or many rows when called? Does it send back data directly or just a big chunk of xml? Does it need you to send it a big chunk of xml?

Let’s start with defining what I’d call a “simple” web service and then go on from there. A perfectly simple web service is one with single row input and single row output, with maybe just a couple of columns in/out, no out-of-the-ordinary datatypes, and located on a nearby machine behind the firewall without any security. It will be invoked for every row that flows past. For DataStage purposes, it will use the Web Services Transformer Stage. In contast, here are many of the factors I’d use to qualify a web service as “complex”:

Output or Input only. Some web services are sources or targets. Call it once and it delivers rows, or receives information at a final target. In DataStage, we use the Web Services Client Stage for these. The gui lets you choose inputs that are static, particularly important if this is a “source”. For what it’s worth though, I tend to use the Web Services Transformer stage for these also, because I prefer to send my input via a link…it allows me to be more creative with the source parameters. What’s important though is that you understand the “pattern” of your desired Web Service

Security and proxies. Can you get there from here? What do you need to get past your firewall? Is HTTPS and SSL part of the web service you need to invoke? In basic scenarios, DataStage provides properties directly for these. WS-Security on the other hand, is more difficult, and involves a lot more hand-shaking and coordination between SOAP client and the provider of the Service.

Complex SOAP Bodies. XML comes in a lot of varieties. Sometimes the SOAP body being sent or received is in a hierarchical form. If that’s the case, we’ll need to decipher that XML into its relational rows and columns after we receive it at the client.

Embedded XML. A SOAP envelope is xml, but sometimes “contains” xml. In otherwords, we might have a simple string called for in the web service “myWebServiceResponse,” but there is no detail for it — it’s simply a big giant chunk of XML that is being passed back for further deciphering by the client. This is similar to the complex SOAP body, except in this case the WSDL contract knows nothing about the structure expect that a single string is being sent back. Again, we’ll need to decipher that XML into its appropriate parts after reception.

Arrays. The request or response is looking for, or sending back, an “array,” or “list” of values. There may be multiple columns in each entry of the list, or just one. This is most easily identified by looking at the WSDL in a browser — developers usually name the data areas as “ArrayOf……” or “ListOf…..”, although not always. A weather example I saw recently had “ListOfCities,” for example.

SOAP Headers. Ah. These can be tricky. Getting SOAP Headers right usually means having to know a bit more about the Service. The authors of the service will hopefully have documented what they expect in a SOAP Header, and when. Complex Web Services APIs may have a sequence of calls…one to “login” and perhaps get an access code, and then another to get started and do real work. The access code and other details are often parts of the header. Formal SOAP Headers are one thing, and what you might call “pseudo-headers” are another. In certain cases, the authors of a Service may have chosen, for various reasons, to place userID and access details inside the Body instead. It depends on the age of the service, how mature the standards, and the SOAP clients using them were when the service was first put into production, and the creativity of the original developer. We can retrieve these from the response, or if necessary, build them for the input request.

A fellow DataStage user has shared a WSDL with some of the issues above. As we work thru it and get it operational, I’ll share the techniques required.