RSS for your Data Transformations!

Huh? What? I thought RSS was for news feeds from CNN or NBC!? The best way to get the latest on the election (like I haven’t had enough) or the latest storm! Instant updates to my browser or other cool electronic gadget.

Well, RSS (Really Simple Syndication in most circles) is coming to your business solutions, if it hasn’t already. You may already be accustomed to that cute little “radio beam” icon, usually orange, that appears on this blog, most web sites, and many other places that you frequent on the internet. You may have an RSS “reader” or “aggregator” in your browser, dedicated on your desktop, or enabled for your Blackberry. Something that updates your favorite team’s box scores, the latest news from IBM or your favorite vendor, or as noted above, the latest gossip in the race for US President. How about sending out intelligent data oriented updates driven by DataStage or a customized SQL statement?

Moving into the mainstream, RSS is not only for updated news clips. Let your executives aim their RSS reader at your customer update feed, and they’ll know whenever a new transaction is generated or when a customer logs a complaint. Behind the scenes, you can have your favorite ETL tool, tickled by request from an RSS reader, pull the latest updated profile for a given customer. The “publish date” that is part of the RSS standard will be used by the sales rep’s RSS reader to highlight new material, making it bold, alternate color, or whatever else their chosen reader is set up to perform. You monitor and supply the data, the user chooses the RSS tooling . That’s freedom that makes everyone happy.

Information Server release 8.1 delivers RSS in the new Information Services Director (the next release of wISD/ISD). I’ve had the pleasure of playing with this new capability for the past few weeks, and it opens up a lot of possibilities [it’s also been a fun way to start learning about this new SOA medium ūüôā ]. Changed Data Capture tools, such as InfoSphere CDC (former DataMirror) can recognize log events, trigging DataStage loads to a target system, and then “always ready” RSS Services delve into the data to find updated gems for selected customer segments or territories. The RSS capability is a new binding added to existing SOAP and EJB capabilities, and is complemented also by the new 8.1 bindings for REST and JMS. I’ll dig deeper into these and other real-time related Info Server features in some upcoming posts.

Web 2.0 is coming. Don’t let it pass you by!

Ernie

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. ūüėČ

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly¬†missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »

Data Quality and Transformation as a Service

In your ETL jobs, do you ever perform “Lookups” to validate account numbers, confirm product codes, or return discount rates?¬† (I’m sure the answer is “of course you do”)¬† Are you certifying address lines as part of your batch ETL processes?¬†¬†¬† (for many of you, another “of course”)¬† How complex is the logic you are using?¬† How much work have you done to establish such lookups, certify addresses, and test the functionality?¬†¬† …and maintain it?¬†¬† What data sources, security issues and other techniques have you invested in?

Have you thought about how valuable those data integration activities might be to the rest of your organization?

Turn them into a “Service.” One that is easily re-usable, not only for the batch work you are doing today via ETL, but for your real-time applications — your java development teams building a portal, or .NET groups setting up front ends for remote devices.¬†¬† Or perhaps for other applications that are performing internal communications using your company’s choice of enterprise service bus.

This is where the Information Services Director comes into play for information Server — providing the ability to publish DataStage, QualityStage, SQL queries and stored procedures as Services. Take the “guts” of your data integration activity (Lookups, Transforms, etc.) and “publish” them as a Service.¬†¬† A “Service” that is supported by industry standards such as Web Services (SOAP over HTTP) and other protocols, along with the necessary artifacts that include a built-in directory and automatically generated WSDL that illustrates the metadata for your creation.

Services that focus on data integration are often most successful when they are built, documented, and maintained by the teams who truly understand the data. In many organizations, this is the same team that has been living with the data migration, transformation, and data warehousing applications.¬† They have invested time and energy researching the models, the legacy systems, and the oddities of the data under-the-covers.¬† They’ve built extensive transformations using ETL tooling. Why not exploit those skills, expertise, and business rule investments for benefits beyond the decision support systems?

I’m not talking about new stuff, bleeding edge creations, or upcoming technology.¬†¬† DataStage, via “RTI,” has been doing this for 4+ years.¬† Web Services are increasingly mainstream.¬†¬† Data integration is still at the core, and ETL tools¬†have proven their mettle at simplifying the management of data access technology.¬† It’s a shame to see it only used for batch.¬†¬†¬†

Ernie

ps… if you are interesting in reading further about this subject, check out this article on Information as a Service: Data Cleansing Pattern. I was honored to be asked to play a small role in developing this article, written by some of my esteemed IBM teammates. -e

Information as a Service: Data Cleansing Pattern

How to Invoke Complex Web Services

Thought it was about time for another set of entries on Web Services. A few months ago I wrote about how we can invoke or “call” a Web Service from within DataStage, with a brief overview of the process. Now it’s time to address more difficult scenarios. I’ve had some comments from some of you who have run into issues — hopefully these details will help.

What exactly is a “complex” web service? Well, the word “complex” is relative, but I guess for our purposes here, it means one that you couldn’t easily get to work the first time out-of-the-box! ūüėČ Seriously though, Web Services come in a lot of flavors. The industry standards are in place, but not always rigorously followed. Web Services may have been designed and written before the standards were solid, or there may simply be constructs within and issues around a particular web service and its pattern that may it difficult to implement in any client tool. In fact, understanding the pattern of the service may be the most important research you can do before trying to utilize a web service in your applications. How often is it called? What does it do? What does it deliver? Does it send back one row or many rows? Does it expect you to “give it” one row or many rows when called? Does it send back data directly or just a big chunk of xml? Does it need you to send it a big chunk of xml?

Let’s start with defining what I’d call a “simple” web service and then go on from there. A perfectly simple web service is one with single row input and single row output, with maybe just a couple of columns in/out, no out-of-the-ordinary datatypes, and located on a nearby machine behind the firewall without any security. It will be invoked for every row that flows past. For DataStage purposes, it will use the Web Services Transformer Stage. In contast, here are many of the factors I’d use to qualify a web service as “complex”:

Output or Input only. Some web services are sources or targets. Call it once and it delivers rows, or receives information at a final target. In DataStage, we use the Web Services Client Stage for these. The gui lets you choose inputs that are static, particularly important if this is a “source”. For what it’s worth though, I tend to use the Web Services Transformer stage for these also, because I prefer to send my input via a link…it allows me to be more creative with the source parameters. What’s important though is that you understand the “pattern” of your desired Web Service

Security and proxies. Can you get there from here? What do you need to get past your firewall? Is HTTPS and SSL part of the web service you need to invoke? In basic scenarios, DataStage provides properties directly for these. WS-Security on the other hand, is more difficult, and involves a lot more hand-shaking and coordination between SOAP client and the provider of the Service.

Complex SOAP Bodies. XML comes in a lot of varieties. Sometimes the SOAP body being sent or received is in a hierarchical form. If that’s the case, we’ll need to decipher that XML into its relational rows and columns after we receive it at the client.

Embedded XML. A SOAP envelope is xml, but sometimes “contains” xml. In otherwords, we might have a simple string called for in the web service “myWebServiceResponse,” but there is no detail for it — it’s simply a big giant chunk of XML that is being passed back for further deciphering by the client. This is similar to the complex SOAP body, except in this case the WSDL contract knows nothing about the structure expect that a single string is being sent back. Again, we’ll need to decipher that XML into its appropriate parts after reception.

Arrays. The request or response is looking for, or sending back, an “array,” or “list” of values. There may be multiple columns in each entry of the list, or just one. This is most easily identified by looking at the WSDL in a browser — developers usually name the data areas as “ArrayOf……” or “ListOf…..”, although not always. A weather example I saw recently had “ListOfCities,” for example.

SOAP Headers. Ah. These can be tricky. Getting SOAP Headers right usually means having to know a bit more about the Service. The authors of the service will hopefully have documented what they expect in a SOAP Header, and when. Complex Web Services APIs may have a sequence of calls…one to “login” and perhaps get an access code, and then another to get started and do real work. The access code and other details are often parts of the header. Formal SOAP Headers are one thing, and what you might call “pseudo-headers” are another. In certain cases, the authors of a Service may have chosen, for various reasons, to place userID and access details inside the Body instead. It depends on the age of the service, how mature the standards, and the SOAP clients using them were when the service was first put into production, and the creativity of the original developer. We can retrieve these from the response, or if necessary, build them for the input request.

A fellow DataStage user has shared a WSDL with some of the issues above. As we work thru it and get it operational, I’ll share the techniques required.

Ernie

…another good link on using XML and Web Services

…been meaning to put a link to the DeveloperWorks site and a very nice detailed piece that some of my IBM colleagues put together on XMLPack and Web Services Pack last year….¬† I haven’t had a chance to review every fine point, but I especially like the level of detail in their screen shots, which offer clarity to when/where/how DataStage uses XPath in its XML functionality.¬†¬†¬† Another good resource for review…¬† http://www.ibm.com/developerworks/db2/library/techarticle/dm-0703xiong/index.html

 -e

Incorporating Java classes into your DataStage Jobs

Java comes up a lot when we talk about “real time.”¬†¬† Not that Java in particular has any special dibbs on¬†the term, but frequently when a site is interested in things like Service Oriented Architecture (SOA), Web Services, messaging,¬†and XML, they are often also interested in Java, J2EE, Application Servers and other things related to Sun’s language¬†standard.¬†

Integrating Java with your ETL processing becomes the next logical discussion, whether “real time” even applies.¬†¬† There may be some functionality, some existing algorithms worth re-using, some remote java-oriented or java managed system or message queue that contains valuable source data (or would be a valuable target), that you’d like to integrate into a data integration flow.¬†¬† DataStage can easily be extended to include your Java functionality or take advantage of your Java experience.

There are two Stages that used to be referred to as JavaPack that are included with DataStage:¬† JavaClient and JavaTransformer.¬†¬† Both allow you to integrate the functionality of a java class into the flow of a DataStage Job.¬†¬† JavaClient is used for a sources or targets (only an output link or only an input link), and the JavaTransformer is used for row-by-row processing where you have something you’d like to invoke for each row that passes through.

DataStage provides a simple API for including java classes into your Jobs.¬† This API allows your class to directly interact with the DataStage engine at run-time — to obtain meta data about the columns and links that exist in the current executing job, and to read and write rows from and to those links when called upon to do so.¬†¬† You define several special methods in your class, such as Process(), that the engine calls whenever it needs a row, or is giving your class control because it’s ready to give you a row.¬† Within that method you have various calls to make, such as readRow [from an input link] and writeRow [to an output link].¬†¬†¬† You can control what comes in and goes out, and also process rejections based on logic in your class.¬† Other than that, your class can do whatever it wants……read messages from JMS queues, invoke remote EJBs….whatever.¬†¬†

The JavaPack is very well documented, with examples and descriptions of all the API calls.¬†¬†¬† However, I’ve included an additional example here¬†for anyone who is interested, including java class, source, .dsx and usage notes.¬†¬†¬† Have fun!

-ernie

btw…I haven’t exactly figured out yet how to best get the names of the files below represented here on this blog, but if you save them from here, each file except the Readme begins with “ExamineRows” and should be ExamineRows.dsx (for the export), ExamineRows.java (for the Source) and ExamineRows.class for the actual compiled class.¬†¬† I haven’t had a chance to re-try it after downloading from here, so worst case, you’ll need to recompile the class yourself in your environment.¬† Otherwise, it should run in v8 “as is”.¬† See the file at the Readme link for details on the expected classpath in the Job, etc., and read the annotations in the Job itself after you import it.¬† -e

Examine Rows Class, Examine Rows Java Source, Examine Rows Readme, Examine Rows DataStage Export

Launching Jobs via Web Service

Kicking off Transformation Jobs or functions via Command-line is an important feature, but it is equally important that we are able to launch such processing via Web Service or other service-oriented invocation.

Command line utilities are generally¬†text based.¬† They require that you learn the vendor’s proprietary syntax.¬†¬† They may require passwords and userids.¬†¬† You may have to worry about clear-text or be sure there are ways to hide credential information.¬† You probably need to know about shell scripting to take maximum advantage of the Command language.¬† You may also need to know things like Job or Map names, need to understand the parameter structure, or the administrative naming conventions for the Transformation tooling (Project Names, Folders, etc.).¬†¬† You probably need to know, as a developer, the hostname where the Transformation is running, or at least the engine¬†you are connecting to for launching the process.¬†¬†

Command line utitilies are powerful…. there are times, however, when a simpler invocation method is needed.¬† Like when the calling application doesn’t have easy access to scripts or the operating system command line.¬†¬† Or when the skills of the developer(s) establishing the invocation do not include scripting or details of the Tranformation tool’s command line syntax.¬†¬† Or when the ultimate location of the tranformation function being launched is unknown or¬†moves frequently.¬†¬†¬† Service invocations using industry standards help abstract all of that….Servers can be anywhere, and underlying awareness of the Tranformation tooling is hidden from the¬†invoking client.¬†¬† Datatypes for Job Parameters¬†are less painful to handle, and authentication¬†and/or encryption can be handled at the transport layer.

A financial site I’ve been working with uses Web Services to kick off DataStage ETL¬†Jobs from within a portal application.¬†¬† The authors of the portal application are Visual Basic .NET experts.¬†¬† They don’t know the¬†first thing about DataStage, are under¬†deadlines like the rest of us, and needed¬†the ability to start an ETL¬†Job¬†to refresh a datamart as the result of a user pushing a¬†button or choosing an¬†option.¬†¬† Using the RTI module for DataStage (release 7.x), the .NET development team merely consumes the automatically published WSDL definition to include this functionality.¬†¬† Within minutes, they have the functionality included in their application.¬† No scripts to maintain, no complex awareness of DataStage.¬† For all the .NET developers¬†know, the refresh process might be home-grown C++ code.¬†¬†¬† That’s the value of using the SOA industry standards.¬† The .NET developers don’t want to know and they don’t have to¬†know that ultimately, it’s a DataStage Job doing the work.¬†¬†¬† Meanwhile, the skilled DataStage¬†ETL team, who really understands the data, is maintaining the Jobs, the rules, the access to the source and target data structures, and publishing the functionality (as a Web Service, including the WSDL “contract”) for use by the rest of the enterprise.¬†

Command line control (in DataStage we generally call it the Job Control API or Job Control Command language, a.k.a. dsjob) in transformation tools is
great – but increasingly we need simpler methods that exploit the new standards.

“Almost” real-time Jobs

It must be Q4, and close to the holidays.¬†¬† Questions come in like heavy¬†rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? ūüôā )

At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents.  This is a common scenario for reading XML input.   Imagine a subdirectory that receives xml documents  (with the same matching schema) throughout the night, via FTP from remote locations.  It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.

There are a myriad of solutions for handling this, usually with some type of polling mechanism.¬† It’s not as pure and simple, however,¬†as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter.¬†¬† Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++.¬†¬† The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate.¬†¬† There may be some¬†custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any.¬†¬† Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting].¬†¬†¬† They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.”¬†¬†¬† Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products.¬† Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.