Always use an Input Link AND an Output Link with the XML Stage

Another way to say this is to “avoid using the XML Stage to perform i/o.” The XML Stage is capable of reading an xml document directly (a feature in the Parser Step) and is also able to write to a new document on disk in the Composer Step. However, while it may seem simpler to do that initially, it makes your Jobs and Stage designs less flexible and less re-usable. You should have an Input link that feeds XML to the XML Stage when you are “reading” or parsing xml (and of course you will have output links that send the results downstream), and you should have an Ouput link that sends your completed XML document(s) downstream when you are writing XML (and of course you will have input links that feed in the source data).

Let’s see why.

When you are first learning the XML Stage, it seems convenient to just “put in the name of the xml document” and keep going. The Parser Step allows you to specify the filename directly (or it can be parameterized), and then you continue with the assignment of the Document Root. Similarly, when creating a new XML document, the Composer Step allows you specify the actual document to be written to disk.

Then someone comes along and says “Our application is changing. The xml documents we currently read from disk will now be coming from MQ Series…” …or maybe “…from a relational table” …or “from hadoop”…. Well, you can’t just “change the Stage type at the end of the link” in that case. You have to “add” the link, and then make what could potentially be extensive changes to your Assembly. While not especially difficult once you are familiar with the Stage, if you have moved on to other projects, or have been promoted and are no longer supporting the Job, a less experienced DataStage developer will be challenged.

So…when using the Parser Step, use one of the options that describes your incoming content as either coming in directly as content (from a column in an upstream Stage), or as a set of filenames (best use case when reading xml documents from disk, especially when you have a whole lot of them in a single sub-directory [see also Reading XML Content as a Source ] )

xmlParserStepOptions

The same thing is true for writing XML. Send your xml content downstream — whether you write it to a sequential file, or to DB2, or to MQ Series or some other target, the logic and coding of your XML Stage remains the same! In the Composer Step, choose the “Pass as String” option and then in the Output Step, map the “composer result” to a single large column (I like to call mine “xmlContent”) that has a longvarchar datatype and some abitrary long length like 99999. While there may be times when this can’t be easily done, or when you need to use the option for long binary strings (Pass as Large Object), for many/most use cases, this will work great.

xmlComposerStepOptions

Get in the habit of always using Input and Output Links with the XML Stage. Your future maintenance and changes/adaptions will be cleaner, and you can take better advantage of features such as Shared Containers for your xml transformation logic.

Ernie

XML Stage: Establish Meaningful Link Names

…and then stick with them! Decide early what you want your Link names to be, before you even open up the Stage and begin your work on the Assembly, and then lock them in. Make a conscious decision not to change or alter them. Why? The XML Stage is not immune to Link name changes like other Stages and Connectors on the DataStage canvas.

How many of you are perfectly happy with DSLink2 and DSLink35 or other automatically generated Link names? I know I don’t spend time on every Job, running around putting on fancy Link names, especially when I’m first building it. It’s nice for documentation, and I know that I should always create meaningful names, but how many of us do?

And how often do we “go back” and edit the Link names later? That’s actually a good thing — for most Stages and Connectors. But for the XML Stage, it is something you want to avoid. Changing Link names will break your Assembly and require that you edit the stage and make changes.

Here is an example of the XML Stage reading xml documents from a subdirectory and performing validation. Valid xml will be sent down the “goodXML” Link, and rejected, invalid xml content will be send down the “badXML” link.

linknames

Notice how, inside the Assembly, these link names are used. Here in the Assembly Parser step, you see the toXML linkname used for the specification of the xml Source:

linknameParserStep

…and here, in the Assembly Output Step, you can see how the Link names are used in the Mapping:

linknamesOutputStep

Those screen shots illustrate how the link name becomes critical to the internals of the Assembly. If you change the link names outside the Stage, the Assembly will end up with errors (various red marks throughout the Assembly, depending on how complex it is):

redLink

Are you able to correct the Assembly when this happens? Of course…and for most scenarios, it’s not difficult…you might just need to change a setting or re-map a couple of columns. But save yourself the trouble. Decide on your Link names, set them up early (preferably before you ever enter the Stage) and then don’t touch them!

—Ernie

Best Practices and Techniques for the “New” XML Stage

Hi Everyone…

It’s been awhile since I’ve posted anything.   A certain amount of “blog” fatigue is to blame, but also because I like to post things that are (as much as possible) proven, time-honored (and not release dependent).   Many/most of the techniques I write about here are ones that I’ve spent many hours helping customers and colleagues implement in real situations.

It’s time I write about the XML Stage.  It is not-so-new anymore, but still feels new as it has had some very important xsd handling features added to it in the last few releases.   This week I will start posting suggestions and tips for using the XML Stage to read and write xml documents using DataStage.  

I’ll start with a pointer to a valuable RedBook that came out last year regarding the XML Stage.  I had the pleasure of reviewing the material as the authors put it together.  It is a great place to start when learning to work with this important Information Server capability.

XML Stage Redbook

Ernie

…and a link to the first “New” XML Stage post…

Establish Meaningful Link Names when using the XML Stage!

Posted in XML. 5 Comments »

New RedBook for XML Stage is available!

The new redbook is available for the enhanced XML capabilities introduced by the “XML Stage” in Release 8.5 in October of 2010. It represents a lot of hard work by my colleagues who work with, developed, and tested this enhanced way of processing XML content in an ETL tool. Congrats to then entire authoring team, the reviewers, and the people who made publication of the Redbook possible — and congrats to the rest of of us who now have another excellent resource for reading and writing complex XML using DataStage, QualityStage, and Information Server!

You will find this new redbook here:

http://www.redbooks.ibm.com/abstracts/sg247987.html?Open

Ernie

New developerWorks article on DataStage and new XML Stage!

Hi all…

My esteemed colleagues on the xml development team have published a great article on the new XML Stage in 8.5….enjoy!

devWorks article on the New XML Stage!

Ernie

The new XMLPack in 8.5….generating xsd’s….

As noted in an earlier post and outlined nicely in Vincent’s blog (new xml!), the new XML Pack is here… It is very powerful, providing new features for reading and transforming hierarchical data, performs faster (and smarter) than earlier xml technologies within DataStage, and much more. It also requires that you have an XML Schema Definition (xsd) for the import of xml metadata. Most of the time this is not an issue. The xml documents you are reading and/or writing are well defined, complying with a formal xsd developed within your organization, or perhaps by a partner, yourself or by a standards body. But sometimes, there is no xsd. You may not have access to one, it might have been lost, or it never existed. The XML might be simple enough that it was just generated by another tool without the use of an xsd (or you are asked to generate it), or the xml might be old enough to pre-date the arrival of xsd’s.

There are many ways to generate an xsd. Popular tools such as Altova XMLSpy support this capability, as do many others, including xml Max, whose link I have over on the link list to the right. A quick search on the web will invite you to try a lengthy list of possibilities. One that I’ve been very successful with is called “trang”.

http://www.thaiopensource.com/relaxng/trang.html

This little tool does more than just xsd generation, although that is the functionality that I have found most useful. I’ve tried it on Windows and on Linux. It is easy to use, well documented, has references from other bloggers across the web, and does the Job. It is command line based, and requires that you have a java run time locally installed. There may be more sophisticated tools out there, but this is sufficient for what I need to be productive with the new XML Stage.

Let me know if you find any others!

Ernie

Posted in XML. 2 Comments »

The new XML Stage is here!

Just announced yesterday…the new XML Stage is available for 8.5! This introduces a whole new level of XML Transformation to the Information Server platform! Among its new capabilities are the ability to read single huge documents using a new streaming methodology that avoids the need to load the document into memory, support for any type of xsd, or collection of xsd’s, to define your xml metadata, and perhaps most important, a whole new hierarchical editing mode called an “Assembly”, which provides support for the creation of complex multi-node hierarchical structures! There’s much more, such as very explicit control of xml validation, a built-in test facility to ease transformation development, and support for both EE and Server Jobs. I’ve had a chance to play with the Stage over the last few months, and will share my experiences and techniques in upcoming posts.

In the meantime, I’d like to congratulate my IBM teammates in engineering for this accomplishment! This new capability will change how we approach many transformation solutions!

You can find the new XML Transformation capability at fix pack central for application to your 8.5 installation.

Ernie

Posted in XML. 1 Comment »