Reviewing really LARGE xml documents…

Recently I came across a tool that was very helpful in the review of large XML documents…..REALLY large xml documents…..Hundreds of megabytes and more! In the years I’ve been working with XML, every so often I come across a document that I can’t read, can’t process, or can’t otherwise easily research for a data integration application. The tools I usually employ either hit errors or take forever to open while parsing thru thousands upon thousands of elements and attributes.

XMLMax, from xponentsoftware, is a convenient alternative to using IE to open my XML, alleviating the need to wait and wait while a large document is opened (all too often failing in the process in one way or another). According to the author, XML Max uses several proprietary buffering techniques that support the immediate display of content and faster consumption of the entire document. I found that it was able to read two and three hundred megabyte documents in mere seconds when compared with other mechanisms.

XMLMax is not just an XML reader, as it also has the ability to edit and manipulate your documents. I found it intuitive to use and easy to install, and liked its features for finding the closing tags of any element, especially large ones. Another great feature is the default for ignoring DTDs. So many other tools default to looking for a DTD if specified in a document, and then blow up when trying to open them! My interests lie in particular with the study and examination of large documents for use as a source or target, and so I focused on those features. It is worth noting, however, that XMLMax also provides ways to use XSLT to break up documents into smaller pieces. At the time of this review, the product only ran on Windows, but there are apparently plans to port it to other platforms and also to provide a command line interface for programmatic operation.

Check it out at http://www.xponentsoftware.com/.

Ernie

Posted in xml. Tags: . Leave a Comment »

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. 😉

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »

…another good link on using XML and Web Services

…been meaning to put a link to the DeveloperWorks site and a very nice detailed piece that some of my IBM colleagues put together on XMLPack and Web Services Pack last year….  I haven’t had a chance to review every fine point, but I especially like the level of detail in their screen shots, which offer clarity to when/where/how DataStage uses XPath in its XML functionality.    Another good resource for review…  http://www.ibm.com/developerworks/db2/library/techarticle/dm-0703xiong/index.html

 -e

XML with embedded XPath in the content…

Wow.  I haven’t written in awhile.    Been off in the land of metadata for the past few months, or heads down on some other projects.  No excuse — some juicy real-time/xml/java issues have been coming up lately and are piled here on my desk to be entered here.   I also have to finally figure out how to do attachments so that I can share some DataStage Jobs and other stuff.

 Ran into an interesting one today that I haven’t seen before, and I see a lot of XML issues.   Guess there’s always something new around the corner, and in reality I’ve only scratched the surface of what XML has to offer….and unless my teammates at IBM and/or our customers run into something, I’m not likely to see it.

 Anyway, this one is pretty cool.   I was given an xml document whose content contains references that point to “other” xml content in the same document.   In this case, it is a node with an attibute containing XPath pointing to another node somewhere (presumably above) in the document.   This attribute is present if-and-only-if all the other sub-elements of the node are missing.  If so, the processing needs to locate the other node and retrieve the values found there in order to populate all the values of the node containing this “reference” attribute.  A sort of “recursive” lookup into the same xml document.

I haven’t entirely worked out the solution using DataStage yet, although the strategy is fairly clear.   First retrieve the node, check for the reference attribute, and if present, pull out the XPath.   This XPath then needs to be used further on in the flow to pull some “other” content from the same XML document.  Compare for nulls and then take the populated content before moving forward…  A bit tedious, but do-able, since DataStage lets me throw XPath around and dynamically create and use Stylesheets.

Anyone else encountered XML like this with internal references?

Ernie

“Almost” real-time Jobs

It must be Q4, and close to the holidays.   Questions come in like heavy rain, and it’s times like this that I see common patterns in the scenarios, whether they are from my IBM teammates helping customers understand our offerings, or from customers and users directly, trying to get things finished before taking well-deserved time off (what’s that? 🙂 )

At any rate, I recently was involed in a bunch of discussions around ETL jobs that wait-for-files, or loop thru directories on a continuous basis, reading their contents.  This is a common scenario for reading XML input.   Imagine a subdirectory that receives xml documents  (with the same matching schema) throughout the night, via FTP from remote locations.  It is desirable to have the job or map sitting there, waiting, until a file arrives, then immediately process that file and go back to waiting.

There are a myriad of solutions for handling this, usually with some type of polling mechanism.  It’s not as pure and simple, however, as the classic “always on” Job that uses a standard API (MQSeries, JMS, etc.) with a built-in wait parameter.   Those “always on” jobs are more predictable, across vendor, across protocol, and the pattern is well-known whether it’s developed with a graphical toolset or directly with C++.   The polling solutions are as varied as there are creative uses of these tools…..some possibilities are scripts that sleep, run, then loop and sleep again, remote daemons that wait for a specific file to arrive and then say “go” to the transformation job, or use of an external scheduler (cron, BMC Patrol, AutoSys, etc.) that monitors the conditions and kicks off the ETL process when appropriate.   There may be some custom API based solutions out there that force an ETL transformation such as a DataStage Job to blindly wait for input that “happens” to be from a subdirectory, but I haven’t seen any.   Most of the polling solutions are looping batch jobs…they do their work, finish reading their input, complete normally, then terminate and are asked to wait again [or their launcher does the waiting].    They act like ‘always on,’ but have less of the complexities because they are typically “very frequent batch.”    Again, custom solutions may exist, but these polling solutions are fairly easy to build out-of-the-box using the tools themselves or in conjunction with system managment products.  Many people prefer to use the system management tooling because it’s better designed to handle circumstances such as machine re-starts, clean-up strategies, etc.

Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

Update:This technique was also recently documented in the XML Stage Redbook, section 7.1.2, page 152. The redbook can be found at the following link:

XML Redbook

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.