Reviewing really LARGE xml documents…

Recently I came across a tool that was very helpful in the review of large XML documents…..REALLY large xml documents…..Hundreds of megabytes and more! In the years I’ve been working with XML, every so often I come across a document that I can’t read, can’t process, or can’t otherwise easily research for a data integration application. The tools I usually employ either hit errors or take forever to open while parsing thru thousands upon thousands of elements and attributes.

XMLMax, from xponentsoftware, is a convenient alternative to using IE to open my XML, alleviating the need to wait and wait while a large document is opened (all too often failing in the process in one way or another). According to the author, XML Max uses several proprietary buffering techniques that support the immediate display of content and faster consumption of the entire document. I found that it was able to read two and three hundred megabyte documents in mere seconds when compared with other mechanisms.

XMLMax is not just an XML reader, as it also has the ability to edit and manipulate your documents. I found it intuitive to use and easy to install, and liked its features for finding the closing tags of any element, especially large ones. Another great feature is the default for ignoring DTDs. So many other tools default to looking for a DTD if specified in a document, and then blow up when trying to open them! My interests lie in particular with the study and examination of large documents for use as a source or target, and so I focused on those features. It is worth noting, however, that XMLMax also provides ways to use XSLT to break up documents into smaller pieces. At the time of this review, the product only ran on Windows, but there are apparently plans to port it to other platforms and also to provide a command line interface for programmatic operation.

Check it out at http://www.xponentsoftware.com/.

Ernie

Posted in xml. Tags: . Leave a Comment »

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. 😉

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »