Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. ;)

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »