Dealing with “primitive” XML

Yesterday I encounted an interesting XML scenario.   One I’ve seen before, but not in awhile.  It began as many do, with questions regarding the import of XML metadata.   Our importer was having problems reading the structure, and the issue trickled onto my desk.  Upon further inspection of the XML document instance, it became clear why we couldn’t read it, but later it was the structure of the document (and the issues encountered trying to generate it) that encouraged me to make this entry today.

The situation we enountered is one that I see on occasion, especially for “older” (usually 5 – 6 years) XML interfaces that were designed when XML was still the “new kid on the block” as far as technology is concerned.  XML has been around longer than six years, but it was around that time that XSDs were starting to blossom (vs DTDs) and XML was turning from its SGML roots and primarily text oriented usage (remember all the “Shakespeare” examples that used to be popular?) to being structured for datatypes and data oriented use cases (I’ve seen testimony by early XML experts who believe that the XML specification was “hijacked” by the database vendors of the world 😉 ).   Whether you agree with that or not, be prepared to deal with XML structures like the following:

<RECORD><KEY>any record key value</KEY>

First of all, it can’t be read by the importer, because it is not a valid document.   So….try to open any document that you have in Internet Explorer, or other tool that handles valid XML.   If IE can’t read it, neither will DataStage.   In this case, it’s not a valid XML document.  It has no header (<?xml version=”1.0″ ?> or similar), and there is no closing tag for <RECORD>.  Further, it turns out that this one one physical record on a flat file, followed on each line by another identical structure.    So this “xml document” is actually a “collection” of xml “chunks”.    With a little bit of extra refinement, it could be a set of repeating nodes in a larger document, or perhaps at least complete documents, one per physical record (as it might be if you had complete documents in a message queue). 

The second “offense” in this document is the “repeating group that is not really a repeating group.”   Here we see the “primitive” design (read “poor” design).   There may be some debate here, but this is really the “easy” way out when trying to model your data structures with XML.    Part of the beauty of XML is the fact that the metadata is represented by the tags.   The element “NAME” above is not the representative meta data for the value “123 Main Street” — it’s simply a tag that contains the “name” of the column that represents “123 Main Street,” which is ADDRESS.   Primitive and lame.  I’m sure there are good reasons out there, but it’s still disappointing to see.    XML gives us the ability to have self describing metadata — far better it would be to have an element of <ADDRESS>123 Main</ADDRESS>, whose “value” is “123 Main”…..simply the value of the text within the element.  Why carry an extra element?    Why have all that additional data?   Do I need an element actually called “value”?   Why?  XML already handles that within its tagging mechanism!  

That being said, how do I go about creating this?   Tooling like DataStage for building XML are designed to bring in many rows from a source and aggregate them into the desired XML structure.  Repeating groups, in a well designed document, would be “rows” of PERSON, each with unique element names.    In this case, the xml repeating group is not a repeating group.    If we have 100’s of columns, it might be justifiable to first pivot the data, creating “n” rows for “m” columns, and then generate the XML noted above.   However, if we’re talking about a minimal number of columns, then it’s probably best to deal with “primitive” xml, with the least complex and most “primitive” solution…….  just hard code the XML string in a Transformation…….  in two minutes you will have been able to generate the right string just by concatenating the tags in question into a long string.  You don’t need any headers, you don’t need dynamic validation, no namespaces or complex nested repeating groups.  Just physical generation of a poorly designed XML “chunk”.    In DataStage, something like “<RECORD><KEY>”:linkname.KEY:”</KEY><NAME>PERSON</NAME><VALUE>”:linkname.PERSON
:”</VALUE><NAME>ADDRESS</NAME><VALUE>”:linkname.ADDRESS: …..and so on to the end of your record.

Not elegant….nor is the structure you are trying to match….   -ernie



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: