Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.

6 Responses to “Reading XML Content as a Source”

  1. Leo Says:

    For External Soruce Stage, you said that you put the string “ls #xmlSubDirectory#*.xml | sort” in the Specific Program property. Can you tell me how you were able to enter the string above into the Specific Program property?

  2. dsrealtime Says:

    …Hi Leo…sorry for the late reply…not sure I understand the question….the property in the Stage is just an open text property that allows you to type in whatever you want….perhaps we’re talking about different Stage/Operator types? — ernie

  3. Pete Morris Says:

    Hi Leo,
    when using the external source stage is there only one output column i.e. resultingFileNames.

    what are the xml input properties set to on the input tab i.e input column = resultingFileNames (is it a key column?) etc.

    .

  4. dsrealtime Says:

    Hi Pete…

    The single column coming from the External Source is tagged with the option on input as “URL” (as opposed to “XML Content”). This directs XMLInput to do its own i/o using the location sent from the unix “ls” command in the External Source. Each “row” sent from the External Source will point to another XML document.

    Ernie

  5. John Chaves Says:

    Great stuff..

    What if the xml is behind a protected SSL site? Which Stage should I use to connect to the server (using https), login, and retrieve the xml?

    • dsrealtime Says:

      The only way I’ve seen creative retrieval done via HTTP and HTTPs is to use JavaPack. I haven’t done it myself, but have talked to folks who have written simple Java classes to get things from a remote site over HTTP.


Leave a Reply