Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.
Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.
DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:
Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.
External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.
Update:This technique was also recently documented in the XML Stage Redbook, section 7.1.2, page 152. The redbook can be found at the following link:
MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.