Reading XML Content as a Source

Reading XML Content as a Source can sometimes be a challenge for ETL types of tools. There are various reasons, but the most common issue is that XML documents can optionally contain line feed characters (typically hex 0D0A) to make them look “pretty” when opened in common editors. These carriage return and line feed characters are ignored by XML processors, but typically are respected by default flat file ETL tooling.

Fortunately, with the increased popularity of XML as a source, these problems no longer exist. Most ETL tools by this time should have a way of reading XML files from disk, or directly (from message queues, etc.) so that this is not a problem.

DataStage in particular has a variety of techniques that can be used to read XML. Which one you choose will depend largely on which type of DataStage you are using and whether you want to read in a single document or a long list of them in a subdirectory. The most common methods I’ve seen for reading in XML in DataStage are:

Folder Stage. For Server Jobs, this is the easiest. The Folder Stage looks for a document or documents on disk based on a subdirectory entry and a wildcard argument for a filename. Both can be parameterized. Look for a built-in Table Definition called Folder that contains the two columns returned by the Stage… FileName and Record. FileName returns the actual and optionally fully qualified filename for the document, and Record contains the entire contents of the file….all of its tags, characters, line feeds, etc. Everything. This is then handed as a unit to the XMLInput Stage where the elements and attributes are deciphered. If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.

External Source Stage. I’ve found this Stage to be the easiest to use for Enterprise Edition (EE) Jobs. It easily allows me to issue an “ls” command for a particular subdirectory, and then send the resulting list to a waiting XMLInput Stage downstream. The Specific Program property contains the string “ls #xmlSubDirectory#*.xml | sort” (without the quotes), and of course could do other things like filter the list via grep. I place the results into a column on the Output link called resultingFileNames, with a varchar datatype and length of 200.

Update:This technique was also recently documented in the XML Stage Redbook, section 7.1.2, page 152. The redbook can be found at the following link:

XML Redbook

MQSeries Stage. In this case I’m reading the XML Content from an MQ Series message queue. It’s just a “chunk” of character bytes as far as MQ and DataStage are concerned. The chunk received is then sent into XMLInput. This can be limiting because MQSeries has it’s own 100M limit on message size, although this can be alleviated with various MQSeries techniques like Segmentation.

Advertisements

31 Responses to “Reading XML Content as a Source”

  1. Leo Says:

    For External Soruce Stage, you said that you put the string “ls #xmlSubDirectory#*.xml | sort” in the Specific Program property. Can you tell me how you were able to enter the string above into the Specific Program property?

  2. dsrealtime Says:

    …Hi Leo…sorry for the late reply…not sure I understand the question….the property in the Stage is just an open text property that allows you to type in whatever you want….perhaps we’re talking about different Stage/Operator types? — ernie

  3. Pete Morris Says:

    Hi Leo,
    when using the external source stage is there only one output column i.e. resultingFileNames.

    what are the xml input properties set to on the input tab i.e input column = resultingFileNames (is it a key column?) etc.

    .

  4. dsrealtime Says:

    Hi Pete…

    The single column coming from the External Source is tagged with the option on input as “URL” (as opposed to “XML Content”). This directs XMLInput to do its own i/o using the location sent from the unix “ls” command in the External Source. Each “row” sent from the External Source will point to another XML document.

    Ernie

  5. John Chaves Says:

    Great stuff..

    What if the xml is behind a protected SSL site? Which Stage should I use to connect to the server (using https), login, and retrieve the xml?

    • dsrealtime Says:

      The only way I’ve seen creative retrieval done via HTTP and HTTPs is to use JavaPack. I haven’t done it myself, but have talked to folks who have written simple Java classes to get things from a remote site over HTTP.

  6. Swetha Says:

    Hi,
    I’m trying to read Xml data using Datastage server job and write to sequential file,
    This xml file is stored in sql server table. The job runs fine for around 30,000 rows and then aborts with error “XML input document parsing failed. Reason: Xalan fatal error (publicId: , systemId: , line: 1, column: 1826): Invalid character (Unicode: 0x1A)” . I serached and couldnot find any help. Have you ever come across this, Not sure what is causing the issue. The output seems fine for all other rows processed before this particular row. How should i remove or convert this Unicode 0x1A?
    Thanks

    • dsrealtime Says:

      There is something bad about that particular row. To be sure, you might want to identify that particular row if you can, save it to disk, and then try to open it independently in IE or other tooling and see if it indeed has an offending character in its content. …and whether it is “offending” or something else like “real data” that someone hoped to be storing here [unlikely but you never know]…. Simplest thing to do (outside of formally spitting out the row and ensuring that someone fixes it in SQL Server) is to pass the column containing the row thru a Transformer and just repair it there. Convert the offensive characters, if found, to hex 20’s (blanks).

  7. Swetha Says:

    I saved the xml and when opening using IE I got the error “An invalid character was found in text content. Error processing resource” , I think the issue is because of character ’ (for example, an element in this xml has text like “balcony’s”) it is not a quote (‘). when I remove that character the xml opens fine. I tried to replace it in transformer but it still the job aborts with the same Unicode error.

    • dsrealtime Says:

      The character is still there….look thru the manual for the Transformer and the detailed expressions that are available…you have to find the offensive characters and delete them entirely. ereplace() might be a suggestion, but there are other functions to look at also.

      Ernie

  8. 2010 in review « Real-Time Data Integration Says:

    […] Reading XML Content as a Source December 200710 comments 3 […]

  9. Sagar Says:

    I am using a real time job with no input and output ISD stages, that is Topology 1. My request is a xml with CDATA tag. In trasnformer when I map to a varchar column it throws a warning called Invalid argvalue and puts ‘,missing =. at the end of CDATA tag. My record gets processed successfully and not sure why I get this warning for end of CDATA tag.

    • dsrealtime Says:

      Not sure I fully understand. Certainly the issue has nothing to do with ISD. Are you writing the xml or reading it? What release and what xml stage are you using? Is it only on “one” of the rows, or every row that contains CDATA? You mentioned the Transformer…do you mean a regular transformer?

      Ernie

  10. Sagar Says:

    I am using IIS 8.1 fix pack 1 and am using a regular trasnformer. The request is passed as a paramter to a job and the job paramter is a string. In trasnformer when I map this job parameter to a output column which is VARCHAR it gives a warning saying invalid arg value and displays the entire xml is warning with “missing” at the end of CDATA tag. Not sure if its treating the end of CDATA in incorrect manner. Have u experienced such a beahviour before ?

    • dsrealtime Says:

      No. It’s not clear which stage is giving the error. If the xml stage didn’t exist downstream, do you still get the error? It’s possible that it has nothing at all to do with xml — maybe the string is being truncated somehow. How big is the string? Any special or odd characters in your xml? Does it happen with only one particular document that you send? Does it matter that you send the value via web services? You noted that it is not a Job with ISDinput. Does the same job fail when you run it from the Director and send the same parameter value? These are all things to check as you debug this.

  11. Rocker Says:

    Hi Ernie,
    Whatever you are, you are a genius to have found this out. I was struggling to read an xml file, but after reading just 5 lines from this blog of yours, I could do my job within 5 mins. Great help buddy. Thanks a lot.
    Thanks.

  12. Arash Says:

    Hi all,
    I have to use xml file as input source, and my xml file is very complicate, it has 4 differents language for each tag/atr
    i m looking for some tuto
    my email is id_2pc@yahoo.com
    thank you
    Best rgards
    Arash

    • dsrealtime Says:

      Hi Arash….language issues are usually handled successfully by the encoding of the document and your NLS settings…….your greater difficulty (if any) will probably be in the structure and retrieval. The size and structure of any xml document is usually where the challenges might be when parsing it. Take a look thru my blog entries, and spend a lot of time searching thru the 1000’s of xml postings on http://www.dsxchange.com . The documentation is very good also, for all the xml Stages that are in DataStage. If you have any specific questions, post them here and I’ll do my best to provide ideas or direct you to other resources.

      Ernie

  13. dspxguy Says:

    Hi Ernie,
    I would like to know the limitation of handling large XML files in 8.5. Do you think reading 8-10 GB files will work in 8.5?
    Thanks

    • dsrealtime Says:

      8 – 10 GB files will work fine in 8.5….they will work fine in 7.5 ………800 to 1gig will also work in 8.5….. : ) Prior to 8.5, the limit is between 200 and 300meg in my experience, depending on the document and its structure. The new xmlStage in 8.5 uses a different methodology for reading xml, and isn’t hindered by this limit.

  14. sumi Says:

    Hi ernie,

    I was able to read the file which has one xml file. I was not able to read the file when there are multiple xml files in the same file.
    The file content is
    First line- one xml file
    second line – another xml file

    I am getting the below warning
    XML input document parsing failed. Reason: Xalan fatal error (publicId: , systemId: xmlres.xml, line: 2, column: 7): Not valid after content

    Could you please help to resolve this?

    Regards,
    Sumi

    • dsrealtime Says:

      For this to work you will have to treat each indivual line of that file as a single document on the link…….as a whole “file”, it’s invalid as xml…but as “individual” documents, each treated alone, it will be fine……..the problem will be getting each of those into their own column……..if you can ensure that there are no odd characters, especially CRLFs embedded within each document, you can probably read it with a sequential Stage…..try that first….see if you can get a job that reads the file using a sequential stage, with one giant column, and puts out as many rows as you have documents. Hopefully there is only one CRLF per line, at the end of each “document”.

  15. Sunil Says:

    ‘Great information Pete. Can you please eloborate more on “If the documents are really large, it may simply be better to send only the filename, and ask the XMLInput Stage to do the actual I/O.” with some example. How to configure the XML Input Stage!

  16. Vidhya Says:

    Hi Ernie,
    I am trying to read xml file using DS Parallell job and writing into seq.file. I am using External source stage and gave “ls pathname” in the source program. Also set imput column content as url/File path in xml input stage. Still I am getting “The primary document entity could not be opened. ” warning and 0 records imported. I am able to open the file using IE and there are no wierd looking characters in the xml that I could find. What else do I need to look into..?

  17. Pierluigi Modesti Says:

    Hello Ernie, we met some months ago in Turin (Italy).
    I have the following issue and hope you can help me: the XML input is tored in a Sql Server xml-type column. What’s your suggestion to read it with DS8.1?
    Thank you,
    Pierluigi

  18. Priya Says:

    Hi Ernie,

    I am retrieving data from DB2 to Oracle in xml format. I have imported my xsd and using XML Stage. In xml assembly, I am using Regroup and XML Composer to step. But I am getting the output combined in one structure. I would like to have each child in one record. Also, I am not getting in my xml output. My data within the tags are read as &gt and &lt. But I would like to read as > and <. Could you please help me with the below.

    • dsrealtime Says:

      Hi Priya.

      The &gt and &lt are by design and are required by the xml standard if they are parts of random text that is “inside” of an element. Such as please check this document &gt schedule.doc! Put that in a valid xml document and open it in your browser — it will show up just fine as: please check this document > schedule.doc!. Don’t use notepad or other text editor to review your xml — open it in a tool that knows how to parse it.

      As for having each record, you can either map your changing key (customer number, page number, etc.) to the document collection (see page 253 in the xml stage redbook), or instead, work with the end-of-wave operator upstream from your xml stage. Set the end-of-wave so that it checks a flag whose value changes every time you get a new group of sorted rows that meets the conditions for a “new” xml document.

      Ernie


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: