Lost in translation?

editor’s note:  It gives me great pleasure to introduce Beate Porst, a good friend and colleague, who is the Offering Manager for DataStage and other parts of the Information Server platform.  Beate will be sharing her insights into Unified Governance and Integration, based on many years of experience with this platform and the issues surrounding data transformation and management.  Today she introduces some of the key new capabilities of Information Server v11.7.  Please welcome Beate to dsrealtime!   –ernie

How IBM Information Server v11.7 could have saved NASA’s 125-million dollar Mars orbiter from becoming lost.

We all know the slogan: Measure twice, cut once. What if we do but don’t know the context of our data?

That is what happened to NASA in 1999. While using the right numbers, their 125-million-dollar Mars orbiter was designed to use the metric system but mission control performed course corrections using the imperial system. This resulted in a too low altitude and contact to the orbiter was lost. An embarrassing moment for NASA.

But it wasn’t the only incident. In 2003, German and Swiss engineers started to build a bridge over the river Rhine in the border town of Laufenburg. Each country started to build the bridge on their side with the goal to meet in the middle. So the plan. Engineers used “sea level” as the reference point. Problem is that sea level in Germany is based on the North Sea where in Switzerland it is based on the Mediterranean, resulting in a 27cm difference. Now, builders in Germany knew the difference but apparently not whether to add or subtract that difference from their base. So they made the wrong choice.

Bridge_Waa

Historical documents show that using out of context, incomplete or inaccurate data has caused problems ever since mankind started to develop different units of measurement.

Now the question is how can you avoid costly incidents such as the above and successfully conquer your data problems and how can IBM Information Server help you in that journey?

Whether you want to build a bridge, send an orbiter to Mars or simply try to identify new markets, you will only be as good as the data you use. This means, it must be complete, in context, trusted and easily accessible in order to drive insights. As if this isn’t challenging enough, your competitiveness also depends on your organizations ability to quickly adapt to changing conditions.

For more than a decade, IBM InfoSphere Information Server has been one of the market-leading platforms for data integration and governance. Users have relied on its powerful and scalable integration, quality and governance capabilities to deliver trusted information to their mission critical business initiatives.

John Muir once wrote: “The power of imagination makes us infinite”.  We have applied our power of imagination to once again reinvent the Information Server platform.

As business agility depends on the flexibility, autonomy, competency, and productiveness of the tools that power your business, we have infused Information Server’s newest release with a number of game changing inventions which include deeper insights into the context and relationship amongst your data, increased automation for your users to complete their work faster and saver, and more flexibility workloads for higher resource optimization. All of those are aimed at making your business more successful when tackling your most challenging data problems.

Let’s look at 4 of those game changing inventions and how they are going to help  your business:

  1. Contextual Search: Out of context data was the leading cause of error for NASA’s failed mission. The new contextual search feature called Enterprise Search provides your users with the context to avoid such costly mistakes. It greatly simplifies and accelerates the understanding, integration, and governance of enterprise data. Users can visually search, explore and easily gain insights through an enriched search experience powered by a knowledge graph. The graph provides context, insight and visibility across enterprise information giving you a much better understanding and awareness of how data is related, linked, and used.
  2. Cognitive Design: Getting trusted data to your end users quickly is an imperative. This process starts with your integration design environment. To help address your data integration, transformation or curation needs quickly, Information Server V11.7 now includes a brand new versatile designer, called DataStage™ Flow Designer. It features an intuitive, modern, and secure interface accessible to all users through a no-install, browser-based experience, accelerating your users’ productivity through automatic schema propagation, highlighted design errors, powerful type ahead search as well as full backwards compatibility to the desktop version of the DataStage™ Designer.
  3. Hybrid Execution: Data Warehouse optimization is one of the leading use cases to address growing data volumes while simplifying and accelerating data analytics. Once again, Information Server V11.7 has strengthened its ability to run on Hadoop with a set of novel features to more efficiently operationalize your Data Lake environment. Amongst those, is an industry unique hybrid execution feature which lets you balance integration workloads across a Hadoop and non-Hadoop environment aimed at minimizing data movements and optimizing your integration resources.
  4. Automation powered by machine learning: Poor data quality is known to cost businesses millions of dollars each year. The inadvertent use of different units of measurements for the Mars orbiter was ultimately a data quality problem. However, the high manual work combined with exponential data growth continues to be an inhibitor for businesses to maintain high data quality. To counter this, Information Server V11.7 is further automating the data quality process, by underpinning data discovery and classification with machine learning, so that you can spent your time focusing on your business goals. The two innovative aspects are:

Automation rules which lets business users define graphical rules which then automatically apply data rule definitions and quality dimensions to data sets based business term assignments and

One-click automated discovery which enables discovery and analysis of all data from a connection in one click providing easy and fast analysis of hundreds or thousands of data sets

Don’t want to get lost in translation? Choose IBM Information Server V11.7 for your next data project.

…another way to load Terms into InfoSphere Business Glossary

copy-of-createbgimportscsv
copyofcreatebusinesstermsandattributesxmldsx3

Here are a few other Jobs for loading new Terms and Categories into Business Glossary. Like the earlier post on Business Glossary, these DataStage Jobs read a potential source of Terms (just alter the source stage as needed) and then create a target csv file that is in the correct format for loading into Business Glossary using the new 8.1 csv import/export features available at the Information Server Web Console… Glossary tab. The Jobs are fairly well annotated and should be self explanatory. I haven’t yet set them up for Custom Attributes, nor have they been widely tested —– but they are already being implemented at a variety of locations. Please let me know if you find them useful.

Ernie

(the one with “XML” in the name is the same as the prior blog entry. Each is named .doc, but is actually a .dsx file).

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. 😉

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »

MQSeries…Ensuring Message Delivery from Queue to Target

Using MQSeries in DataStage as a source or target is very easy…..but ensuring delivery from queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue to database without dropping any messages…

The best way to do this is with an XA transaction, using a formal transaction coordinator, such as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works with MQ to perform transactions across resources….deleting a message from the source queue, INSERTing a row to the target, and then committing the entire operation. This requires the most recent release of DataStage, and the right environment, releases, and configuration of MQSeries and a database that it supports for doing such XA activity….

So what happens if you don’t have the right release of any of these things, or are using an rdbms that is not supported for XA with MQ?

You can come “real close” and accomplish what you’ll need in most scenarios with the attached Job and text file with DDL. This .dsx defines a technique where messages are read from a source queue, then written to a target rdbms……if the INSERT works, messages are immediately removed from the source queue……but if it fails, removal is not performed, and the messages remain in the source queue. Careful thought, testing and review of recovery strategy is necessary, but this technique may be useful in a lot of situations.

Once again, I haven’t mastered the uploading on this site, and had to rename MQCommitTestv8.dsx and TestDDL.txt with pdf suffixes.   Just rename them and you will be fine.  They are both just simple ascii files.

Ernie

targetddl1 mqcommittestv8

…another good link on using XML and Web Services

…been meaning to put a link to the DeveloperWorks site and a very nice detailed piece that some of my IBM colleagues put together on XMLPack and Web Services Pack last year….  I haven’t had a chance to review every fine point, but I especially like the level of detail in their screen shots, which offer clarity to when/where/how DataStage uses XPath in its XML functionality.    Another good resource for review…  http://www.ibm.com/developerworks/db2/library/techarticle/dm-0703xiong/index.html

 -e

Incorporating Java classes into your DataStage Jobs

Java comes up a lot when we talk about “real time.”   Not that Java in particular has any special dibbs on the term, but frequently when a site is interested in things like Service Oriented Architecture (SOA), Web Services, messaging, and XML, they are often also interested in Java, J2EE, Application Servers and other things related to Sun’s language standard. 

Integrating Java with your ETL processing becomes the next logical discussion, whether “real time” even applies.   There may be some functionality, some existing algorithms worth re-using, some remote java-oriented or java managed system or message queue that contains valuable source data (or would be a valuable target), that you’d like to integrate into a data integration flow.   DataStage can easily be extended to include your Java functionality or take advantage of your Java experience.

There are two Stages that used to be referred to as JavaPack that are included with DataStage:  JavaClient and JavaTransformer.   Both allow you to integrate the functionality of a java class into the flow of a DataStage Job.   JavaClient is used for a sources or targets (only an output link or only an input link), and the JavaTransformer is used for row-by-row processing where you have something you’d like to invoke for each row that passes through.

DataStage provides a simple API for including java classes into your Jobs.  This API allows your class to directly interact with the DataStage engine at run-time — to obtain meta data about the columns and links that exist in the current executing job, and to read and write rows from and to those links when called upon to do so.   You define several special methods in your class, such as Process(), that the engine calls whenever it needs a row, or is giving your class control because it’s ready to give you a row.  Within that method you have various calls to make, such as readRow [from an input link] and writeRow [to an output link].    You can control what comes in and goes out, and also process rejections based on logic in your class.  Other than that, your class can do whatever it wants……read messages from JMS queues, invoke remote EJBs….whatever.  

The JavaPack is very well documented, with examples and descriptions of all the API calls.    However, I’ve included an additional example here for anyone who is interested, including java class, source, .dsx and usage notes.    Have fun!

-ernie

btw…I haven’t exactly figured out yet how to best get the names of the files below represented here on this blog, but if you save them from here, each file except the Readme begins with “ExamineRows” and should be ExamineRows.dsx (for the export), ExamineRows.java (for the Source) and ExamineRows.class for the actual compiled class.   I haven’t had a chance to re-try it after downloading from here, so worst case, you’ll need to recompile the class yourself in your environment.  Otherwise, it should run in v8 “as is”.  See the file at the Readme link for details on the expected classpath in the Job, etc., and read the annotations in the Job itself after you import it.  -e

Examine Rows Class, Examine Rows Java Source, Examine Rows Readme, Examine Rows DataStage Export

XML with embedded XPath in the content…

Wow.  I haven’t written in awhile.    Been off in the land of metadata for the past few months, or heads down on some other projects.  No excuse — some juicy real-time/xml/java issues have been coming up lately and are piled here on my desk to be entered here.   I also have to finally figure out how to do attachments so that I can share some DataStage Jobs and other stuff.

 Ran into an interesting one today that I haven’t seen before, and I see a lot of XML issues.   Guess there’s always something new around the corner, and in reality I’ve only scratched the surface of what XML has to offer….and unless my teammates at IBM and/or our customers run into something, I’m not likely to see it.

 Anyway, this one is pretty cool.   I was given an xml document whose content contains references that point to “other” xml content in the same document.   In this case, it is a node with an attibute containing XPath pointing to another node somewhere (presumably above) in the document.   This attribute is present if-and-only-if all the other sub-elements of the node are missing.  If so, the processing needs to locate the other node and retrieve the values found there in order to populate all the values of the node containing this “reference” attribute.  A sort of “recursive” lookup into the same xml document.

I haven’t entirely worked out the solution using DataStage yet, although the strategy is fairly clear.   First retrieve the node, check for the reference attribute, and if present, pull out the XPath.   This XPath then needs to be used further on in the flow to pull some “other” content from the same XML document.  Compare for nulls and then take the populated content before moving forward…  A bit tedious, but do-able, since DataStage lets me throw XPath around and dynamically create and use Stylesheets.

Anyone else encountered XML like this with internal references?

Ernie