Lost in translation?

editor’s note:  It gives me great pleasure to introduce Beate Porst, a good friend and colleague, who is the Offering Manager for DataStage and other parts of the Information Server platform.  Beate will be sharing her insights into Unified Governance and Integration, based on many years of experience with this platform and the issues surrounding data transformation and management.  Today she introduces some of the key new capabilities of Information Server v11.7.  Please welcome Beate to dsrealtime!   –ernie

How IBM Information Server v11.7 could have saved NASA’s 125-million dollar Mars orbiter from becoming lost.

We all know the slogan: Measure twice, cut once. What if we do but don’t know the context of our data?

That is what happened to NASA in 1999. While using the right numbers, their 125-million-dollar Mars orbiter was designed to use the metric system but mission control performed course corrections using the imperial system. This resulted in a too low altitude and contact to the orbiter was lost. An embarrassing moment for NASA.

But it wasn’t the only incident. In 2003, German and Swiss engineers started to build a bridge over the river Rhine in the border town of Laufenburg. Each country started to build the bridge on their side with the goal to meet in the middle. So the plan. Engineers used “sea level” as the reference point. Problem is that sea level in Germany is based on the North Sea where in Switzerland it is based on the Mediterranean, resulting in a 27cm difference. Now, builders in Germany knew the difference but apparently not whether to add or subtract that difference from their base. So they made the wrong choice.

Bridge_Waa

Historical documents show that using out of context, incomplete or inaccurate data has caused problems ever since mankind started to develop different units of measurement.

Now the question is how can you avoid costly incidents such as the above and successfully conquer your data problems and how can IBM Information Server help you in that journey?

Whether you want to build a bridge, send an orbiter to Mars or simply try to identify new markets, you will only be as good as the data you use. This means, it must be complete, in context, trusted and easily accessible in order to drive insights. As if this isn’t challenging enough, your competitiveness also depends on your organizations ability to quickly adapt to changing conditions.

For more than a decade, IBM InfoSphere Information Server has been one of the market-leading platforms for data integration and governance. Users have relied on its powerful and scalable integration, quality and governance capabilities to deliver trusted information to their mission critical business initiatives.

John Muir once wrote: “The power of imagination makes us infinite”.  We have applied our power of imagination to once again reinvent the Information Server platform.

As business agility depends on the flexibility, autonomy, competency, and productiveness of the tools that power your business, we have infused Information Server’s newest release with a number of game changing inventions which include deeper insights into the context and relationship amongst your data, increased automation for your users to complete their work faster and saver, and more flexibility workloads for higher resource optimization. All of those are aimed at making your business more successful when tackling your most challenging data problems.

Let’s look at 4 of those game changing inventions and how they are going to help  your business:

  1. Contextual Search: Out of context data was the leading cause of error for NASA’s failed mission. The new contextual search feature called Enterprise Search provides your users with the context to avoid such costly mistakes. It greatly simplifies and accelerates the understanding, integration, and governance of enterprise data. Users can visually search, explore and easily gain insights through an enriched search experience powered by a knowledge graph. The graph provides context, insight and visibility across enterprise information giving you a much better understanding and awareness of how data is related, linked, and used.
  2. Cognitive Design: Getting trusted data to your end users quickly is an imperative. This process starts with your integration design environment. To help address your data integration, transformation or curation needs quickly, Information Server V11.7 now includes a brand new versatile designer, called DataStage™ Flow Designer. It features an intuitive, modern, and secure interface accessible to all users through a no-install, browser-based experience, accelerating your users’ productivity through automatic schema propagation, highlighted design errors, powerful type ahead search as well as full backwards compatibility to the desktop version of the DataStage™ Designer.
  3. Hybrid Execution: Data Warehouse optimization is one of the leading use cases to address growing data volumes while simplifying and accelerating data analytics. Once again, Information Server V11.7 has strengthened its ability to run on Hadoop with a set of novel features to more efficiently operationalize your Data Lake environment. Amongst those, is an industry unique hybrid execution feature which lets you balance integration workloads across a Hadoop and non-Hadoop environment aimed at minimizing data movements and optimizing your integration resources.
  4. Automation powered by machine learning: Poor data quality is known to cost businesses millions of dollars each year. The inadvertent use of different units of measurements for the Mars orbiter was ultimately a data quality problem. However, the high manual work combined with exponential data growth continues to be an inhibitor for businesses to maintain high data quality. To counter this, Information Server V11.7 is further automating the data quality process, by underpinning data discovery and classification with machine learning, so that you can spent your time focusing on your business goals. The two innovative aspects are:

Automation rules which lets business users define graphical rules which then automatically apply data rule definitions and quality dimensions to data sets based business term assignments and

One-click automated discovery which enables discovery and analysis of all data from a connection in one click providing easy and fast analysis of hundreds or thousands of data sets

Don’t want to get lost in translation? Choose IBM Information Server V11.7 for your next data project.

Advertisements

…another way to load Terms into InfoSphere Business Glossary

copy-of-createbgimportscsv
copyofcreatebusinesstermsandattributesxmldsx3

Here are a few other Jobs for loading new Terms and Categories into Business Glossary. Like the earlier post on Business Glossary, these DataStage Jobs read a potential source of Terms (just alter the source stage as needed) and then create a target csv file that is in the correct format for loading into Business Glossary using the new 8.1 csv import/export features available at the Information Server Web Console… Glossary tab. The Jobs are fairly well annotated and should be self explanatory. I haven’t yet set them up for Custom Attributes, nor have they been widely tested —– but they are already being implemented at a variety of locations. Please let me know if you find them useful.

Ernie

(the one with “XML” in the name is the same as the prior blog entry. Each is named .doc, but is actually a .dsx file).

Handling Nulls in XML Sources

Reading XML is very straightforward when everything is present, all the elements are populated, and the entire XML “tree” exists in the current document that you are reading. However, when elements are missing from the document entirely, are “empty” (by XML definition), or present but formally indicated as null, things can get tricky. Nulls can be confusing enough as it is! I’ll try here to clarify the null issues you should be considering in your ETL processes, and outline specifically how such things are managed in DataStage when you use XMLInput…

For starters, let’s leave the “Repeating Element” (the one you mark as “key”) out of this blog entry. It has its own important considerations. We’ll concentrate for now on standard elements that may or may not appear in your document, may be empty, or may be formally marked as null (element and its tags exist, but it is empty with attribute xsi:nil=”true”). Let’s look at some examples.

Imagine the details of a document about products. It might have an element called “color.” Color could have a value:

<product>
<number>123</number>
<color>blue</blue>
<location>A</location>
.
.
.
</product>

 

Color could be “empty”:

 

<product>
<number>123</number>
<color></color> [or <color/> ]
<location>A</location>
.
.
.
</product>

 

Color could be formally null:

 

<product>
<number>123</number>
<color xsi:nil=”true”></color>
<location>A</location>
.
.
.

</product>

 

…or be entirely missing (for a given product):

 

<product>
<number>123</number>
<location>A</location>
.
.
.
</product>

 

What constitutes a null value? When is it blank? When your tool reads and parses this XML document, what do you want it to do?

DataStage gives us several options. There are two check boxes in the Transformation tab of the Output Link. By default, these boxes are unchecked:

Replace NULLs with empty values [blanks]
Replace empty values with NULLs

Without doing anything, empty elements will be treated as blanks, and null indicated or missing elements will be treated as null.

If you only check “Replace NULLs with empty values,” nothing will be null anymore….

…and if you only check “Replace empty values with NULLs,” more of your rows will have nulls for this column, because anything that was blank in the default scenario will now be null.

Finally, if you check both boxes, only the empty elements will be reteived as null, because the missing elements and the formally null ones will be treated as blank.

Clear as mud, I know. 😉

Last week I worked with a customer who wanted to separate the xsi:nil=”true” elements from the ones that were “missing”. They wanted the xsi:nil=”true” to become null columns, and the ones that were truly missing to be assigned an actual string value of “MISSING.”. I’ll follow this up with an entry on using XSLT to perform subtle column assignments like this one.

-e

…test line added to validate a new RSS reader…

Posted in datastage, etl, xml. Tags: . 3 Comments »

MQSeries…Ensuring Message Delivery from Queue to Target

Using MQSeries in DataStage as a source or target is very easy…..but ensuring delivery from queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue to database without dropping any messages…

The best way to do this is with an XA transaction, using a formal transaction coordinator, such as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works with MQ to perform transactions across resources….deleting a message from the source queue, INSERTing a row to the target, and then committing the entire operation. This requires the most recent release of DataStage, and the right environment, releases, and configuration of MQSeries and a database that it supports for doing such XA activity….

So what happens if you don’t have the right release of any of these things, or are using an rdbms that is not supported for XA with MQ?

You can come “real close” and accomplish what you’ll need in most scenarios with the attached Job and text file with DDL. This .dsx defines a technique where messages are read from a source queue, then written to a target rdbms……if the INSERT works, messages are immediately removed from the source queue……but if it fails, removal is not performed, and the messages remain in the source queue. Careful thought, testing and review of recovery strategy is necessary, but this technique may be useful in a lot of situations.

Once again, I haven’t mastered the uploading on this site, and had to rename MQCommitTestv8.dsx and TestDDL.txt with pdf suffixes.   Just rename them and you will be fine.  They are both just simple ascii files.

Ernie

targetddl1 mqcommittestv8