DataStage and Minecraft (just for fun…)!

Hi Everyone!

Do your kids play Minecraft? Do you? Here is a “just for fun” recording of DataStage in a Minecraft world….. if you use DataStage and you and/or your family members play Minecraft, we hope you’ll enjoy this little adventure into the “world of Transformation”….. ; )

http://youtu.be/YFbLxbPuScA

Ernie

Manually Creating Shared Tables and Data Files

There are lots of ways to import Shared Tables from other tools and other kinds of metadata — but sometimes it’s helpful to just “do it yourself.” You may havehome-grown spreadsheets or documents that contain your column metadata, or simply don’t have immediate access to the COBOL FDs, XSDs, ERwin (or other popular modeling tools) models, or other standard forms of metadata that contain critical table and column information.

See this link for more information on Shared Tables and Data Files… What are Shared Table Definitions?

[Note: 9.1 Update --- in 9.1 this utility is now formally part of istool. It is called istool workbench generate ]

A new utility (courtesy of our creative engineers) has been built that supports the manual creation of Shared Tables and Data Files within the repository. This provides new flexibility for dynamically, and programmatically creating these structures for data lineage and impact analysis purposes. The utilty takes advantage of the istool archive format and syntax. Attached to this posting are three files…the documentation, a .jar file to be copied to your istool directory structure, and a sample import .csv with a variety of Shared Tables and Data Files. It will work with istool in both the Server and Client side environments. Please see the documentation for further details…

Notes!

1) After download, remove the .jpg suffix for the .jar and .csv files !!!! (the suffix is there just to satisfy this blog’s restrictions)
2) I noticed that the .jar file had its dots (.) changed to dashes (-). Leave the underscore, but rename the file so that each dash becomes a dot. The actual jar file you copy into your subdirectory should be: com.ibm.istools.cli.generate_1.0.0.jar
3) USE/READ THE DOCUMENT. It illustrates the detailed syntax that you need.
4) I noticed on linux that a ‘/’ was appended to the .isx file. I renamed the file before doing the “import” step. It worked fine. It seems to work fine “as is” on Windows.

Documentation
jar file [remove ".jpg" after download]
Sample .csv file with detailed table metadata [remove ".jpg" after download]

(what are?) Shared Table Definitions

Information Server 8.x introduced a concept called a “Shared Table Definition.” This is a “container” for Table and Column metadata. It holds the Column names and other critical metadata (sql data type, length, etc.) for your relational tables. We call it a “Shared Table” because it is automatically shared among all of the tooling within Information Server and Foundation Tools.

A Shared Table is immediately visable (and usable) in Information Analyzer, FastTrack, Metadata Workbench and Business Glossary. It is also available for DataStage and QualityStage via the Metadata Sharing option within the DataStage Designer. Shared Tables are “like” DataStage Table Definitions, but with a broader, more enterprise scope. DataStage Table Definitions are still supported as always, but are unique to the DataStage environment. Inside of DataStage, a Shared Table appears in the Repository listing with a little white “book” icon above it. I always call it the “dove” icon because it looks to me like a bird. :)

Shared Tables often reflect a “real”, physical table. They have a multipart name that includes Host, Database, Schema and Table and finally Column names. Shared Tables can be created in a variety of ways:

– Importing tabular metadata via “Connector” inside of FastTrack, Information Analyzer, or the DataStage Connector Wizard (under import…table defs).
— Importing tabular metadata via the Import/Export Manager using Bridges and/or Brokers
— via the “Create Shared Table from Table Definition” option from within DataStage
— via a new utility that allows the creation of Shared Tables and Data Files (also known as Shared Files) from .csv [saving that subject for another blog entry]
— via export of physical models from InfoSphere Data Architect (which could have imported the model(s) from other design tools)
— via Optim export from InfoSphere Discovery

Once they are created, you can see them in the “tree” views of FastTrack, IA, and Workbench, and also delete them in most of those Information Server clients…..each client is looking at the exact same copy of the metadata — so there is absolute consistency between the views, regardless of which client you are using or which role you play in the organization.

These Shared Tables drive the integration activity, or the review of such activity, throughout Information Server, and they also serve as an important “anchor” for concepts that help enable strong information governance. Shared Tables retain associated profiling information (domain population details, data quality assessment history, etc.) from InfoSphere Discovery and Information Analyzer. Shared Tables can be assigned a Data Steward — a subject matter expert who is most familiar with a particular table and its value to the organization. Shared Tables and their Columns can be assigned to a Business Term — or to multiple Business Terms….to futher define them and qualify their purpose and meaning to the enterprise. Perhaps most important, Shared Tables are the key sources and targets for Data Lineage reporting and analysis. Shared Tables are “connected” for data lineage purposes via DataStage Jobs or Extension Mappings that represent processes external to Information Server and Foundation Tools.

Related to a Shared Table is a “Data File”. A Data File is also a container for Fields (or Columns), but is associated with a sequential or flat file instead of an RDBMS. A Data File has similar characteristics, and participates in data lineage and metadata enhancement (Stewards, Terms, etc.) like Shared Tables, but has a different icon when viewed within the Metadata Workbench. The naming convention for a Data File is Host, DataFile name, and Data File Structure….and then Fields. Today, Data Files are uniquely used by DataStage, and created from there using the “Create Shared Table from Table Definition” as noted above. The difference is in the “Locator” tab of the original DataStage Table Definition. If it says “sequential”, then a Data File will be created instead of a Shared Table. However, a new utility just became available that allows Data Files to be created at the command line. I will be writing another entry in the blog to describe this new utility that provides new flexibility for creating these useful objects.

Ernie

Data Lineage vs Business Lineage

Hi all…

Thought it would be a good idea to go into some detail about Data Lineage and Business Lineage.

Data Lineage is a more “technical” detailed lineage from sources to targets that includes ETL Jobs, FTP processes and detailed column level flow activity. It should trace everything from source to target, and be flexible enough to encompass legacy sources, all RDBMS, abstract sources like and targets like message queues and realtime transactions, and every sort of transformation and movement process that touches the data along the way.

Business Lineage is a summary of that lineage — showing primarily only the sources, targets and reporting assets. I think of those assets as sort of “floating” to the top for display in Business Lineage (the same lineage is created under the covers). Business Lineage is often more useful to business members of your teams who aren’t interested in all of the gory detail at the lower level, but want to see a few critical business rules along with the ultimate sources for their reports and portal displays.

With Information Server and Foundation Tools, Lineage between sources and targets (Shared Tables) is established in a variety of ways:

a) via DataStage Jobs that illustrate the transformation of data from source(s) to target(s).

b) via parsing of SQL in rdbms views, or as an extension of the above when custom SQL is used in a Job.

c) via Extension Mappings, an 8.1.1 feature of Metadata Workbench (June, 2009) that supports the illustration and lineage for any source and target.

DataStage Jobs are just “there”. As a developer builds and edits a Job, lineage just “happens”. Either as a matter of course as you follow the links thru a complex Job, or via the algorithms that Metadata Workbench uses to parse through Jobs and find their relationships (ie…JobA writes to a target that becomes the source for JobB). Views are picked up during import from an RDBMS catalog. Metadata for a view is contained in the SQL SELECT that defines it. This SQL is parsed by Metadata Workbench to determine the actual base tables used. Extension Mappings allow you to be creative and define anything you want. This is important for illustrating any type of source-to-target relationship, and can represent anything from a cobol program to a shell script, or nothing at all — it simply may be important to illustrate a from/to relationship between two or more objects in your enterprise.

Business Lineage was an 8.1.2 capability (December 2009) that expands on the lineage concept by displaying only the sources and targets and reporting resources. The DataStage Jobs or Extension Mappings are “just under the surface”, but their detail, which could be extensive, is suppressed by the Business Lineage display. This makes consumption of those screens for lineage review simpler for certain categories of users.

Ernie

Guidelines for publishing services with DataStage and QualityStage…..

There are a variety of issues to consider when publishing a DataStage or QualityStage Job as a Web Service. One of these is end-of-wave, and I have an entry on my blog for that (I’ll find the link later, but I think it’s in the table of contents)… Another is ensuring that ISD is “the” only driving link for the transformation…

Ensuring that WISD Input is the driving stream for the Job flow

For “always on” Jobs, the flow of rows through the job must be driven exclusively by the primary input Stage that is responsible for the “always on” condition. NO path thru an “always on” Job should have its own independent origin. This condition is most notably encountered in Enterprise Edition Jobs that use Stages that support (by design) two or more equal inputs (Reference Match, Join, Merge, Funnel are typical illustrations of these Stage types).

This is not always an easy construct to recognize, especially when converting very lengthy and complex Jobs into WISD. Let’s look at some examples.

In this first example we see a Job using the Funnel Stage. Two links, or two “flows” from “somewhere” upstream in the Job, arrive at the Funnel. If we consider the rule of having WISD “drive” the flow, then this Job design would be invalid. It is imperative that both links have their origin at the WISD Stage. It is not possible to have an ISD Job where one path (like the bottom path below using link02) starts at WISD, and the other path (link02 on top) beginning at a Sequential file or rdbms:

While it could be debated as to what the behavior of the engine when encountering this construct should be in an “always on” Job, it’s easy to appreciate the ambiguity. In a simplistic batch job, rows from each path to the Funnel are intermixed. They can be sorted together, pushed thru “as they arrive” or be grouped (finish one path and then get the other). What happens in a real time scenario when one single row or set of rows comes from a remote client followed by end-of-wave? Should the entire set of relational rows be sent down link link02, followed by the current WISD row on link link01? Should only one Sequential or rdbms row be sorted into the row or rows from the WISD input (saving the “next” link02 row for the “next” WISD request?) …and then what happens for the subsequent WISD request? Should end-of-wave result in a complete re-running of all the stages upstream from the funnel that are further upstream from link link02? What if the ultimate source of this path contains 12 million rows? Do we want to wait for that amount of processing for each request that arrives to the “always on” Job?

This is a potentially confusing area, not only for the DataStage/QualityStage Developer, but for the engine as well. Best that it be avoided. Incorrect and/or invalid responses will be the result, if not complete failure or hanging of requests. A common symptom is that the first request works and the second request returns nothing, or the same payload as the first.

Does this mean you cannot use a Funnel with ISD? Of course not! There are many reasons for using a Funnel, the most common being the need to have independent paths
of logic for data in a single request. The screen shot below illustrates a Job using the Funnel Stage, where all paths to the Funnel are correctly “driven” by the request that flows from the WISDInput Stage:

As noted in an earlier section, the same circumstances can occur when using the Join Stage. Data on the “independent” path will not be refreshed, and results will be invalid, inconsistent, or yield errors. The alternative is to use a Lookup Stage instead. However, as with the Funnel, the Join is allowed in an “always on” Job, provided all incoming paths are driven by the WISDInput Stage.

Incorrect use of Join Stage in a WISD Job. Use a Lookup instead:

Acceptable use of Join Stage in an “always on” WISD Job:

(A recently discovered variant of the Job pattern immediately above is also problematic when used under WISD. This is a Job flow where the “Join” above is replaced by a Lookup, and link “l3” is being used for the “reference” side of the Lookup. Such a Job needs to be designed using a Join instead of expecting one of the WISD feeds to provide the “lookup table” values).

You can expect this behavior to be exhibited by Jobs that use any of the following Stages: Difference, Compare, Merge, Join and QualityStage Reference Match.

Special Considerations for QualityStage

QualityStage Jobs developed for use under WISD need to be particularly conscious of the concept described in the previous section, especially when feeding data into a Reference Match. An often desired pattern is to feed the incoming service based data as the primary link for the Reference Match, and then bring in reference data from a fixed data source. This violates the rules described above. Instead of having a fixed data source attached to the reference link, perform a Lookup based upon your high level blocking factors in the incoming WISD request. This involves using a Copy Stage that splits the incoming row, sending a row to the primary input as before while sending the other row to a Lookup where multiple reference candidates can be dynamically retrieved. Be sure to make this a sparse lookup if you expect that the source data could change while the WISD Job is “on” (enabled), and check it carefully to be sure you’ve set it to return multiple rows.

Happy “servicing…” :)

Ernie

Data Lineage for “other” ETL tools and external processes

Hi all…

Recently I have had the pleasure of working with a couple of consulting firms who have been very creative with Metadata Workbench and the “Extension” features (Extension Mappings and Extended Data Sources) that were introduced last year in release 8.1.1. These companies have deep expertise on Information Server and DataStage as well as other ETL tools, and they have built tooling that automatically generates Extensions.

This is just a technical heads-up….. No bias here, nor am I doing their marketing, but it’s important that everyone knows that building lineage in Workbench for non DataStage processes is very do-able…all it takes is some creativity and awareness of the shell, 3GL program, or other ETL tool or process you are trying to illustrate…..we’ve had some entries in the forum on this subject, and some of you have asked me questions about it offline. Wanted to share these other resources.

Here are their sites if you are interested.

Accantec. http://www.accantec.com Cognos Data Manager. Accantec is based in Germany and has deep Cognos expertise. They have built tooling to illustrate Cognos Data Manager processes in Metadata Workbench for complete lineage.

Compact. http://www.compactBI.com Expertise in Informatica, AbI, others. They too have a solution that represents these tools within Metadata Workbench, often in combination with DataStage.

Ernie

Online XML Document Validation

Formal XML document validation (based on an XML Schema Definition) is often confusing. The schema says one thing, and the document instance says another, but the tool you are using doesn’t seem to interpret things correctly (?!@#?).

Sometimes the problem is in the interpretation of “invalid”……..and deciphering the XML standard and how vendors implement it will make your head spin. When you have time, go thru the w3c site (www.w3c.org). It’s a great site, but you will need to spend lots of time there to understand all the nuances of xml and its parsing.

What is probably more important is that you know all the ways that you can quickly test validation to better understand your documents and the situation you are currently dealing with…. I usually just google “xml validation” to find the myriad of sites available, but found one this morning that has a really elegant interface and thought I’d mention it here.

Like many online validation sites, quick validation is offered for free as an attraction to come to the site and see what other xml tools are available. I will be certainly be browsing this site further and looking at their other tools because their validation tooling was so easy to use:

http://tools.decisionsoft.com/schemaValidate/

When you go there you will see a dialog that lets you browse for a local document and then also browse for its xsd. You then get formal xsd validation results that tell you whether the document is invalid or not, and show you any errors produced by the parser.

Keep this url handy when you need to quickly check a document against its xsd and better understand the behavior of the tool you are using to manipulate it.

Ernie

Follow

Get every new post delivered to your Inbox.

Join 87 other followers