Real-Time Information Governance and Data Integration

  • about ernie ostic
  • Table of Contents (for reasonably useful Postings!)
  • What’s this Blog about?

    Thoughts and techniques concerning all things about data and data integration, especially lineage and “data lineage” and how and why it needs to be tracked and monitored.

  • Select Posts by Area

  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 246 other followers

Are Your Kids Addicted to Minecraft?

October 2, 2012 — dsrealtime

Are your kids playing Minecraft yet? ….or should I say “Are your kids addicted to Minecraft?” This game by Mojang (www.minecraft.net) is a creative game where you put your imagination and creativity to work to build your own “world” using square blocks that you construct or dig and destroy to make tunnels, walls, caves, houses, mountains and more. The square blocks themselves represent many elements (stone, iron, coal, etc.) and in combination, lead you to the discovery or building of new things that you need to expand your empire. You find blocks of iron ore as you are “mining” (basically blowing up blocks of earth as you create new paths) and in combination with coal that makes a stove, you can obtain iron and gold utensils. Of course, to keep things interesting there are monsters and spiders that can kill you as the game turns from day to night, so you need to construct various shelters in order to survive.

I may have some of the facts incorrect — I have only watched the game, not played it, but am amazed by the phenomenon and how “catching” it has been for 10 to 13 year olds around the world. Nearly every parent I speak to has to remind their children to put down their iPad or iTouch or iPhone when they’ve played Minecraft for too many hours, or to turn off their Xbox or other computer that is running other platform versions of the game!

What makes this game so addictive? My only reference is our own creation of mini-worlds when we were kids. What kept us outside for hours, with our toys, designing private and public spaces and sharing them with our friends. For me it was Matchbox cars and trucks and Hot Wheels, with an occasional GI Joe, platoon of plastic soldiers, or Tonka truck mixed in. My best friend and I would build elaborate “villages” in and around the garden, with pachysandra posing as giant palm trees. We used sticks and pebbles to mark parking lots, driveways, and highways, and a small patch of sand at the other end of the garden served as our “remote gravel pit.” This was our own “mining” operation where our special mission trucks would go on “expeditions”. Mojang has clearly tapped into that same experience, only now it is on a touch screen, able to be played anywhere, and shared over the web between children who teach each other new things and proudly show off their new designs. It can also be played anywhere and at any time. Video games exist with far more spectacular graphics and intricate plot lines but bravo to Mojang for delivering a platform that inspires the imagination with basic simplicity while allowing for an infinite array of unique and challenging experiences.

Why write about this on “this” blog?

Watch some youngsters playing this game for awhile. It won’t be long before you are amazed at the speed at which they build/destroy/re-build/tear down and continue to evolve their world. All the time looking out for dangerous spiders or avoiding “Creepers” that will blow up and kill their game character if they step too near. Players make quick decisions about where to dig, what to build, or whether to leave a cave without knowing if it is nighttime (players learn early that monsters come out at night). How do they achieve this speed? Practice of course, but also through collaboration with their peers. They can play on the same network and play within each other’s worlds. They learn a whole new vocabulary and continually learn from others where to go (within their own groups or on the web). Some will accuse me of making a leap here, but purely for fun, this is governance in action. New terminology is shared by everyone in the Minecraft community (do you know what “Creepers” are, or how to get Glowstone and Blaze Rods out of “The Nether”?), helped along by Stewards (check out YouTube — there are 100’s of tutorials and videos out there from experienced “guides”) and metadata galore as players manage “Chests” full of artifacts collected and made, along with accurate counts of their inventory. Lineage is a bit of a stretch and not a concept you can directly apply to Minecraft, but there is a cottage industry for recording software that will let you create videos of a trip through your world or a fight with monsters.

If you are still reading this blog entry and don’t know anything about metadata, I hope you enjoy watching or playing Minecraft with your kids, providing it doesn’t push every other important activity out of the way! …and if you are into governance, I hope you had fun with the analogy and enjoyed a brief respite from metadata and governance in our technical realm. 🙂

Posted in Information Governance. Tags: creeper, governance, metadata, minecraft, RealTime. 3 Comments »

Linking DataStage Jobs Together

September 30, 2010 — dsrealtime

Once you have mastered the “navigation” and asset selection options of Data Lineage reporting, it’s time to look at how DataStage Jobs are automatically linked together. By now you should be comfortable with thinking about your “starting position” for a data lineage report — your initial “perspective” if you will (what object are you standing on when you begin). You should also be comfortable with thinking about the “direction” for your Data Lineage investigation — are you looking “upstream” for “Where did this come from?” or downstream for “Where does this go to?” [In 8.7 or higher, you no longer have to request “upstream” or “downstream”, but you should still give it some thought so that you have an idea of what you will be expecting or what you are looking for].

If you need a refresher on the basics, please see Getting Started with Data Lineage!.

A typical production site for DataStage/QualityStage has MANY Jobs — hundreds perhaps….even thousands. All integrated and working together to transform your data and move it from one place to another. Sometimes they are written by one very hard-working developer, who might have all the lineage in his or her head, but more often it is a larger scale endeavor, with lots of team members, often scattered around the globe, and with varied skill sets and possibly working on related albeit independent solutions. They may know each other, or may not. How are the DataStage Jobs sequenced from a data flow perspective? How does data flow between a Job developed to process data received via FTP from the mainframe and then ultimately to a datamart that supports a reporting system? How does one Job connect to another? Sometimes it may be one giant Job, but its not likely. Intermediate temporary tables are often created for everything from checkpoints to Operational Data Stores to “parking lots” where data can be restructured or delivered to another application along the way. Workbench can sort this out and provide you with lineage through all these Jobs.

Intra-Job Data Lineage (Data Lineage between Jobs) is largely automatic. You simply have to pay attention to a few “good sense” leading practices and understand the pattern. Note that this has “nothing” to do with Shared Tables or Table Definitions at all… it’s entirely done by merely parsing thru your Jobs [this is a key reason why you can get immediate insight on 7.x Jobs that are imported into the 8.x environment — even if you haven’t compiled a single one or started your formal testing and QA process!]

Automated Services is the “parsing” step at the Advanced Tab….when you say “run” with your Project(s) checked, Workbench combs through your Jobs, looking for similarities that will link Jobs together end-to-end. [Note: If you are using 8.7 or higher, this process is called “Detect Associations” …find it by first clicking on “Manage Lineage” at the Advanced Tab, and then clicking on the larger “arrow” icon on the right after selecting one of your DataStage Projects.] Here’s what it looks for among Jobs:

a) Common or “like” Stages between the Target of one Job and the Source of another. Two ODBC Stages are in common, but so is ODBC and say, Oracle OCI. Or DB2Load and DB2Connector. Or two Sequential Stages. Or two Dataset Stages.

b) At least one column in common. [Update Note… the research leading to this post was done originally in early 8.1 … I’ve since discovered in 8.5 and higher that the connection is more flexible now, and one column name in common is no longer required. The default values and Stage type are the major controlling factor in bringing Jobs together.]

c) Same hard coded values (yuk…who does that?) OR….same “default” values for Job Parameters for the critical common properties. For RDBMS type Stages, it’s ServerName, Schema, and Tablename. For Sequential type Stages, it’s the filename. The Automated Services will put together multiple Job Parameter default values if needed.

If your team follows good practices of having parameter sets or common values for things like an ODBC DSN, a high degree of lineage will often occur immediately after the first Automated Services (informally known as “stitching”) occurs. Expect that the “first” time you run Automated Services, it could take a long time and be very intense. Do it during off hours if you have hundreds or thousands of Jobs. After that first time it will recognize a delta and only parse thru the Jobs that are new or have been changed.

Now when you do your lineage reporting, and you start while “standing” on the target Stage of an application down-stream Job….when you ask for “Where does this come from?”, you can expect to see Stages through many Jobs back to the ultimate source. If it dead-ends surprisingly, it’s probably because one of the three rules above didn’t apply, or there’s an odd Stage that isn’t supported for lineage (Redbrick is one of the only ones I’m aware of at this point).

If that fails, use the “Stage Binding” at the Advanced Services tab. This is one of the options for “manual” binding of metadata — a sort of “toolbox” of wrenches and bolts for when you need ’em. The Stage Binding is designed to be used, only when absolutely necessary, to “force” two Stages together for lineage purposes. It is fairly easy to use…it prompts you first for the name of a Job, and then when it presents you with a list of Stages, slide your cursor over to the right so that you can “add” a Stage from another Job. I have used it effectively for unsupported Stages, as noted, but also when the rules above don’t apply. In one case I was sending the output of an XML stage into a Sequential File…and in the next Job, I was reading that with an “External Source” Stage. There is nothing in common between those stage types, and the columns were entirely different (the Sequential File Stage contains a column called “myXML” and the External Source merely carries the output of a unix list command (a set of filenames). I was able to establish perfect lineage however, by using a Manual “Stage Binding”, forcing the Sequential Stage of the first Job and the External Source Stage of the second Job to be “bolted” together.

Good reporting! Next topic — Connecting Database Table objects to your Jobs

Ernie

Posted in Data Lineage. Tags: data lineage, datastage, etl, metadata. 5 Comments »

Getting started with Data Lineage!

September 28, 2010 — dsrealtime

Last night I was reminded about a series of blog entries I’ve wanted to make concerning the InfoSphere Metadata Workbench and how to get the most out of its Data Lineage capabilities. The Workbench is very powerful — it illustrates relationships between processes, business concepts, people, databases, columns, data files, and much, much more. Combined with Business Glossary, it gives you reporting capabilities for the casual business user as well as the (often) more technical dedicated metadata researcher.

I’ve had a variety of entries about Workbench in the past two years (see the table of contents link in the top right, and find the metadata section), but nothing on “getting started”. As Metadata Workbench starts to support more and more objects, knowing certain skills and techniques becomes that much more important. This is especially true when trying to gain the most from Metadata Workbench when it is being used to illustrate Business Terms, Stewards, FastTrack Mappings, DataStage Jobs, Tables and Files, External ETL Tools, scripts and processes, operational metadata data and a vast list of other data integration artifacts.

Many of you who start with Metadata Workbench begin with DataStage/QualityStage Jobs.

[Note ..this particular post applies mostly to pre-8.7 Metadata Workbench, but it is still worth reading initially, even if you are on a newer release of the workbench…the remaining posts in the series discuss the critical things needed for linking jobs together and linking jobs to their source and target table or file objects]

So I will start there.

Once you have mastered lineage with DataStage, and its combination with other objects, you can then easily move on to other concepts for non-DataStage metadata, which I will also cover in this series of blog entries. If you are using Metadata Workbench and are not a DataStage user, stay tuned. As we progress I will take a tour through Extensions, Extension Mappings, Extended Data Sources and all other such concepts.

MAKE SURE YOU START learning about lineage and Metadata Workbench using a small number of Jobs. NO MORE THAN 10 – 15. Any more and the results will be overwhelming or confusing, and will prevent you from understanding some very important and critical rules. Find 10 or so Jobs, probably in one application, and in one folder or related folders, that have things in common. Jobs that share datasets or are part of an overall flow. Get to know these Jobs before ever opening the Workbench to review them. Eventually you will be comfortable using the Workbench to review metadata on Jobs you have never seen — but that’s a poor way to learn the power of the tooling.

From those Jobs, pick a reasonably complex one. Maybe one with a lookup or a Join, a reasonable sequence of Stages (8 to 10 or so) and preferably a single “major” target. Since you are learning about the Workbench, you should be familiar, even intimate, with this Job. That will help as you learn the various ways to navigate through the user interface, because you will know what to expect at each particular dialog, report or screen.

[this first “getting started” assumes that you have NEVER performed Automated Services against your DataStage Project….if you have, it’s ok, but you might not get the same results as I am outlining below — you may get more metadata than I am describing in this initial learning step. …and if you don’t know what I’m talking about (yet), that’s ok too…]

Log into the Metadata Workbench and notice the “Engine” pull down at the left. This is the list of your DataStage Servers and their Projects. Open up the project, it’s folders, and find your Job. Click directly on it. Scroll up and down in the detailed page that appears. there is the main page with the picture of the Job (click on it and you will get an expanded view in a new window of what the Job looks like). The metadata you are viewing is up-to-date from the last moment you or a developer saved the Job in the DS Designer. Also there is a very important listing of the Stage types in the Job, along with their icon. Note below you have many “expandable” sections for things like Job Operational metadata…..investigate the options.

Now click on the “main” target Stage of this Job. This brings you to a similar looking detail page, this one for the “Stage.” Look around, but don’t click anything — when you are ready, select “Data Lineage” at the upper right. As you do so, consider “where you are standing” (you are on a “Stage”) and what sort of lineage you would like to see. As you will discover, knowing “where you are” when you start your lineage is very important.

[If you are using 8.7 or higher, at this point you should soon see a single graphic, probably just one big icon in the middle of the page. This is the lineage for THIS Job — all by itself (unless you’ve already done some other work in lineage, in which case you will get other things linked to it). Look around and then click on the “Expand” link that is on the Job itself. This brings up a detailed page “for that Job”. Look around… “grab” an empty part of the screen with your left mouse button and move the picture around; zoom up and down (there is a little bar at the top left for this). Click on the other buttons that show you both a “mini” edition of your lineage as well as a key for the kinds of lineage that are displayed. Then move to the next post in this series (Linking Jobs) ]

The default option at the next dialog is “Where did this come from”. Ignore the three checked boxes for now and click “Create Report”. This will comb through ALL the possible resources for “where” data for the “stage you started on” came from. Look thru the list. Note also the highlighted line. Move it up and down. This highlight bar lets you select EXACTLY which resource you’d like to see for your actual report. The “total” collection of lineage resources is in front of you right now — you will select which one you want for a detailed source-to-target report. This is often a point of confusion because the highlight bar is not always obvious. Data lineage doesn’t show you “ALL” the sources — just the path to/from the ones that you select [we’ll contrast this in a later entry with Business Lineage, which DOES provide a summary of ALL sources or ALL target from a particular resource].

Look at the bottom of the page. Find the button labeled “Display Final Assets”. Click it. The list of objects above should get much smaller. Most likely, it should just show the source stage for this Job, or maybe its ultimate source as well as a lookup source stage or a source for a Join. Pick the primary source stage for the Job and then click “Show Textual” Report.

Review the result. The textual report isn’t as pretty, but it tends to be more scalable. Scroll up and down, and note what you see on the left, and the Job details you see on the right. Everything is hyperlinked. Now find the little triangle towards the top left of this center pane where your report is (it’s called Report Selection or similar) and click on it. That should expose again the “assets” page. Now you can try “Show Graphical”. When you get there, play with it. Grab some white space around the diagram and move the whole thing around…..try the zoom bar in the upper left. Click on the various icons in the lineage and then right mouse on one of the stages and find “open details in new window”. That will bring you back to a detailed viewing page and the process starts again.

What happens if you choose the target stage of your original Job (the first stage you selected earlier) and ask for “Data Lineage” and select “Where does this go to”? If you haven’t done Automated Services as I’ve noted above, you should likely receive “No assets found” or “No data for the report”. This is because it’s the “final” target — there isn’t anything else. “Where did this come from” will yield a similar result if you happen to be “sitting” on a source when you start your lineage exercise.

If you practice this, you should become very familiar with the lineage report user interface, and will have a strong base for moving forward with more complex, and deeper, scenarios.

Next entry: Linking Jobs together……

(link to next post in this series: Linking Jobs )

Ernie

Posted in data lineage, datastage, etl, meta data, metadata, Metadata Workbench. Tags: data lineage, etl, metadata, metadata workbench. 7 Comments »

What exactly is Data Lineage?

December 15, 2009 — dsrealtime

Metadata management is becoming a big issue (…again, finally, for the “nth” time over the years), thanks to attention to initiatives such as data governance, data quality, master data management, and others. This time it finally feels more serious. That’s a good thing. One conceptual issue that vendors and architects are pushing related to metadata management (whether home-grown or part of a packaged solution) is “data lineage.” What does that mean?

Let’s imagine that a user is confused about a value in a report…perhaps it is a numeric result labeled as “Quarterly Profit Variance”.

What do they do today to gain further awareness of this amount and trust it for making a decision? In many large enterprises, they call someone at the “help desk”. This leads to additional phone calls and emails, and then one or more analysts or developers sifting thru various systems reviewing source code and tracking down “subject matter experts.” One large bank I visited recently said that this can take _days_! …and that’s assuming they ever successfully find the answer. In too many other cases the executive cannot wait that long and makes a decision without knowing the background, with all the risks that entails.

Carefully managed metadata that supports data lineage can help.

Using the example above, quick access to a corporate glossary of terminology will enable an executive to look up the business definition for “Quarterly Profit Variance.” That may help them understand the business semantics, but may not be enough…. They or their support team may need to drill deeper. “Where did the value come from?” “How is it calculated?”

Data lineage can answer these questions, tracing the data path (it’s “lineage”) upstream from the report. Many sources and expressions may have contributed to the final value. The lineage path may run through cubes and database views, ETL processes that load a warehouse or datamart, intermediate staging tables, shells and FTP scripts, and even legacy systems on the mainframe. This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.

Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure. Ultimately this leads to better decision making. Happy reporting!

Ernie Ostic

Update (2017-03-30) to this old but still very valid discussion…..    please check out this excellent new blog entry by one of my IBM colleagues,  Distinguished Engineer and thought leader Mandy Chessell, regarding Data Lineage … https://poimnotes.blog/2017/03/19/understanding-the-origin-of-data/

Posted in Business Glossary, Data Governance, data lineage, datastage, general, Information Governance, meta data, metadata, Metadata Workbench. Tags: data lineage, metadata. 12 Comments »
Blog at WordPress.com.
  • please note

    The postings on this site are my own and don’t necessarily represent current or former employers or their positions, strategies or opinions.

  • Recent posts

    • What is Data Lineage? (a reprise) July 27, 2019
    • Open Metadata Sharing with ODPi/Egeria and IGC October 2, 2018
    • Please welcome Egeria! August 30, 2018
  • Follow Following
    • Real-Time Information Governance and Data Integration
    • Join 246 other followers
    • Already have a WordPress.com account? Log in now.
    • Real-Time Information Governance and Data Integration
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar