Real-Time Information Governance and Data Integration

  • about ernie ostic
  • Table of Contents (for reasonably useful Postings!)
  • What’s this Blog about?

    Thoughts and techniques concerning all things about data and data integration, especially lineage and “data lineage” and how and why it needs to be tracked and monitored.

  • Select Posts by Area

  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 245 other subscribers

Open Metadata Sharing with ODPi/Egeria and IGC

October 2, 2018 — dsrealtime

Hi everyone…..here to share with you the continuing evolution of Open Metadata and its soon-to-be-released implementation for Information Server 11.7 and the Information Governance Catalog (IGC).

Recently I was given the opportunity to start working with what is being called the
igc-omrs-connector — an implementation of ODPi/Egeria and its Open Metadata Repository Services (OMRS) api’s that enable IGC to be the first OMRS-Compliant repository!

The link below points to a demonstration that illustrates the real-time and bi-directional sharing of metadata between two instances of IGC.  It reviews several key concepts of ODPi/Egeria and OMRS (such as the meaning of a “cohort”) and then dives deeper to illustrate and “watch” (with windows into the kafka topics that help enable OMRS sharing) the sharing of metadata between the repositories using OMRS.

The 15 minute recording starts with a brief overview of ODPi/Egeria and OMRS, and then moves into the actual demonstration to illustrate the sharing of technical database metadata.  It continues afterwards with the sharing of glossary information and then the assignment of business terms to technical assets.

The video highlights ODPi/Egeria for the sharing of metadata between two instances of IGC, which is attractive for its application of real-time, bi-directional, and automated metadata exchange; however, the real value is when (in the next few months) Apache Atlas and other repositories _also_ become OMRS-Compliant — thus enabling metadata sharing among _independent_ repositories!    This is where the benefits of Open Metadata will be fully realized.   Costs associated with building and maintaining custom bridging solutions can be reduced, and developers, business users and data scientists alike will be able to more easily find, understand and validate valuable data assets and their meaning while further exploiting metadata driven solutions throughout their organization.

https://www.youtube.com/watch?v=P_RhQXXEbd4&t=11s

Stay tuned.  As additional compliant repositories come on-line, I will profile and (where possible) demonstrate their capabilities here, and continue this important discussion!

Thanks.

-ernie

 

Posted in Data Governance, Information Governance, metadata, Metadata Management. Tags: apache atlas, Egeria, igc, information governance catalog, ODPi/Egeria, Open Metadata. Leave a Comment »

Getting started with Data Lineage!

September 28, 2010 — dsrealtime

Last night I was reminded about a series of blog entries I’ve wanted to make concerning the InfoSphere Metadata Workbench and how to get the most out of its Data Lineage capabilities. The Workbench is very powerful — it illustrates relationships between processes, business concepts, people, databases, columns, data files, and much, much more. Combined with Business Glossary, it gives you reporting capabilities for the casual business user as well as the (often) more technical dedicated metadata researcher.

I’ve had a variety of entries about Workbench in the past two years (see the table of contents link in the top right, and find the metadata section), but nothing on “getting started”. As Metadata Workbench starts to support more and more objects, knowing certain skills and techniques becomes that much more important. This is especially true when trying to gain the most from Metadata Workbench when it is being used to illustrate Business Terms, Stewards, FastTrack Mappings, DataStage Jobs, Tables and Files, External ETL Tools, scripts and processes, operational metadata data and a vast list of other data integration artifacts.

Many of you who start with Metadata Workbench begin with DataStage/QualityStage Jobs.

[Note ..this particular post applies mostly to pre-8.7 Metadata Workbench, but it is still worth reading initially, even if you are on a newer release of the workbench…the remaining posts in the series discuss the critical things needed for linking jobs together and linking jobs to their source and target table or file objects]

So I will start there.

Once you have mastered lineage with DataStage, and its combination with other objects, you can then easily move on to other concepts for non-DataStage metadata, which I will also cover in this series of blog entries. If you are using Metadata Workbench and are not a DataStage user, stay tuned. As we progress I will take a tour through Extensions, Extension Mappings, Extended Data Sources and all other such concepts.

MAKE SURE YOU START learning about lineage and Metadata Workbench using a small number of Jobs. NO MORE THAN 10 – 15. Any more and the results will be overwhelming or confusing, and will prevent you from understanding some very important and critical rules. Find 10 or so Jobs, probably in one application, and in one folder or related folders, that have things in common. Jobs that share datasets or are part of an overall flow. Get to know these Jobs before ever opening the Workbench to review them. Eventually you will be comfortable using the Workbench to review metadata on Jobs you have never seen — but that’s a poor way to learn the power of the tooling.

From those Jobs, pick a reasonably complex one. Maybe one with a lookup or a Join, a reasonable sequence of Stages (8 to 10 or so) and preferably a single “major” target. Since you are learning about the Workbench, you should be familiar, even intimate, with this Job. That will help as you learn the various ways to navigate through the user interface, because you will know what to expect at each particular dialog, report or screen.

[this first “getting started” assumes that you have NEVER performed Automated Services against your DataStage Project….if you have, it’s ok, but you might not get the same results as I am outlining below — you may get more metadata than I am describing in this initial learning step. …and if you don’t know what I’m talking about (yet), that’s ok too…]

Log into the Metadata Workbench and notice the “Engine” pull down at the left. This is the list of your DataStage Servers and their Projects. Open up the project, it’s folders, and find your Job. Click directly on it. Scroll up and down in the detailed page that appears. there is the main page with the picture of the Job (click on it and you will get an expanded view in a new window of what the Job looks like). The metadata you are viewing is up-to-date from the last moment you or a developer saved the Job in the DS Designer. Also there is a very important listing of the Stage types in the Job, along with their icon. Note below you have many “expandable” sections for things like Job Operational metadata…..investigate the options.

Now click on the “main” target Stage of this Job. This brings you to a similar looking detail page, this one for the “Stage.” Look around, but don’t click anything — when you are ready, select “Data Lineage” at the upper right. As you do so, consider “where you are standing” (you are on a “Stage”) and what sort of lineage you would like to see. As you will discover, knowing “where you are” when you start your lineage is very important.

[If you are using 8.7 or higher, at this point you should soon see a single graphic, probably just one big icon in the middle of the page. This is the lineage for THIS Job — all by itself (unless you’ve already done some other work in lineage, in which case you will get other things linked to it). Look around and then click on the “Expand” link that is on the Job itself. This brings up a detailed page “for that Job”. Look around… “grab” an empty part of the screen with your left mouse button and move the picture around; zoom up and down (there is a little bar at the top left for this). Click on the other buttons that show you both a “mini” edition of your lineage as well as a key for the kinds of lineage that are displayed. Then move to the next post in this series (Linking Jobs) ]

The default option at the next dialog is “Where did this come from”. Ignore the three checked boxes for now and click “Create Report”. This will comb through ALL the possible resources for “where” data for the “stage you started on” came from. Look thru the list. Note also the highlighted line. Move it up and down. This highlight bar lets you select EXACTLY which resource you’d like to see for your actual report. The “total” collection of lineage resources is in front of you right now — you will select which one you want for a detailed source-to-target report. This is often a point of confusion because the highlight bar is not always obvious. Data lineage doesn’t show you “ALL” the sources — just the path to/from the ones that you select [we’ll contrast this in a later entry with Business Lineage, which DOES provide a summary of ALL sources or ALL target from a particular resource].

Look at the bottom of the page. Find the button labeled “Display Final Assets”. Click it. The list of objects above should get much smaller. Most likely, it should just show the source stage for this Job, or maybe its ultimate source as well as a lookup source stage or a source for a Join. Pick the primary source stage for the Job and then click “Show Textual” Report.

Review the result. The textual report isn’t as pretty, but it tends to be more scalable. Scroll up and down, and note what you see on the left, and the Job details you see on the right. Everything is hyperlinked. Now find the little triangle towards the top left of this center pane where your report is (it’s called Report Selection or similar) and click on it. That should expose again the “assets” page. Now you can try “Show Graphical”. When you get there, play with it. Grab some white space around the diagram and move the whole thing around…..try the zoom bar in the upper left. Click on the various icons in the lineage and then right mouse on one of the stages and find “open details in new window”. That will bring you back to a detailed viewing page and the process starts again.

What happens if you choose the target stage of your original Job (the first stage you selected earlier) and ask for “Data Lineage” and select “Where does this go to”? If you haven’t done Automated Services as I’ve noted above, you should likely receive “No assets found” or “No data for the report”. This is because it’s the “final” target — there isn’t anything else. “Where did this come from” will yield a similar result if you happen to be “sitting” on a source when you start your lineage exercise.

If you practice this, you should become very familiar with the lineage report user interface, and will have a strong base for moving forward with more complex, and deeper, scenarios.

Next entry: Linking Jobs together……

(link to next post in this series: Linking Jobs )

Ernie

Posted in data lineage, datastage, etl, meta data, metadata, Metadata Workbench. Tags: data lineage, etl, metadata, metadata workbench. 7 Comments »

What exactly is Data Lineage?

December 15, 2009 — dsrealtime

Metadata management is becoming a big issue (…again, finally, for the “nth” time over the years), thanks to attention to initiatives such as data governance, data quality, master data management, and others. This time it finally feels more serious. That’s a good thing. One conceptual issue that vendors and architects are pushing related to metadata management (whether home-grown or part of a packaged solution) is “data lineage.” What does that mean?

Let’s imagine that a user is confused about a value in a report…perhaps it is a numeric result labeled as “Quarterly Profit Variance”.

What do they do today to gain further awareness of this amount and trust it for making a decision? In many large enterprises, they call someone at the “help desk”. This leads to additional phone calls and emails, and then one or more analysts or developers sifting thru various systems reviewing source code and tracking down “subject matter experts.” One large bank I visited recently said that this can take _days_! …and that’s assuming they ever successfully find the answer. In too many other cases the executive cannot wait that long and makes a decision without knowing the background, with all the risks that entails.

Carefully managed metadata that supports data lineage can help.

Using the example above, quick access to a corporate glossary of terminology will enable an executive to look up the business definition for “Quarterly Profit Variance.” That may help them understand the business semantics, but may not be enough…. They or their support team may need to drill deeper. “Where did the value come from?” “How is it calculated?”

Data lineage can answer these questions, tracing the data path (it’s “lineage”) upstream from the report. Many sources and expressions may have contributed to the final value. The lineage path may run through cubes and database views, ETL processes that load a warehouse or datamart, intermediate staging tables, shells and FTP scripts, and even legacy systems on the mainframe. This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.

Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure. Ultimately this leads to better decision making. Happy reporting!

Ernie Ostic

Update (2017-03-30) to this old but still very valid discussion…..    please check out this excellent new blog entry by one of my IBM colleagues,  Distinguished Engineer and thought leader Mandy Chessell, regarding Data Lineage … https://poimnotes.blog/2017/03/19/understanding-the-origin-of-data/

Posted in Business Glossary, Data Governance, data lineage, datastage, general, Information Governance, meta data, metadata, Metadata Workbench. Tags: data lineage, metadata. 12 Comments »

Column-Names, COLUMN_NAMEs, %CoLuMnNaMeS% !!!

November 30, 2009 — dsrealtime

Column names. Sigh. What a long and sordid history our industry has with column names, or “field” names if we go back far enough. In early days they were limited to eight characters, and all caps! It wasn’t unusual to have cryptic meaning squeezed into the first two characters of a name, so that one could categorize twisted column names to give them more meaning. MTCSTTOT might mean Midwest Territory Customer Total. That one is fairly self explanatory, but there were (there still are) much worse.

Then along came the standards committees, open systems solutions, and lots of vendors to “save the day.” Long column names (and tablenames and directory names, etc. etc.) were promised to “clear up everything.” And then what happened? ….we’ve all spent a decade and a half updating parsers, fixing bugs, and learning how to deal with Midwest Territory – Customer Total and how to let the system know that we weren’t aren’t trying to subtract “Customer Total” from “Midwest Territory”.

I used to be a purist. Long column names, short ones, numeric characters, punctuation…..let’s allow it all! Why be picky? Even reserved words shoud be allowed, provided they are in the proper context. Burden should be on the vendors to figure it out and get it right. A piece of me still feels that way, but it is getting beaten up by the realist. Practically speaking, it seems we’ve gone too far. Maybe it’s the increased integration and cross vendor pollenation (or confrontation) that has resulted in the trashing of such lofty goals, but there still isn’t a good enough standard that works across all platforms, all databases, all languages, all tools, all operating systems, and in every context. Recently I’ve watched multiple sites loose sleep over problems that ultimately came down to files not being found in strange directory paths, column names causing strange and misleading syntax errors because of special characters when used in a system that didn’t understand them, and incorrect results in a scenario that mis-parsed an expression. This was across tools of different vendors, and across different environments — increasingly a situation that occurs in many enterprises.

It’s an issue we have to live with. It’s not an easy problem to solve, obviously. I’d still like to see it resolved, but as long as we have lots of different software, lots of different skills and lots of different environments, the problem will continue to raise its ugly head. Long names seem to be “fairly” well supported, and adoption of the CamelCaseMethod that Java programmers are fond of seems to work well in most places. Spaces and strange characters, though, require better justification. Somewhere, somehow, they tend to still show up as thorns.

If you have to use blanks or strange characters in a column name, please document it. Document why you made the choice, so that it can be revisited and/or evaluated later. Document the places that you are using it, and most importantly, be sure to employ tools that will help you find it everywhere if something goes awry.

Ernie

Posted in etl, meta data, metadata. Leave a Comment »

Cool new Widget for Business Glossary!

September 15, 2009 — dsrealtime

Check out this cool new example of a “widget” for access to the InfoSphere Business Glossary! My IBM colleages in engineering created an example that illustrates how you can integrate the Business Glossary directly into your Portals and custom web applications using the REST API that is included with BG. This example is easy to deploy and, for the web-savvy developer, easy to customize. Enjoy!

https://www.ibm.com/developerworks/data/library/techarticle/dm-0909infosphererest/

Ernie

Posted in bg, Business Glossary, metadata, RealTime, REST. Leave a Comment »

Using DataStage to load new Terms into Business Glossary

November 30, 2008 — dsrealtime

There are a variety of ways to import new Terms into the InfoSphere Busines Glossary.  One of these, for initial loads, is to use an XML import.   The XML format is fairly easy to produce (a sample is provided with the Business Glossary and can be found at your Information Server Web Console)

The attached DataStage Server Job illustrates how to load new Terms and Attributes from some external structure.   In this example I use a simple sequential file, but if you look at the Job you will see that it can easily be adapted to any source, or simply write another Job to go from your source to a target that reflects the sample terms I’ve provided below. 

This was tested in 8.0, although has since been modified for use with 8.1.  I’m not sure how well it will import into 8.0.  The terms and attributes are very simplistic and use a hockey theme, just to keep things simple and allow for discussion.     The code is an example for instructional purposes only.   Please let me know if you have any questions or run into problems.

(I hope you can make use of these. Seems the blog has changed and won’t allow me to upload .txt files. I’ve tried putting all the content in “notes pages” of .ppt. You’ll need to download and then open the .ppt [one of only a few file types allowed here] and then see if you can cut and paste the sample .txt file, sample xml, and .dsx.).

Ernie

sample-terms-attributes

Posted in bg, Business Glossary, datastage, meta data, metadata. 1 Comment »
Blog at WordPress.com.
  • please note

    The postings on this site are my own and don’t necessarily represent current or former employers or their positions, strategies or opinions.

  • Recent posts

    • What is Data Lineage? (a reprise) July 27, 2019
    • Open Metadata Sharing with ODPi/Egeria and IGC October 2, 2018
    • Please welcome Egeria! August 30, 2018
  • Follow Following
    • Real-Time Information Governance and Data Integration
    • Join 196 other followers
    • Already have a WordPress.com account? Log in now.
    • Real-Time Information Governance and Data Integration
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar