Real-Time Information Governance and Data Integration

  • about ernie ostic
  • Table of Contents (for reasonably useful Postings!)
  • What’s this Blog about?

    Thoughts and techniques concerning all things about data and data integration, especially lineage and “data lineage” and how and why it needs to be tracked and monitored.

  • Select Posts by Area

  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 246 other subscribers

Open Metadata Sharing with ODPi/Egeria and IGC

October 2, 2018 — dsrealtime

Hi everyone…..here to share with you the continuing evolution of Open Metadata and its soon-to-be-released implementation for Information Server 11.7 and the Information Governance Catalog (IGC).

Recently I was given the opportunity to start working with what is being called the
igc-omrs-connector — an implementation of ODPi/Egeria and its Open Metadata Repository Services (OMRS) api’s that enable IGC to be the first OMRS-Compliant repository!

The link below points to a demonstration that illustrates the real-time and bi-directional sharing of metadata between two instances of IGC.  It reviews several key concepts of ODPi/Egeria and OMRS (such as the meaning of a “cohort”) and then dives deeper to illustrate and “watch” (with windows into the kafka topics that help enable OMRS sharing) the sharing of metadata between the repositories using OMRS.

The 15 minute recording starts with a brief overview of ODPi/Egeria and OMRS, and then moves into the actual demonstration to illustrate the sharing of technical database metadata.  It continues afterwards with the sharing of glossary information and then the assignment of business terms to technical assets.

The video highlights ODPi/Egeria for the sharing of metadata between two instances of IGC, which is attractive for its application of real-time, bi-directional, and automated metadata exchange; however, the real value is when (in the next few months) Apache Atlas and other repositories _also_ become OMRS-Compliant — thus enabling metadata sharing among _independent_ repositories!    This is where the benefits of Open Metadata will be fully realized.   Costs associated with building and maintaining custom bridging solutions can be reduced, and developers, business users and data scientists alike will be able to more easily find, understand and validate valuable data assets and their meaning while further exploiting metadata driven solutions throughout their organization.

https://www.youtube.com/watch?v=P_RhQXXEbd4&t=11s

Stay tuned.  As additional compliant repositories come on-line, I will profile and (where possible) demonstrate their capabilities here, and continue this important discussion!

Thanks.

-ernie

 

Posted in Data Governance, Information Governance, metadata, Metadata Management. Tags: apache atlas, Egeria, igc, information governance catalog, ODPi/Egeria, Open Metadata. Leave a Comment »

Building Metadata Extensions for Information Server: Why?

March 5, 2014 — dsrealtime

Lately I have been working with a lot of sites who are interested in “Extensions”. Extensions are simple ways of defining new objects within Information Server, and/or tying them together for data lineage purposes.

Extensions come in two different flavors. There are Extended Data Sources, which are the equivalent of defining your own tables, columns, files, or other “things” that you want to appear as individual “icons” in your lineage diagrams. The others are called Extension Mapping Documents, which are the specifications that define sources and targets (along with other useful metadata properties) and describe the “lineage” that will be drawn by the Metadata Workbench when performing any type of lineage reporting.

Why create them? Doesn’t Information Server allow imports of tables, columns and files, and other artifacts in our environments? Doesn’t DataStage provide me with data lineage, describing complex flows of data?

The answer to that question depends largely on what you are trying to accomplish with your Information Governance objectives. If you are only narrowly concerned about the DataStage Jobs in your application, and the datamarts that they flow to, there may not be a need for Extensions. However, many of you are expanding your horizons beyond just DataStage, and looking at all of the other elements of your enterprise that need tracking, management, oversight, and governance. Such sites are looking to include in their lineage ALL of their objects — not just the tables and columns defined in their relational databases, but also the legacy objects, the message queues, the green screens, the CICS transactions or even the illustration of “people”, so that Tweets and other social media feeds can be shown as the “source” in a lineage diagram that ends up in Hadoop! Those same sites also need to outline the processes that move and transform data, whether they are DataStage, another ETL tool, shells, FTP scripts, java or other 3gl programs.

Every one of those objects may be important to lineage, especially when there is a need to provide detailed source information to upper management. Equally, those objects also demand governance — such as being assigned Stewards, becoming associated with business concepts and Terms, or shown as “Implementing” a particular data quality “Policy” or “Rule”. Further, such objects benefit by being categorized, labeled, or otherwise organized into Collections that make them more useful to everyone who is in need of further definition and deeper understanding. Anyone who “touches” a piece of data, whether it is for development, evaluating a report, or making a crucial decision will benefit by the addition of Extensions.

Several years ago I talked about Extensions as a way of defining an external Web Service (Data Lineage and Web Services). This is just one example of a flow, outside of normal ETL, that has value in being tracked and managed. I have worked with many customers who have defined other ETL tools for lineage, with or without DataStage. Always the goal is to provide more insight to decision makers who need to know where things come from, how they were calculated, who the experts are (and more).

Building Extensions requires first thinking far outside the box — and looking at “all” the metadata that is important to your data integration efforts. What is the metadata that will be meaningful to those business users? Certainly also, there is the need for impact analysis and providing value to your developers who want to answer questions such as “which processes use this table?” Which processes will be affected if we make changes to this MQ Series queue definition?

These are some of the key reasons “why” people are creating Extensions. There is a lot of “built-in” metadata that exists within Information Server. However, you can extract even MORE value from your Information Server investment by adding new objects and new capabilities to the collection of metadata that you are already successfully managing.

Next post will suggest ways to decide which extensions you need, and then we’ll dive into how to create them and what you should consider…

Ernie

Next post in this series….Methodology for Building Extensions

Posted in Metadata Management. Tags: datastage, metadata workbench. 2 Comments »

Reviewing the Advanced Tab in the Metadata Workbench

February 9, 2011 — dsrealtime

Hi all…

Just thought I’d throw in a quick review of the important (imho) links at the Advanced tab…..some of these factoids are buried in my other posts, but I needed to have a cheat-sheet for myself and others. Here is is:

Automated Services. This option brings up the dialog that runs the parsing or “stitching” process for the detailed metadata you have in your DataStage Jobs or Connector-imported rdbms views. It does a lot of stuff, takes time the first time you run it (if you have a ton of metadata), and should be scheduled during off-hours. After the first run against a particular project, it uses a change recognition mechanism to only pick up Jobs that have been updated. Note the “checked” DS Projects carefully. Only select those that are really critical, and once checked, don’t “uncheck” — as you will see from the warnings, this will “remove” all parsing history. Ultimately, this step is the one that reviews the Jobs, connects them via common information found in Stages, etc. See my other posts for how the connection of Jobs to each other is determined.

Stage Binding. When all else fails, you can connect two stages to each other. Use this when, for some reason, two Jobs won’t connect, or when the rules for connecting them can’t be met. I’ve needed this with some custom Stage or Operator implementations, and when I am using a technique that prevents automatic connection. Imagine having a Sequential Stage at the end of a Job that is writing out some xml content — and then I’m using the XML Stage in the next Job to read that content. There isn’t much in common between those Jobs, but I still want lineage to run directly thru them…

Data Item Binding. This provides a “manual” binding of particular Stages to Database Tables and Data Files (see other posts for what those are, how they are created, and how they are different from “DataStage Table Definitions”). Use this when you are unable to get Database Alias to work as you expect and you simply want to “bolt” a particular Database Table or Data File to a Stage in one of your Jobs to complete the lineage picture.

Data Source Identity. Use this when, for whatever reason, you want to link two identical tables for lineage purposes. Reasons? Two people might have imported the same metadata accidentally and you don’t want to delete it….or you might have the “design” information from an ERwin model and also have the “actual” table information from the rdbms catalog. There are many valid reasons. This link let’s you relate tables together. They must have the same name — the option here lets you relate the “Schemas” of two different databases. Identical tables within those schemas will become linked for lineage reporting — and therefore, also linked to whatever those individual tables connect to for lineage.

Database Alias. This option establishes the connection between an abstract string in a DataStage Stage (Server name, DSN name, etc., as defined by the relational stage) and the “Host/Database” combination that was actually imported. Database Tables in Metadata Workbench are typically “actual” tables — but in DataStage, like any well designed application, the “name” is a placeholder. This assigns the “placeholder” to the host and database. The schema.tablename used in the Stage will then be matched against the Host/Database set of Tables to create a lineage connection. The list presented at this option will be entirely empty until you perform Automated Services. Then it will be populated with each StageType and “server string” combination that it finds in your Jobs.

Hope this helps understand these options.

Ernie

Posted in Metadata Management. 1 Comment »
Blog at WordPress.com.
  • please note

    The postings on this site are my own and don’t necessarily represent current or former employers or their positions, strategies or opinions.

  • Recent posts

    • What is Data Lineage? (a reprise) July 27, 2019
    • Open Metadata Sharing with ODPi/Egeria and IGC October 2, 2018
    • Please welcome Egeria! August 30, 2018
  • Follow Following
    • Real-Time Information Governance and Data Integration
    • Join 197 other followers
    • Already have a WordPress.com account? Log in now.
    • Real-Time Information Governance and Data Integration
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar