Methodology for Building Extensions

Hi Everyone…

In the last post I talked about “why” Metadata Extensions are useful and important (Building Metadata Extensions….”Why?”). Today I want to review the basic steps in a high level “methodology” that you can apply when making decisions about extensions and their construction that will help you meet the objectives of your governance initiatives.

Step 1. Decide what you want to “see” in lineage and/or accomplish from a governance perspective. Do you have custom Excel Spreadsheets that you would like to illustrate as little “pie charts” in a lineage diagram, as the ultimate targets? Do you have mainframe legacy “green screens” that business users would like to see the names of as “icons” in a business lineage report? Are there home grown ETL processes that you need to identify, at least by “name” when they move data between a flat file and your operational data store? Lineage is helping boost confidence to the users, whether they are report and ETL developers, or DBAs tracking actual processes, or reporting users who need some level of validation of a data source. Which objects are “missing” from today’s lineage picture? Which ones would add clarity to the users “big picture”. Each of the use cases above represent scenarios where lineage from the “known” sources (such as DataStage) wasn’t enough. There are no industry “bridges” for custom SQL, personalized spreadsheets, or home grown javascript. And in the green screen case, the natural lineage that illustrated the fields from a COBOL FD left business users confused and in the dark.

The “…accomplish from a governance perspective” in the first sentence above takes this idea further. The value of your solution is not just lineage — it will be valuable to assign Stewards or “owners” to those custom reports, or expiration dates to the green screens. Perhaps those resources are also influenced by formal Information Governance Rules or Terms in the business glossary. The need to manage those resources, beyond their function in lineage, is also something to measure.

Step 2. How will you model it inside of Information Server? Once you know which objects and “things” you want to manage or include in lineage, what objects should you use inside of Information Server? The answer to this is a bit trickier. It requires that you have some knowledge of Information Server and its metadata artifacts, how they are displayed, which ones exist in a parent-child hierarchy (if that is desirable), which ones are dependent upon others, what does their icon look like in data lineage reports, etc. There aren’t any “wrong” answers here, although some methods will have advantages over others. There are many kinds of relationships within Information Server’s metadata, and nearly anything can be illustrated. Generally speaking, if the “thing” you are representing is closest in concept to a “table” or a “file”, then use those formal objects (Database Tables and Data Files). If it is conceptual, consider a formal logical modeling object. If it looks and tastes like a report, then a BI object (pie chart) might be preferred. If it is something entirely odd or abstract (the green screen above, or perhaps a proprietary message queue), then consider an Extended Data Source. I’ll go into more details on each of these things in later posts, but for now, from a methodology perspective, consider this your planning step. It often requires some experimentation to determine how best to illustrate your desired “thing”.

Step 3. How much detail do you need? This question is a bit more difficult to answer, but consider the time-to-value needed for your governance solution, and what your ultimate objectives are. If you have a home grown ETL process, do you need to illustrate every single column mapping expression, and syntax? Or do you just need to be able to find “that” piece of code within a haystack of hundreds of other processes. Both are desirable, but of course, there is a cost attached to capturing explicit detail. More detail requires more mappings, and potentially more parsing (see further steps below). A case in point is a site that is looking at the lineage desired for 10’s of thousands of legacy cobol programs. They have the details in a spreadsheet that will provide significant lineage…..module name, source dataset and target dataset. Would they benefit by having individual MOVE statements illustrated in lineage and searchable in their governance archive? Perhaps, but if they can locate the exact module in a chain in several minutes — something that often takes hours or even days today — the detail of that code can easily be examined by pulling the source code from available libraries. Loading the spreadsheet into Information Server is child’s play — parsing the details of the COBOL code, while interesting and potentially useful, is a far larger endeavor. On a lesser note, “how much detail you need” is also answered by reviewing Information Server technology and determining things like “Will someone with a Basic BG User role be able to see this ‘thing’? “…..which leads to “Do I want every user to see this ‘thing’?”. Also important is whether the metadata detail you are considering is surfaced directly in the detail of lineage, or if you have to drill down in order to view it. How important is that? It depends on your users, their experience with Information Server, how much training they will be getting, etc.

Step 4. Where can you get the metadata that you need? Is it available from another tool or process via extract? in xml? in .csv? Something else? Do you need to use a java or C++ API to get it? Do you have those skills? Will you obtain the information (descriptions, purposes) by interviewing the end users who built their own spreadsheets? Is it written in comments in Excel? Some of the metadata may be floating in the heads of your enterprise users and other employees. Structured interviews may be the best way to capture that metadata and expertise for the future. Other times it is in a popular tool that provides push-button exports, or that has an open-enough model to go directly after its metadata via SQL. ASCII based exports/extracts have proven to be one of the simplest methods. Most often, governance teams are technical but do not often have resources with lower level API skills. Character based exports, whether xml, or .csv or something else, are often readable by many ETL tools, popular character based languages like PERL or similar, or even manipulated by hand with an editor like NotePad. I use DataStage because it’s there, and I am comfortable with it — but the key is that you need to easily garner the metadata you decided you need in the previous steps.

Step 5. Start small! This could easily be one of the earlier steps — the message here is “don’t try to capture everything at once”. Start with a selected set of metadata, perhaps related to one report, or one project. Experiment with each of the steps here with that smaller subset — giving you the flexibility to change the approach, get the metadata from somewhere else, model it differently or change your level of detail as you trial the solution with a selected set of users. Consider the artifacts that will have the most impact, especially for your sponsors. This will immediately focus your attention on a smaller set of artifacts that need to be illustrated for lineage and governance, and allow you to more quickly show a return on the governance investment that you are making.

Step 6. Build it! [and they will come 🙂 ] Start doing your parsing and construct Extensions per your earlier design. Extension Mapping Documents are simple .csv files…no need for java or .net or other type of API calls. Adding objects and connecting them for lineage is easy. Extended Data Sources, Data Files, Terms, BI objects — each are created using simple .csv files, and/or in the case of Terms, xml. I suggest that you do your initial prototypes entirely by hand. Learn how Extensions and other such objects are structured, imported, and stored. As noted earlier, I will go into each of these in more detail in future posts, but all of them are well documented and easily accessible via the Information Server user interfaces. Once you have crafted a few, test the objects for lineage. Assign Terms to them. Experiment with their organization and management. Assign Stewards, play with adding Notes. Work with Labels and Collections to experience the full breadth of governance features that Information Server offers. Then don’t wait — get this small number of objects into the hands of those users — all kinds of users. Have a “test group” that includes selected executives, business reporting users and decision makers in addition to your technical teams. Get their feedback and adjust the hand crafted Extensions as necessary. Then you can move on and investigate how you’d create those in automated fashion while also loading them via command line instead of via the user interfaces.

Keep track of your time while doing these things so that you can measure the effectiveness of the solution vis-a-vis the effort that is required. For some of your extensions, you may decide that you only need a limited number of objects, and that they almost never change — and no future automation will be necessary. For others, you may decide that it is worth the time to invest in your own enterprise’s development of a more robust parser or extract-and-create-extension-mechanism that can be implemented as metadata stores change over time. This also makes it simpler to determine when it makes sense to invest with an IBM partner solution for existing “metadata converters” that already load the repository. These are trusted partners who work closely with all of us at IBM to build solutions that have largely answered the methodology questions above in their work at other locations. IBM Lab Services also can help you build such interfaces. When appropriate and market forces prevail, IBM evaluates such interfaces for regular inclusion in our offerings.

Ultimately, this methodology provides you with a road map towards enhancing your governance solution and meeting your short and longer term objectives for better decision making and streamlined operations via Information Governance.


Creating Data File objects from inside of DataStage

A seldom used object in Metadata Workbench is a “Data File”. It is not as common because it has to be manually created. Database Tables are created whenever you use a Connector or other bridge to import relational tables from a database. Data Files, however, can only be created manually, using the istool workbench generate feature, or from inside of the DataStage/QualityStage Designer.

Why create Data Files?

A Data File is the object available in the Metadata Workbench that represents flat files, .csv files or DataSets. It is able to connect to the Sequential Stage or Dataset Stage for data lineage purposes. A Data File object might be used also for pure governance reasons — a special transaction file might be defined by a particular Business Term, or you might want to assign a Steward to the Data File object — the subject matter expert on one particular file. Of course, if you are a DataStage user, you probably use regular Sequential Table Definitions all the time. Data Files are similar but are more “fixed” — they are designed to represent a specfic flat file, on a given machine, and in a particular sub-directory, as opposed to being a general metadata mapping with proper column offsets for any file that matches the selected schema.

The simplest way to create a formal Data File is to start with a DataStage Table Definition. You may already have one that was created when you imported a sequential file, or can easily create one using the “Save” button on any column list within most Stages. Once you have the Table Definition, double click on it. Review all of the tabs across the top. Pay special attention to the “Locator” Tab. Click on that one. Look at its detail properties. Values at the Locator tab control the creation of Data Files or Database Tables.

Set the pull-down option at the top to “Sequential”. If that value is not already in your pull-down list, type it in… Towards the bottom you will see an entry for the Data Collection — put in the name you want for your file. Close the Table Definition.

Now put your cursor on that Table Definition in the “tree”. Right mouse and select “Shared Table Creation Wizard”. When that dialog opens, click Next. Then open the pull-down dialog and select “create new”, and click Next. Notice the properties at this new page….you have the Filename, the Host (pick a machine or enter a new one) and Path. Make the filename the SAME as what you have hard coded in your Sequential or Dataset Stage, or the filename of any fully expanded Job Parameter default values that you are passing into it. Then set the “Path” value to the fully qualified path of the expanded Job Paramters or what you have in the same filename property. For example, if your filename in the Stage looks like this:

/tmp/myfile/#myfilename# …and #myfilename# has a default value of mySequentialFile.txt

Then use mySequentialFile.txt as the Filename and /tmp/myfile (without the final slash) for the path. Now you will have a Data File inside of Metadata Workbench that you can govern with Steward and Term assignments, and it also will stitch to the Stages that use its name in hard coded fashion or expanded Job Parameters for Design time or Operational lineage.


Lineage for RDBMS “Views”

Hi Everyone…

Someone asked me yesterday about being able to perform lineage in Metadata Workbench on a database “View”. It dawned on me that I may have never created a post on this very important subject. Formal Database views, as created in the rdbms catalog via a CREATE VIEW sql statement, are fully supported for Data Lineage purposes by Information Server and the Metadata Workbench.

The key is in the method of import.

When you import your rdbms catalog information via Connector (in 8.7 Metadata Asset Manager, or in any 8.x release via DataStage, Information Analyzer, or FastTrack), the views are imported and get their own icon for display purposes within the “Hosts” tree (or Implemented Data Resources if you are in 8.7). The details of the view are available for display, and will show the SQL used in the CREATE VIEW statement to originally create it.

More importantly, when you perform “Automated Services” (a.k.a. “stitching”, or in 8.7, “Detect Associations”), Metadata Workbench will parse thru the SQL of the CREATE VIEW and establish data lineage connections to the “source tables” of the view! Once this is done, you will have lineage for the view back to its source tables, and of course, anything that is upstream from those tables, or downstream from the view itself!


Connecting Shared Tables to your DataStage/QualityStage Jobs

At this point you should be familiar with Shared Tables, and understand how data lineage works within, and among, multiple DataStage Jobs. Stage to Stage lineage is very useful, and may be all that you require. It’s also powerful, though to go beyond this and connect your Jobs to the tables that you have imported from various places.

[click here for the prior post in this series, Linking your DataStage Jobs together ]


One of the key reasons for including Shared Tables in lineage is for “Business Lineage”. This is a high level summary of lineage that doesn’t illustrate the lower level transformation details…just the “key” sources and targets (files, tables, reports, etc.) along the way. Another is to connect your Jobs to “external, non Information Server assets.” DataStage doesn’t write directly to a (for example) Cognos report….it writes to a table somewhere, and that table is then read by the business intelligence tool. The connection thru that table is critical for accurate data lineage reporting. Here is how Metadata Workbench makes the connection between a DataStage Job and a table…

Shared Tables have a four (4) part name: Host/Database/Schema/Tablename (see other entries in the table of contents on Shared Tables for more details on the names of Shared Tables or alternatives for importing them).

Relational Stages in DataStage typically use two (2) parts to identify a particular table. “Server” or “Database” or “DSN” name, and tablename. The Server/DSN/Database name is usually in a dedicated property within the Stage. The Tablename might be in a dedicated property, or it could also be embedded in User Defined SQL. Any of them might be hard coded or established via Job Parameters.

The first thing Metadata Workbench needs to do is to “map” the abstract “Server/Database/DSN” name to a particular “Host” and “Database” combination in the list of Shared Tables. Like any application, DataStage is just pointing to some abstract “string” when trying to find a database. An ODBC DSN, for example, “might” be the name of the database, but it could also just be “myODBCdatabase”, which really points to (in the ODBC definition) a DB2 table called HRMAIN. Even if we use the string HRMAIN in the “Server” property of the Stage, we still need a way to identify the particular host that we are considering for data lineage. This is done via the “Database Alias” link in the “Advanced” tab of the Metadata Workbench.

Go to the Advanced Tab after signing into the Workbench. Perform Automated Services for one of your projects (this could take a long time if it’s the first time you are doing it). When it finishes, click on “Database Alias”. Look carefully at the values there. These are the “strings” that are used in your various Jobs to identify databases. Pick the string that is appropriate for the Stage Type that you are working with and slide your cursor over to the right. The “Add” button will allow you to select the desired host/database combination that this abstract string should connect to. In the example noted above, I might assign the string (alias) myODBCdatabase to the host/database combination of QR2H004/HRMAIN. QR2H0004 with Database HRMAIN must be something I have already imported and is viewable in the left navigation pane of the Workbench, or in the Repository Management tab of the Information Server Web Console.

Save it (button on the lower right).

The next time you perform Automated Services, whenever that Stage type with that particular string (myODBCdatabase) is found, Metadata Workbench will use QR2H004/HRMAIN, combined with the fully qualified tablename* in the Stage, to match to a particular Shared Table that has been imported previously.

[* Note… if your tablename is NOT fully qualified, Metadata Workbench will, for most RDBMS Stage Types, pick up and use the value of the Username in order to identify your table]

Are you using Job Parameters? For Design based lineage, Metadata Workbench is smart, and will use the “default values” of those Job Parameters when finding alias “strings” and also when obtaining the fully qualified (schema.tablename) tablenames to use in the linking.

Are you using $PROJDEF? Run the ProcessEnvVars shell or bat file inside of /IBM/InformationServer/ASBNode/bin to obtain the project definitions for use in this algorithm.

Operational Metadata (a whole separate blog entry is needed to discuss OMD) is used to populate the Job Parameter values from “run time”…otherwise the same rules apply. If you are just starting, get to know lineage w/o worrying about OMD. That’s an advanced topic. Understand how database alias works by performing Automated Services, reviewing the Database Alias page in the Workbench, doing the assignment and running Automated Services again. Once it is complete, go to the actual table using the navigator frame at the left (open the Host tree and find the table) and right mouse and select “data lineage”. If it is a source, select “where does this go to”…..if it is a target, select “where did this come from”…… and validate your results.