The requirement to call Web Services from ETL (Extraction-Transformation-Load) is becoming increasingly common. Web Services are being included in ETL streams as a source or target, or for re-use of a critical calculation or function. Often they are invoked on a row-by-row basis for Transformation purposes. Web Services are defined by their WSDL (Web Services Description Language) documents, but most times that is just a “signature” that identifies the input and output parameters. What happens “inside” the Web Service? What actions are being performed on those input parameters? How useful would it be to know exactly what is being done to your data when it is sent to a remote “black box?” Services repositories, if utilized at your site, provide a rich source of metadata, and may include descriptions of the function, who owns it, when it was last updated, and provide automation for the bindings. However, many sites have not standardized on one particular repository, and more importantly, even if such information is available, it is unlikely that the repository participates in data lineage.
Data lineage is the ability to view a complete source-to-target flow, thru every process, every function, and every intermediate step. It illustrates what happens to a column, how it is used in expressions and when its name changes or when it is combined with another column to create something new. Data lineage helps decipher where a value came from, and how it was derived. Do your lineage reports have visibility into the “black boxes” that you are calling remotely? Would you like to “see” where the data flows when it goes “outside” the system? The IBM InfoSphere Metadata Workbench has a new capability called the Data Lineage Extender. This new feature lets me describe the details of any external process and include it in all of my metadata exploration and analysis.
In this blog I try to focus on real-time issues, and so will start with the benefits of defining a remote Web Service; however, the Data Lineage Extender and the “Extension Mappings” that it offers are not limited to Web Services. Extensions, as they are also called, can be used to define external scripts, Stored Procedures, Java classes, COBOL programs, third party transformation tools — anything that you believe is important for completing your lineage picture.
Extensions appear as special new objects in the Metadata Workbench, can be queried and viewed like any other object in the repository, and as noted earlier, participate fully in data lineage reports and graphs. I had the chance to start working with this new capability a few months ago in our lab, and believe that it is going to help fulfill the lineage requirement of many data integration applications that have “missing links” or “gaps” whether or not they revolve around DataStage and QualityStage.
To establish an Extension, first identify the external process you want to include in your lineage. Then decide how much detail is desired. The Extensions support specifications at many levels, including Table, Column, DataStage Stage, etc. and support one to one, one to many, and many to one relationships/mappings to other Extensions / native DataStage objects. (for example, First Name and Last name from Object A map to Name in Object B) Where possible, define the mappings of your external object at the column level.
Let’s look at an example. Here is a snippet of a Job that reaches out to a remote “Lookup” Web Service. The WSDL was brought into the Job, but there is no visibility to the Service except for the input and output parameters*:
A data lineage report without Extensions will simply show that the Web Services Transformer is invoked, but with little other detail. To “complete the lineage picture,” we will want to create an Extension that defines what happens within the Lookup. This is done inside the Metadata Workbench. The screen shot below shows the details of a new Extension Mapping. It contains the columns involved in the lookup, commentary about the type of Lookup that (below) yields the “C0_CorpExpFactor” value, and also provides the ability, via balloon help, to see the ultimate source of my Lookup Column:
Now that the Lookup Extension is defined, it will appear in a Data Lineage report. This next screen shot illustrates a data lineage report at the column level. On the left we see the lineage for the column CORPEXPFACTOR, and where it came from, and on the right we see the context for each column, whether part of a DataStage Link and Job, or within an Extension Mapping (LookupWebService).
Clicking on the “LookupWebService” icon or link on the right will take you to the details of the definition (the previous screenshot of the Extension Mapping Document) where you can review the methods used for the lookup and any commentary from the author. This first lineage view is called the “textual” view of Data Lineage. When requesting a lineage report I can alternate between this textual view and the “graphical” view:
Over the next few weeks I’ll be documenting my experiences with this feature, outlining the various use cases, and sharing the techniques and methods used to define Extension Mappings.
* …of course, philosophically, one may wish to be entirely abstracted from a remote Web Service. “Who cares what tables it accesses, or what procedures it uses, provided I can trust it and rely on its execution” That’s one of the benefits of Service Oriented Architecture, right? This is a debate with many valid positions, however, most sites I speak to want to ensure extensive re-use of their Services, and simpler use/maintenance, but still maintain visibility into data quality and source/target details. Looking at WSDL alone doesn’t tell you very much.: If your directory strategy provides more than just a WSDL catalog, great. The Extensions described here may help complement what you are already doing, or “fill in the gaps” where there are holes.