Once you have mastered the “navigation” and asset selection options of Data Lineage reporting, it’s time to look at how DataStage Jobs are automatically linked together. By now you should be comfortable with thinking about your “starting position” for a data lineage report — your initial “perspective” if you will (what object are you standing on when you begin). You should also be comfortable with thinking about the “direction” for your Data Lineage investigation — are you looking “upstream” for “Where did this come from?” or downstream for “Where does this go to?” [In 8.7 or higher, you no longer have to request “upstream” or “downstream”, but you should still give it some thought so that you have an idea of what you will be expecting or what you are looking for].
If you need a refresher on the basics, please see Getting Started with Data Lineage!.
A typical production site for DataStage/QualityStage has MANY Jobs — hundreds perhaps….even thousands. All integrated and working together to transform your data and move it from one place to another. Sometimes they are written by one very hard-working developer, who might have all the lineage in his or her head, but more often it is a larger scale endeavor, with lots of team members, often scattered around the globe, and with varied skill sets and possibly working on related albeit independent solutions. They may know each other, or may not. How are the DataStage Jobs sequenced from a data flow perspective? How does data flow between a Job developed to process data received via FTP from the mainframe and then ultimately to a datamart that supports a reporting system? How does one Job connect to another? Sometimes it may be one giant Job, but its not likely. Intermediate temporary tables are often created for everything from checkpoints to Operational Data Stores to “parking lots” where data can be restructured or delivered to another application along the way. Workbench can sort this out and provide you with lineage through all these Jobs.
Intra-Job Data Lineage (Data Lineage between Jobs) is largely automatic. You simply have to pay attention to a few “good sense” leading practices and understand the pattern. Note that this has “nothing” to do with Shared Tables or Table Definitions at all… it’s entirely done by merely parsing thru your Jobs [this is a key reason why you can get immediate insight on 7.x Jobs that are imported into the 8.x environment — even if you haven’t compiled a single one or started your formal testing and QA process!]
Automated Services is the “parsing” step at the Advanced Tab….when you say “run” with your Project(s) checked, Workbench combs through your Jobs, looking for similarities that will link Jobs together end-to-end. [Note: If you are using 8.7 or higher, this process is called “Detect Associations” …find it by first clicking on “Manage Lineage” at the Advanced Tab, and then clicking on the larger “arrow” icon on the right after selecting one of your DataStage Projects.] Here’s what it looks for among Jobs:
a) Common or “like” Stages between the Target of one Job and the Source of another. Two ODBC Stages are in common, but so is ODBC and say, Oracle OCI. Or DB2Load and DB2Connector. Or two Sequential Stages. Or two Dataset Stages.
b) At least one column in common. [Update Note… the research leading to this post was done originally in early 8.1 … I’ve since discovered in 8.5 and higher that the connection is more flexible now, and one column name in common is no longer required. The default values and Stage type are the major controlling factor in bringing Jobs together.]
c) Same hard coded values (yuk…who does that?) OR….same “default” values for Job Parameters for the critical common properties. For RDBMS type Stages, it’s ServerName, Schema, and Tablename. For Sequential type Stages, it’s the filename. The Automated Services will put together multiple Job Parameter default values if needed.
If your team follows good practices of having parameter sets or common values for things like an ODBC DSN, a high degree of lineage will often occur immediately after the first Automated Services (informally known as “stitching”) occurs. Expect that the “first” time you run Automated Services, it could take a long time and be very intense. Do it during off hours if you have hundreds or thousands of Jobs. After that first time it will recognize a delta and only parse thru the Jobs that are new or have been changed.
Now when you do your lineage reporting, and you start while “standing” on the target Stage of an application down-stream Job….when you ask for “Where does this come from?”, you can expect to see Stages through many Jobs back to the ultimate source. If it dead-ends surprisingly, it’s probably because one of the three rules above didn’t apply, or there’s an odd Stage that isn’t supported for lineage (Redbrick is one of the only ones I’m aware of at this point).
If that fails, use the “Stage Binding” at the Advanced Services tab. This is one of the options for “manual” binding of metadata — a sort of “toolbox” of wrenches and bolts for when you need ’em. The Stage Binding is designed to be used, only when absolutely necessary, to “force” two Stages together for lineage purposes. It is fairly easy to use…it prompts you first for the name of a Job, and then when it presents you with a list of Stages, slide your cursor over to the right so that you can “add” a Stage from another Job. I have used it effectively for unsupported Stages, as noted, but also when the rules above don’t apply. In one case I was sending the output of an XML stage into a Sequential File…and in the next Job, I was reading that with an “External Source” Stage. There is nothing in common between those stage types, and the columns were entirely different (the Sequential File Stage contains a column called “myXML” and the External Source merely carries the output of a unix list command (a set of filenames). I was able to establish perfect lineage however, by using a Manual “Stage Binding”, forcing the Sequential Stage of the first Job and the External Source Stage of the second Job to be “bolted” together.
Good reporting! Next topic — Connecting Database Table objects to your Jobs
Ernie