Apache Atlas Update: Have you been watching?

It has been awhile since I’ve written anything.  Time to “catch up!”

A lot has been happening in the world of metadata management and governance.   We are now seeing many real life use cases, as machine learning, intelligent data classifications, graph database technology and more are being applied to the information governance domain.    Efforts for standardization in the metadata and governance space are moving forward also.  For this post, let’s take a look at Apache Atlas.

Apache Atlas continues to mature, celebrating several major milestones in 2017.  Shortly after its second birthday (Apache Atlas was launched as an incubator project in May of 2015), Apache Atlas graduated to a top level project status signifying that the project’s community and products have been well-governed under the Apache Software Foundation’s (ASF) meritocratic process and principles.  This is evidence of the hard work performed by the collective Apache Atlas team that Apache Atlas is increasingly ready for real world implementations.  Of course, that milestone, while worthy of recognition, is just one of the many steps Atlas is taking, and continues to make, going forward.  Here are other significant developments for Apache Atlas this year:

  • Introduction of OMRS and its other complementary APIs.  OMRS is a key part of the Open Metadata framework that introduces the notion of repository metadata sharing and access.  In the true spirit of Apache communities, Apache Atlas is not alone in the world of enabling information governance; sharing of metadata between diverse metadata repositories can now be realized, in addition to simpler federation of metadata across multiple Atlas repositories.
  • New common models for critical types of metadata.  To facilitate metadata sharing via OMRS, and to establish a more widely adaptable set of asset definitions, it was agreed by the Atlas team that a common definition for data structures, processes, and other data asset attributes.  This helps facilitate metadata sharing by increasing the likelihood that integrators building interfaces to Atlas will choose a common type definition for their content instead of designing their own custom types while providing extension points if needed.
  • New Glossary Model.  A detailed new glossary model was designed (and API implemented) for a stronger semantic layer.  Business concepts and their relationships are the cornerstone of disciplined information governance.
  • Streamlining of the Apache Atlas infrastructure.   The underlying graph database implementation was upgraded to take maximum advantage of JanusGraph, itself becoming the leading standard for open source graph engines.
  • Continued/ongoing clean-up of the install and build procedures.  Considering the wider adoption of Apache Atlas throughout the governance community, Atlas team has enhanced test suites to assure that the new functionality added is well tested and the build and install processes are more streamlined..  For example, packaging and building Apache Atlas within Docker containers.
  • The number of new Committers!  Apache, as everyone knows (or should know), is a meritocracy.  This means that recognition and influence is determined by an acknowledged investment of time, effort, and contributions.  Formal recognition as a committer requires many months of hard work to moving a project forward.  Congratulations to all the new Committers this year!    Even more important, the increase in Committers and contributors overall is yet another illustration of how Apache Atlas is growing in importance and general industry awareness.
  • The Virtual Data Connector use case.  Self service data exploration environments need to provide an integrated view of data from many different systems and organizations.  Access is needed in order to discover new uses and interesting patterns in the data.   The VDC project aims to provide a single endpoint for accessing data that presents a virtualized view of the data assets with the appropriate data security.  This is accomplished by extending the integration of Apache Atlas with Apache Ranger via the tag-based security access introduced in Apache Atlas in 2016, in order to provide security access based on both the classification tags (eg PII and SPI tags, subject area of the data etc.)  An additional plug-in is added to Apache Atlas to control access to metadata based on whether an end-user is allowed to discover a data sources’ metadata.

So….it’s been a very busy year for Apache Atlas.  While most of these capabilities have already been developed and are being tested, they will become generally available in the upcoming Apache Atlas v1.0 which will be a huge milestone release for the community. The project is maturing, and gaining increased attention across the industry, in the information governance space, and beyond.   The code continues to mature, with increase in adoption and variety of applications every week.   The critical mass of industry expertise contributing to Apache Atlas continues to grow.    Start watching!   Start playing!  Join in and help Apache Atlas reach its next set of milestones!



Main Apache Atlas web site
Atlas Wiki

Links to specific Apache Atlas Topics

Open Metadata and Governance
Link to more details on OMRS
Building Out the Open Metadata Typesystem
Virtual Data Connector


New Governance Blog covering IGC

Hi everyone…. here’s a pointer to another IGC and Governance resource written by some of my IBM colleagues…..  this post includes details on the advantages of using OpenIGC to extend governance to any kind of assets…   https://ibm.co/2AGpdaq .   Happy reading!


Explore the Benefits of Information Governance with the IGC Trial

Earlier today we released the first implementation of the InfoSphere Information Governance Catalog Trial!   This is a downloadable module that lets you quickly and easily get a closer look at the Information Governance Catalog (IGC), complete with real pre-loaded metadata, business terms, and lineage.    It is a modified Docker-based implementation, and is not intended for production use or full-blown Information Server capability, but it allows you and your team to explore IGC, work with its features, and realize how you can achieve your governance objectives for common understanding, data quality monitoring, and data lineage.   Tutorials and videos will also guide you along the way.   The links below go into far more detail and lead also to the formal download page…     Good luck and enjoy!


Main IGC Trial download page and introduction…  https://www.ibm.com/us-en/marketplace/information-governance-catalog

Overview of the IGC Trial and its benefits…  https://www.linkedin.com/pulse/fast-track-your-data-information-governance-catalog-rakesh-ranjan

Insightful post from Marc Haber, IGC Offering Manager… http://www.ibmbigdatahub.com/blog/how-take-next-step-information-governance-now

IBM and Hortonworks!

Hi everyone…

Some exciting recent news, if you haven’t seen it yet…announced a few days ago at the DataWorks Summit/Hadoop Summit in San Jose, a new relationship between IBM and Hortonworks!   Read about it here to learn how IBM and Hortonworks are partnering to further the efforts of our customers to expand their big data solutions.


More important for this blogger is the increased attention this brings to Apache Atlas.  Apache Atlas, if you aren’t already familiar, is an evolving open source approach to enterprise information governance, metadata management, and lineage […go here for a general overview:  https://hortonworks.com/apache/atlas/ ].   One highlight from news above draws particular attention to the contributions IBM and Hortonworks are making to this effort:

“Partnering On Apache

As part of their wide-ranging partnership, the companies will also team to advance the development of Unified Governance (IBM BigIntegrate, IBM BigQuality and IBM Information Governance Catalog) on the Apache Atlas open platform. Information Governance Catalog) on the Apache Atlas open platform. …”

It’s all a work-in-progress, but this is significant news that will hopefully accelerate the initiative.   Have any of you started working heavily with Atlas?   Which release?  Are you using it exclusively with Hadoop, or externally?   Have you interchanged metadata with Atlas and IGC?  Considering it?    Share your experiences!


Related posts:

Evolving Atlas…




Re-defining Data Lineage

Well..not so much “re-defining” as re-fining, and adding clarity to the definition and the discussion.  Please find the time to review this excellent blog entry by my IBM colleague, Distinguished Engineer and thought leader, Mandy Chessell…  https://poimnotes.blog/2017/03/19/understanding-the-origin-of-data/


OpenIGC Accelerator

Hi Everyone…

Happy Spring! [for those of you in the northern hemisphere  ; )  ].   Great time to start “cleaning out” and “fixing up” things….whether around the house, or in the corners of our special projects.    In that latter category, I have “tidied up” a little utility I have been working on to assist everyone in building their OpenIGC prototypes or to assist in “getting to know” OpenIGC — a “form builder” for the “Publishing XML” needed to realize instances of your newly modeled and registered OpenIGC artifacts.

A lot of you have expressed the desire to get deeper into OpenIGC, but have found it difficult to get your arms around the xml aspects of it.  Either that, or cutting and pasting xml in a text editor is just not your thing.   For those reasons and others, I have been exploring various ways that a user interface could be created for OpenIGC assets — without resorting to an elegant albeit complex and lengthy GUI development effort.

Digging around, I found some open source javascript tooling to assist, and brushed off enough javascript and html skills to put it together.     At the url listed below you will find a tool that allows you to upload your bundle descriptor and generate a self-populating “form” to construct a publishing xml document for OpenIGC.   It also provides options to save the publishing xml to disk (for future use/editing) or to directly cut and paste into the igc-rest-explorer page.

It’s not “perfect” (I suspect it probably has its share of anomalies if you click on things out of order), but is hopefully a “helper” that will accelerate your efforts to implement custom assets for governance within IGC.

Please carefully READ the instructions (there is a link to instructions and a simple screen shot on the initial page).    The tool does not entirely “hide” your xml, and it REQUIRES that you understand your bundle (if you don’t know what I am talking about regarding OpenIGC and bundles, please review the blog series starting with https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/ )! ….still, it does a few nice things for you:

  • Performs all the xml tagging/formatting, ensuring that your xml remains “well-formed”
  • Presents a “pull-down” select list for your classNames and attribute enumerations
  • Generates the list of attributes (properties) for whatever class you select
  • Automatically generates the unique “assetIDs” for the asset instances that you define
  • Generates and presents a pull-down list for selecting “parent” assetIDs

As noted above, I can’t promise that it is entirely bug-free, but I can say that it has already helped me accelerate the prototyping of several bundles that I have been building recently to illustrate the power of OpenIGC for extending the repository.    Have fun, good luck, and please let me know how you make out in using this tool!       –ernie



Accessing IGC via cURL

Hi Everyone…

This is a long overdue post …pointing to an article written by one of my IBM colleagues about accessing IGC metadata via its REST APIs — using cURL as your tooling.   He provides some excellent examples, complete with screen shots and recommendations.  Enjoy!