The Linux Foundation Projects
Skip to main content

Discover LF AI & Data Projects with TAC Talks Watch Now

OpenLineage recently graduated from the LF AI & Data Foundation, is an Open standard for metadata and lineage collection. It can instrument jobs as they are running and exchange static lineage about datasets and jobs. The spec defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities. Integrations provide built-in support for Apache Airflow, Spark, Flink, dbt, data warehouses, and more.   

OpenLineage’s graduation from the LF AI represents the culmination of two years of effort devoted to cultivating adoption, fostering partnerships in the ecosystem, collaborating on integrations, and following best practices in governance and security. The project’s graduation signals its endorsement by leaders in the AI and data space, including Ericsson, IBM, Intel, and Huawei. 

The project has experienced significant growth in terms of both adoption and external contributions. OpenLineage also meets OSSF Gold Badge standards for project governance, code quality, and security. OpenLineage’s growing roster of partners includes Atlan, Egeria, Manta, and Microsoft. Its Technical Steering Committee members are affiliated with Apache Airflow, Apache Iceberg, Apache Parquet, dbt, Egeria, Marquez, Microsoft, Snowflake, and Superconductive.      

Recent developments in the project include:

  • Completion of an OpenLineage Provider in the Apache Airflow project, making OpenLineage support a built-in feature of Airflow rather than an externally maintained integration (coming soon in Airflow 2.7).
  • Release of Static Lineage capability in OpenLineage 1.0, adding support for static (AKA “design-time”) use cases such as dataset ownership checks.

Key partnerships in the OpenLineage ecosystem:

  • Atlan’s integration uses job facets to catalog operational metadata from pipelines, enrich existing assets, and provide persona-based lineage information using OpenLineage SDKs.
  • Manta’s OpenLineage Scanner uses job facets to ingest OpenLineage metadata and enrich overall enterprise data pipeline analysis.
  • Microsoft’s integration lets users both extract and visualize lineage from their Databricks notebooks and jobs inside Microsoft Purview. 
  • Egeria’s integration captures OpenLineage events directly via HTTP or the proxy backend. It also publishes events to lineage integration connectors with OpenLineage listeners registered in the same instance of the Lineage Integrator OMIS.
  • Astronomer’s integration uses the OpenLineage Airflow library (`openlineage-airflow`) to extract lineage from Airflow tasks and stores that data in the Astro control plane. This package includes default extractors for popular Airflow operators. The Astronomer UI then renders a graph and list of all tasks and datasets that include OpenLineage data. Astronomer is also a key contributor to the other open source OpenLineage integrations.

To learn more about OpenLineage, join us on Slack and check out the project website, where you can get started with Airflow+Marquez or Spark, find out about opportunities to get involved, read the docs, and more. To learn about contributing to OpenLineage, read the guide for new contributors on GitHub.

Author