The Linux Foundation Projects
Skip to main content

LF AI & Data is excited to announce its newest sandbox project: Unity Catalog. Originally developed and introduced by Databricks in 2021, Unity Catalog has now been open-sourced and donated to the LF AI & Data Foundation. This marks a significant advancement in our mission to support and sustain open source innovation in artificial intelligence (AI) and data.

Introducing Unity Catalog

Unity Catalog offers an open source solution designed to seamlessly integrate with various clouds, data formats, and data platforms. Today’s data platforms often resemble walled gardens, employing “native tables” that are inaccessible in open formats or imposing costs on organizations for always-on computing. Furthermore, data and AI assets such as tables, unstructured data, and AI tools, are arbitrarily siloed into different single-purpose tools. This creates a fragmented landscape of siloed data and governance. Unity Catalog aims to break down these barriers and foster a more unified and flexible approach to data governance.

Features and Capabilities of Unity Catalog

The first release of Unity Catalog is available today. While some of our APIs and features will still be evolving, this release showcases several important capabilities of Unity Catalog:

Universal Interoperability

Unity Catalog provides a universal interface that supports any data format and compute engine. This includes the capability to read tables with Delta Lake, Apache Iceberg™, and Apache Hudi™ clients via Delta Lake UniForm, as well as supporting the Iceberg REST Catalog and Hive Metastore (HMS) interface standards. The catalog is also interoperable with major cloud platforms like Microsoft Azure, AWS, GCP, and Salesforce, and compute engines such as Apache Spark™, Presto, and Trino.

Unity Catalog is supported by a vibrant community of industry leaders and innovators in data and AI – including major cloud platforms such as Microsoft Azure, AWS, GCP, and Salesforce, data and AI platforms such as NVIDIA, Confluent, dbt Labs, Fivetran, Granica, Immuta, Informatica, LanceDB, LangChain, LlamaIndex, Tecton, Unstructured, and compute engines such as CelerData / StarRocks, Daft, DuckDB, and PuppyGraph.

We are already seeing a lot of exciting innovation from partners for AI / ML use cases with Unity Catalog, enabling users to

  • Build unstructured data ETL pipelines to Unity Catalog Volumes with platforms like UnstructuredIO
  • Build LLM agents which can run functions in DBSQL from libraries like LangChain
  • Leverage tabular and non-tabular data together in Python query engines like Daft for multimodal data processing

Unified Governance

With Unity Catalog, organizations can achieve unified governance across all their data and AI assets, simplifying management, discovery, and development at scale. This integrated approach helps in maintaining a cohesive governance framework and consistent access control, thereby reducing complexity and enhancing security.

Open and Flexible

Emphasizing openness, Unity Catalog is built with open APIs and is licensed under Apache 2.0. This ensures maximum flexibility and broad interoperability across various engines, tools, and platforms, allowing customer choice and adaptability.

Join Us on This Exciting Journey

We invite you to join us in this exciting journey to further develop Unity Catalog and help shape the future of data management and governance. Together, we can drive forward the principles of open source, enhancing the landscape of AI and data technology for everyone.

For more information and to get involved with Unity Catalog, please visit https://www.unitycatalog.io/ or the Unity Catalog repository at https://github.com/unitycatalog/

LF AI & Data Resources

Access other resources on LF AI & Data’s GitHub or Wiki