LakeSoul is an end-to-end realtime Lakehouse framework for both BI and AI applications. LakeSoul includes a centralized metadata layer on PostgreSQL to manage large scale of tables, partitions and files on data lakes with supports of concurrent row-level upserts and ACID control. LakeSoul also provides tools to ingest data from RDBMS CDC streams and Kafka log streams in realtime, with automatic table/topic discovery and DDL synchronization. Data managed by LakeSoul is stored in Apache Parquet format and is accessed via a native Parquet IO layer optimized for cloud storages. Access layers are provided for big data engines like Spark and AI engines like PyTorch, so that both BI and AI applications can benefit from realtime Lakehouse.

LakeSoul implements incremental upserts for both row and column and allows concurrent updates. LakeSoul uses LSM-Tree like structure to support updates on hash partitioning table with primary key, and achieve very high write throughput (30MB/s/core) on cloud object store like S3 while providing optimized merge on read performance. LakeSoul scales metadata management and achieves ACID control by using PostgreSQL. LakeSoul provides tools to ingest CDC and log streams automatically in a zero-ETL style.

More information can be found in GitHub: https://github.com/meta-soul/LakeSoul and in documentation: https://www.dmetasoul.com/en/docs/lakesoul/intro/.

LakeSoul is a Sandbox-stage project of LF AI & Data Foundation.

Contributed by DMetaSoul in May 2023.