Angel, an LF AI & Data Foundation Graduated-Stage Project, has released version 3.2.0. Angel is a machine learning framework originally developed by Tencent as the first open-source AI project of the company. The Angel project joined the LF AI & Data Foundation in August 2018 as an Incubation-Stage project, and in December 2019 became a Graduated-Stage project with the support of the Foundation and its technical community.
With full-stack facilities for AI pipeline, from feature engineering to model training and inference, Angel has provided an end-to-end and easy-to-use platform for engineers and scientists. Particularly, it devotes effort to the high dimension sparse model training and graph neural network learning at production scale. In the previous version, 3.1, Angel introduced graph learning for the first time and afforded a set of well optimized algorithms already adopted in a variety of applications. For release 3.2.0, Angel enhances the core of graph learning with numerous new functionalities and optimizations.
Flexible Architecture for Extension
In release 3.2.0, Angel has designed three layers in the graph learning framework for general purpose which include computing engine, operators, and models. This architecture decouples the high-level algorithms from the low-level graph data (vertices and edges) manipulations; thus, it has good extension for both engineering enhancement and new model development. As an example, in the operator layer, there are a group of primitive abstract operator interfaces such as init(), get(), walk(), and sample() that developers can easily implement in their customized operators and extend the model.
Hybrid Running Mode for Best Performance
There are two main kinds of running models, Parameter Server (PS) and MPI for large scale graph learning algorithms. They have different volumes of communication messages during learning for different models like graph embedding and neural networks. It is hard to accommodate all models with good performance by using just one running mode. In version 3.2.0, support of a hybrid running mode is created by combining PS and MPI communication methods which leverages the advantages of both. This hybrid mode can significantly speed up the training process of graph traversal algorithms.
Adaptive Model Data Partitioning
For the big graph model which cannot be loaded in a single machine, we usually need to partition the model data into several parts across several machines. Range Partitioning and Hash Partitioning are two commonly used methods, where the former takes less memory but may cause load skew among machines; and the latter can have good load balance with much more memory. In this release, Angel can automatically and adaptively use range and hash partitioning according to the model, which is a good tradeoff between memory cost and load balancing.
Support for Heterogeneous Graph Learning
The structure of a graph is usually heterogeneous with multiple types of edge between each pair of vertices and multiple types of vertex attributes. This complexity raises challenges to the graph learning framework in terms of storage and computing. To support the heterogeneous graph, Angel optimizes the data structure of graph storage for fast I/O and provides an interface to users for customized PS function implementation. Such that those heterogeneous graph learning algorithms can be easily executed on the Angel framework even with high dimension sparse attributes on each graph vertex. Based on these optimizations, Angel has implemented several built-in heterogeneous models including HAN, GAT, GraphSAGE, IGMC Prediction, and Bipartite-GraphSAGE.
Learning on a huge graph with about 100 billion edges is very challenging in stability and performance. Angel has deeply enhanced this kind of huge graph problem that is increasingly common in real applications such as social network mining or shopping recommendations. With this enhancement, the K-core and Common Friends model training can be three times faster than before with additional reduction of memory cost by 30%.
The Angel Project invites you to adopt or upgrade to version 3.2.0 and welcomes feedback. For details on the additional features and improvements, please refer to the release notes here. Want to get involved with Angel? Be sure to join the Angel-Announce and Angel-Technical-Discuss mailing lists to join the community and stay connected on the latest updates.
Congratulations to the Angel team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with the Foundation, visit the LF AI & Data website.
LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, today is announcing OpenLineage as its latest Sandbox Project.
Released and open sourced by Datakin, OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities.
OpenLineage is a cross-industry effort involving contributors from major open source data projects, including LF AI & Data projects; Marquez, Amundesen, and Egeria. Without OpenLineage, projects have to instrument all jobs and integrations are external, which can break new versions. When OpenLineage is applied, effort of integration is shared and integration can be pushed in each project, meaning the user will not need to play catch-up.
Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We are excited to welcome the OpenLineage project in LF AI & Data. The project addresses a critical component in governing AI and data projects and further expands the robustness of our portfolio of hosted technical projects. We look forward to working with the OpenLineage project to grow the project’s footprint in the ecosystem, expand its community of adopters and contributors, and to foster the creation of collaboration opportunities with our members and other related projects.”
Julien Le Dem, founder of OpenLineage, said: “Data lineage is a complicated and multidimensional problem; the best solution is to directly observe the movement of data through heterogeneous pipelines. That requires the kind of broad industry coordination that the Linux Foundation has become known for. We are proud for OpenLineage to become a LF AI & Data project, and look forward to an ongoing collaboration.”]
LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project. Learn more about OpenLineage on their GitHub and be sure to join the OpenLineage-Announce and OpenLineage-Technical-Discuss mail lists to join the community and stay connected on the latest updates.
A warm welcome to OpenLineage! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.
LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, today is announcing Substra as its latest Incubation Project.
Substra is a framework offering distributed orchestration of machine learning tasks among partners while guaranteeing secure and trustless traceability of all operations. The Substra project was released and open sourced by OWKIN under the Apache-2.0 license.
Substra enables privacy-preserving federated learning projects, where multiple parties collaborate on a Machine Learning objective while each one keeps their private datasets behind their own firewall. Its ambition is to make new scientific and economic data science collaborations possible.
Data scientists using the Substra framework are able to:
Use their own ML algorithm with any Python ML framework
Ship their algorithm on remote data for training and/or prediction and monitor their performances
Build advanced Federated Learning strategies for learning across several remote datasets
Data controllers using the Substra framework are able to:
Make their dataset(s) available to other partners for training/evaluation, ensuring it cannot be viewed or downloaded
Choose fine tuned permissions for your dataset to control its lifecycle
Monitor how the data was usedEngage in advanced multi-partner data science collaborations, even with partners owning competing datasets.
Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We’re excited to welcome the Substra project in LF AI & Data. The project enables data scientists to use their own ML algorithm with any Python framework, deploy their algorithm on remote data for training and/or prediction and monitor their performances, and build advanced Federated Learning strategies for learning across several remote datasets. We look forward to working with the community to grow the project’s footprint and to create new collaboration opportunities for it with our members and other hosted projects.”
Substra operates distributed Machine Learning and aims to provide tools for traceable Data Science.
Data Locality: Data remains in the owner’s data stores and is never transferred. AI models travel from one dataset to another.
Decentralized Trust: All operations are orchestrated by a distributed ledger technology. There is no need for a single trusted actor or third party; security arises from the network.
Traceability: An immutable audit trail registers all the operations realized on the platform simplifying certification of model.
Modularity: Substra is highly flexible; various permission regimes and workflow structures can be enforced corresponding to every specific use case.
Camille Marini, Founder of the Substra project, said: “On behalf of all people who contributed to the Substra framework, I am thrilled and proud that it has been accepted as an incubation project in the LF AI & Data Foundation. Substra has been designed to enable the collaboration / cooperation around the creation of ML models from distributed sources of sensitive data. Indeed, we believe that making discoveries using ML cannot be done without making sure that data privacy and governance are not compromised. We also believe that collaboration between data owners and data scientists is key to be able to create good ML models. These values are shared with the Linux Foundation AI and Data, which thus appears as the perfect host for the Substra project. We hope that it will bring value in the AI & Data community.”
Eric Boniface, General Manager of Substra Foundation, said: “We are very happy and proud at Substra Foundation to see the Substra project becoming an LF AI & Data hosted project. Having been its first umbrella for the open source community, hosting the repositories, elaborating the documentation, animating community workgroups and contributing to first real-world flagship use cases like the HealthChain and MELLODDY projects was an incredible experience shared with the amazing Owkin team developing the framework. It was only a first step at a moderate scale, and we are convinced that joining an experienced and global foundation like the LF AI & Data as an incubation project is a great opportunity and the perfect next chapter for the Substra project, its community, and many more privacy-preserving federated learning use cases to come!”.
LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project. LF AI & Data will support the neutral open governance for Substra to help foster the growth of the project. Learn more about Substra on their GitHub and be sure to join the Substra-Announce and Substra-Technical-Discuss mail lists to join the community and stay connected on the latest updates.
A warm welcome to Substra! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.
We are excited to welcome five new members to the LF AI & Data Foundation. OPPO Mobile Telecommunications Corp has joined as a Premier member, GSI Technology as a General member and Banque de France, Chaitanya Bharathi Institute of Technology, and Sahyadri College of Engineering & Management as Associate members.
The LF AI & Data Foundation will build and support an open community and a growing ecosystem of open source AI and data by accelerating development and innovation, enabling collaboration and the creation of new opportunities for all the members of the community.
“We are thrilled to continue seeing growth among our member community spanning a wide range of organizations. We see a huge potential for driving in AI and data innovation and the support from our members is critical to the success of that effort. A big welcome to our newest members and we hope more organizations will join us to support the LF AI & Data Foundation mission,” said Dr. Ibrahim Haddad, LF AI & Data Foundation Executive Director.
The LF AI & Data Premier membership is for organizations who contribute heavily to open source AI and data as well as bring in their own projects to be hosted at the Foundation. These companies want to take the most active role in enabling open source AI and Data. Premier members also lead via their voting seats on the Governing Board, Technical Advisory Council, and Outreach Committee.
Learn more about the newest Premier member below:
OPPO is a leading global smart device brand. Since the launch of its first smartphone – “Smiley Face” – in 2008, OPPO has been in relentless pursuit of the perfect synergy of aesthetic satisfaction and innovative technology. Today, OPPO provides a wide range of smart devices spearheaded by the Find and Reno series. Learn more here.
The LF AI & Data General membership is targeted for organizations that want to put their organization in full view in support of LF AI & Data and our mission. Organizations that join at the General level are committed to using open source technology, helping LF AI & Data grow, voicing the opinions of their customers, and giving back to the community.
Learn more about the newest General member below:
GSI Technology, Inc. is a leading provider of SRAM semiconductor memory solutions. GSI’s newest products leverage its market-leading SRAM technology. The Company recently launched radiation-hardened memory products for extreme environments and the Gemini® APU, a memory-centric associative processing unit designed to deliver performance advantages for diverse AI applications. Learn more here.
The LF AI & Data Associate membership is reserved for pre-approved non-profits, open source projects, and government entities who support the LF AI & Data mission.
Learn more about the newest Associate members below:
The Banque de France is the French pillar of the Eurosystem, a federal system formed by the European Central Bank and the national central banks of the euro area. Its three main missions are monetary strategy, financial stability and the provision of economic services to the community.
Chaitanya Bharathi Institute of Technology, established in the Year 1979, esteemed as the Premier Engineering Institute in the States of Telangana and Andhra Pradesh, was promoted by a Group of Visionaries from varied Professions of Engineering, Medical, Legal and Management, with an Objective to facilitate the Best Engineering and Management Education to the Students and contribute towards meeting the need of Skilled and Technically conversant Engineers and Management Professionals, for the Country that embarked on an Economic Growth Plan. Learn more here.
Sahyadri College of Engineering and Management (SCEM), Mangaluru was established in the year 2007 under the Bhandary Foundation. SCEM is one of the premier technological institutions inculcating quality and value based education through innovative teaching learning process for holistic development of the graduates. The Institute is affiliated to Visvesvaraya Technological University (VTU), Belagavi with permanent affiliation for most of the programs, approved by the AICTE and the Government of Karnataka. Learn more here.
Welcome New Members!
We look forward to partnering with these new LF AI & Data Foundation members to help support open source innovation and projects within the artificial intelligence (AI) and data space. Welcome to our new members!
Interested in joining the LF AI & Data community as a member? Learn more here and email firstname.lastname@example.org for more information and/or questions.
The LF AI & Data Foundation, the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, is announcing today that hosted project Milvus is advancing from an Incubation level project to a Graduate level. This graduation is the result of Milvus demonstrating thriving adoption, an ongoing flow of contributions from multiple organizations, and both documented and structured open governance processes. Milvus has also achieved a Core Infrastructure Initiative Best Practices Badge, and demonstrated a strong commitment to its community of users and contributors.
Milvus is an open-source vector database built to manage embedding vectors generated by machine learning models and neural networks. The platform is widely used in applications such as computer vision, natural language processing, computational chemistry, personalized recommender systems, and more. The Milvus project extends the capabilities of best-in-class approximate nearest neighbor (ANN) search libraries including Faiss, NMSLIB, and Annoy with a cloud-native database system design. Built with machine learning operations (MLOps) in mind, Milvus provides an efficient, reliable, and flexible database component that contributes to simplified management of the entire machine learning model lifecycle. Milvus has been adopted by over 1,000 organizations worldwide including iQiyi, Kingsoft, Tokopedia, Trend Micro, and more. More than 2,300 developers have joined the Milvus open-source community on GitHub, Slack, mailing lists, and WeChat.
Originally developed and open sourced by Zilliz, Milvus joined LF AI & Data as an incubation project in January 2020. As an Incubation project, the project has benefited from the LF AI & Data’s various enablement services to foster its growth and adoption; including program management support, event coordination, legal services, and marketing services ranging from website creation to project promotion.
“Milvus is a great example of a project that joined us in its early stages and grew significantly with the enablement of our services to graduate as a sign of maturity, functioning open governance, and large-scale adoption,” said Dr. Ibrahim Haddad, Executive Director of the LF AI & Data Foundation. “The development activities, the growth of its users and contributors community, and its adoption is particularly noteworthy. Milvus meets our graduation criteria and we’re proud to be its host Foundation. As a Graduate project, we will continue to support it via an extended set of services tailored for Graduated projects We’re also excited that the project is now eligible for a voting seat on LF AI & Data’s Technical Advisory Council. Congratulations, Milvus!”
“We have made significant progress since Milvus joined the LF AI & Data foundation 16 months ago. With all the good support from the foundation, we have grown a mature community around the Milvus project. We have also found a lot of collaboration opportunities with other members and projects in the foundation. It helped us a lot in promoting the Milvus project.” said Milvus project lead Xiaofan Luan.
Milvus in Numbers
The stats below capture Milvus’ development efforts as of their graduation in June 2021:
Congratulations to the Milvus team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.
Adlik, an LF AI & Data Foundation Incubation-Stage Project, has released version 0.3.0, called Cheetah. Adlik is a toolkit for accelerating deep learning inference, which provides an overall support for bringing trained models into production and eases the learning curves for different kinds of inference frameworks. In Adlik, Model Optimizer and Model Compiler delivers optimized and compiled models for a certain hardware environment, and Serving Engine provides deployment solutions for cloud, edge and device.
In version 0.3.0, Cheetah, you’ll find more frameworks integrated and the Adlik Optimizer succeeds in boosting inference performance of models. In a MLPerf test, a ResNet-50 model is optimized by Adlik optimizer, with model size compressed by 93%, inference latency reduced to 1.33ms. And in Adlik compiler, TVM auto scheduling, which globally and automatically searches for the optimal scheduling solution by re-designing scheduling templates, enables lower latency for ResNet-50 on x86 CPU than OpenVINO. This release enhances features, increases useability, and continues to showcase improvements across a wide range of scenarios. A few release highlights to note include the following:
Integrate deep learning frameworks including PaddlePaddle, Caffe and MXNet
Support compiling into TVM
Support FP16 quantization for OpenVINO
Support TVM auto scheduling
Specific optimization for YOLO V4
Pruning, distillation and quantization for ResNet-50
Support runtime of TVM and TF-TRT
Docker images for cloud native environments support newest version of inference components including OpenVINO (2021.1.110), TensorFlow (2.4.0), TensorRT (22.214.171.124), TFLite (2.4.0), TVM (0.7)
Support paddle models, such as Paddle OCR，PP-YOLO，PPresnet-50
A special thank you goes out to contributors from Paddle for their support in this release. Your contributions are greatly appreciated!
The Adlik Project invites you to adopt or upgrade to Cheetah, version 0.3.0, and welcomes feedback. To learn more about the Adlik 0.3.0 release, check out the full release notes. Want to get involved with Adlik? Be sure to join the Adlik-Announce and Adlik Technical-Discuss mailing lists to join the community and stay connected on the latest updates.
Congratulations to the Adlik team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.
The LF AI & Data Foundation, which supports and builds a sustainable ecosystem for open source AI and Data software, today announced that OPPO has joined its governance board as a Premier member.
OPPO is a leading global smart device brand. Since the launch of its first smartphone – “Smiley Face” – in 2008, OPPO has been in relentless pursuit of the perfect synergy of aesthetic satisfaction and innovative technology. Today, OPPO provides a wide range of smart devices spearheaded by the Find and Reno series. Beyond devices, OPPO provides its users with ColorOS and internet services like HeyTap and OPPO+. OPPO operates in more than 40 countries and regions, with 6 Research Institutes and 4 R&D Centers worldwide, as well as an International Design Center in London. The recently opened, first-ever R&D centre outside of China, in Hyderabad, is playing a pivotal role in the development of 5G technologies. In line with OPPO’s commitment to Make in India, the manufacturing at the Greater Noida plant has been increased to 50 million smartphones per year. According to IDC, OPPO has ranked 4th among the top 5 smartphone brands in India with an 88.4% year on year growth in Q4 2019.
Premier membership is LF AI & Data’s highest tier of membership, reserved for organizations who contribute heavily to the open source artificial intelligence (AI), machine learning (ML), deep learning (DL), and Data space. These organizations, along with members at the General and Associate levels, work in concert with LF AI & Data team members, to take the most active role in enabling open source AI, ML, DL, and Data growing the ecosystem; facilitating collaboration and integration efforts across projects, and spearheading efforts in areas such as interoperability, ethical and responsible AI.
“We are very pleased to welcome OPPO as a Premier Member to our Governing Board with a voting seat on all of our Foundation level committees,” said Dr. Ibrahim Haddad, LF AI & Data Foundation Executive Director. “We are supporting open source development, creating a sustainable ecosystem that makes it easier to rely on and integrate with open source AI and data technologies. Corporations like OPPO realize the importance of this effort and are working hard to foster a healthy ecosystem. We are thrilled that OPPO has strategically joined us at the highest membership level to further drive innovation in the community and support our hosted technical projects.”
LF AI & Data Membership
The LF AI & Data Foundation now has 49 members who are participating across the Premier, General, and Associate membership levels. We’ve seen a diverse group of companies getting involved across various industries and we welcome those interested in contributing to the support of open source projects within the AL, ML, DL, and Data space. Interested in becoming a member of LF AI & Data? Learn more here and email email@example.com for more information and/or questions.
The LF AI & Data Foundation’s mission is to build and support an open AI community, and drive open source innovation in the AI, ML, DL, and Data domains by enabling collaboration and the creation of new opportunities for all the members of the community.
A big thank you to Orange for hosting a great virtual meetup! LF AI & Data Day EU Virtual was held on June 10, 2021 with 76attendees joining live.
This event featured keynote speakers from leading AI industries from IBM, Orange, AIvancity School, and Banque de France with a focus on ML Breakthroughs, open source strategies for scaling machine learning, and Trusted AI. Various AI topics were covered, including technical presentations on MLOps, AI learning, Trusted AI, and new LF AI & Data projects such as Rosae NLG, ONNX, Machine Learning Exchange, and Datashim. ITU’s AI Activities were also presented during the closing session.
Missed the event? Check out all of the presentations and recording here.
This meetup took on a virtual format but we look forward to connecting again at another event in person soon. LF AI & Data Day is a regional, one-day event hosted and organized by local members with support from LF AI & Data, its members, and projects. If you are interested in hosting an LF AI & Data Day please email firstname.lastname@example.org to discuss.
Event host, Orange, is a leading telecommunications company with headquarters in France. They are the largest telecoms operator in France, with the bulk of their operations in Europe, Africa and the Middle East. As an LF AI & Data General Member, Orange is involved with the LF AI & Data Governing Board, Outreach Committee, Trusted AI Committee, and is an active contributor to the LF AI & Data Acumos project.
LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI)and Data open source projects, today is announcing DELTA as its latest Incubation Project.
DELTA is a deep learning based end-to-end natural language and speech processing platform. It aims to provide easy and fast experiences for using, deploying, and developing natural language processing (NLP) and speech models for both academia and industry use cases. DELTA is mainly implemented using TensorFlow and Python 3. It was released and open sourced by DiDi Global.
Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We are excited to welcome DELTA to LF AI & Data and help it thrive in a neutral, vendor-free environment under an open governance model. We look forward to help the project grow its community of users and contributors, enable collaboration and integration opportunities with other hosted projects to drive innovation open source AI technologies.”
DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users. It helps you to train, develop, and deploy NLP and/or speech models, featuring:
One command to train NLP and speech models, including:
NLP: text classification, named entity recognition, question and answering, text summarization, etc
Use configuration files to easily tune parameters and network structures
What you see in training is what you get in serving: all data processing and features extraction are integrated into a model graph
Uniform I/O interfaces and no changes for new models
Easily build state-of-the-art models using modularized components
All modules are reliable and fully-tested
Yunbo Wang, co-creator of DELTA, said: “ NLP and voice technology have been widely applied throughout DiDi’s business. For instance, Didi has built an intelligent customer service system based on AI to assist the efficiency of human customer service and reduce repetitive effort. Based on voice recognition and natural language understanding, DiDi has built a voice assistant function for drivers and applied it to the contact-free ride-hailingservices in Japan and Australia. In the future, DiDi will continue to actively promote the opening of related capabilities. Through one-stop natural language processing tools and platforms, DiDi will help its industrial partners realize better AI application landing”.
LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project. LF AI & Data will support the neutral open governance for DELTA to help foster the growth of the project. Check out the Documentation to start working with DELTA today. Learn more about DELTA on their GitHub and be sure to join the DELTA-Announce and DELTA-Technical-Discuss mail lists to join the community and stay connected on the latest updates.
A warm welcome to DELTA! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.
We’re excited to announce the v0.4 release of Ludwig — the open source, low-code declarative deep learning framework created and open sourced by Uber and hosted by the LF AI & Data Foundation. Ludwig enables you to apply state-of-the-art tabular, NLP, and computer vision models to your existing data and put them into production with just a few short commands.
The focus of this release is to bring MLOps best practices through declarative deep learning with enhanced scalability for data processing, training, and hyperparameter search. The new features of this release include:
Integration with Ray for large-scale distributed training that combines Dask and Horovod
A new distributed hyperparameter search integration with Ray Tune
The addition of TabNet as a combiner for state-of-the-art deep learning on tabular data
MLflow integration for unified experiment tracking and model serving
Preconfigured datasets for a wide variety of different tasks, leveraging Kaggle
Ludwig combines of all these elements into a single toolkit that guides you through machine learning end-to-end:
Experimentation with different model architectures using Ray Tune
Data cleaning and preprocessing up to large out-of-memory datasets with Dask and Ray
Distributed training on multi-node clusters with Horovod and Ray
Deployment and serving the best model in production with MLflow
Ludwig abstracts away the complexity of combining all these disparate systems together through its declarative approach to structuring machine learning pipelines. Instead of writing code for your model, training loop, preprocessing, postprocessing, evaluation, and hyperparameter optimization, you only need to declare the schema of your data as a simple YAML configuration:
Starting from a simple config like the one above, any and all aspects of the model architecture, training loop, hyperparameter search, and backend infrastructure can be modified as additional fields in the declarative configuration to customize the pipeline to meet your requirements:
Why Declarative Machine Learning Systems?
Ludwig’s declarative approach to machine learning presents the simplicity of conventional AutoML solutions with the flexibility of full-featured frameworks like TensorFlow and PyTorch. This is achieved by creating an extensible, declarative configuration with optional parameters for every aspect of the pipeline. Ludwig’s declarative programming model allows for key features such as:
Multi-modal, multi-task learning in zero lines of code. Mix and match tabular data, text, imagery, and even audio into complex model configurations without writing code.
Integration with any structured data source. If it can be read into a SQL table or Pandas DataFrame, Ludwig can train a model on it.
Easily explore different model configurations and parameters with hyperopt. Automatically track all trials and metrics with tools like Comet ML, Weights & Biases, and MLflow.
Automatically scale training to multi-GPU, multi-node clusters. Go from training on your local machine to the cloud without code or config changes.
Fully customize any part of the training process. Every part of the model and training process is fully configurable in YAML, and easy to extend through custom TensorFlow modules with a simple interface.
Ludwig distributed training and data processing with Ray
Ludwig on Ray is a new backend introduced in v0.4 that illustrates the power of declarative machine learning. Starting from any existing Ludwig configuration like the one above, users can scale their training process from running on their local laptop, to running in the cloud on a GPU instance, to scaling across hundreds of machines in parallel, all without changing a single line of code.
By integrating with Ray, Ludwig is able to provide a unified way for doing distributed training:
Ray enables you to provision a cluster of machines in a single command through its cluster launcher.
Horovod on Ray enables you to do distributed training without needing to configure MPI in your environment.
Dask on Ray enables you to process large datasets that don’t fit in memory on a single machine.
Ray Tune enables you to easily run distributed hyperparameter search across many machines in parallel.
All of this comes for free without changing a single line of code in Ludwig. When Ludwig detects that you’re running within a Ray cluster, the Ray backend will be enabled automatically.
After launching a Ray cluster by running ray up on the command line, you need only ray submit your existing Ludwig training command to scale out across all the nodes in your Ray cluster.
Behind the scenes, Ludwig will do the work of determining what resources your Ray cluster has (number of nodes, GPUs, etc.) and spreading out the work to speed up the training process.
Ludwig on Ray will use Dask as a distributed DataFrame engine, allowing it to process large datasets that do not fit within the memory of a single machine. After processing the data into Parquet or TFRecord format, Ludwig on Ray will automatically spin up Horovod workers to distribute the TensorFlow training process across multiple GPUs.
To get you started, we provide Docker images for both CPU and GPU environments. These images come pre-installed with Ray, CUDA, Dask, Horovod, TensorFlow, and everything else you need to train any model with Ludwig on Ray. Just add one of these Docker images to your Ray cluster config and you can start doing large scale distributed deep learning in the cloud within minutes:
As with other aspects of Ludwig, the Ray backend can be configured through the Ludwig config YAML. For example, when running on large datasets in the cloud, it can be useful to customize the cache directory where Ludwig writes the preprocessed data to use a specific bucket in a cloud object storage system like Amazon S3:
In Ludwig v0.4, you can use cloud object storage like Amazon S3, Google Cloud Storage, Azure Data Lake Storage, and MinIO for datasets, processed data caches, config files, and training output. Just specify your filenames using the appropriate protocol and environment variables, and Ludwig will take care of the rest.
Check the Ludwig user guide for a complete description of available configuration options.
New distributed hyperparameter search with Ray Tune
Another new feature of the 0.4 release is the ability to do distributed hyperparameter search. With this release, Ludwig users will be able to execute hyperparameter search using cutting edge algorithms, including Population-Based Training, Bayesian Optimization, and HyperBand, among others.
We first introduced hyperparameter search capabilities for Ludwig in v0.3, but the integration with Ray Tune — a distributed hyperparameter tuning library native to Ray — makes it possible to distribute the search process across an entire cluster of machines, and use any search algorithms provided by Ray Tune within Ludwig out-of-the-box. Through Ludwig’s declarative configuration, you can start using Ray Tune to optimize over any of Ludwig’s configurable parameters with just a few additional lines in your config file:
To run this on Ray across all the nodes in your cluster, you need only take the existing ludwig hyperopt command and ray submit it to the cluster:
Within the hyperopt.sampler section of the Ludwig config, you’re free to customize the hyperparameter search process with the full set of search algorithms and configuration settings provided by Ray Tune:
State-of-the-Art Tabular Models with TabNet
The first version of Ludwig released in 2019 supported tabular datasets using a concat combiner that implements the Wide and Deep learning architecture. When users specify numerical, category, and binary feature types, the concat combiner will concatenate the features together and build a stack of fully connected layers.
In this release we are extending Ludwig’s support for tabular data by adding a new TabNet combiner. TabNet is a state-of-the-art deep learning model architecture for tabular data that uses sparsity and multiple steps of feature transformations and attention to achieve high performance. The Ludwig implementation allows users to also use feature types other than the classic tabular ones as inputs.
Training a TabNet model is as easy as specifying a tabnet combiner and providing its hyperparameters in the Ludwig configuration.
We compared the performance achieved by the Ludwig TabNet implementation with the performance reported in the original paper, where the authors trained for longer and performed hyperparameter optimization, and confirmed it can achieve very comparable results in minimal time even when trained locally, as shown in the table below.
TabNet Paper Accuracy
Ludwig TabNet Accuracy
Forest Tree Cover
In addition to TabNet, we also added a new Transformer based combiner and improved upon the existing concat combiner by supporting optional skip connections. These additions make Ludwig a powerful and flexible option for training deep learning models on tabular data.
Experiment Tracking and Model Serving with MLflow
MLflow is an open source experiment tracking and model registry system.
Ludwig v0.4 introduces first-class support for tracking Ludwig train, experiment, and hyperopt runs in MLflow with just a single extra command-line argument: –mlflow.
The experiment_name you provide to Ludwig will map directly to an experiment in MLflow so you can organize multiple training or hyperopt runs together.
This functionality is also exposed through the Python API through a single callback:
In addition to tracking experiment results, MLflow can also be used to store and serve models in production. Ludwig v0.4 makes it easy to take an existing Ludwig model (either saved as a directory or in an MLflow experiment) and register it with the MLflow model registry:
The Ludwig model will be converted automatically to MLflow’s model.pyfunc format, allowing it to be executed in a framework-agnostic way through a REST endpoint, Spark UDF, Python API with Pandas, etc.
Preconfigured datasets from Kaggle
Since its initial release, Ludwig has required datasets to be provided in tabular form, with a header containing names that can be referenced from the configuration file. In order to make it easy to get started with applying Ludwig to popular datasets and tasks, we’ve added a new datasets module in v0.4 that allows you to download datasets, process them into a tabular format ready for use with Ludwig, and load them into a DataFrame for training in a single line of code.
The Ludwig datasets module integrates with the Kaggle API to provide instant access to popular datasets used in Kaggle competitions. In v0.4, we provide access to popular competition datasets like Titanic, Rossmann Store Sales, Ames Housing and more. Here is an example of how to load the Titanic dataset:
Adding a new dataset is straightforward and just requires extending the Dataset abstract class and implementing minimal data manipulation code. This has allowed us to quickly expand the set of supported datasets to include SST, MNIST, Amazon Review, Yahoo Answers and many more. For a full list of the available dataset please check the User Guide, We encourage you to contribute your own favorite datasets!
Our goal is to make machine learning easier and more accessible to a broader audience. We’re excited to continue to pursue this goal with features for Ludwig in the pipeline, including:
End-to-end AutoML with Neural Architecture Search – Offload part or all of the work of picking the optimal search strategy, tuning parameters, and choosing encoders/combiners/decoders for your given dataset and resources during model training.
Combined hyperopt & distributed training – Jointly run hyperopt and distributed training to find the best model within a provided time constraint.
Pure TensorFlow low-latency serving – Leverage a flexible and high-performance serving system designed for production machine learning environments using TensorFlow Serving.
PyTorch backend – Write custom Ludwig modules using all your favorite frameworks and take advantage of the rich ecosystem each provides.
We hope that these new capabilities will make it easier for our community to continue to build state-of-the-art models. If you are excited in this direction as we are, join our community and get involved! We are building this open source project together, we’ll keep on pushing for a release of Ludwig v0.5 and we welcome contributions from anyone who is excited to see this happen!
We also recognize that for many organizations, success with machine learning means solving many challenges end-to-end; from connecting & accessing data, to training and deploying model pipelines, and then making those models easily available to the rest of the organization.
That’s why we’re also excited to announce that we are building a new solution called Predibase, a cohesive enterprise platform built on top of Ludwig, Horovod, and Ray to help realize the vision of making machine learning easier and more accessible. We’ll be sharing more details soon, and if you’re excited to get in touch in the meantime please feel free to reach out to us at email@example.com (we are hiring!).
We really hope that you find the new features in Ludwig 0.4 exciting, and want to thank our amazing community for the contributions and requests. Please drop us a comment or email with any feedback, and happy training!
A lot of work went into Ludwig v0.4, and we want to thank everyone who contributed and helped, and in particular the main contributors and community members to this release: our co-maintainer Jim Thompson, Saikat Kanjilal, Avanika Narayan, Nimit Sohoni, Kanishk Kalra, Michael Zhu, Elias Castro-Hernandez, Debbie Yuen, Victor Dai. Special thanks to the immense support from the Stanford’s Hazy research group led by Prof. Chris Ré, to Richard Liaw, Hao Zhang and Micheal Chau from the Ray team, and the LF AI & Data staff.