Substra Joins LF AI & Data as New Incubation Project

By Blog

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, today is announcing Substra as its latest Incubation Project. 

Substra is a framework offering distributed orchestration of machine learning tasks among partners while guaranteeing secure and trustless traceability of all operations. The Substra project was released and open sourced by OWKIN under the Apache-2.0 license. 

Substra enables privacy-preserving federated learning projects, where multiple parties collaborate on a Machine Learning objective while each one keeps their private datasets behind their own firewall. Its ambition is to make new scientific and economic data science collaborations possible.

Data scientists using the Substra framework are able to:

  • Use their own ML algorithm with any Python ML framework
  • Ship their algorithm on remote data for training and/or prediction and monitor their performances
  • Build advanced Federated Learning strategies for learning across several remote datasets

Data controllers using the Substra framework are able to:

  • Make their dataset(s) available to other partners for training/evaluation, ensuring it cannot be viewed or downloaded
  • Choose fine tuned permissions for your dataset to control its lifecycle
  • Monitor how the data was usedEngage in advanced multi-partner data science collaborations, even with partners owning competing datasets.

Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We’re excited to welcome the Substra project in LF AI & Data. The project enables data scientists to use their own ML algorithm with any Python framework, deploy their algorithm on remote data for training and/or prediction and monitor their performances, and build advanced Federated Learning strategies for learning across several remote datasets. We look forward to working with the community to grow the project’s footprint and to create new collaboration opportunities for it with our members and other hosted projects.” 

Substra operates distributed Machine Learning and aims to provide tools for traceable Data Science.

  • Data Locality: Data remains in the owner’s data stores and is never transferred. AI models travel from one dataset to another.
  • Decentralized Trust: All operations are orchestrated by a distributed ledger technology. There is no need for a single trusted actor or third party; security arises from the network.
  • Traceability: An immutable audit trail registers all the operations realized on the platform simplifying certification of model.
  • Modularity: Substra is highly flexible; various permission regimes and workflow structures can be enforced corresponding to every specific use case.

Camille Marini, Founder of the Substra project, said: “On behalf of all people who contributed to the Substra framework, I am thrilled and proud that it has been accepted as an incubation project in the LF AI & Data Foundation. Substra has been designed to enable the collaboration / cooperation around the creation of ML models from distributed sources of sensitive data. Indeed, we believe that making discoveries using ML cannot be done without making sure that data privacy and governance are not compromised. We also believe that collaboration between data owners and data scientists is key to be able to create good ML models. These values are shared with the Linux Foundation AI and Data, which thus appears as the perfect host for the Substra project. We hope that it will bring value in the AI & Data community.”

Eric Boniface, General Manager of Substra Foundation, said: “We are very happy and proud at Substra Foundation to see the Substra project becoming an LF AI & Data hosted project. Having been its first umbrella for the open source community, hosting the repositories, elaborating the documentation, animating community workgroups and contributing to first real-world flagship use cases like the HealthChain and MELLODDY projects was an incredible experience shared with the amazing Owkin team developing the framework. It was only a first step at a moderate scale, and we are convinced that joining an experienced and global foundation like the LF AI & Data as an incubation project is a great opportunity and the perfect next chapter for the Substra project, its community, and many more privacy-preserving federated learning use cases to come!”.

LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project.  LF AI & Data will support the neutral open governance for Substra to help foster the growth of the project. Learn more about Substra on their GitHub and be sure to join the Substra-Announce and Substra-Technical-Discuss mail lists to join the community and stay connected on the latest updates. 

A warm welcome to Substra! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

Substra Key Links

LF AI & Data Resources

New LF AI & Data Members Welcome – Q2 2021

By Blog

We are excited to welcome five new members to the LF AI & Data Foundation. OPPO Mobile Telecommunications Corp has joined as a Premier member, GSI Technology as a General member and Banque de France, Chaitanya Bharathi Institute of Technology, and Sahyadri College of Engineering & Management as Associate members. 

The LF AI & Data Foundation will build and support an open community and a growing ecosystem of open source AI and data by accelerating development and innovation, enabling collaboration and the creation of new opportunities for all the members of the community.

“We are thrilled to continue seeing growth among our member community spanning a wide range of organizations. We see a huge potential for driving in AI and data innovation and the support from our members is critical to the success of that effort. A big welcome to our newest members and we hope more organizations will join us to support the LF AI & Data Foundation mission,” said Dr. Ibrahim Haddad, LF AI & Data Foundation Executive Director.

Premier Members

The LF AI & Data Premier membership is for organizations who contribute heavily to open source AI and data as well as bring in their own projects to be hosted at the Foundation. These companies want to take the most active role in enabling open source AI and Data. Premier members also lead via their voting seats on the Governing Board, Technical Advisory Council, and Outreach Committee.

Learn more about the newest Premier member below:

OPPO is a leading global smart device brand. Since the launch of its first smartphone – “Smiley Face” – in 2008, OPPO has been in relentless pursuit of the perfect synergy of aesthetic satisfaction and innovative technology. Today, OPPO provides a wide range of smart devices spearheaded by the Find and Reno series. Learn more here.

General Members

The LF AI & Data General membership is targeted for organizations that want to put their organization in full view in support of LF AI & Data and our mission. Organizations that join at the General level are committed to using open source technology, helping LF AI & Data grow, voicing the opinions of their customers, and giving back to the community.

Learn more about the newest General member below:

GSI Technology, Inc. is a leading provider of SRAM semiconductor memory solutions. GSI’s newest products leverage its market-leading SRAM technology. The Company recently launched radiation-hardened memory products for extreme environments and the Gemini® APU, a memory-centric associative processing unit designed to deliver performance advantages for diverse AI applications. Learn more here.

Associate Members

The LF AI & Data Associate membership is reserved for pre-approved non-profits, open source projects, and government entities who support the LF AI & Data mission.

Learn more about the newest Associate members below: 

The Banque de France is the French pillar of the Eurosystem, a federal system formed by the European Central Bank and the national central banks of the euro area. Its three main missions are monetary strategy, financial stability and the provision of economic services to the community.

Chaitanya Bharathi Institute of Technology, established in the Year 1979, esteemed as the Premier Engineering Institute in the States of Telangana and Andhra Pradesh, was promoted by a Group of Visionaries from varied Professions of Engineering, Medical, Legal and Management, with an Objective to facilitate the Best Engineering and Management Education to the Students and contribute towards meeting the need of Skilled and Technically conversant Engineers and Management Professionals, for the Country that embarked on an Economic Growth Plan. Learn more here.

Sahyadri College of Engineering and Management (SCEM), Mangaluru was established in the year 2007 under the Bhandary Foundation. SCEM is one of the premier technological institutions inculcating quality and value based education through innovative teaching learning process for holistic development of the graduates. The Institute is affiliated to Visvesvaraya Technological University (VTU), Belagavi with permanent affiliation for most of the programs, approved by the AICTE and the Government of Karnataka. Learn more here.

Welcome New Members!

We look forward to partnering with these new LF AI & Data Foundation members to help support open source innovation and projects within the artificial intelligence (AI) and data space. Welcome to our new members!

Interested in joining the LF AI & Data community as a member? Learn more here and email for more information and/or questions. 

LF AI & Data Resources

LF AI & Data Foundation Announces Graduation of Milvus Project

By Blog

The LF AI & Data Foundation, the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, is announcing today that hosted project Milvus is advancing from an Incubation level project to a Graduate level. This graduation is the result of Milvus demonstrating thriving adoption, an ongoing flow of contributions from multiple organizations, and both documented and structured open governance processes. Milvus has also achieved a Core Infrastructure Initiative Best Practices Badge, and demonstrated a strong commitment to its community of users and contributors. 

Milvus is an open-source vector database built to manage embedding vectors generated by machine learning models and neural networks. The platform is widely used in applications such as computer vision, natural language processing, computational chemistry, personalized recommender systems, and more. The Milvus project extends the capabilities of best-in-class approximate nearest neighbor (ANN) search libraries including Faiss, NMSLIB, and Annoy with a cloud-native database system design. Built with machine learning operations (MLOps) in mind, Milvus provides an efficient, reliable, and flexible database component that contributes to simplified management of the entire machine learning model lifecycle. Milvus has been adopted by over 1,000 organizations worldwide including iQiyi, Kingsoft, Tokopedia, Trend Micro, and more. More than 2,300 developers have joined the Milvus open-source community on GitHub, Slack, mailing lists, and WeChat.

Originally developed and open sourced by Zilliz, Milvus joined LF AI & Data as an incubation project in January 2020.  As an Incubation project, the project has benefited  from  the LF AI & Data’s various enablement services to foster its growth and adoption; including program management support, event coordination, legal services, and marketing services ranging from website creation to project promotion.

“Milvus is a great example of a project that joined us in its early stages and grew significantly with the enablement of our services to graduate as a sign of maturity, functioning open governance, and large-scale adoption,” said Dr. Ibrahim Haddad, Executive Director of the LF AI & Data Foundation. “The development activities, the growth of its users and contributors community, and its adoption is particularly noteworthy. Milvus meets our graduation criteria and we’re proud to be its host Foundation. As a Graduate project, we will continue to support it via an extended set of services tailored for Graduated projects  We’re also excited that the project is now eligible for a voting seat on LF AI & Data’s Technical Advisory Council. Congratulations, Milvus!”

“We have made significant progress since Milvus joined the LF AI & Data foundation 16 months ago. With all the good support from the foundation, we have grown a mature community around the Milvus project. We have also found a lot of collaboration opportunities with other members and projects in the foundation. It helped us a lot in promoting the Milvus project.” said Milvus project lead Xiaofan Luan.

Milvus in Numbers

The stats below capture Milvus’ development efforts as of their graduation in June 2021:

  • Contributors on GitHub: 140 
  • GitHub stars: 6.4K
  • GitHub forks: 887
  • Docker hub downloads: 320K
  • Known community members: 2.3K
LFX Insights stats on Milvus project

Curious about how to get involved with Milvus? 

Check out the Milvus Quickstart Guide and be sure to join the Milvus Announce and Milvus Technical-Discuss mailing lists to join the community and stay connected on the latest updates. Learn more about Milvus on their website and GitHub.

Congratulations to the Milvus team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.

Milvus Key Links

LF AI & Data Resources

Adlik Cheetah Release (v0.3.0) Now Available!

By Blog

Adlik, an LF AI & Data Foundation Incubation-Stage Project, has released version 0.3.0, called Cheetah. Adlik is a toolkit for accelerating deep learning inference, which provides an overall support for bringing trained models into production and eases the learning curves for different kinds of inference frameworks. In Adlik, Model Optimizer and Model Compiler delivers optimized and compiled models for a certain hardware environment, and Serving Engine provides deployment solutions for cloud, edge and device.

In version 0.3.0, Cheetah, you’ll find more frameworks integrated and the Adlik Optimizer succeeds in boosting inference performance of models. In a MLPerf test, a ResNet-50 model is optimized by Adlik optimizer, with model size compressed by 93%, inference latency reduced to 1.33ms. And in Adlik compiler, TVM auto scheduling, which globally and automatically searches for the optimal scheduling solution by re-designing scheduling templates, enables lower latency for ResNet-50 on x86 CPU than OpenVINO. This release enhances features, increases useability, and continues to showcase improvements across a wide range of scenarios. A few release highlights to note include the following:

  • Compiler
    • Integrate deep learning frameworks including PaddlePaddle, Caffe and MXNet
    • Support compiling into TVM
    • Support FP16 quantization for OpenVINO
    • Support TVM auto scheduling
  • Optimizer
    • Specific optimization for YOLO V4
    • Pruning, distillation and quantization for ResNet-50
  • Inference Engine
    • Support runtime of TVM and TF-TRT
    • Docker images for cloud native environments support newest version of inference components including OpenVINO (2021.1.110), TensorFlow (2.4.0), TensorRT (, TFLite (2.4.0), TVM (0.7)
  • Benchmark Test
    • Support paddle models, such as Paddle OCR,PP-YOLO,PPresnet-50

A special thank you goes out to contributors from Paddle for their support in this release. Your contributions are greatly appreciated! 

The Adlik Project invites you to adopt or upgrade to Cheetah, version 0.3.0, and welcomes feedback. To learn more about the Adlik 0.3.0 release, check out the full release notes. Want to get involved with Adlik? Be sure to join the Adlik-Announce and Adlik Technical-Discuss mailing lists to join the community and stay connected on the latest updates. 

Congratulations to the Adlik team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

Adlik Key Links

LF AI & Data Resources

LF AI & Data Foundation Welcomes OPPO to its Governing Board as a Premier Member

By Blog


The LF AI & Data Foundation, which supports and builds a sustainable ecosystem for open source AI and Data software, today announced that OPPO has joined its governance board as a Premier member.

OPPO is a leading global smart device brand. Since the launch of its first smartphone – “Smiley Face” – in 2008, OPPO has been in relentless pursuit of the perfect synergy of aesthetic satisfaction and innovative technology. Today, OPPO provides a wide range of smart devices spearheaded by the Find and Reno series. Beyond devices, OPPO provides its users with ColorOS and internet services like HeyTap and OPPO+. OPPO operates in more than 40 countries and regions, with 6 Research Institutes and 4 R&D Centers worldwide, as well as an International Design Center in London. The recently opened, first-ever R&D centre outside of China, in Hyderabad, is playing a pivotal role in the development of 5G technologies. In line with OPPO’s commitment to Make in India, the manufacturing at the Greater Noida plant has been increased to 50 million smartphones per year. According to IDC, OPPO has ranked 4th among the top 5 smartphone brands in India with an 88.4% year on year growth in Q4 2019. 

Premier membership is LF AI & Data’s highest tier of membership, reserved for organizations who contribute heavily to the open source artificial intelligence (AI), machine learning (ML), deep learning (DL), and Data space. These organizations, along with members at the General and Associate levels, work in concert with LF AI & Data team members, to take the most active role in enabling open source AI, ML, DL, and Data growing the ecosystem; facilitating collaboration and integration efforts across projects, and spearheading efforts in areas such as interoperability, ethical and responsible AI.

“We are very pleased to welcome OPPO as a Premier Member to our Governing Board with a voting seat on all of our Foundation level committees,” said Dr. Ibrahim Haddad, LF AI & Data Foundation Executive Director. “We are supporting open source development, creating a sustainable ecosystem that makes it easier to rely on and integrate with open source AI and data technologies. Corporations like OPPO realize the importance of this effort and are working hard to foster a healthy ecosystem. We are thrilled that OPPO has strategically joined us at the highest membership level to further drive innovation in the community and support our hosted technical projects.”

LF AI & Data Membership

The LF AI & Data Foundation now has 49 members who are participating across the Premier, General, and Associate membership levels. We’ve seen a diverse group of companies getting involved across various industries and we welcome those interested in contributing to the support of open source projects within the AL, ML, DL, and Data space. Interested in becoming a member of LF AI & Data? Learn more here and email for more information and/or questions.

The LF AI & Data Foundation’s mission is to build and support an open AI community, and drive open source innovation in the AI, ML, DL, and Data domains by enabling collaboration and the creation of new opportunities for all the members of the community. 

LF AI & Data Resources

Thank you to Orange for a Great Virtual LF AI & Data Day EU!

By Blog

A big thank you to Orange for hosting a great virtual meetup! LF AI & Data Day EU Virtual was held on June 10, 2021 with 76 attendees joining live. 

This event featured keynote speakers from leading AI industries from IBM, Orange, AIvancity School, and Banque de France with a focus on ML Breakthroughs,  open source strategies for scaling machine learning, and Trusted AI. Various AI topics were covered, including technical presentations on MLOps, AI learning, Trusted AI, and new LF AI & Data projects such as Rosae NLG, ONNX, Machine Learning Exchange, and Datashim. ITU’s AI Activities were also presented during the closing session.     

Missed the event? Check out all of the presentations and recording here.

This meetup took on a virtual format but we look forward to connecting again at another event in person soon. LF AI & Data Day is a regional, one-day event hosted and organized by local members with support from LF AI & Data, its members, and projects. If you are interested in hosting an LF AI & Data Day please email to discuss.

Event host, Orange, is a leading telecommunications company with headquarters in France. They are the largest telecoms operator in France, with the bulk of their operations in Europe, Africa and the Middle East. As an LF AI & Data General Member, Orange is involved with the LF AI & Data Governing Board, Outreach Committee, Trusted AI Committee, and is an active contributor to the LF AI & Data Acumos project.  

LF AI & Data Resources

DELTA Joins LF AI & Data as New Incubation Project

By Blog

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI)and Data open source projects, today is announcing DELTA as its latest Incubation Project. 

DELTA is a deep learning based end-to-end natural language and speech processing platform. It aims to provide easy and fast experiences for using, deploying, and developing natural language processing (NLP) and speech models for both academia and industry use cases. DELTA is mainly implemented using TensorFlow and Python 3. It was released and open sourced by DiDi Global.

Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We are excited to welcome DELTA to LF AI & Data and help it thrive in a neutral, vendor-free environment under an open governance model. We look forward to help the project grow its community of users and contributors, enable collaboration and integration opportunities  with other hosted projects to drive innovation open source AI technologies.” 

DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users. It helps you to train, develop, and deploy NLP and/or speech models, featuring:

  • Easy-to-use
    • One command to train NLP and speech models, including:
      • NLP: text classification, named entity recognition, question and answering, text summarization, etc
      • Speech: speech recognition, speaker verification, emotion recognition, etc
    • Use configuration files to easily tune parameters and network structures
  • Easy-to-deploy
    • What you see in training is what you get in serving: all data processing and features extraction are integrated into a model graph
    • Uniform I/O interfaces and no changes for new models
  • Easy-to-develop
    • Easily build state-of-the-art models using modularized components
    • All modules are reliable and fully-tested

Yunbo Wang, co-creator of DELTA, said: “ NLP and voice technology have been widely applied throughout DiDi’s business. For instance, Didi has built an intelligent customer service system based on AI to assist the efficiency of human customer service and reduce repetitive effort. Based on voice recognition and natural language understanding, DiDi has built a voice assistant function for drivers and applied it to the contact-free ride-hailing services in Japan and Australia. In the future, DiDi will continue to actively promote the opening of related capabilities. Through one-stop natural language processing tools and platforms, DiDi will help its industrial partners realize better AI application landing”. 

LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project.  LF AI & Data will support the neutral open governance for DELTA to help foster the growth of the project. Check out the Documentation to start working with DELTA today. Learn more about DELTA on their GitHub and be sure to join the DELTA-Announce and DELTA-Technical-Discuss mail lists to join the community and stay connected on the latest updates. 

A warm welcome to DELTA! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

DELTA Key Links

LF AI & Data Resources

Ludwig AI v0.4 – Introducing Declarative MLOps with Ray, Dask, TabNet, and MLflow integrations

By Blog

Guest Authors: Piero Molino, Travis Addair, Devvret Rishi

We’re excited to announce the v0.4 release of Ludwig — the open source, low-code declarative deep learning framework created and open sourced by Uber and hosted by the LF AI & Data Foundation. Ludwig enables you to apply state-of-the-art tabular, NLP, and computer vision models to your existing data and put them into production with just a few short commands.

The focus of this release is to bring MLOps best practices through declarative deep learning with enhanced scalability for data processing, training, and hyperparameter search. The new features of this release include: 

  • Integration with Ray for large-scale distributed training that combines Dask and Horovod
  • A new distributed hyperparameter search integration with Ray Tune
  • The addition of TabNet as a combiner for state-of-the-art deep learning on tabular data
  • MLflow integration for unified experiment tracking and model serving
  • Preconfigured datasets for a wide variety of different tasks, leveraging Kaggle

Ludwig combines of all these elements into a single toolkit that guides you through machine learning end-to-end:

  • Experimentation with different model architectures using Ray Tune
  • Data cleaning and preprocessing up to large out-of-memory datasets with Dask and Ray
  • Distributed training on multi-node clusters with Horovod and Ray
  • Deployment and serving the best model in production with MLflow

Ludwig abstracts away the complexity of combining all these disparate systems together through its declarative approach to structuring machine learning pipelines. Instead of writing code for your model, training loop, preprocessing, postprocessing, evaluation, and hyperparameter optimization, you only need to declare the schema of your data as a simple YAML configuration:

Starting from a simple config like the one above, any and all aspects of the model architecture, training loop, hyperparameter search, and backend infrastructure can be modified as additional fields in the declarative configuration to customize the pipeline to meet your requirements:

Why Declarative Machine Learning Systems?

Ludwig’s declarative approach to machine learning presents the simplicity of conventional AutoML solutions with the flexibility of full-featured frameworks like TensorFlow and PyTorch. This is achieved by creating an extensible, declarative configuration with optional parameters for every aspect of the pipeline. Ludwig’s declarative programming model allows for key features such as:

  • Multi-modal, multi-task learning in zero lines of code. Mix and match tabular data, text, imagery, and even audio into complex model configurations without writing code.
  • Integration with any structured data source. If it can be read into a SQL table or Pandas DataFrame, Ludwig can train a model on it.
  • Easily explore different model configurations and parameters with hyperopt. Automatically track all trials and metrics with tools like Comet ML, Weights & Biases, and MLflow.
  • Automatically scale training to multi-GPU, multi-node clusters. Go from training on your local machine to the cloud without code or config changes.
  • Fully customize any part of the training process. Every part of the model and training process is fully configurable in YAML, and easy to extend through custom TensorFlow modules with a simple interface.

Ludwig distributed training and data processing with Ray

Ludwig on Ray is a new backend introduced in v0.4 that illustrates the power of declarative machine learning. Starting from any existing Ludwig configuration like the one above, users can scale their training process from running on their local laptop, to running in the cloud on a GPU instance, to scaling across hundreds of machines in parallel, all without changing a single line of code.

By integrating with Ray, Ludwig is able to provide a unified way for doing distributed training:

  • Ray enables you to provision a cluster of machines in a single command through its cluster launcher.
  • Horovod on Ray enables you to do distributed training without needing to configure MPI in your environment.
  • Dask on Ray enables you to process large datasets that don’t fit in memory on a single machine.
  • Ray Tune enables you to easily run distributed hyperparameter search across many machines in parallel.

All of this comes for free without changing a single line of code in Ludwig. When Ludwig detects that you’re running within a Ray cluster, the Ray backend will be enabled automatically.

After launching a Ray cluster by running ray up on the command line, you need only ray submit your existing Ludwig training command to scale out across all the nodes in your Ray cluster.

Behind the scenes, Ludwig will do the work of determining what resources your Ray cluster has (number of nodes, GPUs, etc.) and spreading out the work to speed up the training process.

Ludwig on Ray will use Dask as a distributed DataFrame engine, allowing it to process large datasets that do not fit within the memory of a single machine. After processing the data into Parquet or TFRecord format, Ludwig on Ray will automatically spin up Horovod workers to distribute the TensorFlow training process across multiple GPUs.

To get you started, we provide Docker images for both CPU and GPU environments. These images come pre-installed with Ray, CUDA, Dask, Horovod, TensorFlow, and everything else you need to train any model with Ludwig on Ray. Just add one of these Docker images to your Ray cluster config and you can start doing large scale distributed deep learning in the cloud within minutes:

As with other aspects of Ludwig, the Ray backend can be configured through the Ludwig config YAML. For example, when running on large datasets in the cloud, it can be useful to customize the cache directory where Ludwig writes the preprocessed data to use a specific bucket in a cloud object storage system like Amazon S3:

In Ludwig v0.4, you can use cloud object storage like Amazon S3, Google Cloud Storage, Azure Data Lake Storage, and MinIO for datasets, processed data caches, config files, and training output. Just specify your filenames using the appropriate protocol and environment variables, and Ludwig will take care of the rest.

Check the Ludwig user guide for a complete description of available configuration options.

New distributed hyperparameter search with Ray Tune

Another new feature of the 0.4 release is the ability to do distributed hyperparameter search. With this release, Ludwig users will be able to execute hyperparameter search using cutting edge algorithms, including Population-Based Training, Bayesian Optimization, and HyperBand, among others. 

We first introduced hyperparameter search capabilities for Ludwig in v0.3, but the integration with Ray Tune —  a distributed hyperparameter tuning library native to Ray —  makes it possible to distribute the search process across an entire cluster of machines, and use any search algorithms provided by Ray Tune within Ludwig out-of-the-box. Through Ludwig’s declarative configuration, you can start using Ray Tune to optimize over any of Ludwig’s configurable parameters with just a few additional lines in your config file:

To run this on Ray across all the nodes in your cluster, you need only take the existing ludwig hyperopt command and ray submit it to the cluster:

Within the hyperopt.sampler section of the Ludwig config, you’re free to customize the hyperparameter search process with the full set of search algorithms and configuration settings provided by Ray Tune:

State-of-the-Art Tabular Models with TabNet

The first version of Ludwig released in 2019 supported tabular datasets using a concat combiner that implements the Wide and Deep learning architecture. When users specify numerical, category, and binary feature types, the concat combiner will concatenate the features together and build a stack of fully connected layers.

In this release we are extending Ludwig’s support for tabular data by adding a new TabNet combiner. TabNet is a state-of-the-art deep learning model architecture for tabular data that uses sparsity and multiple steps of feature transformations and attention to achieve high performance. The Ludwig implementation allows users to also use feature types other than the classic tabular ones as inputs.

Training a TabNet model is as easy as specifying a tabnet combiner and providing its hyperparameters in the Ludwig configuration.

We compared the performance achieved by the Ludwig TabNet implementation with the performance reported in the original paper, where the authors trained for longer and performed hyperparameter optimization, and confirmed it can achieve very comparable results in minimal time even when trained locally, as shown in the table below.

DatasetXGBoost AccuracyTabNet Paper AccuracyLudwig TabNet Accuracy
Poker Hands0.7110.9920.9914
Higgs Boson0.78840.7846
Forest Tree Cover0.89340.96990.9508

In addition to TabNet, we also added a new Transformer based combiner and improved upon the existing concat combiner by supporting optional skip connections. These additions make Ludwig a powerful and flexible option for training deep learning models on tabular data.

Experiment Tracking and Model Serving with MLflow

MLflow is an open source experiment tracking and model registry system.

Ludwig v0.4 introduces first-class support for tracking Ludwig train, experiment, and hyperopt runs in MLflow with just a single extra command-line argument: –mlflow.

The experiment_name you provide to Ludwig will map directly to an experiment in MLflow so you can organize multiple training or hyperopt runs together.

This functionality is also exposed through the Python API through a single callback:

In addition to tracking experiment results, MLflow can also be used to store and serve models in production. Ludwig v0.4 makes it easy to take an existing Ludwig model (either saved as a directory or in an MLflow experiment) and register it with the MLflow model registry:

The Ludwig model will be converted automatically to MLflow’s model.pyfunc format, allowing it to be executed in a framework-agnostic way through a REST endpoint, Spark UDF, Python API with Pandas, etc.

Preconfigured datasets from Kaggle

Since its initial release, Ludwig has required datasets to be provided in tabular form, with a header containing names that can be referenced from the configuration file. In order to make it easy to get started with applying Ludwig to popular datasets and tasks, we’ve added a new datasets module in v0.4 that allows you to download datasets, process them into a tabular format ready for use with Ludwig, and load them into a DataFrame for training in a single line of code.

The Ludwig datasets module integrates with the Kaggle API to provide instant access to popular datasets used in Kaggle competitions. In v0.4, we provide access to popular competition datasets like Titanic, Rossmann Store Sales, Ames Housing and more. Here is an example of how to load the Titanic dataset:

Adding a new dataset is straightforward and just requires extending the Dataset abstract class and implementing minimal data manipulation code. This has allowed us to quickly expand the set of supported datasets to include SST, MNIST, Amazon Review, Yahoo Answers and many more. For a full list of the available dataset please check the User Guide, We encourage you to contribute your own favorite datasets!

What’s Next?

Our goal is to make machine learning easier and more accessible to a broader audience. We’re excited to continue to pursue this goal with features for Ludwig in the pipeline, including:

  • End-to-end AutoML with Neural Architecture Search – Offload part or all of the work of picking the optimal search strategy, tuning parameters, and choosing encoders/combiners/decoders for your given dataset and resources during model training.
  • Combined hyperopt & distributed training – Jointly run hyperopt and distributed training to find the best model within a provided time constraint.
  • Pure TensorFlow low-latency serving – Leverage a flexible and high-performance serving system designed for production machine learning environments using TensorFlow Serving.
  • PyTorch backend – Write custom Ludwig modules using all your favorite frameworks and take advantage of the rich ecosystem each provides.

We hope that these new capabilities will make it easier for our community to continue to build state-of-the-art models. If you are excited in this direction as we are, join our community and get involved! We are building this open source project together, we’ll keep on pushing for a release of Ludwig v0.5 and we welcome contributions from anyone who is excited to see this happen!

We also recognize that for many organizations, success with machine learning means solving many challenges end-to-end; from connecting & accessing data, to training and deploying model pipelines, and then making those models easily available to the rest of the organization.

That’s why we’re also excited to announce that we are building a new solution called Predibase, a cohesive enterprise platform built on top of Ludwig, Horovod, and Ray to help realize the vision of making machine learning easier and more accessible. We’ll be sharing more details soon, and if you’re excited to get in touch in the meantime please feel free to reach out to us at (we are hiring!).

We really hope that you find the new features in Ludwig 0.4 exciting, and want to thank our amazing community for the contributions and requests. Please drop us a comment or email with any feedback, and happy training!


A lot of work went into Ludwig v0.4, and we want to thank everyone who contributed and helped, and in particular the main contributors and community members to this release: our co-maintainer Jim Thompson, Saikat Kanjilal, Avanika Narayan, Nimit Sohoni, Kanishk Kalra, Michael Zhu, Elias Castro-Hernandez, Debbie Yuen, Victor Dai. Special thanks to the immense support from the Stanford’s Hazy research group led by Prof. Chris Ré, to Richard Liaw, Hao Zhang and Micheal Chau from the Ray team, and the LF AI & Data staff.

Ludwig Resources

LF AI & Data Day EU Virtual – June 10, 2021

By Blog

Orange and the LF AI & Data Foundation are pleased to announce the upcoming LF AI & Data Day* EU Virtual, to be held via Zoom on June 10, 2021.

This virtual event will feature keynote speakers from leading AI industries with a focus on open source strategies for scaling machine learning and deep learning. The event schedule will cover various AI topics including, technical presentations on MLOps, Trusted AI, and new LF AI & Data projects.

Registration is now open and the event is free to attend. The capacity will be 100 attendees. Please see the schedule below and also visit the event website for up-to-date information.

Thursday, June 10, 1:00 – 5:00 PM CEST

1:00- 1:10

Setup video conferencing 

Welcome Message & Agenda, Orange 


5 Breakthroughs to scale ML beyond the limits of experiments, Jamil Chawki PhD, AI Program Director & Co-founder of Orange AI Marketplace, Orange Internal Networks Infrastructure & Services | Chair LF AI & Data Outreach Committee 


Open source for Advancing Education in AI Technology & Business, Tawhid Chtioui, PhD, President-founder & Dean, AIvancity School for Technology, Business & Society 


LF AI & Data Updates and the Road Ahead, Ibrahim Haddad PhD, Executive Director, LF AI & Data Foundation   


A visual and scalable DL component library for Trusted AI using  AI Explainability & Fairness 360, Adversarial Robustness Toolkit and Elyra AI pipeline editor, Romeo Kienzler, CTO and Chief Data Scientist, STSM | IBM Centre for Open Source Data and AI Technologies


Trusted AI principles RREPEATS, François Jezequel Business Development, Orange Fab France | LF AI board member & Souad Ouali, Head of inter-operators Relationships’, Orange




RosaeNLG, an LF AI & Data Sandbox project on Natural Language Generation, Ludan Stoecklé, Founder of RosaeNLG


Accelerating AI maturity with MLOps, François Tillerot, Data-AI Product Business Owner & Co-founder of Orange AI Marketplace, Orange Business Services


Datashim, an LF AI & Data project to accelerate data access for Kubernetes/Openshift workloads, Yiannis Gkoufas, Research Software Engineer, IBM Research


Why ONNX Runtime matters for deploying AI in Institution, Xavier Tao, Data Engineer, Banque De France


Closing Session

Event host, Orange, is a leading telecommunications company with headquarters in France. They are the largest telecoms operator in France, with the bulk of their operations in Europe, Africa and the Middle East.

As an LF AI & Data General Member, Orange is involved with the LF AI & Data Governing Board, Outreach Committee, Trusted AI Committee, and is an active contributor to the LF AI & Data Acumos project.  

Note: In order to ensure the safety of our event participants and staff due to the Novel Coronavirus situation (COVID-19), the event hosts have decided to make this a virtual-only event via Zoom.

*LF AI & Data Day is a regional, one-day event hosted and organized by local members with support from LF AI & Data and its Projects. Learn more about the LF AI & Data Foundation here.

LF AI & Data Resources

What You Cannot Miss in Any AI Implementation: Fairness

By Blog

Guest Authors: Utpal Mangla, VP & Senior Partner; Global Leader: IBM’s Telecom Media Entertainment Industry Center of Competency at IBM, & Luca Marchi, AI Innovation, Center of Competence for Telco, Media and Entertainment, IBM, & Kush Varshney, Distinguished Research Staff Member, Manager at IBM Thomas J. Watson Research Center, & Shikhar Kwatra, Data&AI Architect, AI/ML Operationalization Leader at IBM

AI Fairness

Artificial Intelligence (AI) is becoming a key cog in how the world works and how it lives. But the reality is that AI is not as widespread in critical enterprise workflows as it could be because it is not perceived to be safe, reliable, fair, and trustworthy. With increasing regulation, concern about brand reputation, burgeoning complexity, and a renewed focus on social justice, companies are not ready and willing to deploy a “science experiment” at scale in their operations. As Thomas J. Watson, Sr., an early chief executive of IBM said, “The toughest thing about the power of trust is that it’s very difficult to build.”

We’ve seen many newsworthy examples of AI producing unfair outcomes: blacks being discriminated against in criminal recidivism, low-income students systematically having low “predicted” exam scores when the coronavirus pandemic cancelled the real exam, men and women having different lending decisions despite having exactly the same assets, and many more. Why is this happening and what can we do about it?

Lessons from Commercial Aviation

It is instructive to look at the history of commercial aviation to understand what is happening with AI today. The first flights by the Wright brothers and Santos-Dumont during 1903-1906 to the introduction of the commercial jetliner, the Boeing 707, in 1957 can be considered as the first 50 years of aviation. This period was all about just understanding how to make planes fly with limited commercialization. In the second 50 years of aviation that followed, the fundamental nature of airplanes did not change—today’s commercial jets are basically the same as the Boeing 707—but there was a heavy emphasis on safety, efficiency, and automation. Now commercial airlines operate almost everywhere with safety records hundreds of times better than fifty years ago.

What is the lesson for AI? We are just at the beginning of the second 50 years of AI. We can trace the beginnings of AI to a 1956 conference at Dartmouth. We can say that the first 50 years concluded when deep learning won the ImageNet competition in 2012. Just like in aviation, the first 50 years were spent on getting AI to simply work—to be competent and accurate at narrow tasks—with limited commercialization. Now our job is to work on making AI more safe, reliable, fair, trustworthy, efficient, and automated, and bring commercialization everywhere.

Accuracy Isn’t All You Need

To make AI trustworthy, we need it to be more than accurate. We need it to be fair so that it doesn’t discriminate against certain groups and individuals based on their race, gender, or other protected social attributes. We need it to be reliable and robust so that it can be used in different settings and contexts without spectacularly falling apart. We need it to be explainable or interpretable so that people can understand how AI makes its predictions. We need it to realize when it is unsure.

The LF AI & Data Foundation’s three open source toolkits: AI Fairness 360, AI Explainability 360, and Adversarial Robustness 360 Toolbox, are means for practicing data scientists to address these needs for making AI more trustworthy. Let’s dig into fairness in more detail. Where do the problems come from and how can we mitigate them?

Where Do Fairness Issues Come From?

AI, specifically machine learning, tends to reflect back and sometimes amplify unwanted biases that are already present in society. There are four main reasons why there can be unfairness in AI:

  1. Problem misspecification – when the problem owner and data scientist pose the problem they are going to be creating a solution for, they may make choices that introduce unwanted behaviors. For example, if they want to predict whether someone will commit a crime in the future, but they design an AI system to predict whether someone will be arrested in the future, they can introduce unfairness. First, being arrested does not imply that a person is guilty of a crime. Second, there are more arrests made in neighborhoods where police patrol more often, and that is not done equally.
  2. Features containing social biases – some attributes in a dataset already contain traces of structural biases that provide systematic disadvantage to certain groups. For example, the SAT score may be used as a feature for predicting an applicant’s success in college, but it is known to already contain biases so that some minority groups do worse because of cultural knowledge embedded in the questions.
  3. Sampling biases – sometimes datasets overrepresent privileged groups and underrepresent unprivileged groups. For example, face attribute classification datasets are known to be skewed towards white males.
  4. Data preparation – one key step in AI development pipelines is feature engineering, where raw data is transformed before being fed to the AI. There are several subjective choices made in this process, some of which can lead to unfairness

Measuring and Mitigating Unfairness

Just as there are many reasons why AI can yield unfairness, there are many ways to measure it and mitigate it. Choosing how to measure unfairness is not as easy as it sounds because different fairness metrics encode different worldviews and politics. As one option, you can measure the difference in selection rates of an AI, say the difference between the fraction of black applicants who got accepted to a college and the fraction of white applicants. As a different option, you can measure the difference in accuracy rates between the same two groups. They both sound about the same at face value but are actually quite different. In the first option, you are implicitly assuming that features have social biases (like the SAT score), but in the second option, you assume that all the unfairness is due to other reasons like sampling biases.

If you measure that an AI system is behaving unfairly, what can you do about it? You can apply one of many possible bias mitigation algorithms. The basic idea of bias mitigation is that you want a sort of statistical independence between protected attributes like ethnicity or gender and the predicted outcome like success in college. Statistical independence is the notion that two dimensions are unlinked and have no relationship with each other. There are many statistical methods that encourage independence, but that is a longer discussion for another day. Feel free to check out the AI Fairness 360 documentation for more details about bias mitigation if you can’t wait!

LF AI & Data Resources