All Posts By

Christina Harter

LF AI & Data Day ONNX Community Virtual Meetup – March 2021

By Blog

The LF AI & Data Foundation is pleased to sponsor the upcoming LF AI & Data Day* – ONNX Community Virtual Meetup – March 2021, to be held via Zoom on March 24, 2021.

ONNX, an LF AI & Data Foundation Graduated Project, is an open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them. 

The virtual meetup will cover ONNX Community updates, partner/end-user stories, and SIG/WG updates. Thanks to Baidu and Ti Zhou who volunteered to be the host for the workshop. If you are using ONNX in your services and applications, building software or hardware that supports ONNX, or contributing to ONNX, you should attend! This is a great opportunity to connect with and hear from people working with ONNX across many companies. 

Registration is now open and the event is free to attend. Capacity will be 500 attendees. Please see the event schedule below and also visit the event website for up to date information on this virtual meetup.

8:00 AM China (Thur 3/25)  
5:00 PM PT/USA (Wed 3/24)  
Event Kickoff – Agenda Review
Host: Ti Zhou (Baidu)

ONNX Progress Update
Speakers: ONNX Steering Committee
Prasanth, Harry, Jim, Joohoon, Sheng
8:25 AM China (Thur 3/25)  
5:25 PM PT/USA (Wed 3/24)
Community Presentations – Agenda Review (10 minute short talks)
Host: Ti Zhou (Baidu)

popONNX: Support ONNX on IPU
Speaker: Han Zhao (GraphCore-UK)

Spring Project:Multi Backend Neural Network Auto Quantization and Deploy over ONNX
Speaker: Yu Feng Wei (SenseTime-HongKong)

ONNX Runtime for Mobile Scenarios: From model to on-device inferencing
Speaker: Tom Wildenhain (Microsoft-USA) and Scott McKay (Microsoft-Australia)

Introduction to DL Framework PaddlePaddle and Paddle2ONNX Module
Speaker: Wranky Wang (Baidu-China)

ONNX on microcontrollers
Speaker: Rohit Sharma (AITechSystems-USA_CA)

Monitoring and Explaining ONNX Models in Production
Speaker: Krishna Gade (FiddlerAI-USA_CA)

ONNX client for Acumos
Speaker: Philippe Dooze (Orange-France)

Deploy ONNX model seamlessly across the cloud, edge, and mobile devices using MindSpore
Speaker: Leon Wang (Huawei-China)

ONNX Runtime Training
Speaker: Peng Wang (Microsoft_China)

Quantization support for ONNX using LPOT (Low precision optimization tool)
Speakers: Haihao Shen (Intel – China) and Saurabh Tangri (Intel)
Contact: Rajeev Nalawadi (Intel-China)
10:15 AM China (Thur 3/25)
7:15 PM PT/USA (Wed 3/24)
SIGs and WGs Updates – Agenda Review (10 minute talks)
Speaker: Ti Zhou (Baidu)

Architecture/Infrastructure SIG Update
Chair: Ashwini Khade (Microsoft)

Operators SIG Update
Co-Chairs: Michał Karzyński (Intel) and Ganesan Ramalingen (Microsoft)

Converters SIG Update
Co-Chairs: Guenther Schmuelling (Microsoft), Kevin Chen (Nvidia), Chin Huang (IBM)

Model Zoo/Tutorials SIG Update
Co-Chair: Wenbing Li (Microsoft) and Vinitra Swamy (Microsoft)

Q&A and Discussion

Want to get involved with ONNX? Be sure to join the ONNX-Announce mailing list to join the community and stay connected on the latest updates. You can join technical discussions on GitHub and more conversations with the community on LF AI & Data Slack’s ONNX channels.

Note: In order to ensure the safety of our event participants and staff due to the Novel Coronavirus situation (COVID-19) the ONNX Steering Committee decided to make this a virtual-only event via Zoom.

*LF AI & Data Day is a regional, one-day event hosted and organized by local members with support from LF AI & Data and its Projects. Learn more about the LF AI & Data Foundation here.

ONNX Key Links






LF人工智能与数据基金会很高兴欢迎您参加LF人工智能与数据日*-2021年3月ONNX社区虚拟⻅面会. 本次活动将于中国时间3月25日通过Zoom视频会议在线举办. 本次活动主题为LF人工智能与数据基金会的已毕业项目ONNX.

本次活动将涵盖ONNX社区更新、合作伙伴/终端用户故事以及SIG/WG更新. 感谢百度和周倜主动担任本次研讨会的主持人.

如果您在您的服务和应用中使用了ONNX,正在构建支持ONNX的软件或硬件,或者正在为ONNX做贡献,请务必参加!这是一个与众多使用 ONNX 技术的公司人员会面和交流的好机会.

注:由于新型冠状病毒(COVID-19)的情况,为了确保我们的活动参与者和工作人员的安全, ONNX指导委员会决定在Zoom线上举办此次活动.

*LF 人工智能与数据日(LF AI & Data Day)是由当地成员主办和组织的为期一天的地区性活动,得到 LF 人工智能 与数据基金会及其项目的支持. 在这里可以了解更多关于LF AI与数据基金会的信息.

LF AI & Data Resources

LF AI & Data Foundation Announces Graduation of Pyro Project

By Blog

The LF AI & Data Foundation, the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI), machine learning (ML), deep learning (DL), and data open source projects, is announcing today that hosted project Pyro is advancing from an Incubation level project to a Graduate level. This graduation is the result of Pyro demonstrating thriving adoption, an ongoing flow of contributions from multiple organizations, and both documented and structured open governance processes. Pyro has also achieved a Core Infrastructure Initiative Best Practices Badge, and demonstrated a strong commitment to community.

As an Incubation Project, Pyro utilized the LF AI & Data Foundation’s various enablement services to foster its growth and adoption; including program management support, event coordination, legal services, and marketing services ranging from website creation to project promotion. 

Pyro is a universal probabilistic programming language (PPL) written in Python and supported by either PyTorch or JAX on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling.

It was open sourced by Uber, the project founder, and joined LF AI & Data as an Incubation Project in January 2019. 

“The journey of Pyro from Incubation to Graduation has been very impressive,” said Dr. Ibrahim Haddad, Executive Director of the LF AI & Data Foundation. “The development activities, the growth of its community, and its adoption is particularly noteworthy.  Pyro has exceeded our graduation criteria and we’re proud to be its host Foundation and to support it across a number of services. As a Graduate project, our support to the Pyro project and its community will continue and we’re excited to have the project represented as a voting member on our Technical Advisory Council. Congratulations, Pyro!”

“We on the Pyro team have been happily surprised at the wide adoption of Pyro in both industry and the sciences. Since branching out to provide NumPyro (a JAX-based implementation of Pyro), we’ve seen a growth in the diversity of contributors, from applied scientists and statistics practitioners to machine learning researchers. A big part of Pyro’s growth is due to user trust in our being part of a neutral foundation rather than a single company.” said Pyro project lead, Fritz Obermeyer. 

2020 in Numbers

The stats below capture Pyro’s development efforts from January 1, 2020 to December 14th, 2020:

  • 6.6k Github Stars
  • 797 Github forks 
  • 324 Github dependents 
  • >100 contributors  
  • >1000 forum topics 

Curious about how to get involved with Pyro? 

Check out their Getting Started Guide. And be sure to join the Pyro Announce and Pyro Technical-Discuss mailing lists to join the community and stay connected on the latest updates. 

Congratulations to the Pyro team! We look forward to continued growth and success as part of the LF AI & Data Foundation. To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.

Pyro Key Links

LF AI & Data Resources

JanusGraph Joins LF AI & Data as New Incubation Project

By Blog

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI), machine learning (ML), deep learning (DL), and data open source projects, today is announcing JanusGraph as its latest Incubation Project. 

JanusGraph is a distributed, scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. The project was launched in 2017 through a partnership with organizations including  Expero, Google, GRAKN.AI, Hortonworks, IBM and others.

Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We’re really excited with JanusGraph joining LF AI & Data alongside several other AI and Data projects. JanusGraph provides the capability for storing and processing large-scale connected data, which is proving to be very useful for projects in many domains, including IoT, Social Networks, Malware & Fraud detection, Identity and Access Management, etc. These areas can also benefit from intelligent analytics and predictions from AI & machine learning, a key focus of the LF AI & Data Foundation. We look forward to working with the community to grow the project’s footprint and to create new collaboration opportunities with our members and other hosted projects.” 

JanusGraph is an open source, distributed graph database under The Linux Foundation. JanusGraph is available under the Apache License 2.0. The project recently reached the #5 position on the December 2020 global graph database ranking by DB-Engines. JanusGraph was originally forked from the TitanDB graph database, which has been developed since 2012. The first version of JanusGraph (v0.1.0) was released on April 20, 2017.

JanusGraph supports various storage backends, including Apache Cassandra, Apache HBase, Google Cloud Bigtable, Oracle BerkeleyDB, Scylla. Additionally, JanusGraph supports 3rd party storage adapters to be used with other storage backends, such as Aerospike, DynamoDB, and FoundationDB.

In addition to online transactional processing (OLTP), JanusGraph supports global graph analytics (OLAP) with its Apache Spark integration. JanusGraph supports geo, numeric range, and full-text search via external index storages (Elasticsearch, Apache Solr, Apache Lucene). JanusGraph has native integration with the Apache TinkerPop graph stack, including Gremlin graph query language and graph server.

JanusGraph is a valuable graph database because it is developed to be a layer on top of other databases and thus, developers of JanusGraph may focus more on solving challenges related to graph itself. Instead of spending time “reinventing the wheel”, developers may leverage existing stores which focus on low level storage optimizations / performance / consistency / compression while use JanusGraph which focus more on query optimizations, TinkerPop stack up-to date implementations, data storage / index storage integration, etc.

Oleksandr Porunov, member of the JanusGraph Technical Steering Committee, said: “On behalf of the JanusGraph Technical Steering Committee, we are excited to join LF AI & Data Foundation. JanusGraph has a number of applications in wide-ranging domains — including financial services, security, and Internet of Things — which benefit from managing and analyzing large amounts of connected data to derive insights and make intelligent predictions using relationships inherent in their data with the help of machine learning. We look forward to collaborating with other projects under the LF AI & Data umbrella to enable solving complex, large-scale problems with solutions built on scalable storage, analytics, and machine learning.”

LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project.  LF AI & Data will support the neutral open governance for JanusGraph to help foster the growth of the project. Learn more about JanusGraph on their GitHub and be sure to join the JanusGraph-Announce and JanusGraph-Dev mail lists to join the community and stay connected on the latest updates.A warm welcome to JanusGraph! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

JanusGraph Key Links

LF AI & Data Resources

Ludwig Joins LF AI & Data as New Incubation Project

By Blog

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI), machine learning (ML), deep learning (DL), and Data open source projects, today is announcing Ludwig as its latest Incubation Project.

Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code. All you need to provide is your data, a list of fields to use as inputs, and a list of fields to use as outputs, Ludwig will do the rest. Simple command line interfaces and programmatic APIs can be used to train models both locally and in a distributed way, and to use them to predict on new data. Ludwig was released and open sourced by Uber

“We are very pleased to welcome Ludwig to LF AI. AI, ML, and DL can be perceived as a difficult technology to use. Ludwig provides the opportunity for less experienced engineers and data scientists to use DL models in their work, providing easy-to-use tools and API’s.” said Dr. Ibrahim Haddad, Executive Director of LF AI & Data. “We look forward to supporting this project and helping it to thrive under a neutral, vendor-free, and open governance.” LF AI & Data supports projects via a wide range of benefits; and the first step is joining as an Incubation Project. 

Dr. Piero Molino, Ludwig’s creator and maintainer, said: “I’m excited about Ludwig joining the Linux Foundation. The open governance will allow for both increased participation from the community and companies already using it as well as opening the door to new collaborations. This is definitely a step towards Ludwig’s goal of democratizing AI, ML and DL.” 

LF AI & Data will support the neutral open governance for Ludwig to help foster the growth of the project. Key features for Ludwig include:

  • General: A new data type-based approach to deep learning model design that makes the tool suited for many different applications.
  • Flexible: Experienced users have deep control over model building and training, while newcomers will find it easy to use.
  • Extensible: Easy to add new model architecture and new feature data-types.
  • Understandable: Deep learning models internals are often considered black boxes, but we provide standard visualizations to understand their performances and compare their predictions.
  • Easy: No coding skills are required to train a model and use it for obtaining predictions.
  • Open: Ludwig is released under the open source Apache License 2.0.

Ludwig’s type based abstraction allows to define combinations of inputs and output types to create deep learning models to solve many different tasks without writing code: a text classifier can be trained by specifying text as input and category as output, an image captioning system can be trained by specifying image as input and text as output, a speaker verification model can be obtained providing two audio inputs and a binary output, and a time series forecasting can be obtained by providing a time series as input and a numerical value as output. By combining different data types, the number of tasks are limitless. 

Despite not requiring any coding skills, Ludwig also provides an extremely simple programmatic interface, that allows for training deep learning models and uses them for prediction in just a couple lines of code. It also comes with already built in REST serving capabilities, visualizations of models and predictions, and extensible interfaces to add your own models and hyperparameter optimization.

Check out the Getting Started guide to to start working with Ludwig today. Learn more about Ludwig on their website and be sure to join the Ludwig-Announce and Ludwig-Technical-Discuss mail lists to join the community and stay connected on the latest updates. 

A warm welcome to Ludwig and we look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

Ludwig Key Links

LF AI & Data Resources

Access other resources on LF AI & Data’s GitHub or Wiki

New LF AI & Data Member Welcome – Q3/Q4 2020

By Blog

2020 has been a year of growth for LF AI. We welcomed 22 new members throughout the year, which means we now have a total of 40 members across our Premier, General, and Associate memberships. In October, LF AI extended its scope to include a data domain by joining efforts with ODPi and forming the LF AI & Data Foundation.

The LF AI & Data Foundation will build and support an open community and a growing ecosystem of open source AI, data and analytics projects, by accelerating development and innovation, enabling collaboration and the creation of new opportunities for all the members of the community.

In Q3 and Q4, we welcomed 16 new members to the LF AI & Data Foundation. Learn more about each of these organizations below:

Premier Members

The LF AI & Data Premier membership is for organizations who contribute heavily to open source AI, ML, DL, and Data and bring in their own projects to be hosted at the Foundation. They work in concert with LF AI & Data team members. These companies want to take the most active role in enabling open source AI, ML, DL, and Data.

We welcomed 1 new Premier Member, SAS Institute, in Q3/Q4. Learn more about this organization in their own words below:

SAS Institute

SAS is the leader in analytics. Through innovative software and services, SAS empowers and inspires customers around the world to transform data into intelligence. SAS is a trusted analytics powerhouse for organizations seeking immediate value from their data. A deep bench of analytics solutions and broad industry knowledge keep our customers coming back and feeling confident. With SAS®, you can discover insights from your data and make sense of it all. Identify what’s working and fix what isn’t. Make more intelligent decisions. And drive relevant change.

General Members

The LF AI & Data General membership is targeted for organizations that want to put their organization in full view in support of LF AI & Data and our mission. Organizations that join at the General level are committed to using open source technology, helping LF AI & Data grow, voicing the opinions of their customers, and giving back to the community.

In the second half of 2020, we welcomed 9 new General Members – AlphaBravo, Broadcom, Cloudera, D2IQ, Databricks, Herron Tech, Index Analytics, ING Bank and Precisely. Learn more about these organizations in their own words below:


AlphaBravo operates at the nexus of technical innovation and supports mission needs for federal and commercial customers.  Our team comes from all over the technology grid allowing us to bring a unique perspective to each engagement.


Broadcom Inc. is a global infrastructure technology leader built on 50 years of innovation, collaboration and engineering excellence. With roots based in the rich technical heritage of AT&T/Bell Labs, Lucent and Hewlett-Packard/Agilent, Broadcom focuses on technologies that connect our world. Through the combination of industry leaders Broadcom, LSI, Broadcom Corporation, Brocade, CA Technologies and Symantec, the company has the size, scope and engineering talent to lead the industry into the future.


At Cloudera, we believe that data can make what is impossible today, possible tomorrow. To that end, we deliver an enterprise data cloud that provides cloud-native services to manage and secure the entire data lifecycle from ingest to experimentation, from the Edge to AI, in any cloud or data center.


The journey to the cloud is filled with an abundance of decisions to be made — from the technologies you select, to the frameworks you decide on, to the management tools you’ll use. What you need is a trusted guide that’s been down this path before. That’s where D2iQ can help. D2iQ simplifies and automates the really difficult tasks needed for enterprise Kubernetes in production at scale, reducing operational burden and TCO. With an open source and flexible approach to automation, the D2iQ Kubernetes Platform will deliver the results required—regardless if you are deploying in the cloud, on-premise, in a highly secure air-gapped environment, or on the edge. As a cloud native pioneer, we have more than a decade of experience tackling the most complex, mission-critical deployments in the industry. For more information visit


With origins in both academia and the open-source community, Databricks has always been devoted to simplifying data, sharing knowledge and pursuing truths. Founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks brings together data engineering, science and analytics on an open, unified platform so data teams can collaborate and innovate faster. More than five thousand organizations worldwide —including Shell, Conde Nast and Regeneron — rely on Databricks as a unified platform for massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Venture-backed and headquartered in San Francisco (with offices on four continents) and hundreds of global partners, including Microsoft, Amazon, Tableau, Informatica, Cap Gemini and Booz Allen Hamilton, Databricks is on a mission to help data teams solve the world’s toughest problems.

Herron Tech

Innovation, Time-to-Market, Standardization, Speed, Speed, Speed and Fail Fast – these are just a few of the terms that you will hear from us. Herron Tech’s mission is to empower enterprises with the digital business agility required to respond quickly and effectively to emerging threats to their business, and proactively seize new market opportunities. Herron Tech has leveraged the SOAJS Open Source Platform to provide an array of turnkey solutions that are designed to significantly increase stability, standardization and ROI, while decreasing valuable time-to-market.

Index Analytics

At Index Analytics, we bring together business and IT professionals to drive results and boost engagement for clients in federal health. With a passion for data science and customer experience, and a commitment to staying accessible, we’ve built a better approach to federal health IT and the programs it supports ‐ and we want to put it to work for you.

ING Bank 

ING is a global bank with a strong European base. Our 53,000 employees serve around 38.4 million customers, corporate clients and financial institutions in over 40 countries. Our purpose is to empower people to stay a step ahead in life and in business. Our products include savings, payments, investments, loans and mortgages in most of our retail markets. For our Wholesale Banking clients we provide specialised lending, tailored corporate finance, debt and equity market solutions, payments & cash management and trade and treasury services. Customer experience is what differentiates us and we’re continuously innovating to improve it. We also partner with others to bring disruptive ideas to market faster.


Precisely are the architects behind the accuracy and consistency of your data. Our approach gives you the confidence and context to reach beyond today’s performance. We move and help process data with integrity, giving tomorrow’s market leaders the ability to make better decisions and, ultimately, build new possibilities.With unmatched expertise across data domains, disciplines and platforms, we equip you with high quality, enriched insights that fuel innovation and power decision-making at scale. Simply put, we build trust in your data.

Associate Members

The LF AI & Data Associate membership is reserved for pre-approved non-profits, open source projects, and government entities. 

In the second half of 2020, we welcomed 6 new Associate Members – aivancity School for Technology, OpenI, Peng Cheng Laboratory, Shanghai OpenSource Information Technology Association, Université Libre de Tunis, and XPRIZE. Learn more about these organizations in their own words below: 

aivancity School for Technology

Aivancity is a hybrid school built around Artificial Intelligence, Business and Ethics. Aivancity took the form of a company with a mission and thus became the first higher education institution in France to adopt this status. aivancity, school for technology, business and society, places commitments of employability, diversity, responsibility and territorial anchoring and openness to the city at the heart of its statutes. The values ​​it carries are also the missions it sets for itself, and constitute the very essence of the school, and will be controlled by an external body.  


OpenIntelligence Platform is a new generation of AITISA technology innovation strategic alliance (AITISA), which organizes industry, University and research to collaborate to build a shared open source software and hardware open data supercommunity. It shoulders the mission and dream of “new generation of AI open source platform”. The English name is OpenIntelligence, or OpenI for short. The platform aims to promote open source, open and collaborative innovation in the field of artificial intelligence, build OpenI technology chain, innovation chain and ecological chain, promote the healthy and rapid development of artificial intelligence industry and its wide application in various social and economic fields.

Peng Cheng Laboratory

Peng Cheng Laboratory (PCL) is a new type of scientific research institution in the field of network communications in China. PCL focuses on the strategic, forward-looking, original scientific research and core technology development in the related fields. PCL is headquartered in Shenzhen, Guangdong, with main research themes in network communication, cyberspace and network intelligence. As an integral part of China’s national strategic scientific and technological initiatives, PCL is committed to serving China’s national developmental scheme in broadband communications, future network, as well as serving its key role in establishing the Guangdong-Hong Kong-Macao Greater Bay Area, and helping Shenzhen building itself towards a pioneering demonstration zone with Chinese characteristics.

Shanghai OpenSource Information Technology Association

Shanghai Open Source Information Technology Association is a professional non-profit social organization committed to open source information technology innovation and industry development. It was voluntarily formed by enterprises, universities, research institutes, social organizations and professionals.

Université Libre de Tunis

The Tunisia Private University (ULT) is a university in Tunis, Tunisia. It was founded in 1973 and is organized in six faculties.


XPRIZE, a 501(c)(3) nonprofit, is the global leader in designing and implementing innovative competition models to solve the world’s grandest challenges. has designed and operated global competitions in the domain areas of Space, Oceans, Learning, Health, Energy, Environment, Transportation, Safety and Robotics. Active competitions include the $20 Million NRG COSIA Carbon XPRIZE, the $15 Million XPRIZE Feed The Next Billion, $10 Million Rainforest XPRIZE, the $10 Million ANA Avatar XPRIZE, the $5 Million IBM Watson AI XPRIZE, $5 Million XPRIZE Rapid Reskilling, XPRIZE NextGen Mask Challenge and $5 Million XPRIZE Rapid COVID Testing. For more information, visit

Welcome New Members!

We look forward to partnering with these new LF AI & Data Foundation members to help support open source innovation and projects within the artificial intelligence (AI), machine learning (ML), deep learning (DL), and data space. Welcome to our new members!

Interested in joining the LF AI & Data community as a member? Learn more here and email for more information and/or questions. 

LF AI & Data Resources

DataPractices Joins LF AI & Data as New Incubation Project

By Blog

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI), machine learning (ML), deep learning (DL), and Data open source projects, today is announcing DataPractices as its latest Incubation Project. 

DataPractices is a “Manifesto for Data Practices,” comprised of values and principles to illustrate the most effective, modern, and ethical approach to data teamwork.  DataPractices was released and open sourced by based on the output and work stemming from the Open Data Science Leadership Summit in 2017.

Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: “We’re really excited with DataPractices joining LF AI & Data alongside several other data and AI projects. We look forward to growing the project’s community and to creating new collaboration opportunities between DataPractices, our members and other hosted projects.” was originally launched as the DataPractices manifesto, the result of the Open Data Science Leadership Summit. This summit, held on 10 Nov 2017, was a collaborative gathering of some of the leaders in the Data Science, Semantics, Journalism, and Visualization worlds. The aim was to start discussions that could help crystalize thinking and action in the industry to better define Data Science and help the community to evolve with cohesion.

As a part of the ongoing DataPractices effort to increase awareness and data literacy in the data ecosystem, it was always the plan to move beyond the “words on a page” of the Manifesto and give people real education and tools to help them increase their capabilities around modern data teamwork. Since the original launch, many contributors have helped to contribute open courseware content. All of the assets and courseware developed as a part of this effort will continue to be freely available to all, for personal or commercial use.

Patrick McGarry, founder of DataPractices, said: “ was started to act as a community-led resource to help accelerate the adoption of data literacy efforts. No one knows how to bridge commercial resources and community engagement like The Linux Foundation, so it’s incredibly exciting to be a part of their ongoing commitment to the data ecosystem.”

LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project.  LF AI & Data will support the neutral open governance for DataPractices to help foster the growth of the project. Learn more about DataPractices on their GitHub and be sure to join the DataPractices-Announce and DataPractices-Technical-Discuss mail lists to join the community and stay connected on the latest updates. 

A warm welcome to DataPractices! We look forward to the project’s continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the LF AI & Data website.

DataPractices Key Links

LF AI & Data Resources

Egeria 2.5 Release Now Available!

By Blog

Egeria, an LF AI & Data Foundation Graduate Project, has released version 2.5! Egeria is an open source project dedicated to making metadata open and automatically exchanged between tools and platforms, no matter which vendor they come from. 

In version 2.5, Egeria adds a variety of improvements. Highlights include:

  • The following improvements to the presentation-server user interface:
    • The Type Explorer UI
      • supports options to show/hide deprecated types and/or deprecated attributes. Please refer to the Type Explorer for details.
      • preserves the user-selected focus type across reloads of type information from the repository server.
    • The Repository Explorer UI
      • has the Enterprise option enabled by default. It can be disabled to perform more specific, localized queries.
      • now indicates whether an instance was returned by an enterprise or local scope operation against its home repository or is a reference copy or proxy.
      • has a user-settable limit on the number of search results (and a warning to the user if it is exceeded)
      • now colors nodes based on their home metadata collection’s ID. This previously used metadata collection’s name but a metadata collection’s name can be changed, whereas the metadata collection’s ID is permanent.
      • has improved help information covering search
    • The Dino UI
      • displays a server’s status history in a separate dialog instead of inline in the server details view.
  • The following improvements to the repositories:
    • The Graph Repository
      • find methods have reinstated support for core properties, previously temporarily disabled due to property name clashes that are now resolved
  • A new type ‘OpenMetadataRoot’ has been added as the root type for all Open Metadata Types. See the base model
  • The admin services guide has some additional information on configuring TLS security
  • Improvements to the gradle build scripts, but at this point it remains incomplete and build of egeria still requires maven
  • Bug Fixes
  • Dependency Updates

To learn more about the Egeria 2.5 release, check out the full release notes. Want to get involved with Egeria? Be sure to join the Egeria-Announce and Egeria Technical-Discuss mailing lists to join the community and stay connected on the latest updates. 

Congratulations to the Egeria team and we look forward to continued growth and success as part of the LF AI & Data Foundation! To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.

Egeria Key Links

LF AI & Data Resources

Sparklyr 1.5 Release Now Available!

By Blog

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.5! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

In version 1.5, sparklyr adds a variety of improvements. Highlights include:

  • A large number of user feedbacks were addressed in this release, especially ones related to the `dplyr` interface of `sparklyr`. Spark dataframes now work with a larger number of dplyr verbs in the same way that R dataframes do.
  • There were 4 useful additions to the `sdf_*` family of functions.
    • As the name suggests, functions starting with the prefix `sdf_` in `sparklyr` are ones interfacing with Spark dataframes
    • `sdf_expand_grid()` performs the equivalent of `expand.grid()` with Spark dataframes
    • `sdf_partition_sizes()` computes partition size(s) of a Spark dataframe efficiently
    • `sdf_unnest_longer()` and `sdf_unnest_wider()` are Spark equivalents of `tidyr::unnest_longer()` and `tidyr::unnest_wider()`. They can be used to transform a struct column within a Spark dataframe.
      • `sdf_unnest_longer()` transforms fields in a struct columns into new rows
      • `sdf_unnest_wider()` transforms fields in a struct columns into new columns
  • The default non-arrow-based serialization format in sparklyr used to be CSV. It is now RDS starting from `sparklyr` 1.5 (a lot more detailed context can be found here).
    • Correctness issues that were previously hard to fix with CSV serialization were resolved easily with the new RDS format.
    • There was some performance improvement from RDS serialization too.
    • One can now import binary columns from R dataframe to Spark with RDS serialization.
    • RDS serialization also facilitated reduction of serialization overhead in Spark-based foreach parallel backend.

As usual, there is strong support for `sparklyr` from our fantastic open-source community! In chronological order, we thank the following individuals for making their pull request part of `sparklyr` 1.5:

To learn more about the sparklyr 1.5.0 release, check out the full release notes. Want to get involved with sparklyr? Be sure to join the sparklyr-Announce and sparklyr Technical-Discuss mailing lists to join the community and stay connected on the latest updates. 

Congratulations to the sparklyr team and we look forward to continued growth and success as part of the LF AI & Data Foundation! To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.

sparklyr Key Links

LF AI & Data Resources

Human-Centered AI for BI Industry

By Blog

Guest Authors: Cupid Chan – BI & AI Committee Chair, Xiangxiang Meng, Scott Rigney, Dalton Ruer, Sachin Sinha, and Gerard Valerio – Original white paper here.

Introduction by Cupid Chan, BI & AI Committee Chair

When we want to “talk” to a database, the language we need to know called Structured Query Language (SQL). It is, as its name suggested, very structured and from a functional perspective, the language fulfills its purpose pretty well as we can tell the machine to execute our instructions. However, when you look at this from a user-friendly perspective, even professionals may sometimes have trouble to understand the intricacy of the language because it’s not designed with user experience in mind when it was first launched in 1970s. That’s why Business Intelligence (BI) tools was born and became the translator by introducing an interface on top of SQL to allow better human interaction. BI converts high level request to a low-level SQL language for a database to execute. This shifts the paradigm from computer-centric to a more human-centered approach for the end users.

History repeats itself. Today, we have another set of “languages” because of the new wave of Artificial Intelligence (AI). Even though a lot more people are diving in to learn these languages and the popularity grows at a much faster rate than when SQL was introduced, the very same gap of computer-centric VS human-centered is still here. Just like any other technology, and AI is no exception: Unless we understand we, human, are the ultimate beneficiary, technology is just a toy for a geek. Fortunately, it doesn’t take long for the AI community to realize this gap. For example, Stanford University started a Human-Centered Artificial Intelligence institute led by Fei-Fei Li and John Etchemendy last year. Human-Centered AI will for sure be a trend for the AI community. This should also set the direction to let AI be truly beneficial to our society.

In 2019, our group explored how BI is being impacted by and should respond to the AI phenomenon. This year, BI & AI Committee takes a step further to investigate this influential topic of Human-Centered AI. A group of BI and Analytics leaders dissect this subject into six different areas to see how BI industry should adopt to this important theme.

Machine Teaching by Cupid Chan

Even though the term Artificial Intelligence (AI) was coined back in 1956 in The Dartmouth Conference, the true golden era of AI started in 2012 when Jeff Dean and Andrew Ng published their paper Building High-Level Features Using Large Scale Unsupervised Learning leveraging multilayer neural nets known as deep neural networks, a subset of Machine Learning (ML). Since then, a lot researches were done on how ML can handle different challenges by various algorithms. Moreover, most of the result were shared in open-sourced libraries and frameworks such as TensorFlow and Scikit-learn. We can then bring our own data and take advantages of the optimization already done on the algorithm by other data scientists. Furthermore, with transfer learning, we can even take the result of pre-trained models and reuse that to the new data set as a starting point to cut down a lot of training time. What a leap in ML in just past 8 years.

In my YouTube video “A.Iron Chef” (,, I summarize 7 steps in machines learning process to show how is the AI/ML relevant to the cooking. As you can see from the diagram below, even though data scientists drive the sequential ML process, algorithm optimalization and efficient model is the focal point with machine learning in the center. Other human interactions are just sparing on different steps as separated activities to aid the “learning” process. For example, domain experts use brute force to annotate data as the input of the learning. Once this is done, data scientists pick it up and continue the ML process. Until the model is finalized and deployed to Production, the end user will then get the result decided by the machine.

 Just like a successful education system, learning is only part of the contributing factor. What about the other side of the equation – Machine Teaching? AI/ML, without a busines context to anchor on with the purpose of serving human, is just a toy for a geek. Yes, it may be fun to play with but there will be a disconnect to the beneficiary – we human. In order to turn this around, we need to inject human into the overall picture and put human in the center of this picture.


Data is the ingredient of all ML Supervised Learning recipes. In order to get it right, human uses brute force and follows a lengthy, error-prone, and tedious process to annotate data traditionally. In order to properly inject human into a teaching process, we let the machine to read the partially annotated input data by itself. At the same time, we allow the machine to be uncertain when it encounters something that cannot be deduced based on the previous annotated data set. What is “uncertain” you may ask. This is the threshold we can set depending on the criticality of having a correct label. If a machine comes up with a confidence score below the threshold, it will ask the human for clarification. Human will then respond, not only with the label, but potentially with reason why it should be labelled that way. That means the reasoning will be taught, instead of just letting the algorithm to learn by itself.

In BI, vendors can incorporate this approach to replace the traditional human data wrangling feature. The tool can explore the data provided by the users and learn the preliminary patterns demonstrated by the data set in an unsupervised manner. Only when the machine encounters something that does not make sense, it will use natural language to query the user. Even though this approach may miss a few data points because human is not exhaustively walk through each and every observation, this on-demand interaction between human and machine allows a much better scalability of annotation. The reliability will also be increased as human engages more in the teaching with a reason, yet not using brute force to walk through millions of data points.


On the other hand, when a model is built and deployed to a production system, we should not just accept that as the final decision to provide the service, which is static and not reflecting relevant considerations. Similar to what we do for the input, we allow the machine to return a “state of uncertainty” when it falls below certain threshold defined by the users. And when this happens, human step in and teach the machine why certain decision should be made. At the same time, we can also perform audit to the result to “accept” or “reject” the prediction generated by the trained model. This human feedback will then be recorded and set back to the system as the input for another round of training. As we continue this iterative process, we teach the machine by providing dynamic feedback. More importantly, we convert the one-way machine learning, to bi-directional machine learning and teaching process.

Most modern BI tools provide ways for users to insert comments for the sake of communicating and collaborating among users. We can evolve this feature to another level by making a dashboard showing not only the predicted results of a model, but also feedback fields allowing users to enter confirmation (accept/reject) and comments of the predicted result. The comments provided by the users can teach the machine by feeding that into a natural language processing (NLP) engine. Hence the original data will not be the only input of an evolving model. By combing the original data set with the users’ feedback will make the cycle completed by machine teaching.

From AI to IA

The following diagram summarizes the overall Machine Teaching in a Human-Centered AI. We are all be fascinated by cool technology. That’s why Artificial Intelligence have grown and surged in the past few years. However, putting technology in the center exposes a risk of spending resources without actual benefits experienced by the end users. Therefore, let’s do a mindset change and flip AI to IA – Intelligence Augmentation. Putting human at the center, AI, yet another technology, is just peripheral to augment the intelligence of us – human.

The Self-learning AI Engine by Xiangxiang Meng

A human-centered AI requires the underlying AI engine to be as smart as possible. This means we want a human-centered AI that minimize human input or invention in the process of preparing the data, training the models, and select the champion models to generate insights for the BI frontend. Self-learning is the path to continuously evolve the AI engine to get rid of most of routine process from the end user. This includes the capabilities to self-learn to be adaptive to the BI frontend, the platform that hosts the AI engine, and the input problem from the end user.

Rule #1 Frontend Agnostic

The first level of the self-learning perspective of the AI engine is frontend-agnostic. In a human- centered design of the AI engine, users are looking for an autonomous AI layer that can automatically identify and adapt to the BI frontend.

For example, the AI engine should be able to automatically identify which type of BI front-end is sending the request and identify the current version of the BI frontend. Such information can be used to trigger an automatic data wrangling process. In such process, the input data is converted on a column-by-column based (if necessary) by matching the data types used by the BI frontend and the data types used by the AI backend.

Also, the AI engine should learn to automate the data cleanup process. Without extra information from the BI backend, the AI engine should identify data quality checking and data cleanup policy for each column of the input data set. Examples include removing duplicate columns, identify and prevent using ID variables or high-cardinality variables in the models, discarding columns with too many missing values and imputing other columns with missing values.

Rule #2 Platform Agnostic

A self-learned AI engine is platform agnostic. With the recent advances in machine learning, deep learning and reinforcement learning, an AI engine is often an ecosystem consisted of dozens or even hundreds of machine learning, deep learning and other analytics packages and algorithms. In order to achieve the goal of human-centered AI, the engine should self-learn to organize all these packages and be platform agnostic.

A human-centered AI engine should automatically install, update, and resolve package dependency issues so that the entire engine can be deployed in different on-prem, cloud or other hosting environment. Software package dependency is a key concept for a human-center AI engine to minimize human interventions, and it can be as small as checking the dependency between two single source file or as big as deploying dozens or hundreds of software packages in the same environment.

A human-center AI engine should also self-learn to fully leverage the computing resource of the environment. The efficiency of the AI engine highly depends on how platform agnostic it can be

to adjust according to available computing source and hardware configuration of the environment it is up and running. For example, this might include what kind of models to train, how many models to train, what is the complexity level of parameter tuning, and so for. In an environment with limited resource, complicated models and parameter tuning should be avoid guaranteeing the minimum latency required by the BI frontend. In an environment with advanced computing resource such as GPU or TPU, the AI engine should automatically include more complicated models such as deep neural networks.

Rule #3 Model Recommendation

A human-centered AI engine is expected to minimize the amount of work for the end user to select what models to run and what models to be retired. Besides being frontend agnostic and platform agnostic, it is crucial for a self-learnable AI engine to recommend the right set of machine learning and deep learning for a specific input problem from the BI engine.

A key prerequisite of model recommendation is the ability to memorize and classify the input problems from the BI engine. This allows the engine to maintain a list of candidate analytic models that achieve good accuracy for a specific type of input problem and update the candidate list over time to include new models and retire old models. In many applications it is challenging to maintain a list of good models for a specific input data set. Therefore, the engine should be able to search and identify similar problems solved by the engine in the past and try to identify models that might work for similar problem to speed up the model selection process.

Beside keeping a model zoo that can quickly recommend a subset of models to execute instead of running all the models, a self-learned AI engine should be able to automatically update the champion model for the same input problem based on whenever the stability of the data changes dramatically. Most of the machine learning or deep learning models yield high accuracy when the scoring data set is consistent with the training data set. However, this might change when the scoring data set starts to include new columns or adding new levels to the existing columns that makes the current champion model no longer produce very accurate predictions or classifications. In this case, the AI engine should start to include the new observations into the training data set, retrain all the models, and select a new champion mode.

The AI-BI Interface by Scott Rigney

Because artificial intelligence encompasses a broad collection of techniques for learning from data, its potential to impact many domains is clear. Unfortunately, business intelligence (BI) has yet to fully harness the power of artificial intelligence and expose it to users at scale. This is partially explained by the fact that AI is an evolving field and its use generally requires expertise in frameworks, algorithms, and techniques that are unfamiliar to BI practitioners. As a result, AI has yet to converge on uniform standards that are pervasive in BI. To achieve the end-goal of self-learning BI systems, the industry should define a uniform set of standards for model abstraction, reusability, auditability, and versioning.


BI users are familiar with a drag-and-drop paradigm whereby their actions within a dashboarding application are translated into SQL. A sophisticated BI platform can customize its SQL generation patterns to query from hundreds of databases platforms and database versions. This is no small feat. BI’s value proposition is that it simplifies the process of querying data so completely that even advanced users don’t need to be proficient in SQL.

What would be the impact if BI applications could provide a similar level of abstraction to machine learning? For example, if a user could lasso a collection of data and guide the system towards identifying relevant patterns. This extends beyond enumerating the use cases and mapping them to preferred models. As another example, when training a model, a novice user may be unaware that walk-forward optimization is a more appropriate cross-validation technique than k-fold when working with time-series data. In both cases the system needs to adapt to user inputs and eliminate complex sub-steps. This effort to simplify the integration is needed in order to lower the barriers to adoption. The key point is that the BI system and its designers have an obligation to abstract away the complexity and build towards a model where drag-and-drop actions are automatically mapped to relevant algorithms.


The concept of reusability is common in BI: it provides economies of scale. For example, a single filter on a business dimension (“Year” = 2020) can be used in multiple dashboards and applied to multiple datasets. What is the equivalent reusability construct for machine learning models used in BI? Should the coefficients of a model that were found by training on the full dataset be accessible to a user who does not have access to the full dataset?

One approach is to think about the training pipeline in terms of reusability. In this example, a new regression model could be re-run against a subset of the data but re-use the training and optimization techniques used by its predecessor. A potential benefit to the end-user is that more relevant, nuanced patterns may be observable in a subset of the dataset than in the full dataset. While this premise runs counter to the general goal of model generalizability, the impact to the end-user may make it worthwhile. In terms of reusability, the result can be one model: one problem definition but containing coefficients (in the case of a regression model) that have a one-to-many relationship with end-users. The takeaway is that some adaptation to key concepts in machine learning, such as relaxing the requirement of generalizability, may increase relevance in a self-service BI context. 


Dashboards produced by BI systems are often subject to regulatory oversight. As a result, BI systems need to support calculation auditability. This requirement is complicated by the nature of machine learning: two different algorithms applied to the same dataset and given the same task may produce different results. Furthermore, some algorithms produce interpretable coefficients (regression models) and others do not (tree-based models). The result is that BI systems can’t guarantee auditability. To circumvent this, BI systems can pursue a path towards transparency by leveraging model interpretability frameworks. LIME, short for Local Interpretable Model-Agnostic Explanations, is one such framework that seeks to describe black- box model behavior. In it, model explainability is achieved by repeatedly perturbing the inputs to the model and analyzing the resulting predictions. Although imperfect due to the nature of randomness in the sampling process, LIME and its variants offer a strategy for the auditability requirements that are common in BI.


Similar to the requirements around calculation auditability, in BI, the logic used to produce calculations can also be change-managed if the BI system makes use of version control logic. A similar approach is used by data science teams who have solutions deployed in production; they’re likely to use some variation of “model ops” or “ML ops” or “champion/challenger” techniques to control the model that is actually being used to make decisions. In such contexts, the rationale for promoting or demoting one model over another may be based on objective metrics or subjective business criteria. Because business data is messy and volatile (changes with time), BI stakeholders need tooling that would allow them to prevent a model from retraining, promote or demote a model, or restore a previous version of a model. This is an important consideration for real-world deployment of AI in BI, especially in industries that are heavily regulated.

Discussed here were a set of principles for technical integration that should be considered – by BI vendors and practitioners alike – in order to achieve more pervasive, yet practical everyday use of AI in BI contexts. By model abstraction, we suggest that the BI system and its developers have a duty to abstract away the algorithmic decisions that would otherwise be made by a trained data scientist; the analogy of SQL generation and its impact on BI was given. Reusability defined a conceptual model for how to think about the training process in a way that provides a one-to-many relationship between the model and its BI end-users. The auditability requirement stipulates that models, no matter how sophisticated, need to be interpretable by business users and other stakeholders. Finally, with version control, BI users should be able to define the criteria by which the models used in the system are change-managed. It is hoped that the adoption of operating principles such as these represent a bedrock set of foundations needed to achieve technical integration between AI and BI. The result is a self-learning system that could usher in a new era of sophisticated, AI-powered analysis that is accessible more users and with less friction.

Enabling Humans to Visualize the magic of AI by Dalton Ruer

Imagine if you will be in a world where human end users and Automated Machine Learning partner up. Alas, the majority of business users wouldn’t know a Random Forest from a real forest. They don’t seek to be data scientists, and for most Machine Learning is simply “magic.” While fun to watch in person, magic isn’t really trusted, and end users aren’t going to take action on something they don’t trust.

The insights and power of the machine, combined with the innate business knowledge and intuition of the business user, is too potent to ignore. Thus, the goal is to help them bridge that trust gap via:

  • Education right in their Business Intelligence framework where they are already working.
  • Allowing them to provide their input to the Artificial Intelligence to ensure it learns about the business and not just about the 0’s and 1’s about the business.

To that end we have derived these 4 rules any BI Vendor should follow if they want to allow their product to enable more end users to consume more Automated ML insights.

Rule #1: Protect end users from harm

Like with many things, business users aren’t always right in their assertion of what it is that they want. You likely have countless years of experience sitting in meeting, gathering requirements and watching users change their requests over and over and over. You are probably all too familiar with the fact that sometimes the very things that they ask for will cause them more harm, than not doing anything. But for sake of a good story bear with me.

For instance, a sales manager looking at current sales trends might say “I just want to see a forecast of how much we will sell next quarter.” While your Business Intelligence tool can certainly provide some super slick interface that allows them to ask, the right question isn’t “Can we forecast in a super intuitive way for the end user?” The right questions are “How reliable will a forecast from the two quarters of their data be?” and “Will generating that forecast invalidate or reinforce their trust in the technology they can’t see?”

Each Business Intelligence vendor will choose a different interface for the end users to take advantage of things like Time Based Forecasting. The sales manager will likely never care to know which “model” has been selected under the covers. But when the data suggests that they need it, the sales manager has to be informed by the Business Intelligence of how forecasting works and how the confidence of the results goes up the more data that can be provided.

Remember the goal of Business Intelligence, regardless of the vendor, is for the insights to be actionable. Providing automated machine learning insights is the easy part, the tricky part is not presenting insights based on limited data that will erode the end users trust in the technology or cause them to take horrible incorrect actions.

Rule #1 for Business Intelligence vendors is the equivalent of the first rule of the Hippocratic Oath in the medical profession. Protect end users from harm.

Rule #2: Ask for Input

Imagine a marketing manager who has a known budget and can run several different types of campaigns. Should he send mailers at a cost of X, make personal calls or encourage face to face events? Each has a known cost. So, he asks for a prediction of what to do for which customers.

Easy right. Behind the scenes we could run 57 different models, choose the one with the highest confidence score and bingo he has his results on screen. The marketing manager doesn’t need to know any of the 57 model’s names. But they likely do need to be prompted for one key piece of information … do they care about negative prediction? Sending an email to a customer who doesn’t respond has minimal cost. But what is the cost of staging an event for face-to-face involvement and our model having a 98% accuracy but a 5% negative rate. Would they prefer that over a 97.7% accuracy and a 0% negative rate? That’s where it gets interesting.

Like all business users, the marketing manager who wants to take advantage of the power of Machine Learning on the fly, doesn’t care one iota about the fact that 57 models were run. But they are the only ones that can provide that very key piece of cognitive intelligence that they have which the models don’t.

While any Business Intelligence solution can generate insights from raw data without educating or prompting end users. The data is only half of the solution. The success of human centered Artificial Intelligence depends upon the BI vendors following this rule. Educate the end users enough about what is happening, ask the end users for their input.

Rule #3: Confidence Scores instill Confidence

Another type of education that must be provided is regarding the data size and its potential relationship to the predication accuracy. A data scientist who is asked to analyze employee churn is going to know that the prediction is going to be much more accurate if they are asked by a company providing them 20,000 historical employee records, than if they are asked by a company providing only 15 historical records.

However, the Human Resource managers at both companies seeing a Business Information screen with a button to run churn are both going to want to press the button. After all it is “magic.” They won’t know, until told, why the amount of data used matters. Likewise, they won’t understand what a z-score, t-score or p-value is. That’s where the BI vendors need to provide the education in some way rather than just automatically choosing to run and report whatever model had the highest score. Business users will need to know what the “score” is and educated whether it should be trusted in a way that fits their role. Meaning business users won’t understand why there is such a big deal between a .2 and a .7.

Put yourself in their shoes and imagine clicking that pretty icon to run churn while analyzing data in the company’s Business Intelligence tool and then seeing:

“{insert your name here} the data you provided in this application was run through a series of Machine Learning models and it has predicted that John Doe will remain with the organization. Because the magic system had so much data to work we are super-duper confident that this prediction is accurate and can be trusted.”

Or seeing

“{insert your name here} the data you provided in this application was run through a series of Machine Learning models and it has predicted that John Doe will remain with the organization.Because the magic system had so little data to work in the application you are using as a result of the filters you have in place, we suggest you take that prediction with a grain of salt.”

All kidding aside, the point is that as a result of the size of the data being utilized by the business end user the p-z-t values could be all over the place and the end user needs to know. Regardless of a high or low confidence score, being educated will instill trust and confidence in the business users.

Rule #4: If it looks too good to be true, it probably is

Data science is a lot like an online dating application. The first thing a data scientist will do when considering a relationship with the data is run a data profile. Because they will always want to know if the data is trustworthy, or if it looks too good to be true.

Imagine a telecommunication company wanting to investigate customer churn. A data scientist would obviously need to ask, “Which field represent that your historical customers have actually churned?” When they run a data profile, they might report that there is a 100% correlation between Field A and the field specified as the target. Yeah, our p-z-t-a-b and c scores are going to be off the charts when we use that model. So, while the manager will be ecstatic if we follow rule #3 and report the scoring in some manner, it might not be trustworthy because it’s “too good.”

Thus, Business Intelligence vendors must also do some level of data profiling and educated the business users before blindly running Machine Learning algorithms. When something in the “profile” seems suspect, business users need to be informed and allow them to choose the path.

“Oh silly me that is just the text version of the target field before we translated it to a numeric flag. I shouldn’t have included it in the list of variables, please ignore that field.” Or they might say “That is fantastic news we found a 100% accurate correlation variable. Please run the predictions using it and I am certainly going to get my bonus.”

The opposite scenario also needs to be handled via profiling where it’s clearly a data problem causing a really low score. “Are you kidding me? 92% of the data values you want me to use for the predictions are NULL. This isn’t magic, I can’t fill in the blanks for you and offer good predictions.”

All kidding aside, business users need to be educated by the business intelligence tools regarding the data’s trustworthiness. If they aren’t going to want to trust the results, there is no sense even beginning the relationship.

If BI vendors can follow these rules, we can dramatically magnify the power of BI or AI alone by creating a partnership between business users and automated machine learning. Naturally each vendor will apply its own secret sauce, if you will, as to how to implement these 4 rules. The goal isn’t for all to have a common look or feel. Just to ensure that:

  • End users are protecting from harm
  • End users are asked for input when higher model scores aren’t the only thing that matters
  • End users are told in nonscientific ways how likely predictions are to be accurate
  • End users are prompted when need be if their data looks too good to be true, or so bad it needs to be ignored.

Data Quality by Sachin Sinha

In structured data world there is huge importance given to data quality. It is widely perceived that “If your data is bad, your machine learning or business intelligence tools are useless”.

While bad data quality is not a new problem, its effect increases exponentially in the machine learning and ML driven BI world. We have already seen in public domain some biases these models can deliver because of bad quality data. In the enterprise world this can lead to catastrophic consequences if business starts taking decisions based on predictions from a model that was fed bad quality data to create.

However, there are instances where data is just not within our control. We have all run into instances where a sparsely populated data set is all you have. We can all scream about the quality of data but sometimes data is what is given to us and there is no way to change that. Now, even with that sparsely populated data, a subject matter expert human can take decisions that will be better than what any model will come up with using this sparsely populated data.

The reason being the presence of human knowledge, experience and wisdom which is absent from anything that ML is building.

In the earlier sections we learned that machine teaching seeks to gain knowledge from people rather than extracting knowledge from data alone. Essentially, we can now use knowledge that was generated and internalized in the past to help with the decisions that machines are either going to take or help with in the future.

We all know about the Data-Information-Knowledge-Wisdom hierarchical pyramid. That is how humans learn. That is how we accumulate knowledge, insight and wisdom. By providing context to the data, we create information which results in knowledge and eventually wisdom. BI vendors have operated for decades in this pyramid. In fact, BI tools are primarily responsible for taking the corporate data and turning it into information by providing context and meaning to the data which then results into corporate knowledge and eventually wisdom to act.

Can BI tools expand their influence on the pyramid and now maybe even make use of the upper layers to solve the problem of data quality. Can we combine data and machine teaching using the knowledge and wisdom of a subject matter expert to build better model and as a result better analytics even in the absence of pristine data quality? Is it possible to finally address the issue of data quality without inserting synthetic data into the dataset to fix the data quality issues? The result of poor data quality is that it will require not only synthetic data but also labeling of the data in a way that can counter absence of data or poor data quality. If BI tools can leverage knowledge and wisdom from the subject matter experts, we can build models and create analytics, with fewer labels and less than pristine data, than traditional methods.

Just like the role of the teacher is to optimize the transfer of knowledge to the learning algorithm so it can generate a useful model, SMEs can play a central role in data collection and labeling. SMEs can filter data to select specific examples or look at the available example data and provide context based on their own intuition or biases. Similarly, given two features on a large unlabeled set, SMEs can conjecture that one is better than the other. However, what we are doing now is that finally leveraging the knowledge and wisdom layers of the pyramid to enrich and enhance the data layer, filter out the noise and provide a better context, not on the gut feeling, but based on the knowledge and wisdom that was created based on data from years or decades of experience.

So how can BI vendors enable that? BI vendors should try to leverage advances in machine teaching to bring some real-life benefits to the end users. First, they should try to incorporate data profiling algorithms to auto detect the quality of the data, not just from population of the dataset perspective, but also from the perspective of how accurate analytics modeled on this dataset will be. This will serve two purposes. It will inform the user of the quality of their dataset and it will also alert them that this is something that they will probably need the help of a SME.

Once the data profiling algorithms detect lower than threshold quality then BI tools should kick in the second step. For this step they should build workflows for SME interaction that will kick- in automatically when the algorithm detects data quality lower than a certain threshold. At this point, BI tool will offer its user a workflow that will teach the machine to counter poor data quality. As an example, this could be in the form of elimination of rows of data because it depicts seasonality that is not applicable to the analytics being generated.

Leveraging a SME and their wisdom to teach a machine and solve data quality problem is not the only way to bring wisdom back into the tool set that helped generate that wisdom in the first place. Another avenue that BI tools can leverage is data cataloging. Data cataloging is a way to create knowledge and wisdom about which datasets have better quality and which datasets are applicable in a particular context. BI vendors should work on better integration with the data cataloging tools. This should not be limited to the integration where an end-user can open a dataset from the catalog into their BI tool of choice. Ideal integration would truly leverage the knowledge and wisdom that data catalog helped create. It will provide, for example, end users, suggestions for right data sets, when it identifies that the dataset in use has quality issues and there are others in the catalog classified as better dataset for this use case.

While machine teaching does appear to be something that can help immensely in the data quality area, the onus of making it available to the end users is on the BI tool vendors. Also, there are other ways to reach into the upper layers of the information pyramid and use that to help end users. As the end user tool of choice for analytics, business intelligence tools are the right medium to not only help them reach into the upper layers of the pyramid but also make it useful for them by using it to solve a very important question that remains a thorny issue, data quality.

AI Literacy by Gerard Valerio

We are in the midst of the 4th Industrial Revolution where the convergence of various technologies are redefining the world we live in. Among these technologies are analytics and artificial intelligence (AI for short). Analytics and AI are the engines behind data and digital transformation in the business world. Everything from searching using Google to recommendations and suggestions on online shopping and social media sites to automated call center response are being fueled by the use of analytics and AI. All of these are great examples of human-centered AI where AI is applied to the human experience and interaction with technology.

Here’s a question to spark thought: What is AI Literacy and how do you know if you have achieved it? Good question. AI Literacy begins with asking ourselves what does great look like when AI is used successfully? What goes into successful application and use of AI?

Achieving AI literacy is both the journey and the destination. With respect to AI Literacy, here are 3 critical success factors to consider as a starting point for achieving AI literacy:

  1. AI Strategy
  2. AI Maturity, Proficiency, and Readiness
  3. AI Transparency and Trust

This list of critical success factors is by no means complete and so ask yourself what else would you add as a critical success factor to achieve AI literacy?

AI Strategy

The successful use of AI begins with strategy. What is it that you are intending to happen or occur through the use of AI? Who are the benefactors? Why do the benefactors or users need AI? The use of AI rises out of use cases backed by a value proposition for a product or service that either augments or automates a business process or a human experience.

Any AI application or use involves data inputs and data outputs. The data inputs can collect user information or input, user interaction data, sensor data, etc. This input data is then processed by an algorithm or AI code that is developed to produce an output such as data or information that feeds an action to be taken or a decision to be made whether automated or requiring human involvement.

Here are some considerations that go into framing AI strategy:

  • Data Strategy – Data underpins AI and therefore the use of data as inputs and outputs to an AI algorithm needs to be well defined.
  • Ethical and Legal Considerations – Does the intention of the use of AI or the outcome produced by an AI algorithm raise into question any ethical or legal considerations? AI can be used for both social good and social evil applications. Today, we live in a world that is beginning to be plagued by disinformation through social media. While social media was designed to foster community and connection through family, friends, and followers, it is also being leveraged for nefarious purposes. The same AI used on social media sites such as Facebook and the like can also be manipulated to do just the opposite of what may have been intended – good information can benefit building community and connection while disinformation can breakdown community and connection.
  • Human Factor – Central to the application and use of AI is the human experience. How are humans interacting with AI? Are there any requirements for a successful interaction? What is a failed interaction and why did it happen? How do we account for the variability of human action and behavior? Lastly, how is change integrated so that humans embrace AI comfortably?
  • Technology – At the core of AI is the use of technology as a vehicle through which AI acts.Considerations such as choice of technology, skills and talent necessary to build AI, and deployment all matter in the successful application and use of AI.
  • Use Cases – What is it that you are trying to make better through the use of AI? Is it a product or service and making them better, faster, smarter? Is it process automation intended to improve efficiency and speed? Whatever the case may be, this is what informs the design and development of an AI algorithm.

AI Maturity, Proficiency, and Readiness

Organizations undertaking AI will find that this is an iterative and learning process that unfolds through experimentation (trial-and-error and trial-and-success). Achieving AI literacy as an individual or organization is an evolving process. Just like with any new capability or skill, an organization will first have to traverse a maturity lifecycle. Think crawl, walk, run.

AI literacy as a means to drive successful use of AI comes from an awareness and maturing AI skills (competencies and proficiencies). Think of AI maturity in the context of the Capability Maturity Model (CMM) and the 5 levels of maturity.

In the context of AI, here are the 5 levels of maturity of AI as a capability:

  • Level I:  Characterized as discovery and exploratory, individuals and organizations are developing an awareness of what AI is and what is possible with AI.
  • Level II: Characterized as application and experimentation, AI is being developed and tested to materialize a concept into existence and to learn from failure and success.
  • Level III: Characterized as operational and tactical, AI is “productionized” for experience and process efficiency.
  • Level IV: Characterized as executing and strategic, AI use is becoming pervasive and leading to broader change and innovation with respect to digital and data transformation. Critical mass of proficiency is achieved, and organizational use is beginning to scale.
  • Level V: Characterized as optimizing or transforming, the use of AI is broad and enterprise or organizational wide.

AI Transparency and Trust

Trusting AI is perhaps one of the bigger challenges on the road to AI literacy. Why? Mostly because the application or use of AI is encountered as a black box experience to those who are knowingly or unknowingly the intended benefactors. We have all heard of the phrase “trust, but verify” when it comes to analytics. To establish AI trust, producers of AI must test often and test continuously (monitor) their implementation of AI in order to ensure intended outcome or result produced by the use application or use of AI. Trust is built and gained by the consumers of AI when it consistently and positively benefits their experience with AI.

Another way to contribute to AI Trust is to also think about how AI can be made more transparent and easier to understand. AI transparency has been associated or equated to the concept of explainable AI (XAI). When considering the use of AI and factoring in the idea of transparency or explaining the AI algorithm in the context of human understanding, we build AI trust as we work to demystify the black-box experience and nature of AI algorithms.


The road to AI literacy is a complex journey with challenges. Having a roadmap helps define the journey in the context of creating intentional outcomes through the use of AI for a business purpose. This roadmap is AI Strategy. The road to AI literacy is a transformation process with different stages of organizational maturity in the application and use of AI. Knowing where an organization is in its AI capability maturity helps define the near-term focus and what is necessary to move to the next capability maturity level. The road to AI literacy requires developing transparency and trust in the application and use of AI. Without a commitment or effort towards AI transparency and trust, AI literacy is not possible without overcoming the biases and fear of what people suppose AI is all about (AI is here to replace humans resulting in loss of income, loss of jobs, loss of personal prosperity, etc.).

Authors (by last name alphabetically)

Cupid Chan – Index Analytics

Cupid Chan is a seasoned professional who is well-established in the industry. His journey started out as one of the key players in building a world-class BI platform. He has been a consultant for years providing solutions to various Fortune 500 companies as well as the Public Sector. He is the Lead Architect of a contract in a government agency leading a BI and analytics program on top of both Big Data and traditional DB platforms. He is the Board of Directors, Technical Steering Committee (TSC) and the Chairperson of BI & AI Project in Linux Foundation ODPi.

Xiangxiang Meng – SAS

Xiangxiang Meng is a Staff Scientist in the Data Science Technologies department at SAS. Xiangxiang received his PhD and MS from the University of Cincinnati. The current focus of his work is on the algorithm development and platform design for machine learning and business intelligence software, including SAS Visual Statistics and SAS In-Memory Statistics on Hadoop. His research interests include decision trees and tree ensemble models, Bayesian networks and Naive Bayes, recommendation systems, and parallelization of machine learning algorithms on distributed data.

Scott Rigney – MicroStrategy

Scott works as a Principal Product Manager at MicroStrategy. He manages several products including data science integrations, APIs, SDKs, and is the creator of the “mstrio” package for Python and R. Before MicroStrategy, he worked in data science and developed machine learning systems for predicting application outages, process optimization, and IT system dependency discovery using network graph models. Scott holds a master’s degree in data science from Northwestern University and a bachelor’s of science in Finance from Virginia Tech.

Dalton Ruer – Qlik

Dalton Ruer is a Data Scientist Storyteller and Analytics Evangelist. He is a seasoned author, speaker, blogger and YouTube video creator who is best known for dynamically sharing inconvenient truths and observations in a humorous manner. The passion which Dalton shares thru all mediums moves and motivates others to action.

Sachin Sinha – Microsoft

Sachin Sinha is a Director and Technology Strategist at Microsoft. After graduating from the University of Maryland, he continued his information management research as a Data Engineer and Architect. Throughout his career, Sachin designed systems that helped his customers make decisions based on data. During this time he helped startups in the healthcare space get off the ground by building a business on data and mature from seed to series funding. Sachin also helped several organizations in public sector achieve their mission by enabling them for decisions based on data. Now at Microsoft, as a Technology Strategist, he helps customers with digital transformation. Sachin takes pride in engaging with public sector customers each day to help them achieve more for their organization’s mission. He currently lives in Fairfax, VA, with his wife and two sons, and remains a fervent supporter of Terps and Ravens.

Gerard Valerio – Tableau

With more than 7-years at Tableau (now Salesforce), Gerard Valerio is a director leading a team of solution engineers in evangelizing Tableau and Salesforce to the U.S. Federal Government. He has built a career on data spanning first generation data warehouses in Oracle, Informix, and Sybase to implementing and selling business intelligence tools like SAP Business Objects, IBM Cognos, and MicroStrategy. Mr. Valerio also worked in the data integration space as a customer and employee of Informatica. His Big Data experience spans working with Terabyte to Petabyte-sized data volumes staged on in-memory columnar databases like Vertica to structured/unstructured data residing in Hadoop-based data lakes.

LF AI & Data Resources

Access other resources on LF AI & Data’s GitHub or Wiki

Ethical AI: its implications for Enterprise AI Use-cases and Governance

By Blog

Guest Author: Debmalya Biswas, Philip Morris International/Nanoscope AIOriginally posted on

This is an extended article accompanying the presentation on “Open Source Enterprise AI/ML Governance” at Linux Foundation’s Open Compliance Summit, Dec 2020 (link) (pptx)

Abstract. With the growing adoption of Open Source based AI/ML systems in enterprises, there is a need to ensure that AI/ML applications are responsibly trained and deployed. This effort is complicated by different governmental organizations and regulatory bodies releasing their own guidelines and policies with little to no agreement on the definition of terms, e.g., there are 20+ definitions of ‘fairness’. In this article, we will provide an overview explaining the key components of this ecosystem: Data, Models, Software, Ethics and Vendor Management. We will outline the relevant regulations such that Compliance and Legal teams are better prepared to establish a comprehensive AI Governance framework. Along the way, we will also highlight some of the technical solutions available today that can be used to automate these requirements.

1. Enterprise AI

For the last 4–5 years, we have been working hard towards implementing various Artificial Intelligence/Machine Learning (AI/ML) use-cases at our enterprises. We have been focusing on building the most performant models, and now that we have a few of them in production; it is time to move beyond model precision to a more holistic AI Governance framework that ensures that our AI adoption is in line with organizational strategy, principles and policies.

The interesting aspect of such enterprises, even for mid-sized ones, is that AI/ML use-cases are pervasive. The enterprise use-cases can be broadly categorized based on the three core technical capabilities enabling them: Natural Language Processing (NLP), Computer Vision and Predictive Analytics (Fig. 1).

Fig. 1 Enterprise AI use-cases

Their pervasiveness also implies that they get implemented and deployed via a diverse mix of approaches in the enterprise. We can broadly categorize them into three categories (Fig. 2):

  1. Models developed and trained from scratch, based on Open-Source AI/ML frameworks, e.g., scikit, TensorFlow, PyTorch. Transfer Learning may have been used. The point here is that we have full source code and data visibility in this scenario. The recently established Linux Foundation AI & Data Community (link) plays an important role in supporting and driving this thriving ecosystem of AI/Data related Open Source projects. 
  2. Custom AI/ML applications developed by invoking ML APIs (e.g., NLP, Computer Vision, Recommenders) provided by Cloud providers, e.g., AWS, Microsoft Azure, Google Cloud, IBM Bluemix. The Cloud ML APIs can be considered as black box ML models, where we have zero visibility over the training data and underlying AI/ML algorithms. We however do retain visibility over the application logic. 
  3. Finally, we consider the “intelligent” functionality embedded within ERP/CRM application suites, basically those provided by SAP, Salesforce, Oracle, etc. We have very little control or visibility in such scenarios, primarily acting as users of a Software-as-a-Service (SaaS) application - restricted to vendor specific development tools.

Description automatically generated
Fig. 2 Enterprise AI development/deployment approaches  

Needless to say, a mix of the three modes is also possible. And, in all three cases, the level of visibility/control varies depending on whether the implementation was performed by an internal team or an outsourced partner (service integrator). So, the first task from an enterprise governance point of view is to create an inventory of all AI/ML deployments – capturing at least the following details:

Use-case | Training/Source data characteristics | Model type (algorithm) |
Business function regulatory requirements | Deployment region

2. Ethical AI

“Ethical AI, also known as Responsible (or Trustworthy) AI, is the practice of using AI with good intention to empower employees and businesses, and fairly impact customers and society. Ethical AI enables companies to engender trust and scale AI with confidence.” [1]

Failing to operationalize Ethical AI can not only expose enterprises to reputational, regulatory, and legal risks; but also lead to wasted resources, inefficiencies in product development, and even an inability to use data to train AI models. [2]

There has been a recent trend towards ensuring that AI/ML applications are responsibly trained and deployed, in line with the enterprise strategy and policies. This is of course good news, but it has been complicated by different governmental organizations and regulatory bodies releasing their own guidelines and policies; with little to no agreement or standardization on the definition of terms, e.g., there are 20+ definitions of ‘fairness’ [3]. A recent survey [4] of AI Ethics guidelines of almost 84 documents around the world summarized “that no single ethical principle was common to all of the 84 documents on ethical AI we reviewed”.

  • EU Ethics guidelines for Trustworthy AI (link)
  • UK Guidance on the AI auditing framework (link)
  • Singapore Model AI Governance Framework (link)

Software companies (e.g., Google, Microsoft, IBM) and the large consultancies (e.g., Accenture, Deloitte) have also jumped on the bandwagon, publishing their own AI Code of Ethics cookbooks.

  • AI at Google: our principles (link)
  • Microsoft AI Principals (link)
  • IBM Trusted AI for Business (link)
  • Accenture Responsible AI: A Framework for Building Trust in Your AI Solutions (link)

In general, they follow a similar storyline by first outlining their purpose, core principles, and then describing what they will (and will not) pursue — mostly focusing on social benefits, fairness, accountability, user rights/data privacy. 

At this stage, they all seem like public relations exercises, with very little details on how to apply those high-level principles, across AI use-cases at scale. 

In the context of the above discussion, we now turn our focus back on the top four AI/ML Principles that we need to start applying (or at least start thinking about) at enterprises — ideally, as part of a comprehensive AI Governance Framework. The fifth aspect would be ‘Data Privacy’ which has already received sufficient coverage and there seem to be mature practices in place at enterprises to address those concerns.

2.1 Explainability

Explainable AI is an umbrella term for a range of tools, algorithms and methods, which accompany AI model predictions with explanations. Explainability and transparency of AI models clearly ranks high among the list of ‘non-functional’ AI features to be considered first by enterprises. For example, this implies having to explain why an ML model profiled a user to be in a specific segment - which led him/her to receiving an advertisement. This aspect is also covered under the ‘Right to Explainability’ in most regulations, e.g., the below paragraph is quoted from the Singapore AI Governance Framework:

“It should be noted that technical explainability may not always be enlightening, especially to the man in the street. Implicit explanations of how the AI models’ algorithms function may be more useful than explicit descriptions of the models’ logic. For example, providing an individual with counterfactuals (such as “you would have been approved if your average debt was 15% lower” or “these are users with similar profiles to yours that received a different decision”) can be a powerful type of explanation that organisations could consider.”

The EU GDPR also covers the ‘Right to Explainability’ — refer to the below articles:

  • Limits to decision making based solely on automated processing and profiling (Art. 22)
  • Right to be provided with meaningful information about the logic involved in the decision (Art. 13, 15)

Note that GDPR does not mandate the ‘Right to Explainability’, rather it mandates the ‘Right to Information’. GDPR does allow the possibility of completely automated decision making as long as personal data is not involved, and the goal is not to evaluate the personality of a user — human intervention is needed in such scenarios.

AI/ML practitioners will know that having data and source code visibility is not the same as ‘explainability’. Machine (Deep) Learning algorithms vary in the level of accuracy and explainability that they can provide , and  it is not surprising that often the two are inversely proportional.

For example (Fig. 3), fundamental ML algorithms, e.g., Logistic Regression, Decision Trees, provide better explainability as it is possible to trace the independent variable weights and coefficients, and the various paths from nodes to leaves in a tree. One can notice ‘explainability’ becoming more difficult as we move to Random Forests, which are basically an ensemble of Decision Trees. At the end of the spectrum are Neural Networks, which have shown human-level accuracy. It is very difficult to correlate the impact of a (hyper)parameter assigned to a layer of the neural network, to the final decision in a deep (multi-layer) neural network. This is also the reason why optimizing a neural network currently remains a very ad-hoc and manual process — often based on the Data Scientist’s intuition [5].


Description automatically generated
Fig. 3 AI Model Accuracy vs. Explainability

It is also important to understand that an ‘explanation’ can mean different things for different users.

“the important thing is to explain the right thing to the right person in the right way at the right time” [6]

The right level of explanation abstraction (Fig. 4) depends on the goal, domain knowledge and complexity comprehension capabilities of the subjects. It is fair to say that most explainability frameworks today are targeted towards the AI/ML Developer.

A picture containing text

Description automatically generated
Fig. 4 AI explainability abstraction depending on user type

Improving model explainability is an active area of research within the AI/ML Community, and there has been significant progress in model agnostic explainability frameworks (Fig. 5). As the name suggests, these frameworks separate explainability from the model, by trying to correlate the model predictions to the training data, without requiring any knowledge of the model internals.

Diagram, schematic

Description automatically generated
Fig. 5 Model Agnostic Explainable AI

One of the most widely adopted model agnostic explainability frameworks is Local Interpretable Model-Agnostic Explanations (LIME). LIME is “a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner by learning an interpretable model locally around the prediction.” LIME provides easy to understand (approximate) explanations of a prediction by training an explainability model based on samples around a prediction. It then weighs them based on their proximity to the original prediction. The approximate nature of the explainability model might limit its usage for compliance needs.

For example, the snapshot below (Fig. 6) shows the LIME output of an Artificial Neural Network (ANN) trained on a subset of the California Housing Prices Dataset (link). It shows the important features, positively and negatively impacting the model’s predictions.


Description automatically generated
Fig. 6 LIME output visualization

As mentioned, this is an active research area, and progress continues to be made with the release of more such (sometimes model-specific) explainability frameworks and tools, e.g., the recently released NLP Language Interpretability Tool by Google Research (link). In practice, for commercial systems, the current SOTA would be Facebook’s ‘Why you’re seeing this Post’ [7] – Fig. 7.

Graphical user interface, text, application

Description automatically generated
Fig. 7 Facebook’s ‘Why you’re seeing this Post’ feature (Source: Facebook)

It includes information on how often they interact with that post’s author, medium (videos, photos or links); and the popularity of the post compared to others. The feature is an extension of their ‘Why am I seeing this ad?’ which includes information on how the user’s Facebook profile data matched details provided by an advertiser.

2.2 Bias and Fairness

[8] defines AI/ML Bias “as a phenomenon that occurs when an algorithm produces results that are systemically prejudiced due to erroneous assumptions in the machine learning process”.

Bias in AI/ML models is often unintentional; however, it has been observed far too frequently in deployed use-cases to be taken lightly. Google Photo labeling pictures of a black Haitian-American programmer as “gorilla”, to the more recent “White Barack Obama” images; are examples of ML models discriminating on gender, age, sexual orientation, etc. The unintentional nature of such biases will not prevent your enterprise from getting fined by regulatory bodies or facing public backlash on social media — leading to loss of business. Even without the above repercussions, it is just ethical that AI/ML models should behave in all fairness towards everyone, without any bias. However, defining ‘fairness’ is easier said than done. Does fairness mean, e.g., that the same proportion of male and female applicants get high risk assessment scores? Or that the same level of risk result in the same score regardless of gender? It’s impossible to fulfill both definitions at the same time [9].

Bias creeps into AI models, primarily due to the inherent bias already present in the training data. As such, the ‘data’ part of AI model development is key to addressing bias.

[10] provides a good classification of the different types of ‘bias’ — introduced at different stages of the AI/ML development lifecycle (Fig. 8):

Image for post
Fig. 8 AI/ML Bias types (Source: [10])

Focusing on the ‘training data’ related bias types,

  • Historical Bias: arises due to historical inequality of human decisions captured in the training data
  • Representation Bias: arises due to training data that is not representative of the actual population
  • Measurement & Aggregation Bias: arises due to improper selection and combination of features.

A detailed analysis of the training data is needed to ensure that it is representative and uniformly distributed over the target population, with respect to the selected features. Explainability also plays an important role in detecting bias in AI/ML models.

In terms of tools, TensorFlow Fairness Indicators (link) is a good example of a library that enables easy computation of commonly identified fairness metrics.

Image for post
Fig. 9 TensorFlow Fairness Indicators (Source:

Here, it is important to mention that bias detection is not a one-time activity. Similar to monitoring model drift, we need to plan for bias detection to be performed on a continual basis. As new data comes in (including feedback loops), a model that is unbiased today can become biased tomorrow. For instance, this is what happened with Microsoft’s ‘teen girl’ AI Chatbot, which within hours of its deployment had learned to become a “a Hitler-loving sex robot”.

2.3 Reproducibility 

Reproducibility is a basic principle of Ethical AI and it implies that “All AI/ML predictions must be reproducible.” If an outcome cannot be reproduced, it cannot be trusted. The combination of model version, (hyper-)parameter setting, training dataset features, etc. that contribute to a single prediction can make reproducibility challenging to implement in practice.

To make AI truly reproducible, we need to maintain the precise lineage and provenance of every ML prediction.

This is provided by a mix of MLOps and Data Governance practices. Facebook AI Research (FAIR) recently introduced the ML Code Completeness Checklist to enhance reproducibility and provide a collection of best practices and code repository assessments, allowing users to build upon previously published work [11]. While MLOps [12] seems to be in vogue, there is an inherent lack of principles/standardization on the Data side. A promising framework in this context is FAIR [13], which has started to get widespread adoption in Healthcare Research.

  • Findable: Data should be easy to find for both humans and machines, which implies rich metadata and unique/persistent identifier.
  • Accessible: Data can be accessed in a trusted fashion with authentication and authorization provisions.
  • Interoperable: Shared ontology for knowledge representation, ensuring that the data can interoperate with multiple applications/workflows for analysis, storage, and processing.
  • Reusable: Clear provenance/lineage specifications and usage licenses such that the data can be reused in different settings.

While FAIR is quite interesting from a Data Governance perspective, it remains to be seen how it gets adopted outside Healthcare Research. The Open Data licenses (e.g., Creative Commons) are very different from the more mature landscape of Open Source Software licenses, e.g. Apache, MIT [14].

There is a great risk that we will end up in another standardization/licensing mess, rather than a comprehensive framework to store, access and analyze both data and models (code) in a unified fashion.

2.4 Accountability

Similar to the debate on self-driving cars with respect to “who is responsible” if an accident happens? Is it the user, car manufacturer, insurance provider, or even the city municipality (due a problem with the road traffic signals)? The same debate applies in the case of AI models as well — who is accountable if something goes wrong?, e.g., as explained above in the case of a biased AI/ML model deployment.

Accountability is esp. important if the AI/ML model is developed and maintained by an outsourced partner/vendor.

Below we outline a few questions (in the form of a checklist) that needs to considered/clarified before signing the contract with your preferred partner:

  • Liability: Given that we are engaging with a 3rd party, to what extent are they liable? This is tricky to negotiate and depends on the extent to which the AI system can operate independently. For example, in the case of a Chatbot, if the bot is allowed to provide only a limited output (e.g., respond to a consumer with only limited number of pre-approved responses), then the risk is likely to be a lot lower as compared to an open-ended bot that is free to respond. In addition:

What contractual promises should we negotiate (e.g., warranties, SLAs)?
What measures to we need to implement if something goes wrong (e.g., contingency planning)?

  • Data ownership: Data is critical to AI/ML systems, as such negotiation of ownership issues around not only training data, but input data, output data, and other generated data is critical. For example, in the context of a consumer facing Chatbot [14]:
  • Input data could be the questions asked by consumers whilst interacting with the bot.
  • Output data could be the bot’s responses, i.e., the answers given to the consumers by the bot.
  • Other generated data include the insights gathered as a result of our consumers use of the AI, e.g., the number of questions asked, types of questions asked, etc.

Also, if the vendor is generating the training data, basically bearing the cost for annotation; do we still want to own the training data?

  • Confidentiality and IP/Non-Compete clauses: In addition to (training) data confidentiality, do we want to prevent the vendor from providing our competitors with access to the trained model, or at least any improvements to it — particularly if it is giving us a competitive advantage? With respect to IP, we are primarily interested in the IP of the source code – at an algorithmic level.

    Who owns the rights of the underlying algorithm? Is it proprietary to a 3rd party? If yes, have we negotiated appropriate license rights, such that we can use the AI system in the manner that we want?

When we engage with vendors to develop an AI system, is patent protection possible, and if so, who has the right to file for the patent?

3. Conclusion

To conclude, we have highlighted four key aspects of AI/ML model development that we need to start addressing today — as part of a holistic AI Governance Framework.

As with everything in life, esp. in IT, there is no clear black and white and a blanket AI policy mandating the usage of only explainable AI/ML models is not optimal — implies missing out on what non-explainable algorithms can provide.

Depending on the use-case and geographic regulations, there is always scope for negotiation. The regulations related to different use-cases (e.g., profiling), are different in different geographies. In terms of bias and explainability as well, we have the full spectrum from ‘fully explainable’ to ‘partially explainable, but auditable’ to ‘fully opaque, but with very high accuracy’. Given this, there is a need to form a knowledgeable and interdisciplinary team (consisting of at least IT, Legal, Procurement, Business representatives), often referred to as the AI Ethics Committee — that can take such decisions in a consistent fashion, in-line with the company values and strategy.


  1. R. E-Porter. Beyond the promise: implementing Ethical AI (link)
  2. R. Blackman. A Practical Guide to Building Ethical AI (link)
  3. S. Verma, J. Rubin. Fairness definitions explained (link)
  4. A. Jobin, M. Ienca, E. Vayena. The global landscape of AI Ethics Guidelines (link)
  5. D. Biswas. Is AutoML ready for Business?. Medium, 2020 (link)
  6. N. Xie, et. al. Explainable Deep Learning: A Field Guide for the Uninitiated (link)
  7. CNBC. Facebook has a new tool that explains why you’re seeing certain posts on your News Feed (link)
  8. SearchEnterprise AI. Machine Learning bias (AI bias) (link)
  9. K. Hao. This is how AI Bias really happens — and why it’s so Hard to Fix (link)
  10. H. Suresh, J. V. Guttag. A Framework for Understanding Unintended Consequences of Machine Learning (link)
  11. Facebook AI Research. New code completeness checklist and reproducibility updates (link)
  12. Google Cloud. MLOps: Continuous Delivery and Automation Pipelines in Machine Learning (link)
  13. GO Fair. The FAIR Guiding Principles for Scientific Data Management and Stewardship (link)
  14. D. Biswas. Managing Open Source Software in the Enterprise. Medium, 2020 (link)
  15. D. Biswas. Privacy Preserving Chatbot Conversations. In proceeding of the 3rd NeurIPS Workshop on Privacy-preserving Machine Learning (PPML), 2020 (paper)
  16. Gartner. Improve the Machine Learning Trust Equation by Using Explainable AI Frameworks (link)
  17. Google. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing (link)
  18. Forbes. ML Integrity: Four Production Pillars For Trustworthy AI (link)
  19. Forrester. Five AI Principles To Put In Practice (link)
  20. Deloitte. AI Ethics: A Business Imperative for Boards and C-suites (link)

LF AI & Data Resources