Data-Centric AI: the Systematic Engineering of Data to Build AI Systems

By Irving Wladawsky-Berger

After decades of promise and hype, artificial intelligence has finally become the defining technology of our era. Over the past few decades, the necessary ingredients have come together to propel AI beyond universities and research labs into the broader marketplace: powerful, inexpensive computer technologies; advanced algorithms, models, and systems; and huge amounts of all kinds data.

Like the steam engine, electricity, computers, and the internet, AI will have a historically transformative impact across economies and societies. But, over the past two centuries we’ve learned that historically transformative technologies have great potential from the outset, but realizing that potential requires major complementary investments including business process redesign; innovative new products, applications and business models; and the re-skilling of the workforce; as well as a fundamental rethinking of organizations, industries, economies, and societal institutions.

In the early 1990s, for example, it was clear that something big and exciting was taking place with the advent of the internet and World Wide Web, but it wasn’t at all clear where things were heading, how to sort out the growing hype from reality, and the implication to industries, companies and jobs. As was the case with the dot-com bubble of the 1990s, there’s now a combination of rapid technology adoption, new standards-based open source infrastructures, exciting new applications, — some of which will turn out to be truly innovative and some rather silly, — and a speculative frenzy that is likely to end with the bursting of the current AI bubble.

However, while sharing many attributes with previous historically transformative technologies, AI is in a class by itself, especially when it comes to the kinds of concerns, fears and uncertainties it’s been generating.

Throughout the Industrial Revolution there were periodic panics about the impact of automation on jobs, going back to the Luddites, – textile workers who in the 1810s smashed the new machines that were threatening their jobs. But each time those fears arose in the past, technology advances ended up creating more jobs than they destroyed.

Automation fears have understandably accelerated in recent years, as our increasingly smart machines can now be applied to activities requiring intelligence and cognitive capabilities that not long ago were viewed as the exclusive domain of humans. Over the past decade, powerful AI systems have matched or surpassed human levels of performance in a number of tasks such as image and speech recognition, skin cancer classification, breast cancer detection, and highly complex games like Go.

In the past few years, foundation models, large language models (LLMs) and generative AI systems like ChatGPT are now propelling AI to a whole new level of expectations and uncertainties. AI is now raising not only additional concerns about job automation, but also fears that an increasingly powerful, human-like, out of control artificial generaI intelligence (AGI) could become an existential threat to humanity. We have not seen this before.

I’ve been closely following, and writing about advances in AI over the past 20 years to help me sort out what’s hype and what’s real. Without a doubt, there’s been huge progress over these past two decades. But, based on everything I’ve read from the AI research community, the really big deal is the emergence of data-centric AI, — which some are also calling software 2.0. The centrality of data is the common element in the key technologies that have advanced AI over the past 20 years, including big data and advanced analytics in the 2000s, machine and deep learning in the 2010s, and more recently foundation models, LLMs, and generative AI.

Over the past decade, AI has made dramatic progress in a number of domains, including natural language processing, computer vision, recommendation systems, healthcare, biology, finance, and scientific discovery, noted “Data-centric Artificial Intelligence: A Survey,” a June, 2023 paper by researchers from Rice University and Texas A&M. “A vital enabler of these great successes is the availability of abundant and high-quality data. Many major AI breakthroughs occur only after we have the access to the right training data.”

“In parallel, the value of data has been well-recognized in industries. Many big tech companies have built infrastructures to organize, understand, and debug data for building AI systems. All these efforts in constructing training data, inference data, and the infrastructure to maintain data have paved the path for the achievements in AI today.”

AI systems have two major components: the models used to make predictions, based on advanced analytics, application-specific machine and deep learning systems, and large foundation models that can be adapted for a wide range of applications; and the data used to train the models so they can make predictions.

In conventional model-centric AI systems, the primary goal of researchers and practitioners has been to improve the performance of the system by making modifications to the models so they can make more accurate predictions while keeping the training data mostly unchanged. Model-centric AI encourages model advancements, but puts too much trust in the data.

Data-centric AI systems shift the focus from models toward data. Data is viewed not merely as fuel for AI, but as a determining factor in the overall system quality, and a way to help build AI systems that deal with complex real-world problems.

“In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.”

“A few rising AI companies have placed data in the central role because of many benefits, such as improved accuracy, faster deployment, and standardized workflow. These collective initiatives across academia and industry demonstrate the necessity of building AI systems using data-centric approaches.”

What’s the current state of data-centric AI research, and what are its potential future directions? To help address these important questions, the paper provides a high level overview of data-centric AI by addressing four key questions:

What are the necessary tasks to make AI data-centric? “Data-centric AI encompasses a range of tasks that involve developing training data, inference data, and maintaining data. These tasks include but are not limited to 1) cleaning, labeling, preparing, reducing, and augmenting the training data, 2) generating in-distribution and out-of-distribution data for evaluation, or tuning prompts to achieve desired outcomes, and 3) constructing efficient infrastructures for understanding, organizing, and debugging data.”

Why is automation significant for developing and maintaining data? “Given the availability of an increasing amount of data at an unprecedented rate, it is imperative to develop automated algorithms to streamline the process of data development and maintenance.” A number of such automated algorithms have been developed which span different automation levels, from programmatic oprocess-based automation, to machine-learning-based automation, to continuous deployment pipeline-based automation.

In which cases and why is human participation essential in data-centric AI? “Human participation is necessary for many data-centric AI tasks, such as the majority of data labeling tasks and several tasks in inference data development. Notably, different methods may require varying degrees of human participation, ranging from full involvement to providing minimal inputs. Human participation is crucial in many scenarios because it is often the only way to ensure that the behavior of AI systems aligns with human intentions.”

What is the current progress of data-centric AI? Although data-centric AI is a relatively new concept, considerable progress has already been made in many relevant tasks, the majority of which were viewed as preprocessing steps in the model-centric paradigm. Meanwhile, many new tasks have recently emerged, and research on them is still ongoing. Among the three key data-centric tasks, — training data development, inference data development, and data maintenance,— training data development has received relatively more research attention. “As research papers on data-centric AI are growing exponentially, we could witness even more progress in this field in the future.”

The paper adds that data-centric AI does not diminish the value of model-centric AI. “Instead, these two paradigms are complementarily interwoven in building AI systems.” Model-centric methods can be used to achieve data-centric AI goals, while data-centric AI can help improve model-centric AI objectives. “Therefore, in production scenarios, data and models tend to evolve alternatively in a constantly changing environment.”

Data-Centric AI: the Systematic Engineering of Data to Build AI Systems

Previous PostTransforming Conversational Assistants: Open Voice Interoperability Initiative at Work

Next PostThe Relaunch of the LF AI & Data Outreach Committee