Data-efficiency in the Food Systems


Data efficiency plays a vital role in using AI-based solutions in next generation food systems. In this project, we will identify key challenges in food systems using domain-specific examples and develop technical approaches to address data efficiency in food systems.

The project ties closely to projects submitted from the application clusters, as explained below. In this project, we highlight and abstract the challenges in food systems with the goal of developing generalizable solutions to food systems. In application-cluster projects, the focus is to address the specific challenges of the corresponding applications, starting with data collection.


Data efficiency arises because of two salient features of the food system: (1) high variability and diversity in terms of crop traits, environmental conditions, multi-faceted quality measures, and consumer preferences; (2) high cost—in terms of both labor and time (e.g., the innate growth cycle of crops)—associated with data collection and the low quality of observational data (e.g., self-reported dietary intake data). The first challenge gives rise to a highly complex learning space that all AI solutions need to navigate through: high dimensional input and output, reward and loss as feedback for adaptation and learning are difficult to define (e.g., the taste of a strawberry variety), and highly nonlinear and non-convex objective function landscapes. Compounding this difficult learning task is the second challenge that starves the AI models with few, noisy, and incomplete data points to learn from. These two challenges highlight the importance of data efficiency in developing effective AI algorithms in next generation food systems. Consider the following applications:

  1. Molecular breeding. One project proposes to develop a simulation tool for breeding programs to evaluate crossing decisions and experimental designs for data collection. It  will use recent data on quality and yield from the UC Davis strawberry and pepper breeding programs to estimate genetic values of candidate lines.
  2. Ag production. A 3D plant growth model has been developed at UC Davis. The model can help guide algorithms in yield prediction and ag production control. The model can be improved using real-world data, e.g., real growth images or the fine-grained yield data from a specific field.
  3. Food processing. One project on sanitation proposes to develop a Digital Twin model and corresponding machine learning algorithms for optimized components associated with pathogen contact-tracing, sanitation and decontamination.
  4. Nutrition. One project predicts glycan content from real-world food diaries. Obtaining labels from dietitians is an expensive and time consuming process. How to best reduce the number of labels they need to generate is a key component for scalability.

Common to all these examples lie the foundational research question that we hope to address in this project: 1) when data samples are expensive, how do we obtain them as efficient as possible; and 2) how do we bridge the gap between (simulation) models and the real world environment.


Data efficiency plays a vital role in using AI-based solutions in next generation food systems. The project is closely related to projects proposed in the application clusters. The success of this project can benefit multiple food system applications.


To address the common data-efficiency challenge among all four application clusters that we identified above, we consider the following two-pronged approaches.

Improving models through active learning and efficient exploration: In the above application problems, we face large-scale learning problems where obtaining labeled data is costly and time consuming. This is a challenging issue for not only real data but also simulation data. In molecular breeding, it takes multiple growing seasons and significant cost to obtain a new sample. In simulation models, typically, the better the model, the higher the computational complexity. For example, it takes five minutes to generate one good image created by a 3D-model generated in ag production. We will pursue an active learning approach where the decision maker actively chooses which data points to use or which experiments to carry out for the purpose of gathering the most relevant information for the learning task at hand. The focus is on exploiting latent structures in the learning space determined by the specific underlying applications in food systems. Furthermore, when the state-action space is large, we will explore efficient approximation and exploration techniques. Various techniques have been developed in the literature, using different mechanisms to measure state-action uncertainties. Building on our existing work using GAN-based models, we will investigate efficient exploration strategies, incorporating domain knowledge.

Bridging the gap between simulation models and real environments. Simulation plays a key role in studying complex real systems in a safe and cost effective manner. Many projects have existing simulation tools or models built using existing data. Several proposed projects depend on building digital-twin models. While simulation tools are powerful, there typically exists a gap between simulation models and the real world. How to bridge the gap between the two is a key research question that is relevant to all application clusters. We plan to explore two types of approaches. 1) domain adaptation, where one adapts a model trained in a source domain (simulation) to a new target domain (a real environment). 2) Domain randomization, where discrepancies between the source and target domains are modeled as variability in the source domain. We plan to study the integration of model-based and model-free sequential decision processes. We plan to investigate the idea of meta learning in this scenario, also incorporating invariant features, identified both through training and through utilizing domain knowledge. We also plan to investigate GAN (generative adversarial network) based approaches, where we augment training data using GAN.

Project Team


The projects are jointly conducted with partners through the AIFS Network:

Cornell University

Q. Zhao


UC Davis

M. Earles


D. Lemay


X. Liu


N. Nitin


D. Runcie


I. Tagkopoulos


Z. Yu


UC Berkeley

T. Zohdi