Statistical Learning for Food Safety Assessment under Privacy and Resource Constraints


We aim to develop data analytics for food safety assessment, specifically,

  • to build digital-twin models of processing facilities, utilizing pre-existing industry datasets and developing the ability of integrating diverse streams of data for improved model parameterization;
  • to develop statistical learning and inference algorithms for rapid and reliable pathogen detection and risk-averse prediction of effective corrective measures, by integrating synthesized data from the digital-twin models with the facility-owned data;
  • to address practical constraints, in particular, privacy constraints to incentivize data sharing and improve human trust, as well as resource constraints in terms of computation and memory complexity.


These digital-twin models integrated with the constraint-informed statistical learning and data privatization algorithms would allow operators to receive streamlined information (e.g., in real-time) with the capacity to rapidly identify key factors or sites at risk, ultimately improving food-safety related decision-making and resulting in more cost- effective pathogen control programs. Our long-term goals are to extend the data analytics proposed here using Listeria monocytogenes models to decision support tools for other pathogens affecting the food production, including the spread of COVID-19 among workers in the food industry.


The food processing industry has a long-standing need for food-safety data analytics to aid rapid and reliable decision making. A major food safety concern is Listeria monocytogenes, a foodborne pathogen with a case fatality rate of 20% and more than $4 billion in annual costs to American consumers and food companies. Science-based Listeria environmental monitoring programs in food processing facilities are a key tool to reduce the risk of food contamination. The key challenge is the high cost and risk involved in Listeria testing and experimentation. Decision making in existing Listeria control programs are typically based on sparse data and human intuitive judgment, resulting in highly suboptimal solutions, especially in complex situations and under stress. While there is considerable interest in the food industry for agent-based simulation models to aid decision making, there are challenges that prevent their full utilization.

First, the models are complex and require steep learning curve for agricultural scientists and large memory/computation resources. Their integration with learning and prediction to generate actionable information in the presence of uncertainties in both models and data has not received sufficient research attention. Second, the data required for model development are privately held and distributed in complex space of production facilities. In particular, because of privacy and liability concerns, any data on L. monocytogenes identification in a facility are kept under strict confidence. Addressing these challenges requires interdisciplinary collaborations to bring together experts in food safety as well as machine learning.


The figure below illustrates our integrative approach that consists of three components.

Digital-twin models: We aim to develop agent-based modeling and simulation tools to facilitate decision-making in complex systems and to minimize the risk of making wrong decisions. Models for five produce processing facilities are under development. We will augment the agent-based models on Listeria dynamics in food processing facilities to allow integration of diverse industry datasets and real-time data streams.

Statistical learning for rare event detection: Food safety events are typically rare. The challenges in detecting such rare events lie in:

(i) the massive search space and the high cost and risk in assessing potential corrective measures;

(ii) the need for high detection accuracy due to the costly and potentially catastrophic consequences associated with false positives and negatives;

(iii) the time sensitivity of the problem due to the urgency for taking recourse measures.

Our technical approach is rooted in active hypothesis testing that sequentially and adaptively determines where to search based on the current estimate on where the rare events may reside. Integrated into active hypothesis testing is an online learning component to mitigate model uncertainty inherent to digital twins and to provide adaptivity to real-time data, model drifts, and abrupt changes. Risk aversion will be specifically modeled.

Privacy guarantee: Data privacy is a primary concern for the food industry where leakage of information about detection of a pathogen can have significant financial and societal impacts. Differential privacy is a noise addition mechanism, widely adopted by several industries, to ensure privacy of entities that contribute data. This comes at the price of a loss in utility. While availability of large amounts of data can help mitigate this effect, the rarity of the events of interest could pose significant challenges. A solution to this challenge is the notion of context- aware privacy that provides stronger privacy guarantees for events of concern. Performing active hypothesis testing on privatized data samples is also of interest.


Success, measured over multiple years, manifests in two aspects:

(i) digital- twin models contributing to existing library and new risk-averse and constraint-informed AI solutions for food safety assessment;

(ii) personnel (postdoc and GRA) training at the interface between AI and food safety for the next-generation work force.

Project Team


The projects are jointly conducted with partners through the AIFS Network:

Cornell University

J. Acharya

R. Ivanek

M. Wiedmann

Q. Zhao

UC Berkeley

T. Zohdi

UC Davis

N. Nitin