[Literature Review] EVAAA: A Virtual Environment Platform for Essential Variables in Autonomous and Adaptive Agents

This review is intended for my personal learning

Paper Info

Title: EVAAA: A Virtual Environment Platform for Essential Variables in Autonomous and Adaptive Agents
Authors: Sungwoo Lee, Jungmin Lee, Sohee Kim, Hyebhin Yoon, Shinwon Park, Junhyeok Park, Jaehyuk Bae, Seok-Jun Hong, Choong-Wan Woo
Venue: NeurIPS 2025, Datasets and Benchmarks Track
Code: cocoanlab/evaaa

I learned about this work through Dr. Choong-Wan Woo's talk "Pain, Affect, and Life-Inspired AI" at Rethinking Intelligence: A NeuroAI Symposium (RIANS). It was an amazing and insightful talk that connected the neuroscience of interoception and affect to the design of autonomous agents, and this paper was one of the works he presented. Many thanks to Dr. Woo for sharing this line of research.

Prior Knowledge

Homeostatic regulation is a core principle in biological systems: organisms maintain internal physiological variables within viable bounds to survive. Ross W. Ashby formalized this idea in Design for a Brain (1952), proposing that intelligent agents regulate "essential variables" to remain viable despite environmental disturbances. This was later operationalized in the Two-Resource Problem by McFarland and Spier (1997), which demonstrated behavioral trade-offs agents face when managing competing internal needs under limited resources.

In the RL literature, Keramati and Gutkin proposed a normative framework unifying reward maximization with homeostatic regulation, defining reward as the reduction in deviation from an internal set point. More recent work by Yoshida (2017) implemented embodied agents in 3D environments that exhibit survival-oriented behaviors through internal-state regulation. These ideas connect to broader research on intrinsic motivation and neuromodulation, where internal signals such as hunger, thirst, or fatigue serve as context that shapes learning dynamics, exploration, and risk sensitivity.

Main Question

Can a unified, biologically inspired simulation environment that grounds agent reward in internal physiological state regulation serve as a scalable benchmark for studying autonomy, adaptivity, and generalization in RL agents, and does it reveal meaningful gaps between current algorithms and biological-level adaptive behavior?

Key Claims

A single reward function derived from internal state dynamics can replace task-specific, manually engineered rewards and still support the emergence of diverse adaptive behaviors across progressively complex environments.
Multimodal egocentric perception (vision, olfaction, thermoception, collision) combined with interoceptive signals provides a richer and more biologically grounded observation space than existing RL benchmarks.
The two-tiered environment architecture (naturalistic curriculum for training, unseen testbeds for evaluation) enables systematic assessment of generalization under internal-state constraints, revealing that increased training complexity does not monotonically improve generalization.
A large gap persists between state-of-the-art model-based RL (DreamerV3) and human performance on the proposed tasks, indicating that EVAAA is a challenging and useful benchmark for developing more adaptive agents.

Method

Formulation

EVAAA models agent behavior as an extended MDP , where the state space is factored as . The external state consists of egocentric sensory inputs, while the internal state comprises four essential variables (EVs): satiation, hydration, body temperature, and tissue damage. Each EV evolves through passive time-dependent decay and discrete changes from agent-environment interactions (e.g., eating food increases satiation).

The reward function captures the temporal reduction in homeostatic deviation:

where measures squared deviation from the set point . An episode terminates if any EV exits its viable range. In the appendix, the authors also present the reward as a negative normalized Euclidean distance from the set points:

where is the set point and is the allowed deviation for each variable. This formulation is preserved across all environments, eliminating the need for task-specific reward engineering.

Agent Design

The agent is embodied in a 3D Unity environment and operates with five discrete actions: no action, move forward, turn left, turn right, and eat. Observations span five modalities: interoception (four EV scalars), egocentric vision (64x64 RGB from a first-person camera), olfaction (a 10-dimensional vector aggregated from nearby resource objects weighted by inverse distance), thermoception (a 3x3 grid of thermal sensors sampling from an underlying 100x100 temperature field), and collision detection (100 radial rays across 360 degrees, aggregated into a 10-dimensional binary vector).

Training Environments

The naturalistic training curriculum consists of four levels with increasing complexity. Level 1 introduces basic resource foraging in open fields. Level 2 adds obstacles (bonfires, rocks, bushes) and stable spatial cues linking resources to landmarks. Level 3 requires navigation through semi-structured terrain with partial wall enclosures, and in Level 3-2 food spawns in a single randomly selected region per episode, forcing dynamic exploration. Level 4 introduces a patrolling predator with a state machine (resting, searching, chasing, attacking) and a day/night cycle that modulates visibility, temperature, and predator activity.

Experimental Testbeds

A suite of seven unseen test tasks is organized into basic homeostatic regulation (two-resource choice, collision avoidance) and advanced adaptive skills (risk-taking, Y-maze spatial navigation, goal manipulation under volatile internal states, multi-goal planning, and predators with day/night). These testbeds isolate specific decision-making capacities such as need prioritization, anticipatory planning, rapid goal revision upon sudden internal-state shifts, and temporally sensitive survival strategies.

Baseline Algorithms and Human Benchmarking

Three RL algorithms are evaluated: DQN (reactive value-based), PPO (on-policy), and DreamerV3 (model-based with world model, planning, and memory). Both curriculum-based and non-curriculum training regimes are tested. Eight human participants completed the same training and testing tasks to provide an upper-bound reference. In training, survival steps per episode is the primary metric; in testing, success is defined task-specifically as either resource acquisition within a step budget or full survival over the episode duration.

Result

In the naturalistic training environments, DreamerV3 consistently achieved the highest survival steps across all levels, benefiting from its world model and memory under partial observability. DQN converged faster than DreamerV3 on Level 1-1 but failed to generalize beyond that level. PPO and DQN could only survive successfully in Level 1-1. Curriculum-based training outperformed direct training on complex levels for all algorithms. Performance dropped sharply beyond Level 3-2, confirming the benchmark's scalability in difficulty.

On the unseen testbeds, human participants achieved near-ceiling success rates across all tasks, substantially outperforming every RL agent. DreamerV3 performed best among the algorithms but exhibited inconsistent generalization: training on harder levels did not yield monotonic improvements on test tasks. For instance, success on the two-resource choice and collision avoidance tasks did not consistently benefit from increased training complexity.

A modality ablation study revealed that vision and the essential variables are both critical for learning (removing either causes complete failure). Among the remaining modalities, thermoception and collision proved more important than olfaction for both training and generalization. Emergent behaviors were observed, including detouring after initial misjudgment in the two-resource task (in agents trained on Level 1-2 or higher) and self-termination in the Y-maze (in agents trained on Level 3-2, which had learned to reset episodes when unable to locate food).

Homeostatic reinforcement learning for integrating reward collection and physiological stability - The normative framework by Keramati and Gutkin that defines reward as reduction in homeostatic deviation, which directly informs EVAAA's reward design.
Homeostatic agent for general environment - Yoshida's prior work implementing embodied agents with survival-oriented behaviors through internal-state regulation in 3D environments.
Mastering diverse control tasks through world models - DreamerV3, the best-performing baseline algorithm in EVAAA, demonstrating model-based RL with planning and memory.
Benchmarking the spectrum of agent capabilities - Crafter, a survival benchmark with multiple internal states and unified reward, but lacking 3D embodiment and multimodal perception.
Avalon: A benchmark for RL generalization using procedurally generated worlds - A first-person 3D survival benchmark with procedural generation, but limited to a single scalar energy variable rather than multi-dimensional internal states.
Life-inspired interoceptive artificial intelligence for autonomous and adaptive agents - A companion paper from the same group outlining the broader vision for interoceptive AI that EVAAA is designed to support.