Skip to content

Training StateTree RL Agents via Hierarchical Reinforcement Learning

This example reproduces the simple hierarchical Branch Selector environment (Evaluator + Forward/Backward Tasks). When the flag is true, the agent should move towards positive direction, and when the flag is false, the agent should move towards the negative direction.

Overhead view of the training environment

Prerequisites

Before starting, please refer to the Getting Started with Schola guide to set up the Unreal Engine project and Schola plugin.

Architecture Overview

Schola’s StateTree integration consists of four main components:

  1. StateTree Training Environment - Actor that manages the training loop and agent lifecycle
  2. Step Inference Task - Task node that defines an agent’s observation/action spaces
  3. RL Decision Evaluator - Evaluator that drives branch selection via RL
  4. RL Branch Condition - Condition that checks which branch the RL agent selected

Example: Hierarchical Basic Branch Selector

This guide uses a simple hierarchical example where an evaluator learns to select forward or backward movement based on a flag value.

StateTree Structure:

Root State
├── [Evaluator] BranchSelector
│ └── Observes: current flag (0 or 1)
│ └── Action Space: Discrete(2) → 0=Forward, 1=Backward
├── → Forward State [RLBranch == 0]
│ └── [Task] MoveForwardTask
│ └── Observes: position
│ └── Action Space: Box(-1, 1)
└── → Backward State [RLBranch == 1]
└── [Task] MoveBackwardTask
└── Observes: position
└── Action Space: Box(-1, 1)

Setting Up a StateTree Training Environment

Step 1: Create the StateTree Asset

  1. In Content Browser, right-click → Artificial IntelligenceState Tree
  2. Open the StateTree editor
  3. Design your state hierarchy with Forward and Backward states

Step 2: Create the Evaluator Blueprint

  1. Create a Blueprint that inherits from UStateTreeEvaluator_RLDecision
  2. Override Define to set up the observation and action spaces
  1. Override Observe to provide position observations
  1. Override Compute Reward to reward movement in the correct direction based on the task type
  1. Override ResetForEpisode to reset the evaluator state at the beginning of each episode. In this example, we don’t need to reset anything in the evaluator, so we only call observe to get the initial observation for the new episode.

Step 3: Create Task Blueprints

  1. Create a shared Blueprint for Forward and Backward tasks inheriting from UStateTreeTask_StepInference:
  2. Add a boolean variable bIsForwardTask to differentiate behavior
  3. Override Define to set up the observation and action spaces
  1. Override Observe to provide position observations
  1. Override Act to apply movement based on the selected action
  1. Override Compute Reward to reward movement in the correct direction based on the task type
  1. Override ResetForEpisode to reset the task state at the beginning of each episode. In this example, we don’t need to reset anything in the task, so we only call observe to get the initial observation for the new episode.

Step 4: Set Up Transitions with RLBranch Conditions

  1. In the StateTree editor, create transitions from Root to Forward and Backward states
  2. Add RL Branch Check condition to each transition
  3. Set BranchIndex to 0 for Forward, 1 for Backward
  4. Bind the condition’s SelectedBranch input to your evaluator’s output

Step 5: Create the Training Environment

  1. Create a Blueprint → Parent: StateTree Training Environment
  2. Set the StateTree Asset reference
  3. Add a boolean property bCurrentFlag
  4. Override IsEpisodeOver to define termination conditions
  1. Override OnEpisodeReset to randomize the flag

Step 6: Set Up the Level

  1. Add your StateTree Training Environment actor to the level
  2. Add the actor that will be controlled by the StateTree
  3. Configure the environment’s StateTree reference

Training Configuration

Use RLlib via the Schola CLI to run training. Example command used for the simple hierarchical environment:

Terminal window
schola rllib train ppo editor \
--port 8002 \
--save-final-policy \
--export-onnx \
--fcnet-hiddens 16 16 \
--timesteps 100000

Argument explanations:

  • --port 8002: Port used by the Schola gRPC protocol (Unreal ↔ Python); change if you have port conflicts.

  • --save-final-policy: Persist the final learned policy to the checkpoint directory after training completes.

  • --export-onnx: Export the saved policy to ONNX for deployment in StateTree.

  • --fcnet-hiddens 16 16: Two hidden fully-connected layers with 16 units each for policy/value networks.

  • --timesteps 100000: Total environment timesteps for training.

Results

For the simple Direction Selector environment the trained policies produced deterministic behavior:

  • The branch selector evaluator chooses 0 when the flag is true and 1 when the flag is false.

  • The forward task policy consistently outputs action 1.

  • The backward task policy consistently outputs action -1.