Training StateTree RL Agents via Hierarchical Reinforcement Learning

This example reproduces the simple hierarchical Branch Selector environment (Evaluator + Forward/Backward Tasks). When the flag is true, the agent should move towards positive direction, and when the flag is false, the agent should move towards the negative direction.

Overhead view of the training environment

Prerequisites

Before starting, please refer to the Getting Started with Schola guide to set up the Unreal Engine project and Schola plugin.

Architecture Overview

Schola’s StateTree integration consists of four main components:

StateTree Training Environment - Actor that manages the training loop and agent lifecycle
Step Inference Task - Task node that defines an agent’s observation/action spaces
RL Decision Evaluator - Evaluator that drives branch selection via RL
RL Branch Condition - Condition that checks which branch the RL agent selected

Example: Hierarchical Basic Branch Selector

This guide uses a simple hierarchical example where an evaluator learns to select forward or backward movement based on a flag value.

StateTree Structure:

Root State
│
├── [Evaluator] BranchSelector
│   └── Observes: current flag (0 or 1)
│   └── Action Space: Discrete(2) → 0=Forward, 1=Backward
│
├── → Forward State [RLBranch == 0]
│      └── [Task] MoveForwardTask
│          └── Observes: position
│          └── Action Space: Box(-1, 1)
│
└── → Backward State [RLBranch == 1]
       └── [Task] MoveBackwardTask
           └── Observes: position
           └── Action Space: Box(-1, 1)

Setting Up a StateTree Training Environment

Step 1: Create the StateTree Asset

In Content Browser, right-click → Artificial Intelligence → State Tree
Open the StateTree editor
Design your state hierarchy with Forward and Backward states

Step 2: Create the Evaluator Blueprint

Create a Blueprint that inherits from UStateTreeEvaluator_RLDecision
Override Define to set up the observation and action spaces

Override Observe to provide position observations

Override Compute Reward to reward movement in the correct direction based on the task type

Override ResetForEpisode to reset the evaluator state at the beginning of each episode. In this example, we don’t need to reset anything in the evaluator, so we only call observe to get the initial observation for the new episode.

Step 3: Create Task Blueprints

Create a shared Blueprint for Forward and Backward tasks inheriting from UStateTreeTask_StepInference:
Add a boolean variable bIsForwardTask to differentiate behavior
Override Define to set up the observation and action spaces

Override Observe to provide position observations

Override Act to apply movement based on the selected action

Override Compute Reward to reward movement in the correct direction based on the task type

Override ResetForEpisode to reset the task state at the beginning of each episode. In this example, we don’t need to reset anything in the task, so we only call observe to get the initial observation for the new episode.

Step 4: Set Up Transitions with RLBranch Conditions

In the StateTree editor, create transitions from Root to Forward and Backward states
Add RL Branch Check condition to each transition
Set BranchIndex to 0 for Forward, 1 for Backward
Bind the condition’s SelectedBranch input to your evaluator’s output

Step 5: Create the Training Environment

Create a Blueprint → Parent: StateTree Training Environment
Set the StateTree Asset reference
Add a boolean property bCurrentFlag
Override IsEpisodeOver to define termination conditions

Override OnEpisodeReset to randomize the flag

Step 6: Set Up the Level

Add your StateTree Training Environment actor to the level
Add the actor that will be controlled by the StateTree
Configure the environment’s StateTree reference

Training Configuration

Use RLlib via the Schola CLI to run training. Example command used for the simple hierarchical environment:

schola rllib train ppo editor \
  --port 8002 \
  --save-final-policy \
  --export-onnx \
  --fcnet-hiddens 16 16 \
  --timesteps 100000

Argument explanations:

--port 8002: Port used by the Schola gRPC protocol (Unreal ↔ Python); change if you have port conflicts.
--save-final-policy: Persist the final learned policy to the checkpoint directory after training completes.
--export-onnx: Export the saved policy to ONNX for deployment in StateTree.
--fcnet-hiddens 16 16: Two hidden fully-connected layers with 16 units each for policy/value networks.
--timesteps 100000: Total environment timesteps for training.

Results

For the simple Direction Selector environment the trained policies produced deterministic behavior:

The branch selector evaluator chooses 0 when the flag is true and 1 when the flag is false.
The forward task policy consistently outputs action 1.
The backward task policy consistently outputs action -1.