Skip to content

schola.scripts.sb3.settings.PPOSettings

Class Definition

class schola.scripts.sb3.settings.PPOSettings(learning_rate=0.0003, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, normalize_advantage=True, ent_coef=0.0, vf_coef=0.5, max_grad_norm=0.5, use_sde=False, sde_sample_freq=-1)

Bases: object

Dataclass for configuring the settings of the Proximal Policy Optimization (PPO) algorithm. This includes parameters for the learning process, such as learning rate, batch size, number of steps, and other hyperparameters that control the behavior of the PPO algorithm.

Parameters

learning_rate

Type: float

n_steps

Type: int

batch_size

Type: int

n_epochs

Type: int

gamma

Type: float

gae_lambda

Type: float

clip_range

Type: float

normalize_advantage

Type: bool

ent_coef

Type: float

vf_coef

Type: float

max_grad_norm

Type: float

use_sde

Type: bool

sde_sample_freq

Type: int

Attributes

batch_size

Type: int
Default: 64

Minibatch size for each update. This is the number of timesteps used in each batch for training the policy. Must be a divisor of n_steps.

clip_range

Type: float
Default: 0.2

Clipping range for the policy update. This is the maximum amount by which the new policy can differ from the old policy during training. This helps to prevent large updates that can destabilize training.

constructor

Type: Type[PPO]

critic_type

Type: str

ent_coef

Type: float
Default: 0.0

Coefficient for the entropy term in the loss function. This encourages exploration by adding a penalty for certainty in the policy’s action distribution. A higher value will encourage more exploration, while a lower value will make the policy more deterministic. Set to 0.0 to disable entropy regularization.

gae_lambda

Type: float
Default: 0.95

Lambda parameter for Generalized Advantage Estimation (GAE). This parameter helps to balance bias and variance in the advantage estimation. A value of 1.0 corresponds to standard advantage estimation, while lower values will reduce variance but may introduce bias.

gamma

Type: float
Default: 0.99

Discount factor for future rewards. This determines how much the agent values future rewards compared to immediate rewards. A value of 0.99 means that future rewards are discounted by 1% per time step.

learning_rate

Type: float
Default: 0.0003

Learning rate for the optimizer.

max_grad_norm

Type: float
Default: 0.5

Maximum gradient norm for clipping. This is used to prevent exploding gradients by scaling down the gradients if their norm exceeds this value. This can help to stabilize training, especially in environments with high variance in the rewards or gradients.

n_epochs

Type: int
Default: 10

Number of epochs to update the policy. This is the number of times the model will iterate over the collected data during training. More epochs can lead to better convergence but also overfitting.

n_steps

Type: int
Default: 2048

Number of steps to run for each environment per update. This is the number of timesteps collected before updating the policy.

name

Type: str

normalize_advantage

Type: bool
Default: True

Whether to normalize the advantages. Normalizing the advantages can help to stabilize training by ensuring that they have a mean of 0 and a standard deviation of 1. This can lead to more consistent updates to the policy.

sde_sample_freq

Type: int
Default: -1

Frequency at which to sample the SDE noise. This determines how often the noise is sampled when using State Dependent Exploration (SDE). A value of -1 means that it will sample the noise at every step, while a positive integer will specify the number of steps between samples. This can help to control the exploration behavior of the agent.

use_sde

Type: bool
Default: False

Whether to use State Dependent Exploration (SDE). This can help to improve exploration by adapting the exploration noise based on the current state of the environment. When set to True, it will use SDE for exploration instead of the standard exploration strategy.

vf_coef

Type: float
Default: 0.5

Coefficient for the value function loss in the overall loss function. This determines how much weight is given to the value function loss compared to the policy loss. A higher value will put more emphasis on accurately estimating the value function, while a lower value will prioritize the policy update.

Methods

__init__

__init__(learning_rate=0.0003, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, normalize_advantage=True, ent_coef=0.0, vf_coef=0.5, max_grad_norm=0.5, use_sde=False, sde_sample_freq=-1)

Return type: None