schola.scripts.sb3.settings.SACSettings

class schola.scripts.sb3.settings.SACSettings(learning_rate=0.0003, buffer_size=1000000, learning_starts=100, batch_size=256, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef=‘auto’, target_update_interval=1, target_entropy=‘auto’, use_sde=False, sde_sample_freq=-1) : Bases: object

Dataclass for configuring the settings of the Soft Actor-Critic (SAC) algorithm. This includes parameters for the learning process, such as learning rate, buffer size, batch size, and other hyperparameters that control the behavior of the SAC algorithm.

Methods


`__init__`([learning_rate, buffer_size, …])

Attributes


`action_noise`	Action noise to use for exploration.
`batch_size`	Minibatch size for each update.
`buffer_size`	Size of the replay buffer.
`constructor`
`critic_type`
`ent_coef`	Coefficient for the entropy term in the loss function.
`gamma`	Discount factor for future rewards.
`gradient_steps`	Number of gradient steps to take during each training update.
`learning_rate`	Learning rate for the optimizer.
`learning_starts`	Number of timesteps before learning starts.
`name`
`optimize_memory_usage`	Whether to optimize memory usage for the replay buffer.
`replay_buffer_class`	Class to use for the replay buffer.
`replay_buffer_kwargs`	Additional keyword arguments to pass to the replay buffer constructor.
`sde_sample_freq`	Frequency at which to sample the SDE noise.
`target_entropy`	Target entropy for the entropy regularization.
`target_update_interval`	Interval for updating the target networks.
`tau`	Soft update parameter for the target networks.
`train_freq`	Frequency of training the policy.
`use_sde`	Whether to use State Dependent Exploration (SDE).

Parameters: : - learning_rate (float)

buffer_size (int)
learning_starts (int)
batch_size (int)
tau (float)
gamma (float)
train_freq (int)
gradient_steps (int)
action_noise (Any)
replay_buffer_class (Any)
replay_buffer_kwargs (dict)
optimize_memory_usage (bool)
ent_coef (Any)
target_update_interval (int)
target_entropy (Any)
use_sde (bool)
sde_sample_freq (int)

__init__(learning_rate=0.0003, buffer_size=1000000, learning_starts=100, batch_size=256, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef=‘auto’, target_update_interval=1, target_entropy=‘auto’, use_sde=False, sde_sample_freq=-1) : Parameters: : - learning_rate (float)

buffer_size (int)
learning_starts (int)
batch_size (int)
tau (float)
gamma (float)
train_freq (int)
gradient_steps (int)
action_noise (Any | None)
replay_buffer_class (Any | None)
replay_buffer_kwargs (dict | None)
optimize_memory_usage (bool)
ent_coef (Any)
target_update_interval (int)
target_entropy (Any)
use_sde (bool)
sde_sample_freq (int)

Return type: : None

action_noise*: Any* = None : Action noise to use for exploration. This can be a callable function or a noise process (e.g., Ornstein-Uhlenbeck) that adds noise to the actions taken by the policy to encourage exploration. This is important in continuous action spaces to help the agent explore different actions and avoid getting stuck in local optima. If set to None, no noise will be added to the actions.

batch_size*: int* = 256 : Minibatch size for each update. This is the number of samples drawn from the replay buffer to perform a single update to the policy. A larger batch size can lead to more stable updates but requires more memory. Must be less than or equal to buffer_size.

buffer_size*: int* = 1000000 : Size of the replay buffer. This is the number of transitions (state, action, reward, next state) that can be stored in the buffer. A larger buffer allows for more diverse samples to be used for training, which can improve performance but also increases memory usage.

property constructor*: Type[SAC]*

property critic_type*: str*

ent_coef*: Any* = ‘auto’ : Coefficient for the entropy term in the loss function. This encourages exploration by adding a penalty for certainty in the policy’s action distribution. A higher value will encourage more exploration, while a lower value will make the policy more deterministic. When set to ‘auto’, it will automatically adjust the coefficient based on the average entropy of the actions taken by the policy. This can help to balance exploration and exploitation during training.

gamma*: float* = 0.99 : Discount factor for future rewards. This determines how much the agent values future rewards compared to immediate rewards. A value of 0.99 means that future rewards are discounted by 1% per time step. This is important for balancing the trade-off between short-term and long-term rewards in reinforcement learning.

gradient_steps*: int* = 1 : Number of gradient steps to take during each training update. This specifies how many times to update the model parameters using the sampled minibatch from the replay buffer. A value of 1 means that the model is updated once per training step, while a higher value (e.g., 2) means that the model is updated multiple times. This can help to improve convergence but may also lead to overfitting if set too high.

learning_rate*: float* = 0.0003 : Learning rate for the optimizer. This controls how much to adjust the model parameters in response to the estimated error each time the model weights are updated. A lower value means slower learning, while a higher value means faster learning.

learning_starts*: int* = 100 : Number of timesteps before learning starts. This is the number of steps to collect in the replay buffer before the first update to the policy. This allows the agent to gather initial experience and helps to stabilize training by ensuring that there are enough samples to learn from.

property name*: str*

optimize_memory_usage*: bool* = False : Whether to optimize memory usage for the replay buffer. When set to True, it will use a more memory-efficient implementation of the replay buffer, which can help to reduce memory consumption during training. This is particularly useful when working with large environments or limited hardware resources. Note that this may slightly affect the performance of the training process, as it may introduce some overhead in accessing the samples.

replay_buffer_class*: Any* = None : Class to use for the replay buffer. This allows for customization of the replay buffer used for training. By default, it will use the standard ReplayBuffer class provided by Stable Baselines3. However, you can specify a custom class that inherits from ReplayBuffer to implement your own functionality or behavior for storing and sampling transitions.

replay_buffer_kwargs*: dict* = None : Additional keyword arguments to pass to the replay buffer constructor. This allows for further customization of the replay buffer’s behavior and settings when it is instantiated. For example, you can specify parameters like buffer_size, seed, or any other parameters supported by your custom replay buffer class. This can help to tailor the replay buffer to your specific needs or environment requirements.

sde_sample_freq*: int* = -1 : Frequency at which to sample the SDE noise. This determines how often the noise is sampled when using State Dependent Exploration (SDE). A value of -1 means that it will sample the noise at every step, while a positive integer will specify the number of steps between samples. This can help to control the exploration behavior of the agent. A higher frequency can lead to more diverse exploration, while a lower frequency may lead to more stable but less exploratory behavior.

target_entropy*: Any* = ‘auto’ : Target entropy for the entropy regularization. This is used to encourage exploration by setting a target for the average entropy of the actions taken by the policy. When set to ‘auto’, it will automatically calculate the target entropy based on the dimensionality of the action space (e.g., -dimensionality of the action space). This helps to balance exploration and exploitation during training by encouraging the agent to explore more diverse actions.

target_update_interval*: int* = 1 : Interval for updating the target networks. This determines how often the target networks are updated with the main networks’ weights. A value of 1 means that the target networks are updated every training step, while a higher value (e.g., 2) means that they are updated every other step. This can help to control the stability of training by ensuring that the target networks are kept up-to-date with the latest policy parameters.

tau*: float* = 0.005 : Soft update parameter for the target networks. This controls how much the target networks are updated towards the main networks during training. A smaller value (e.g., 0.005) means that the target networks are updated slowly, which can help to stabilize training. This is typically a small value between 0 and 1.

train_freq*: int* = 1 : Frequency of training the policy. This determines how often the model is updated during training. A value of 1 means that the model is updated every time step, while a higher value (e.g., 2) means that the model is updated every other time step. This can help to control the trade-off between exploration and exploitation during training.

use_sde*: bool* = False : Whether to use State Dependent Exploration (SDE). This can help to improve exploration by adapting the exploration noise based on the current state of the environment. When set to True, it will use SDE for exploration instead of the standard exploration strategy. This can lead to more efficient exploration in complex environments, but may also introduce additional computational overhead.