schola.scripts.sb3.settings.SACSettings
Class Definition
class schola.scripts.sb3.settings.SACSettings(learning_rate=0.0003, buffer_size=1000000, learning_starts=100, batch_size=256, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1, target_entropy='auto', use_sde=False, sde_sample_freq=-1)
Bases: object
Dataclass for configuring the settings of the Soft Actor-Critic (SAC) algorithm. This includes parameters for the learning process, such as learning rate, buffer size, batch size, and other hyperparameters that control the behavior of the SAC algorithm.
Parameters
learning_rate
Type: float
buffer_size
Type: int
learning_starts
Type: int
batch_size
Type: int
tau
Type: float
gamma
Type: float
train_freq
Type: int
gradient_steps
Type: int
action_noise
Type: Any | None
replay_buffer_class
Type: Any | None
replay_buffer_kwargs
Type: dict | None
optimize_memory_usage
Type: bool
ent_coef
Type: Any
target_update_interval
Type: int
target_entropy
Type: Any
use_sde
Type: bool
sde_sample_freq
Type: int
Attributes
action_noise
Type: Any
Default: None
Action noise to use for exploration. This can be a callable function or a noise process (e.g., Ornstein-Uhlenbeck) that adds noise to the actions taken by the policy to encourage exploration. This is important in continuous action spaces to help the agent explore different actions and avoid getting stuck in local optima. If set to None, no noise will be added to the actions.
batch_size
Type: int
Default: 256
Minibatch size for each update. This is the number of samples drawn from the replay buffer to perform a single update to the policy. A larger batch size can lead to more stable updates but requires more memory. Must be less than or equal to buffer_size.
buffer_size
Type: int
Default: 1000000
Size of the replay buffer. This is the number of transitions (state, action, reward, next state) that can be stored in the buffer. A larger buffer allows for more diverse samples to be used for training, which can improve performance but also increases memory usage.
constructor
Type: Type[SAC]
critic_type
Type: str
ent_coef
Type: Any
Default: 'auto'
Coefficient for the entropy term in the loss function. This encourages exploration by adding a penalty for certainty in the policy’s action distribution. A higher value will encourage more exploration, while a lower value will make the policy more deterministic. When set to ‘auto’, it will automatically adjust the coefficient based on the average entropy of the actions taken by the policy. This can help to balance exploration and exploitation during training.
gamma
Type: float
Default: 0.99
Discount factor for future rewards. This determines how much the agent values future rewards compared to immediate rewards. A value of 0.99 means that future rewards are discounted by 1% per time step. This is important for balancing the trade-off between short-term and long-term rewards in reinforcement learning.
gradient_steps
Type: int
Default: 1
Number of gradient steps to take during each training update. This specifies how many times to update the model parameters using the sampled minibatch from the replay buffer. A value of 1 means that the model is updated once per training step, while a higher value (e.g., 2) means that the model is updated multiple times. This can help to improve convergence but may also lead to overfitting if set too high.
learning_rate
Type: float
Default: 0.0003
Learning rate for the optimizer. This controls how much to adjust the model parameters in response to the estimated error each time the model weights are updated. A lower value means slower learning, while a higher value means faster learning.
learning_starts
Type: int
Default: 100
Number of timesteps before learning starts. This is the number of steps to collect in the replay buffer before the first update to the policy. This allows the agent to gather initial experience and helps to stabilize training by ensuring that there are enough samples to learn from.
name
Type: str
optimize_memory_usage
Type: bool
Default: False
Whether to optimize memory usage for the replay buffer. When set to True, it will use a more memory-efficient implementation of the replay buffer, which can help to reduce memory consumption during training. This is particularly useful when working with large environments or limited hardware resources. Note that this may slightly affect the performance of the training process, as it may introduce some overhead in accessing the samples.
replay_buffer_class
Type: Any
Default: None
Class to use for the replay buffer. This allows for customization of the replay buffer used for training. By default, it will use the standard ReplayBuffer class provided by Stable Baselines3. However, you can specify a custom class that inherits from ReplayBuffer to implement your own functionality or behavior for storing and sampling transitions.
replay_buffer_kwargs
Type: dict
Default: None
Additional keyword arguments to pass to the replay buffer constructor. This allows for further customization of the replay buffer’s behavior and settings when it is instantiated. For example, you can specify parameters like buffer_size, seed, or any other parameters supported by your custom replay buffer class. This can help to tailor the replay buffer to your specific needs or environment requirements.
sde_sample_freq
Type: int
Default: -1
Frequency at which to sample the SDE noise. This determines how often the noise is sampled when using State Dependent Exploration (SDE). A value of -1 means that it will sample the noise at every step, while a positive integer will specify the number of steps between samples. This can help to control the exploration behavior of the agent. A higher frequency can lead to more diverse exploration, while a lower frequency may lead to more stable but less exploratory behavior.
target_entropy
Type: Any
Default: 'auto'
Target entropy for the entropy regularization. This is used to encourage exploration by setting a target for the average entropy of the actions taken by the policy. When set to ‘auto’, it will automatically calculate the target entropy based on the dimensionality of the action space (e.g., -dimensionality of the action space). This helps to balance exploration and exploitation during training by encouraging the agent to explore more diverse actions.
target_update_interval
Type: int
Default: 1
Interval for updating the target networks. This determines how often the target networks are updated with the main networks’ weights. A value of 1 means that the target networks are updated every training step, while a higher value (e.g., 2) means that they are updated every other step. This can help to control the stability of training by ensuring that the target networks are kept up-to-date with the latest policy parameters.
tau
Type: float
Default: 0.005
Soft update parameter for the target networks. This controls how much the target networks are updated towards the main networks during training. A smaller value (e.g., 0.005) means that the target networks are updated slowly, which can help to stabilize training. This is typically a small value between 0 and 1.
train_freq
Type: int
Default: 1
Frequency of training the policy. This determines how often the model is updated during training. A value of 1 means that the model is updated every time step, while a higher value (e.g., 2) means that the model is updated every other time step. This can help to control the trade-off between exploration and exploitation during training.
use_sde
Type: bool
Default: False
Whether to use State Dependent Exploration (SDE). This can help to improve exploration by adapting the exploration noise based on the current state of the environment. When set to True, it will use SDE for exploration instead of the standard exploration strategy. This can lead to more efficient exploration in complex environments, but may also introduce additional computational overhead.
Methods
__init__
__init__(learning_rate=0.0003, buffer_size=1000000, learning_starts=100, batch_size=256, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1, target_entropy='auto', use_sde=False, sde_sample_freq=-1)
Return type: None