Summary

Example

(State-Action-Reward-State-Action)

SARSA is another value-based Reinforcement learning algorithm, differing from Q-learning in that it updates the Q-values based on the action actually taken by the policy.

SARSA update rule:

Explanation:

  • : The Q-value of the current state and action .
  • : The learning rate, determining how much new information overrides old information.
  • : The reward received after taking action from state .
  • : The discount factor, balancing immediate and future rewards.
  • : The Q-value for the next state and the action actually taken according to the policy.

Notes:

  • SARSA’s on-policy nature ensures that it learns a policy that aligns with its exploration strategy, leading to more stable behavior in environments with randomness or noise.
  • The learning process may be slower compared to Q-learning, but it can be more robust in environments where the agent’s behavior is expected to align closely with the policy it follows.