• On-Policy vs. Off-Policy: Unlike Q-learning, which is off-policy and updates based on the best possible action in the next state, SARSA is on-policy and updates based on the actual action taken by the agent.
  • Conservatism: SARSA tends to be more conservative in its policy updates, making it suitable for environments where the agent’s policy needs to adapt to uncertainties.

In reinforcement learning how do we optimise policies