Q-learning is a value-based, model-free RL algorithm where the agent learns the optimal policy by updating Q-values based on the rewards received. It is particularly useful in discrete environments like grids.
Q-learning update rule:
Explanation:
- : The Q-value of the current state and action .
- : The learning rate, determining how much new information overrides old information.
- : The reward received after taking action from state .
- : The discount factor, balancing immediate and future rewards.
- : The maximum Q-value for the next state across all possible actions .
Notes:
- Q-learning is well-suited for environments where the state and action spaces are discrete and manageable in size.
- The algorithm is designed to converge to the optimal policy, even in non-deterministic environments, as long as each state-action pair is sufficiently explored.