Building a Chess Analysis App (Part 1) - Exploring the Chess.com API with Python
Introduction
As part of my Streamlit Chess Analysis project, one of the first challenges was fetching and processing game data from Chess.com. While Chess.com provides Published Data API, working with it directly presents several challenges: navigating monthly archive pagination, handling edge cases in game timing data, and structuring JSON responses for analysis.
This post explores how I built the ChessDataFetcher class to handle these challenges. I’ll use real data from my account as a working example, to demonstrate the system in action.
The Chess.com API: Structure and Challenges
Chess.com organises game data into monthly archives. The ChessDataFetcher class encapsulates this pipeline:
Chess.com API -> Fetch Games -> Validate Data -> Process -> DataFrame
Architecture: From API to Analysis
The fetcher provides three key capabilities:
1. Archive Discovery
The first step is discovering what data exists:
from src.data_fetcher import ChessDataFetcher
fetcher = ChessDataFetcher()
archives = fetcher.get_available_archives("RhysLWells")
This queries https://api.chess.com/pub/player/{username}/games/archives and returns monthly archive URLs:
| Month | Archive URL |
|---|---|
| 2025-09 | https://api.chess.com/pub/player/rhyslwells/games/2025/09 |
| 2025-10 | https://api.chess.com/pub/player/rhyslwells/games/2025/10 |
| 2025-11 | https://api.chess.com/pub/player/rhyslwells/games/2025/11 |
| 2025-12 | https://api.chess.com/pub/player/rhyslwells/games/2025/12 |
| 2026-01 | https://api.chess.com/pub/player/rhyslwells/games/2026/01 |
2. Bulk Fetching
Rather than manually requesting each archive, fetch_all_games() automates the process:
all_games = fetcher.fetch_all_games("RhysLWells", limit_months=12)
This method:
- Calls
get_available_archives()and iterates through each monthly archive - Implements rate limiting (0.5s delay between requests)
- Returns a complete list of game dictionaries
Running this on my account: 264 games fetched across 123 days (2025-09-15 to 2026-01-16). The API provides metadata for each game stored in JSON, including time control classification (blitz, bullet, rapid), player ratings, results, and PGN data.
3. Data Validation
Raw API data isn’t always clean. During development, I discovered games with negative durations (derived from PGNs). This would corrupt any time-based analysis. I found three main edge cases causing this:
- Midnight Crossings:Game starts at 23:50:00, ends at 00:10:00
- Timezone Issues: Durations > 24 hours
- Missing Data: No start time in PGN headers
The validate_game_durations() method addresses these issues.
Processing: From JSON to DataFrame
The final step is transforming raw data into a structured format suitable for analysis. The process_and_save() method transforms (validated) raw API responses into structured pandas DataFrames:
df = fetcher.process_and_save("RhysLWells", all_games, mode='json')
The process_and_save() method performs the following steps:
Processing steps:
- Parse each game’s metadata (ratings, results, openings, ECO codes)
- Calculate results from user’s perspective (win/loss/draw)
- Validate game durations using the logic above
- Remove duplicates (based on timestamp + opponent)
And builds a DataFrame with the following columns:
Extracted columns:
| Category | Fields |
|---|---|
| Temporal | date, timestamp |
| Players | user_color, opponent, user_rating, opponent_rating |
| Results | result, result_label |
| Opening | opening, eco, eco_url |
| Metadata | time_control, game_url, game_duration_seconds, pgn |
Performance: Processing 264 games takes approximately less that 15 seconds, primarily limited by API rate constraints.
Initial Insights
With clean, validated data, we can extract interesting patterns:
Opening Repertoire
From the opening data, we see the most frequently played openings:
| Opening | Games | % |
|---|---|---|
| Nimzowitsch Defense 2.D4 E6 3.Nf3 D5 4.E5 | 15 | 5.7% |
| French Defense Knight Variation | 10 | 3.8% |
| Philidor Defense 3.Bc4 | 8 | 3.0% |
| French Defense Advance Variation | 8 | 3.0% |
| Italian Game | 8 | 3.0% |
For myself, French Defense appears frequently across multiple variations, so it’s a key part of my repertoire.
Performance by Color
From the results data, we can summarise performance by color:
| Color | Win Rate | W | L | D | Total |
|---|---|---|---|---|---|
| White | 45.1% | 56 | 69 | 7 | 132 |
| Black | 49.6% | 63 | 64 | 5 | 132 |
Design Principles
Two key decisions I made to ensure reliability where:
API Etiquette: We identify the application via User-Agent headers
HEADERS = {"User-Agent": "Chess Analysis Dashboard (Python/requests)"}
This helps Chess.com track API usage and contact developers if issues arise.
Rate Limiting: A 0.5-second delay between requests prevents overwhelming the API
time.sleep(0.5) # Respect API constraints
This avoids potential IP bans while maintaining reasonable fetch times (264 games in ~15 seconds).
Conclusion
The ChessDataFetcher class solves the complexity of working with the Chess.com API by handling archive discovery, rate limiting, data validation, and structuring automatically. With this data in hand I can proceed to further analysis and prediction in the chess analysis dashboard.
Leave a comment