Design patterns and parallelization strategies for Python API scrapers — from a naive loop to async I/O and multi-core execution — plus how to benchmark each step.
Author
Jonathan Pearce
Published
March 2, 2026
1 Introduction
Public APIs are a goldmine for data collection. Two good examples are VIA Rail’s real-time train data (allData.json) and the MaxSold auction item catalogue (msapi/auctions/items). Both return structured JSON you can consume directly without HTML parsing, making them ideal targets for a well-engineered scraper.
This article walks through the full design journey:
Start with a naive synchronous scraper that is easy to reason about.
Layer in CPU parallelism (concurrent.futures) for CPU-bound post-processing.
Benchmark each stage so you know where your bottleneck actually is.
Decide whether to store raw JSON blobs or normalised tabular data.
Approximate read time: 10 minutes. Code is written for Python 3.10+.
2 Design Considerations Before You Write a Line of Code
2.1 Choose the right libraries
Concern
Recommended library
Notes
Synchronous HTTP
requests
Simple, battle-tested
Async HTTP
aiohttp
Pairs with asyncio; faster than httpx for pure async
Async HTTP (alt)
httpx
Drop-in requests API with async support
CPU parallelism
concurrent.futures
ProcessPoolExecutor is stdlib and straightforward
Rate-limit awareness
tenacity
Retry with exponential back-off
Data wrangling
pandas / polars
polars is faster for large normalised tables
2.2 Respect API rate limits
Most free APIs throttle by IP or token. Before looping over thousands of IDs:
Read the API docs for stated limits (requests/minute, concurrent connections).
Add a Retry-After header handler so you back off automatically on HTTP 429.
Keep a semaphore (asyncio.Semaphore) to cap your own concurrency — don’t rely solely on the server to push back.
2.3 Raw JSON vs. tabular storage
Storing raw JSON gives you a replayable source of truth; you can re-derive any schema without hitting the API again. The trade-off is storage and query complexity.
A practical middle ground: save the raw JSON alongside a normalised Parquet file. Tools like pandas.json_normalize and polars make the transformation cheap.
3 Stage 1 — The Naive Synchronous Scraper
Start here. It is easy to debug and gives you a timing baseline.
Synchronous scraper (baseline)
import timeimport requestsimport jsonfrom pathlib import PathAUCTION_IDS = [103293, 103294, 103295, 103296, 103297]BASE_URL ="https://maxsold.maxsold.com/msapi/auctions/items"OUT_DIR = Path("data/raw")OUT_DIR.mkdir(parents=True, exist_ok=True)def fetch_auction(auction_id: int) ->dict:"""Fetch items for a single auction and return the JSON payload.""" resp = requests.get(BASE_URL, params={"auctionid": auction_id}, timeout=10) resp.raise_for_status()return resp.json()def run_naive(): start = time.perf_counter() results = {}for aid in AUCTION_IDS: results[aid] = fetch_auction(aid) (OUT_DIR /f"{aid}.json").write_text(json.dumps(results[aid])) elapsed = time.perf_counter() - startprint(f"Naive: fetched {len(AUCTION_IDS)} auctions in {elapsed:.2f}s")return resultsif__name__=="__main__": run_naive()
Note
time.perf_counter() measures wall-clock time (real elapsed time). Use time.process_time() to measure only CPU time consumed by your own process. For I/O-heavy scrapers, wall time is what matters; for CPU-heavy post-processing, process time helps isolate computation cost.
4 Stage 2 — Async I/O with asyncio and aiohttp
Network requests spend most of their time waiting. asyncio lets a single thread juggle hundreds of pending requests by switching tasks while one is waiting for a response.
Async scraper with semaphore rate-limiting
import asyncioimport timeimport aiohttpimport jsonfrom pathlib import PathAUCTION_IDS = [103293, 103294, 103295, 103296, 103297]BASE_URL ="https://maxsold.maxsold.com/msapi/auctions/items"OUT_DIR = Path("data/raw")MAX_CONCURRENT =5# tune to stay within API limitsasyncdef fetch_auction_async( session: aiohttp.ClientSession, sem: asyncio.Semaphore, auction_id: int,) ->tuple[int, dict]:asyncwith sem: # blocks if MAX_CONCURRENT tasks are already runningasyncwith session.get(BASE_URL, params={"auctionid": auction_id}) as resp: resp.raise_for_status() data =await resp.json()return auction_id, dataasyncdef run_async(): sem = asyncio.Semaphore(MAX_CONCURRENT) start = time.perf_counter()asyncwith aiohttp.ClientSession() as session: tasks = [fetch_auction_async(session, sem, aid) for aid in AUCTION_IDS] pairs =await asyncio.gather(*tasks, return_exceptions=False)for aid, data in pairs: (OUT_DIR /f"{aid}.json").write_text(json.dumps(data)) elapsed = time.perf_counter() - startprint(f"Async: fetched {len(AUCTION_IDS)} auctions in {elapsed:.2f}s")returndict(pairs)if__name__=="__main__": asyncio.run(run_async())
Expected speedup: roughly linear with the number of concurrent requests, up to the API’s connection limit. A synchronous scraper that takes 10 s for 10 URLs often completes in ~1–2 s with async — the network latency that used to stack up now overlaps.
5 Stage 3 — CPU Parallelism with concurrent.futures
Async I/O solves the waiting problem. If your post-processing step (JSON normalisation, deduplication, feature engineering) is CPU-intensive, you need actual parallel execution across CPU cores. ProcessPoolExecutor spawns worker processes, bypassing Python’s Global Interpreter Lock (GIL).
CPU-parallel post-processing with ProcessPoolExecutor
import jsonimport timefrom concurrent.futures import ProcessPoolExecutor, as_completedfrom pathlib import Pathimport pandas as pdRAW_DIR = Path("data/raw")OUT_CSV = Path("data/processed/auctions.parquet")OUT_CSV.parent.mkdir(parents=True, exist_ok=True)def normalise_file(path: Path) -> pd.DataFrame:"""Load one raw JSON file and return a flat DataFrame.""" payload = json.loads(path.read_text())# json_normalize handles nested dicts; 'record_path' depends on actual API shape df = pd.json_normalize(payload ifisinstance(payload, list) else [payload]) df["_source_file"] = path.stemreturn dfdef run_parallel_normalise(): json_files =list(RAW_DIR.glob("*.json")) start = time.perf_counter() frames = []with ProcessPoolExecutor() as executor: futures = {executor.submit(normalise_file, p): p for p in json_files}for future in as_completed(futures): frames.append(future.result()) combined = pd.concat(frames, ignore_index=True) combined.to_parquet(OUT_CSV) elapsed = time.perf_counter() - startprint(f"Parallel normalise: {len(json_files)} files in {elapsed:.2f}s")return combinedif__name__=="__main__": run_parallel_normalise()
5.1 Combining async fetch + parallel processing
The two techniques compose naturally: fetch async, process in parallel.
Code
flowchart LR A[Auction ID list] -->|asyncio.gather| B[Async HTTP fetches\naiohttp + Semaphore] B -->|raw JSON blobs| C[Disk / memory] C -->|ProcessPoolExecutor| D[CPU workers\nnormalise & clean] D --> E[Parquet / CSV output]
WarningAvoid mixing async and multiprocessing carelessly
ProcessPoolExecutor spawns separate processes, each with their own event loop. Do not pass aiohttp.ClientSession objects across process boundaries — they are not picklable. Fetch in async, then hand raw bytes/dicts to the process pool.
6 Measuring Performance
6.1 Wall time vs. CPU time
Comparing wall time and CPU time
import timedef measure(fn, *args, **kwargs):"""Run fn and report wall time and CPU time.""" t_wall_start = time.perf_counter() t_cpu_start = time.process_time() result = fn(*args, **kwargs) wall = time.perf_counter() - t_wall_start cpu = time.process_time() - t_cpu_startprint(f"Wall time : {wall:.3f}s")print(f"CPU time : {cpu:.3f}s")print(f"I/O ratio : {1- cpu / wall:.1%} of wall time was I/O wait")return result
A high I/O ratio (> 80 %) means async will help most. A low I/O ratio means CPU parallelism is the lever to pull.
6.2 Checking CPU core utilisation
Monitor CPU utilisation with psutil
import psutilimport timeimport threadingdef monitor_cpu(interval: float=0.5, duration: float=10.0):"""Print per-core CPU % at regular intervals.""" stop_event = threading.Event()def _poll():whilenot stop_event.is_set(): percents = psutil.cpu_percent(interval=interval, percpu=True)print("Cores:", [f"{p:5.1f}%"for p in percents]) t = threading.Thread(target=_poll, daemon=True) t.start()return stop_event # caller sets stop_event.set() to stop# Usage:# stop = monitor_cpu()# run_parallel_normalise()# stop.set()
If you see only one or two cores active during ProcessPoolExecutor work, check that your task granularity is large enough to justify the IPC overhead of spawning processes.
6.3 Using timeit for micro-benchmarks
Micro-benchmark with timeit
import timeitresult = timeit.timeit( stmt="json.loads(raw)", setup='import json; raw = open("data/raw/103293.json").read()', number=1000,)print(f"Average per call: {result /1000*1e6:.1f} µs")
7 Practical Development Workflow
Write the naive scraper first. Confirm the API shape, handle edge cases (empty pages, 404s), and get clean data before optimising.
Profile before parallelising. Use cProfile or line_profiler to find real bottlenecks, not imagined ones.
Add async. Switch requests → aiohttp, wrap the loop in asyncio.gather, add a semaphore. Re-measure.
Add CPU parallelism only if needed. If processing is trivial (< 5 % of total time), the ProcessPoolExecutor overhead is not worth it.
Tune concurrency empirically. Try MAX_CONCURRENT = 5, 10, 20 and plot wall time vs. value. You will see diminishing returns and, eventually, API-imposed 429 errors.
TipSave raw JSON, always
Before optimising, ensure you are persisting raw API responses to disk. If your parser has a bug or the API changes shape, you want to re-run parsing without re-hitting the network.
8 Other Languages to Consider for Very Large Scraping Tasks
Python is an excellent default, but for millions of URLs or extremely tight latency budgets, consider:
Language
Strength
Key libraries
Go
Native goroutines make concurrency trivial; compiles to a single binary
net/http, colly, chromedp
Rust
Near-C performance, zero-cost async via tokio; great for sustained high-throughput pipelines
reqwest, tokio, scraper
JavaScript / Node.js
Event-loop is async by default; ideal if the target is a JS-rendered SPA
axios, playwright, cheerio
Java / Kotlin
JVM thread pool maturity; good choice in enterprise environments
OkHttp, Ktor, jsoup
Julia
Surprisingly fast HTTP; useful when scraping feeds directly into numerical analysis
HTTP.jl, JSON3.jl
For most analysts, Python’s ecosystem and the speed gains from aiohttp + ProcessPoolExecutor are more than sufficient. Rewriting in Go or Rust becomes worthwhile when you are sustaining > 10 000 requests/minute over long periods or deploying scraping as a production microservice.
9 Conclusion
The path from naive to production-grade scraper is incremental:
A synchronous scraper built with requests is the right starting point — simple to write, easy to debug.
Switching to async I/O with aiohttp and a semaphore typically delivers a 5–20× speedup for network-bound workloads with minimal code change.
CPU parallelism via ProcessPoolExecutor complements async when post-processing is expensive.
Benchmark every stage with time.perf_counter() and psutil so optimisation decisions are data-driven, not guesswork.
Store raw JSON alongside your processed output to keep the pipeline reproducible.
The two APIs mentioned in this article — VIA Rail’s allData.json and MaxSold’s auction items endpoint — are useful real-world targets to practice against because they return structured JSON, have predictable shapes, and are publicly accessible.