<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Jonathan Pearce</title>
<link>https://jonathan-pearce.github.io/blog/</link>
<atom:link href="https://jonathan-pearce.github.io/blog/index.xml" rel="self" type="application/rss+xml"/>
<description>Personal website of Jonathan Pearce — projects, blog posts, datasets, and resources on data analytics, machine learning, and statistics.</description>
<generator>quarto-1.9.36</generator>
<lastBuildDate>Mon, 02 Mar 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Accelerating Web Scraping with APIs</title>
  <dc:creator>Jonathan Pearce</dc:creator>
  <link>https://jonathan-pearce.github.io/blog/posts/web-scraping-apis/</link>
  <description><![CDATA[ 




<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Public APIs are a goldmine for data collection. Two good examples are VIA Rail’s real-time train data (<a href="https://tsimobile.viarail.ca/data/allData.json"><code>allData.json</code></a>) and the MaxSold auction item catalogue (<a href="https://maxsold.maxsold.com/msapi/auctions/items?auctionid=103293"><code>msapi/auctions/items</code></a>). Both return structured JSON you can consume directly without HTML parsing, making them ideal targets for a well-engineered scraper.</p>
<p>This article walks through the full design journey:</p>
<ol type="1">
<li>Start with a <strong>naive synchronous scraper</strong> that is easy to reason about.</li>
<li>Add <strong>async I/O</strong> (<code>asyncio</code> + <code>aiohttp</code>) to eliminate network idle time.</li>
<li>Layer in <strong>CPU parallelism</strong> (<code>concurrent.futures</code>) for CPU-bound post-processing.</li>
<li><strong>Benchmark</strong> each stage so you know where your bottleneck actually is.</li>
<li>Decide whether to store raw JSON blobs or normalised tabular data.</li>
</ol>
<p>Approximate read time: 10 minutes. Code is written for Python 3.10+.</p>
<hr>
</section>
<section id="design-considerations-before-you-write-a-line-of-code" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="design-considerations-before-you-write-a-line-of-code"><span class="header-section-number">2</span> Design Considerations Before You Write a Line of Code</h2>
<section id="choose-the-right-libraries" class="level3" data-number="2.1">
<h3 data-number="2.1" class="anchored" data-anchor-id="choose-the-right-libraries"><span class="header-section-number">2.1</span> Choose the right libraries</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Concern</th>
<th>Recommended library</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Synchronous HTTP</td>
<td><code>requests</code></td>
<td>Simple, battle-tested</td>
</tr>
<tr class="even">
<td>Async HTTP</td>
<td><code>aiohttp</code></td>
<td>Pairs with <code>asyncio</code>; faster than <code>httpx</code> for pure async</td>
</tr>
<tr class="odd">
<td>Async HTTP (alt)</td>
<td><code>httpx</code></td>
<td>Drop-in <code>requests</code> API with async support</td>
</tr>
<tr class="even">
<td>CPU parallelism</td>
<td><code>concurrent.futures</code></td>
<td><code>ProcessPoolExecutor</code> is stdlib and straightforward</td>
</tr>
<tr class="odd">
<td>Rate-limit awareness</td>
<td><code>tenacity</code></td>
<td>Retry with exponential back-off</td>
</tr>
<tr class="even">
<td>Data wrangling</td>
<td><code>pandas</code> / <code>polars</code></td>
<td><code>polars</code> is faster for large normalised tables</td>
</tr>
</tbody>
</table>
</section>
<section id="respect-api-rate-limits" class="level3" data-number="2.2">
<h3 data-number="2.2" class="anchored" data-anchor-id="respect-api-rate-limits"><span class="header-section-number">2.2</span> Respect API rate limits</h3>
<p>Most free APIs throttle by IP or token. Before looping over thousands of IDs:</p>
<ul>
<li>Read the API docs for stated limits (requests/minute, concurrent connections).</li>
<li>Add a <code>Retry-After</code> header handler so you back off automatically on HTTP 429.</li>
<li>Keep a <strong>semaphore</strong> (<code>asyncio.Semaphore</code>) to cap your own concurrency — don’t rely solely on the server to push back.</li>
</ul>
</section>
<section id="raw-json-vs.-tabular-storage" class="level3" data-number="2.3">
<h3 data-number="2.3" class="anchored" data-anchor-id="raw-json-vs.-tabular-storage"><span class="header-section-number">2.3</span> Raw JSON vs.&nbsp;tabular storage</h3>
<p>Storing raw JSON gives you a replayable source of truth; you can re-derive any schema without hitting the API again. The trade-off is storage and query complexity.</p>
<p>A practical middle ground: <strong>save the raw JSON</strong> alongside a normalised Parquet file. Tools like <code>pandas.json_normalize</code> and <code>polars</code> make the transformation cheap.</p>
<hr>
</section>
</section>
<section id="stage-1-the-naive-synchronous-scraper" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="stage-1-the-naive-synchronous-scraper"><span class="header-section-number">3</span> Stage 1 — The Naive Synchronous Scraper</h2>
<p>Start here. It is easy to debug and gives you a timing baseline.</p>
<div id="naive-scraper" class="cell" data-execution_count="1">
<details class="code-fold">
<summary>Synchronous scraper (baseline)</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb1-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pathlib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Path</span>
<span id="cb1-5"></span>
<span id="cb1-6">AUCTION_IDS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103293</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103294</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103295</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103296</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103297</span>]</span>
<span id="cb1-7">BASE_URL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://maxsold.maxsold.com/msapi/auctions/items"</span></span>
<span id="cb1-8">OUT_DIR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Path(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/raw"</span>)</span>
<span id="cb1-9">OUT_DIR.mkdir(parents<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, exist_ok<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-10"></span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fetch_auction(auction_id: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>:</span>
<span id="cb1-13">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Fetch items for a single auction and return the JSON payload."""</span></span>
<span id="cb1-14">    resp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> requests.get(BASE_URL, params<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auctionid"</span>: auction_id}, timeout<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb1-15">    resp.raise_for_status()</span>
<span id="cb1-16">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> resp.json()</span>
<span id="cb1-17"></span>
<span id="cb1-18"></span>
<span id="cb1-19"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> run_naive():</span>
<span id="cb1-20">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter()</span>
<span id="cb1-21"></span>
<span id="cb1-22">    results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb1-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> aid <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> AUCTION_IDS:</span>
<span id="cb1-24">        results[aid] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fetch_auction(aid)</span>
<span id="cb1-25">        (OUT_DIR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>aid<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">.json"</span>).write_text(json.dumps(results[aid]))</span>
<span id="cb1-26"></span>
<span id="cb1-27">    elapsed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb1-28">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Naive: fetched </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(AUCTION_IDS)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> auctions in </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>elapsed<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb1-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> results</span>
<span id="cb1-30"></span>
<span id="cb1-31"></span>
<span id="cb1-32"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb1-33">    run_naive()</span></code></pre></div></div>
</details>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p><code>time.perf_counter()</code> measures <strong>wall-clock time</strong> (real elapsed time). Use <code>time.process_time()</code> to measure only CPU time consumed by your own process. For I/O-heavy scrapers, wall time is what matters; for CPU-heavy post-processing, process time helps isolate computation cost.</p>
</div>
</div>
<hr>
</section>
<section id="stage-2-async-io-with-asyncio-and-aiohttp" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="stage-2-async-io-with-asyncio-and-aiohttp"><span class="header-section-number">4</span> Stage 2 — Async I/O with <code>asyncio</code> and <code>aiohttp</code></h2>
<p>Network requests spend most of their time waiting. <code>asyncio</code> lets a single thread juggle hundreds of pending requests by switching tasks while one is waiting for a response.</p>
<div id="async-scraper" class="cell" data-execution_count="2">
<details class="code-fold">
<summary>Async scraper with semaphore rate-limiting</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> asyncio</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> aiohttp</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb2-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pathlib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Path</span>
<span id="cb2-6"></span>
<span id="cb2-7">AUCTION_IDS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103293</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103294</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103295</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103296</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">103297</span>]</span>
<span id="cb2-8">BASE_URL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://maxsold.maxsold.com/msapi/auctions/items"</span></span>
<span id="cb2-9">OUT_DIR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Path(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/raw"</span>)</span>
<span id="cb2-10">MAX_CONCURRENT <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># tune to stay within API limits</span></span>
<span id="cb2-11"></span>
<span id="cb2-12"></span>
<span id="cb2-13"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fetch_auction_async(</span>
<span id="cb2-14">    session: aiohttp.ClientSession,</span>
<span id="cb2-15">    sem: asyncio.Semaphore,</span>
<span id="cb2-16">    auction_id: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>,</span>
<span id="cb2-17">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">tuple</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>]:</span>
<span id="cb2-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> sem:            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># blocks if MAX_CONCURRENT tasks are already running</span></span>
<span id="cb2-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> session.get(BASE_URL, params<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auctionid"</span>: auction_id}) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> resp:</span>
<span id="cb2-20">            resp.raise_for_status()</span>
<span id="cb2-21">            data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">await</span> resp.json()</span>
<span id="cb2-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> auction_id, data</span>
<span id="cb2-23"></span>
<span id="cb2-24"></span>
<span id="cb2-25"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> run_async():</span>
<span id="cb2-26">    sem <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> asyncio.Semaphore(MAX_CONCURRENT)</span>
<span id="cb2-27">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter()</span>
<span id="cb2-28"></span>
<span id="cb2-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> aiohttp.ClientSession() <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> session:</span>
<span id="cb2-30">        tasks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [fetch_auction_async(session, sem, aid) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> aid <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> AUCTION_IDS]</span>
<span id="cb2-31">        pairs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">await</span> asyncio.gather(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>tasks, return_exceptions<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb2-32"></span>
<span id="cb2-33">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> aid, data <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> pairs:</span>
<span id="cb2-34">        (OUT_DIR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>aid<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">.json"</span>).write_text(json.dumps(data))</span>
<span id="cb2-35"></span>
<span id="cb2-36">    elapsed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb2-37">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Async: fetched </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(AUCTION_IDS)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> auctions in </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>elapsed<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb2-38">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(pairs)</span>
<span id="cb2-39"></span>
<span id="cb2-40"></span>
<span id="cb2-41"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb2-42">    asyncio.run(run_async())</span></code></pre></div></div>
</details>
</div>
<p><strong>Expected speedup:</strong> roughly linear with the number of concurrent requests, up to the API’s connection limit. A synchronous scraper that takes 10 s for 10 URLs often completes in ~1–2 s with async — the network latency that used to stack up now overlaps.</p>
<hr>
</section>
<section id="stage-3-cpu-parallelism-with-concurrent.futures" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="stage-3-cpu-parallelism-with-concurrent.futures"><span class="header-section-number">5</span> Stage 3 — CPU Parallelism with <code>concurrent.futures</code></h2>
<p>Async I/O solves the <em>waiting</em> problem. If your post-processing step (JSON normalisation, deduplication, feature engineering) is CPU-intensive, you need actual parallel execution across CPU cores. <code>ProcessPoolExecutor</code> spawns worker processes, bypassing Python’s Global Interpreter Lock (GIL).</p>
<div id="cpu-parallel" class="cell" data-execution_count="3">
<details class="code-fold">
<summary>CPU-parallel post-processing with ProcessPoolExecutor</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> json</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> concurrent.futures <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ProcessPoolExecutor, as_completed</span>
<span id="cb3-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pathlib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Path</span>
<span id="cb3-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb3-6"></span>
<span id="cb3-7">RAW_DIR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Path(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/raw"</span>)</span>
<span id="cb3-8">OUT_CSV <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Path(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/processed/auctions.parquet"</span>)</span>
<span id="cb3-9">OUT_CSV.parent.mkdir(parents<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, exist_ok<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb3-10"></span>
<span id="cb3-11"></span>
<span id="cb3-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> normalise_file(path: Path) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> pd.DataFrame:</span>
<span id="cb3-13">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Load one raw JSON file and return a flat DataFrame."""</span></span>
<span id="cb3-14">    payload <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> json.loads(path.read_text())</span>
<span id="cb3-15">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># json_normalize handles nested dicts; 'record_path' depends on actual API shape</span></span>
<span id="cb3-16">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.json_normalize(payload <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">isinstance</span>(payload, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> [payload])</span>
<span id="cb3-17">    df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_source_file"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> path.stem</span>
<span id="cb3-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> df</span>
<span id="cb3-19"></span>
<span id="cb3-20"></span>
<span id="cb3-21"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> run_parallel_normalise():</span>
<span id="cb3-22">    json_files <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(RAW_DIR.glob(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*.json"</span>))</span>
<span id="cb3-23">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter()</span>
<span id="cb3-24"></span>
<span id="cb3-25">    frames <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb3-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> ProcessPoolExecutor() <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> executor:</span>
<span id="cb3-27">        futures <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {executor.submit(normalise_file, p): p <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> p <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> json_files}</span>
<span id="cb3-28">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> future <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> as_completed(futures):</span>
<span id="cb3-29">            frames.append(future.result())</span>
<span id="cb3-30"></span>
<span id="cb3-31">    combined <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat(frames, ignore_index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb3-32">    combined.to_parquet(OUT_CSV)</span>
<span id="cb3-33"></span>
<span id="cb3-34">    elapsed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb3-35">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Parallel normalise: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(json_files)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> files in </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>elapsed<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb3-36">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> combined</span>
<span id="cb3-37"></span>
<span id="cb3-38"></span>
<span id="cb3-39"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__name__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"__main__"</span>:</span>
<span id="cb3-40">    run_parallel_normalise()</span></code></pre></div></div>
</details>
</div>
<section id="combining-async-fetch-parallel-processing" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="combining-async-fetch-parallel-processing"><span class="header-section-number">5.1</span> Combining async fetch + parallel processing</h3>
<p>The two techniques compose naturally: <strong>fetch async, process in parallel</strong>.</p>
<div class="cell" data-layout-align="default">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode default code-with-copy"><code class="sourceCode default"><span id="cb4-1">flowchart LR</span>
<span id="cb4-2">    A[Auction ID list] --&gt;|asyncio.gather| B[Async HTTP fetches\naiohttp + Semaphore]</span>
<span id="cb4-3">    B --&gt;|raw JSON blobs| C[Disk / memory]</span>
<span id="cb4-4">    C --&gt;|ProcessPoolExecutor| D[CPU workers\nnormalise &amp; clean]</span>
<span id="cb4-5">    D --&gt; E[Parquet / CSV output]</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>

</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Avoid mixing async and multiprocessing carelessly
</div>
</div>
<div class="callout-body-container callout-body">
<p><code>ProcessPoolExecutor</code> spawns separate processes, each with their own event loop. Do not pass <code>aiohttp.ClientSession</code> objects across process boundaries — they are not picklable. Fetch in async, <em>then</em> hand raw bytes/dicts to the process pool.</p>
</div>
</div>
<hr>
</section>
</section>
<section id="measuring-performance" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="measuring-performance"><span class="header-section-number">6</span> Measuring Performance</h2>
<section id="wall-time-vs.-cpu-time" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="wall-time-vs.-cpu-time"><span class="header-section-number">6.1</span> Wall time vs.&nbsp;CPU time</h3>
<div id="timing-example" class="cell" data-execution_count="4">
<details class="code-fold">
<summary>Comparing wall time and CPU time</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb5-2"></span>
<span id="cb5-3"></span>
<span id="cb5-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> measure(fn, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>args, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>kwargs):</span>
<span id="cb5-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Run fn and report wall time and CPU time."""</span></span>
<span id="cb5-6">    t_wall_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter()</span>
<span id="cb5-7">    t_cpu_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.process_time()</span>
<span id="cb5-8"></span>
<span id="cb5-9">    result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fn(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>args, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>kwargs)</span>
<span id="cb5-10"></span>
<span id="cb5-11">    wall <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.perf_counter() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> t_wall_start</span>
<span id="cb5-12">    cpu <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.process_time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> t_cpu_start</span>
<span id="cb5-13"></span>
<span id="cb5-14">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Wall time : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>wall<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb5-15">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"CPU time  : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cpu<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb5-16">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"I/O ratio : </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> cpu <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> wall<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.1%}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> of wall time was I/O wait"</span>)</span>
<span id="cb5-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> result</span></code></pre></div></div>
</details>
</div>
<p>A high I/O ratio (&gt; 80 %) means async will help most. A low I/O ratio means CPU parallelism is the lever to pull.</p>
</section>
<section id="checking-cpu-core-utilisation" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="checking-cpu-core-utilisation"><span class="header-section-number">6.2</span> Checking CPU core utilisation</h3>
<div id="cpu-check" class="cell" data-execution_count="5">
<details class="code-fold">
<summary>Monitor CPU utilisation with psutil</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> psutil</span>
<span id="cb6-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb6-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> threading</span>
<span id="cb6-4"></span>
<span id="cb6-5"></span>
<span id="cb6-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> monitor_cpu(interval: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, duration: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">10.0</span>):</span>
<span id="cb6-7">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Print per-core CPU % at regular intervals."""</span></span>
<span id="cb6-8">    stop_event <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> threading.Event()</span>
<span id="cb6-9"></span>
<span id="cb6-10">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _poll():</span>
<span id="cb6-11">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> stop_event.is_set():</span>
<span id="cb6-12">            percents <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> psutil.cpu_percent(interval<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>interval, percpu<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb6-13">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cores:"</span>, [<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>p<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:5.1f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> p <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> percents])</span>
<span id="cb6-14"></span>
<span id="cb6-15">    t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> threading.Thread(target<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>_poll, daemon<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb6-16">    t.start()</span>
<span id="cb6-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> stop_event   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># caller sets stop_event.set() to stop</span></span>
<span id="cb6-18"></span>
<span id="cb6-19"></span>
<span id="cb6-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage:</span></span>
<span id="cb6-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># stop = monitor_cpu()</span></span>
<span id="cb6-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># run_parallel_normalise()</span></span>
<span id="cb6-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># stop.set()</span></span></code></pre></div></div>
</details>
</div>
<p>If you see only one or two cores active during <code>ProcessPoolExecutor</code> work, check that your task granularity is large enough to justify the IPC overhead of spawning processes.</p>
</section>
<section id="using-timeit-for-micro-benchmarks" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="using-timeit-for-micro-benchmarks"><span class="header-section-number">6.3</span> Using <code>timeit</code> for micro-benchmarks</h3>
<div id="timeit-example" class="cell" data-execution_count="6">
<details class="code-fold">
<summary>Micro-benchmark with timeit</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> timeit</span>
<span id="cb7-2"></span>
<span id="cb7-3">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> timeit.timeit(</span>
<span id="cb7-4">    stmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"json.loads(raw)"</span>,</span>
<span id="cb7-5">    setup<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'import json; raw = open("data/raw/103293.json").read()'</span>,</span>
<span id="cb7-6">    number<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>,</span>
<span id="cb7-7">)</span>
<span id="cb7-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Average per call: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e6</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.1f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> µs"</span>)</span></code></pre></div></div>
</details>
</div>
<hr>
</section>
</section>
<section id="practical-development-workflow" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="practical-development-workflow"><span class="header-section-number">7</span> Practical Development Workflow</h2>
<ol type="1">
<li><strong>Write the naive scraper first.</strong> Confirm the API shape, handle edge cases (empty pages, 404s), and get clean data before optimising.</li>
<li><strong>Profile before parallelising.</strong> Use <code>cProfile</code> or <code>line_profiler</code> to find real bottlenecks, not imagined ones.</li>
<li><strong>Add async.</strong> Switch <code>requests</code> → <code>aiohttp</code>, wrap the loop in <code>asyncio.gather</code>, add a semaphore. Re-measure.</li>
<li><strong>Add CPU parallelism only if needed.</strong> If processing is trivial (&lt; 5 % of total time), the <code>ProcessPoolExecutor</code> overhead is not worth it.</li>
<li><strong>Tune concurrency empirically.</strong> Try <code>MAX_CONCURRENT</code> = 5, 10, 20 and plot wall time vs.&nbsp;value. You will see diminishing returns and, eventually, API-imposed 429 errors.</li>
</ol>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Save raw JSON, always
</div>
</div>
<div class="callout-body-container callout-body">
<p>Before optimising, ensure you are persisting raw API responses to disk. If your parser has a bug or the API changes shape, you want to re-run parsing without re-hitting the network.</p>
</div>
</div>
<hr>
</section>
<section id="other-languages-to-consider-for-very-large-scraping-tasks" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="other-languages-to-consider-for-very-large-scraping-tasks"><span class="header-section-number">8</span> Other Languages to Consider for Very Large Scraping Tasks</h2>
<p>Python is an excellent default, but for millions of URLs or extremely tight latency budgets, consider:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Language</th>
<th>Strength</th>
<th>Key libraries</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Go</strong></td>
<td>Native goroutines make concurrency trivial; compiles to a single binary</td>
<td><code>net/http</code>, <code>colly</code>, <code>chromedp</code></td>
</tr>
<tr class="even">
<td><strong>Rust</strong></td>
<td>Near-C performance, zero-cost async via <code>tokio</code>; great for sustained high-throughput pipelines</td>
<td><code>reqwest</code>, <code>tokio</code>, <code>scraper</code></td>
</tr>
<tr class="odd">
<td><strong>JavaScript / Node.js</strong></td>
<td>Event-loop is async by default; ideal if the target is a JS-rendered SPA</td>
<td><code>axios</code>, <code>playwright</code>, <code>cheerio</code></td>
</tr>
<tr class="even">
<td><strong>Java / Kotlin</strong></td>
<td>JVM thread pool maturity; good choice in enterprise environments</td>
<td><code>OkHttp</code>, <code>Ktor</code>, <code>jsoup</code></td>
</tr>
<tr class="odd">
<td><strong>Julia</strong></td>
<td>Surprisingly fast HTTP; useful when scraping feeds directly into numerical analysis</td>
<td><code>HTTP.jl</code>, <code>JSON3.jl</code></td>
</tr>
</tbody>
</table>
<p>For most analysts, Python’s ecosystem and the speed gains from <code>aiohttp</code> + <code>ProcessPoolExecutor</code> are more than sufficient. Rewriting in Go or Rust becomes worthwhile when you are sustaining &gt; 10 000 requests/minute over long periods or deploying scraping as a production microservice.</p>
<hr>
</section>
<section id="conclusion" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">9</span> Conclusion</h2>
<p>The path from naive to production-grade scraper is incremental:</p>
<ul>
<li>A <strong>synchronous scraper</strong> built with <code>requests</code> is the right starting point — simple to write, easy to debug.</li>
<li>Switching to <strong>async I/O</strong> with <code>aiohttp</code> and a semaphore typically delivers a 5–20× speedup for network-bound workloads with minimal code change.</li>
<li><strong>CPU parallelism</strong> via <code>ProcessPoolExecutor</code> complements async when post-processing is expensive.</li>
<li><strong>Benchmark every stage</strong> with <code>time.perf_counter()</code> and <code>psutil</code> so optimisation decisions are data-driven, not guesswork.</li>
<li><strong>Store raw JSON</strong> alongside your processed output to keep the pipeline reproducible.</li>
</ul>
<p>The two APIs mentioned in this article — VIA Rail’s <code>allData.json</code> and MaxSold’s auction items endpoint — are useful real-world targets to practice against because they return structured JSON, have predictable shapes, and are publicly accessible.</p>
<hr>
</section>
<section id="references-further-reading" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="references-further-reading"><span class="header-section-number">10</span> References &amp; Further Reading</h2>
<ul>
<li><p>Van Rossum, G. et al.&nbsp;<em>asyncio — Asynchronous I/O</em>. Python 3 documentation. <a href="https://docs.python.org/3/library/asyncio.html" class="uri">https://docs.python.org/3/library/asyncio.html</a></p></li>
<li><p>aiohttp maintainers. <em>aiohttp documentation</em>. <a href="https://docs.aiohttp.org" class="uri">https://docs.aiohttp.org</a></p></li>
<li><p>Python Software Foundation. <em>concurrent.futures — Launching parallel tasks</em>. <a href="https://docs.python.org/3/library/concurrent.futures.html" class="uri">https://docs.python.org/3/library/concurrent.futures.html</a></p></li>
<li><p>Giampaolo Rodolà. <em>psutil documentation</em>. <a href="https://psutil.readthedocs.io" class="uri">https://psutil.readthedocs.io</a></p></li>
<li><p>Reitz, K. <em>Requests: HTTP for Humans</em>. <a href="https://requests.readthedocs.io" class="uri">https://requests.readthedocs.io</a></p></li>
<li><p>pandas contributors. <em>pandas.json_normalize</em>. <a href="https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html" class="uri">https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html</a></p></li>
<li><p>Nguyen, T. (2021). <em>Fast API scraping in Python: asyncio vs threading vs multiprocessing</em>. Towards Data Science. <a href="https://towardsdatascience.com/fast-api-scraping-in-python-asyncio-vs-threading-vs-multiprocessing" class="uri">https://towardsdatascience.com/fast-api-scraping-in-python-asyncio-vs-threading-vs-multiprocessing</a></p></li>
<li><p>Nikitin, A. <em>colly — Fast and Elegant Scraping Framework for Go</em>. <a href="https://github.com/gocolly/colly" class="uri">https://github.com/gocolly/colly</a></p></li>
<li><p>Tokio contributors. <em>Tokio — An asynchronous Rust runtime</em>. <a href="https://tokio.rs" class="uri">https://tokio.rs</a></p></li>
<li><p>tenacity contributors. <em>tenacity — Retrying library for Python</em>. <a href="https://tenacity.readthedocs.io" class="uri">https://tenacity.readthedocs.io</a></p></li>
<li><p>VIA Rail Canada. <em>TSI Mobile API — allData.json</em>. <a href="https://tsimobile.viarail.ca/data/allData.json" class="uri">https://tsimobile.viarail.ca/data/allData.json</a></p></li>
<li><p>MaxSold Inc.&nbsp;<em>Auction Items API</em>. <a href="https://maxsold.maxsold.com/msapi/auctions/items?auctionid=103293" class="uri">https://maxsold.maxsold.com/msapi/auctions/items?auctionid=103293</a></p></li>
</ul>


</section>

 ]]></description>
  <guid>https://jonathan-pearce.github.io/blog/posts/web-scraping-apis/</guid>
  <pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://jonathan-pearce.github.io/blog/posts/web-scraping-apis/thumbnail.png" medium="image" type="image/png" height="76" width="144"/>
</item>
<item>
  <title>Exploratory Data Analysis with Python</title>
  <dc:creator>Jonathan Pearce</dc:creator>
  <link>https://jonathan-pearce.github.io/blog/posts/example-eda/</link>
  <description><![CDATA[ 




<section id="overview" class="level2">
<h2 class="anchored" data-anchor-id="overview">Overview</h2>
<p>This example post demonstrates how a typical Exploratory Data Analysis (EDA) blog post looks on this site. It generates synthetic data so the post renders without external dependencies.</p>
</section>
<section id="generate-data" class="level2">
<h2 class="anchored" data-anchor-id="generate-data">Generate Data</h2>
<div id="7eb44bfb" class="cell" data-execution_count="1">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb1-4"></span>
<span id="cb1-5">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb1-6"></span>
<span id="cb1-7">n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span></span>
<span id="cb1-8">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({</span>
<span id="cb1-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>: np.random.randint(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">65</span>, n),</span>
<span id="cb1-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"income"</span>: np.random.exponential(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50000</span>, n).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb1-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"score"</span>: np.random.beta(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, n).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>),</span>
<span id="cb1-12">})</span>
<span id="cb1-13">df.head()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="1">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">age</th>
<th data-quarto-table-cell-role="th">income</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>63</td>
<td>42639.29</td>
<td>0.0089</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>20</td>
<td>5414.64</td>
<td>0.4655</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>46</td>
<td>60170.99</td>
<td>0.2125</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>52</td>
<td>54112.52</td>
<td>0.1904</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>56</td>
<td>2517.17</td>
<td>0.2594</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
</section>
<section id="distributions" class="level2">
<h2 class="anchored" data-anchor-id="distributions">Distributions</h2>
<div id="cell-fig-distributions" class="cell" data-execution_count="2">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">fig, axes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5</span>))</span>
<span id="cb2-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> ax, col <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(axes, df.columns):</span>
<span id="cb2-3">    ax.hist(df[col], bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, edgecolor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>)</span>
<span id="cb2-4">    ax.set_title(col.capitalize())</span>
<span id="cb2-5">fig.tight_layout()</span>
<span id="cb2-6">plt.show()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<div id="fig-distributions" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-distributions-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://jonathan-pearce.github.io/blog/posts/example-eda/index_files/figure-html/fig-distributions-output-1.png" class="img-fluid figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-distributions-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Distribution of age, income, and score.
</figcaption>
</figure>
</div>
</div>
</div>
</section>
<section id="correlations" class="level2">
<h2 class="anchored" data-anchor-id="correlations">Correlations</h2>
<div id="e20dd01a" class="cell" data-execution_count="3">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">df.corr().<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="3">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">age</th>
<th data-quarto-table-cell-role="th">income</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">age</th>
<td>1.000</td>
<td>0.096</td>
<td>0.012</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">income</th>
<td>0.096</td>
<td>1.000</td>
<td>-0.077</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">score</th>
<td>0.012</td>
<td>-0.077</td>
<td>1.000</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
</section>
<section id="next-steps" class="level2">
<h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2>
<ul>
<li>Handle outliers</li>
<li>Feature engineering</li>
<li>Modelling</li>
</ul>


</section>

 ]]></description>
  <guid>https://jonathan-pearce.github.io/blog/posts/example-eda/</guid>
  <pubDate>Sun, 15 Feb 2026 00:00:00 GMT</pubDate>
  <media:content url="https://jonathan-pearce.github.io/blog/posts/example-eda/thumbnail.png" medium="image" type="image/png" height="96" width="144"/>
</item>
</channel>
</rss>
