Scraping Inside the Quota - Golam Ahmed Mugdha

A hard ceiling changes everything

The Open Tender Scraper runs against a portal with a strict budget: roughly 500 connections an hour. That single constraint dictated more design decisions than any feature requirement did. When you can't just do more work by trying harder, every wasted connection is a feature you can't ship.

Pooling, because the naive path is a luxury you can't afford

The obvious implementation opens a fresh connection per operation. It's simpler to write — and it burns roughly 80% more connections than it needs to, walking straight into the quota wall. Instead, the scraper keeps a small pool of 2–3 persistent connections, checking them out and returning them with automatic cleanup on error. Slightly more complex code; an 80% cut in connection spend. Against a hard ceiling, that's not an optimization — it's the difference between running and being locked out.

When the resource is capped, code complexity is a currency you spend to buy headroom.

All-or-nothing writes

High-volume scraping into a normalized, multi-table schema invites a nasty failure: a half-written record, an order row with no line items, referential gaps nobody notices until a report comes out wrong. So every related set of inserts is wrapped in one transaction — BEGIN → INSERT… → COMMIT, and any failure triggers a full ROLLBACK. There is no such thing as a partial record. Foreign keys are enforced at the database, not merely hoped for in code, so integrity survives even a buggy deploy.

Checking duplicates without asking the database

Deduplicating each scraped record with its own query is an N+1 trap that only gets slower as the dataset grows. Instead the scraper loads every existing ID into an in-memory set once at startup — an O(1) membership check per record, no round trip. It trades a little memory for the elimination of thousands of redundant queries, which also means thousands of connections saved against that same quota.

A loop that adapts, not a cron that doesn't

Finally, scheduling. A fixed cron can't tell a healthy run from a rate-limited one — it just fires again on the clock and makes things worse. A supervised loop watches each run's exit code and adapts: the normal interval on success, a longer cool-down on a rate-limit signal, a recovery pause on failure. The result is self-healing 24/7 operation that respects the ceiling instead of crashing into it.

The lesson

Constraints aren't the enemy of good architecture — they're the brief. A generous environment lets sloppy designs survive; a 500-an-hour quota forces every connection to justify itself. Pooling, atomic writes, in-memory dedup, and adaptive supervision aren't unrelated tricks. They're four answers to the same discipline: spend the scarce resource like it's scarce.