The problem
The existing indexer was a single Node.js process tailing RPC websockets, writing to Postgres through an ORM. It fell behind during volatility, reorgs left the data inconsistent, and a single analytics query could lock the write path for minutes.
The team needed to survive a 10× protocol volume increase without downtime.
What we built
A separated read/write data plane. Ingestion became an idempotent, checkpointed stream. Reads moved to ClickHouse behind a thin query API with enforced cost ceilings. We added a reconciliation pass that validates the last N blocks against an independent archive node and auto-heals drift.
Highlights
- Indexer rewritten in Go with a reorg-aware checkpoint protocol
- Analytics store on ClickHouse with materialized views for hot queries
- Grafana dashboards tied to SLO budgets, PagerDuty wired to SLO burn
- Zero-downtime cutover using dual-write and backfill
Outcome
- 10× increase in tracked volume, no incidents through the cutover
- p95 query latency dropped from 2.4s to 290ms
- MTTD for upstream RPC issues dropped from hours to < 2 minutes
- Cost per million indexed events down 40%
Stack
Go · ClickHouse · Postgres · Kafka · Grafana · PagerDuty · Terraform · AWS