Performance Benchmarks

dvb-WarpPool uses Criterion for hot-path benchmarks. Three suites cover the critical code paths. The goal is not maximum throughput — a solo pool's workload is small — but a regression baseline so that code changes are caught early when they unexpectedly become expensive.

Suites

Bench	Crate	Hot-Path	Frequency in Pool
`validate`	`warppool-share-validator`	`ShareValidator::validate()` per accepted+rejected share	per-share (most frequent)
`build_job`	`warppool-job-builder`	`JobBuilder::build()` per new block template	per-job (~every 30-60s)
`vardiff`	`warppool-stratum-v1`	`VarDiff::observe_share()` per accepted share	per-share

Running Locally

# Single bench
cargo bench -p warppool-share-validator --bench validate

# All three
cargo bench --workspace --benches

# Compile-check only, no runs (CI smoke)
cargo bench --workspace --no-run

Criterion writes reports to target/criterion/<bench-name>/report/index.html. On a second run it compares against the first and reports drift (Performance has regressed. / Performance has improved.).

Baseline Numbers (2026-05-27, MacBook M-Series, release build)

These numbers are a snapshot, not a hard contract — they can vary by 2× depending on hardware and CPU throttling. On Linux x86_64 server hardware the values are typically similar or better.

Bench	Time	Throughput
`validate_full/0` (no merkle branches)	1.32 µs	760K shares/s
`validate_full/8` (typical regtest)	5.49 µs	182K shares/s
`validate_full/12` (typical mainnet)	7.59 µs	132K shares/s
`sha256d_80b_header`	528 ns	—
`sha256d_500b_coinbase`	1.55 µs	—
`merkle_root/12`	(hot-path portion) ~2 µs	—
`reconstruct_coinbase`	< 200 ns	—
`build_header`	< 30 ns	—

Take-away: validate scales with merkle-branch count (linearly). At 12 branches the pool can validate ~130K shares/s — that's 1000× more than a solo pool with 7 Bitaxes will ever see (typically 1-5 shares/s). Validate is NEVER the bottleneck.

`build_job` (per job-refresh)

Bench	Time	Throughput
`build_job/0` (empty / regtest)	~100 µs	—
`build_job/100`	~150 µs	—
`build_job/1000`	~700 µs	—
`build_job/4000` (typical full block)	2.59 ms	386 jobs/s
`merkle_branches/4000`	2.30 ms	—

Take-away: Job-build scales with tx-count, dominated by merkle-branch computation. 2.59ms / job for a full mainnet block is clearly visible but not a problem — templates arrive every 30+ seconds, not every ms.

Bench	Time
`vardiff_observe_share_hold` (stationary)	5.2 ns
`vardiff_observe_share_retarget` (8-share burst)	37.9 ns (~5ns/share)
`difficulty_to_target_be`	12.85 ns
`vardiff_decision_variant_match`	432 ps

Take-away: VarDiff is effectively free. Even under extreme load scenarios (>100K shares/s) it consumes <1ms/s of CPU.

Interpretation

What you can read from the numbers:

Question	Hint
"Is my pool burning too much CPU?"	No. At 10 shares/s and 12 merkle branches: ~76 µs share-validate time per second = 0.0076% CPU
"How many workers can my pool serve at most?"	The Stratum connection cap (profile-dependent, 64-4096). Share-validate is not the limit
"Is ASIC-boost / merkle-tree caching worth it?"	No, not in a solo pool. In a 10M-shares/s pool, per-template merkle-branch caching would be a factor of 5-10

CI

.github/workflows/benches.yml runs only:

Manual dispatch (operator clicks "Run workflow" in the UI)
On tag push (release snapshot)

NOT on every PR — Criterion runs are expensive (~5min build + 5min suite), and GitHub-runner noise makes microbench comparisons unreliable.

Reports are uploaded as artifact criterion-reports-<sha> with 30-day retention. The operator can download them and view them locally in the HTML report.

Regression Workflow

When a bench suddenly becomes 50% slower:

Run cargo bench --bench <name> locally → confirm
git bisect between the last known-good version and HEAD
On dependency bumps: inspect the Cargo.lock diff (often pulls a new version of a transitive dep)

Criterion automatically stores the last baseline in target/criterion/ — when you bench locally, it compares against YOUR last run, not against GitHub. For a CI-vs-local comparison, download the artifact and place it locally under the target/criterion/ path.

What is Deliberately Not Benched

Path	Why not
Stratum V1 TCP I/O	Tokio async-IO is syscall-bound; criterion would be noise-dominated. tokio-console is more useful for inspection.
Bitcoin RPC	Network IO + Bitcoin-Core-side dominates. The Phase 16.3 RPC-latency histogram is the right observation.
Translator V1↔V2 mapping	Per-job (every 30-60s), not latency-critical. Would be effort for little benefit.
Storage SQL	sqlx + WAL mode dominates. If needed, bench directly with the `sqlite3` CLI.
Notifier sinks	HTTP/SMTP IO, not CPU-bound. End-to-end latency is readable from the /metrics histogram.