Docs/Explanation/Routing Engine

>_ DOCS / EXPLANATION

THE ROUTING
ENGINE.

Every request passes through three stages in order: semantic cache lookup, provider scoring, and execution with automatic failover. This page explains exactly how each stage works.

The Three Stages

01
Semantic Cache

The request is embedded using Google text-embedding-004 and compared against recent responses stored in Redis. If cosine similarity exceeds 0.92, the cached response is returned immediately — no LLM call, no cost.

02
Provider Scoring

All healthy providers are scored against your chosen routing mode (cost, speed, quality, or balanced). The top scorer is selected as the primary candidate. Two backup candidates are retained for failover.

03
Execution + Failover

The primary provider is called. On any failure — rate limit, timeout, 5xx — the router automatically retries with the next-ranked provider. This happens transparently; your client sees one clean response.

Scoring Algorithm

Each provider candidate receives a composite score between 0 and 1. The weights shift depending on the routing mode you specify. A provider that is down or degraded is penalized before any mode-specific scoring.

Scoring Factors
FactorSourceDescription
success_rateDB: facilitator_health7-day rolling success ratio. A provider at 99% beats one at 95%.
p95_settle_msDB: facilitator_health95th-percentile latency in ms. Weighted heavily in speed mode.
cost_per_1k_tokenslib/ai-providers/registry.tsInput + output token price. Primary factor in cost mode.
reputation_scoreERC-8004 on-chainOn-chain reputation from ERC-8004 registry. Normalized 0–1.
health_statusLive health probehealthy=1.0, degraded=0.5, down=0. Applied as a multiplier.
Weight Distribution by Mode
ModeCostSpeedQualityReliability
cost70%10%10%10%
speed10%70%10%10%
quality10%10%70%10%
balanced25%25%25%25%

Semantic Cache Detail

The cache is tenant-scoped — one tenant's responses never leak to another. The similarity threshold of 0.92 is the empirically calibrated default: high enough to block hallucination-risk false positives, low enough to catch paraphrased duplicates.

Embedding model
text-embedding-004 (Google)
Similarity metric
Cosine similarity
Default threshold
0.92 (configurable per request)
Storage backend
Redis with vector index
Scope
Per-tenant (never shared across accounts)
Cache hit latency
< 50ms (vs 1–10s for LLM call)
Cache hit cost
$0.00
TTL
24 hours (default)

Opt out of caching

To bypass the cache for a specific request (e.g., real-time data queries), set "cache": false in the p402 object. The request will still be routed normally but the response will not be stored.

Automatic Failover

The router holds a ranked list of three provider candidates for every request. If the top-ranked provider fails for any reason, the router immediately retries with the next candidate — no delay, no error surfaced to your client.

HTTP 429 (rate limit)
Immediate retry with candidate #2
HTTP 5xx (server error)
Immediate retry with candidate #2
Connection timeout (> 30s)
Abort + retry with candidate #2
All 3 candidates fail
Return HTTP 503 with PROVIDER_UNAVAILABLE
Budget exceeded mid-stream
Return HTTP 403 with BUDGET_EXCEEDED; no retry

Failover is transparent

The p402_metadata.provider field in the response tells you which provider actually served the request. If failover occurred, this will differ from what you might expect based on your mode. You can use this to diagnose provider degradation.

OpenRouter Meta-Provider

P402 treats OpenRouter as a single provider that proxies 300+ models. When your routing mode selects OpenRouter, the model within OpenRouter is further optimized based on your mode. This means you get access to every new frontier model the moment OpenRouter adds it — no adapter changes required on your end.

Model freshness

GPT-4.1, Claude 4.5, Gemini 2.0 Pro available the day they land on OpenRouter.

Unified billing

One OPENROUTER_API_KEY covers all 300+ models. P402 adds a transparent 1% routing fee on top.

Fallback depth

If your primary direct provider (e.g. Anthropic) fails, OpenRouter serves as a deep fallback pool.

Model pinning

Force a specific model with "model": "anthropic/claude-opus-4" in configuration.

Per-Request Configuration

The p402 object in your request body controls routing behavior for that request. All fields are optional; omitting them uses account defaults.

json — p402 configuration object
{
  "messages": [...],
  "p402": {
    "mode": "cost",          // "cost" | "speed" | "quality" | "balanced"
    "cache": true,           // true = use semantic cache (default: true)
    "session_id": "ses_...", // budget-capped session (optional)
    "max_cost_usd": 0.01,    // hard ceiling per request (optional)
    "provider": "anthropic", // force a specific provider (optional)
    "model": "claude-opus-4" // force a specific model (optional)
  }
}
mode
Selects the weight profile for provider scoring. Defaults to "balanced" if not set.
cache
Set false to skip semantic cache lookup and storage for this request. Useful for real-time or personalized responses.
session_id
Attaches this request to a budget-capped session. Requests that would exceed the session budget return HTTP 403 before the LLM is called.
max_cost_usd
A per-request cost ceiling. If the cheapest available provider exceeds this, you get COST_LIMIT_EXCEEDED rather than a surprise charge.
provider
Bypasses scoring entirely and routes to this specific provider. Use for compliance requirements or testing. Falls back to normal routing if the provider is down.
model
Forces a specific model. The provider that serves this model is selected automatically unless you also specify provider.

Routing Decision in the Response

Every response includes a p402_metadata field that tells you exactly what happened: which provider was selected, what it cost, and whether the response came from cache.

json — p402_metadata
{
  "p402_metadata": {
    "provider": "deepseek",        // Provider that served the request
    "model": "deepseek-v3",        // Model used
    "cost_usd": 0.0003,            // What you were charged
    "direct_cost": 0.0031,         // What GPT-4o would have cost
    "savings": 0.0028,             // Savings from intelligent routing
    "input_tokens": 24,
    "output_tokens": 187,
    "cached": false,               // true = served from semantic cache
    "latency_ms": 1240,            // Time from request to first token
    "mode": "cost",                // Mode used for this request
    "failover": false              // true = primary provider failed, used fallback
  }
}