Documentation
Features
Model Routing
Laghav's DistilBERT ONNX classifier routes every prompt to the cheapest capable model in under 5ms. On average it selects Haiku 68% of the time — saving 98% vs Opus on simple queries.
How it works
When you set model: "auto", Laghav classifies the compressed prompt into one of four complexity tiers using a fine-tuned DistilBERT model exported to ONNX (3.4ms CPU inference). If classifier confidence is below 0.70, it falls back to a pattern-matching rule set.
| Category | Routed to | Typical examples | Savings vs Opus |
|---|---|---|---|
| simple (68%) | claude-haiku-3 | FAQ, yes/no, greetings, classification | 98% |
| translation (8%) | claude-haiku-3 | Any language translation task | 98% |
| code (19%) | claude-sonnet-4 | Code gen, debugging, review | 80% |
| complex (5%) | claude-opus-4 | Research, legal, multi-step reasoning | 0% |
Routing reason in response
routing.py
response = client.complete(messages=messages, model="auto")print(response.laghav_meta.routing_reason) # "faq_pattern"print(response.laghav_meta.model_requested) # "auto"print(response.model) # "claude-haiku-3-20240307"
Override routing
override.py
# Force a specific model — bypasses routingresponse = client.complete(messages=messages,model="claude-opus-4", # always uses Opuslaghav_options={"route": False} # disable routing middleware)
✦routing_reason values
Common routing reasons:
faq_pattern, translation_task, code_task, analytical, ml_high_confidence, ml_fallback_pattern