The Intelligent
Triage Layer
for AI.

Most apps use premium models for tasks that don't need them. Inferenz sits between your app and your providers, intelligently triaging queries to slash your inference bill without touching your code.

Join the waitlist

How it works The pipeline How it starts The proof The deal

01 — How it works

Four steps.
One line of code changed.

Inferenz intercepts every prompt before it reaches the model. In milliseconds, it decides which model should handle it. Your users never notice. Your invoice does.

01 — Intercept

Your traffic hits us first

You change your base URL from api.openai.com to your Inferenz endpoint. Every API call flows through us before it reaches any provider.

02 — Classify

Two engines, one decision

The Complexity Engine scores every prompt in under 5ms — no API call, no cost. Ambiguous prompts trigger the Cascade Engine: try cheap first, escalate only if needed.

03 — Route

Right model, right cost

Clear-cheap prompts go to GPT-4o mini ($0.15/1M). Clear-complex stay premium. Ambiguous prompts resolve cheap 60–80% of the time. Repeat or similar prompts are served instantly from cache at zero cost.

04 — Account

Every saving is logged

Shadow accounting tracks what you paid and what you would have paid — request by request, engine by engine. Savings are verifiable, transparent, and the basis of every invoice.

Live routing simulation Cache → Complexity → Cascade

Loading prompt...

—

GPT-4o mini · $0.15/1M

Gemini 2.0 Flash · $0.40/1M

Claude Sonnet 4.6 · $3.00/1M

02 — The pipeline

Three layers.
Zero wasted spend.

Most routing tools make one guess per prompt. Inferenz runs three layers — cache, classify, cascade — so repeat prompts cost nothing, clear prompts route instantly, and ambiguous ones get a safety net.

Layer 00

Cache Layer

Pre-classification · <1ms · $0 cost

Checks every prompt against stored responses before the classifier even runs. Exact matches return instantly. Similar prompts are caught by semantic matching using Jaccard + character n-gram similarity — no embeddings, no API calls.

Up to ~80% hit rate on repetitive workloads

Engine 01

Complexity Engine

Pre-generation · <5ms · no API call

Analyses the prompt before any model sees it. Scores cognitive complexity using Bloom's Taxonomy, information density via Shannon Entropy, structural signals, and domain vocabulary. Deterministic — same prompt, same decision, every time.

Handles ~80% of cache misses

Engine 02

Cascade Engine

Post-generation · response verification

For ambiguous prompts, tries the cheap model first then scores the response across four confidence signals: completeness, hedging language, refusal detection, and format compliance. Escalates to premium only when needed. Based on AutoMix (NeurIPS 2024).

Handles ~20% of cache misses · 60–80% resolve cheap

03 — How it starts

A free audit.
No commitment.

Before you change anything, we show you exactly what you'd save. Export your API logs, send them to us, and we return a one-page report with your numbers.

Export your API logs

Request a 24–48 hour CSV export from your OpenAI or Anthropic dashboard. We only need token counts and model names — no prompt content required.

We run both engines

We run the Complexity Engine and simulate Cascade routing against every row. Takes about 30 minutes on our end.

You see the numbers

We return a one-page savings report: current daily spend, optimised spend, projected monthly saving. If the numbers don't stack up, we'll tell you.

One line of code

Change your base URL. We handle routing, privacy compliance, and billing automatically.

baseURL: "https://api.inferenz.com.au/[your-id]"

04 — The proof

What a real audit
looks like.

A 24-hour log from a mid-size AI product. 12,400 requests, all routed to GPT-4o. Here's what both engines found.

Inferenz Audit Report — Sample client · 24h period Sample data

Requests analysed

100% on GPT-4o

Safely rerouted

37% of all requests

Daily spend — current

$292

AUD · all on premium

Daily spend — Inferenz

$185

AUD · optimised routing

Monthly saving

$3,226

AUD · projected 30 days

Key insight

37% of prompts needed no premium model at all

Based on a simulated 24-hour log.

05 — The deal

We don't charge
unless we save you money.

No monthly fee. No setup cost. No lock-in. We take 10% of the savings we generate — nothing more.

10%

of verified monthly savings

Savings are calculated from shadow accounting data — every request logged, every dollar accounted for. You can verify every line. We invoice via Stripe automatically.

Our incentive is always aligned with yours: the more we save you, the more we earn.

Example — 30 day period

Without Inferenz$5,550 AUD

With Inferenz$1,650 AUD

Total saved$3,900 AUD

Inferenz fee (10%)$390 AUD

You keep$3,510 AUD

06 — Early access

Join the waitlist.
Get the first audit free.

We're onboarding a small group of AU AI companies to validate the platform. Sign up below — if you're spending on inference we'll run a free audit and show you your numbers before you commit to anything.

Takes 2 minutes

Tell us about your AI spend.
We'll show you your savings.

The form covers your company, monthly AI spend, and what you're building. Takes 2 minutes. After you submit, we'll run a free audit and send you a one-page savings report within 48 hours.

Apply for early access

We review every application personally · AI companies only · Reply within 48 hours

Free

Audit cost

48h

Response time

10%

Of savings only

The IntelligentTriage Layerfor AI.

Four steps.One line of code changed.

Three layers.Zero wasted spend.

A free audit.No commitment.

What a real auditlooks like.

We don't chargeunless we save you money.

Join the waitlist.Get the first audit free.

Want to find out more?Get in touch with us