On-demand vs provisioned throughput, batch inference, model selection, and token caching — with actual numbers.
TL;DR — Five strategies that compound to 60-80% savings on Bedrock:
- Batch inference for async workloads → 50% discount, zero quality impact
- Prompt caching for repeated context → 90% cheaper on cached input tokens
- Model routing (cheap models for simple tasks) → 40-70% cost reduction
- Benchmarking + LLM-as-a-judge to validate cheaper models are good enough
- Observability (OTel + DoiT GenAI Intelligence) to catch drift and keep savings locked in
We took one customer from $40K/month to $18K using this playbook. Details below.
I lead the APAC engineering team at DoiT. Last quarter, a financial services customer asked me to review their Bedrock spend. They were running Claude Sonnet for everything — ticket classification, document extraction, customer-facing chat — and their monthly bill had quietly crept past $40K. Two weeks later, we had it down to $18K without touching output quality.
This is that playbook. If you own Bedrock costs for an engineering or data team, these are the levers that actually move your bill.
The customer examples in this article are composites based on patterns across multiple engagements — not individual case studies.
Understanding the Two Pricing Models
Bedrock has two pricing models, and I've seen expensive mistakes in both directions. (For a detailed breakdown of how Bedrock pricing works — including fine-tuning costs, token estimation, and tagging strategies — see Josh Palmer's CloudOps guide to Bedrock pricing. This article focuses on the engineering decisions that cut the bill.)
On-Demand Pricing
You pay per token — input and output, priced separately. No commitment, no minimum.
Pick this when your usage is bursty, you're still experimenting, or your monthly spend is below the crossover point (more on this below).
The thing that catches people: output tokens typically cost 3-5x more than input tokens. That financial services customer I mentioned? They were estimating costs based on input token pricing alone and underestimating their actual bill by nearly 3x. A task that consumes 1,000 input tokens but generates 2,000 output tokens is dominated by the output cost. Always model both sides.
Provisioned Throughput
You purchase model units by the hour (or with 1-month/6-month commitments). A model unit is Bedrock's chunk of reserved capacity — think "a lane on the highway" you pay for by the hour whether you use it or not.
Pick this when you have sustained, predictable workloads, your utilization consistently exceeds the crossover threshold, and you need guaranteed latency.
I worked with a media company last year that had provisioned throughput running at 15% utilization. They'd set it up during a load test, forgot to switch back, and bled money for three months before anyone noticed. Switching them to on-demand was a one-line config change that saved them thousands immediately. Provisioned throughput with low utilization is the single most expensive mistake on Bedrock.
Finding the Crossover Point
Monthly on-demand cost = (input_tokens × input_price) + (output_tokens × output_price)Monthly provisioned cost = model_units × hourly_rate × 730 hoursFor most models, provisioned becomes cheaper around 60-70% sustained utilization of a model unit. The word "sustained" is doing heavy lifting there — don't look at peak usage. Traffic that sits around 70% all day with small spikes is a good candidate; traffic that idles at 10% and spikes to 100% for five minutes every hour is not.
Batch Inference: The Overlooked 50% Discount
This is probably my favourite thing to talk about with customers because the reaction is always the same: "Wait, we can just... pay half?"
Yes. Bedrock prices batch tokens at ~50% of on-demand rates for supported models. Same models, same quality, half the price. (As of writing — check the pricing page before you bet a roadmap on that number.) The tradeoff is a 24-hour SLA with no streaming — your jobs go into a queue and complete asynchronously.
Any workload that doesn't need a real-time response is a candidate. Categorizing support tickets, processing document corpora for RAG, generating product descriptions, running eval suites, pulling structured data from unstructured documents. I'd estimate most teams have 30-40% of their Bedrock workload that could move to batch without any user-facing impact.
Here's what that looks like concretely. Take 10,000 requests averaging 500 input tokens and 200 output tokens each, using Claude Sonnet (on-demand pricing in ap-southeast-2):
| Mode | Input Cost | Output Cost | Total |
|---|---|---|---|
| On-demand | $15.00 | $37.50 | $52.50 |
| Batch | $7.50 | $18.75 | $26.25 |
$26.25 saved on a single batch run. Daily pipelines compound that to thousands per month. And the implementation is just JSONL files in S3 — it's a plumbing change, not an architecture change.
Model Selection as Cost Strategy
Choosing the right model is a cost decision with 10-60x variance. Most teams default to something more expensive than they need because they picked it during prototyping and never revisited the decision.
The Tiered Architecture
The most cost-effective Bedrock deployments I've seen use tiers:
- Tier 1 (Haiku, Mistral 7B, Llama 3 8B) — classification, routing, extraction, simple Q&A. Fractions of a cent per request. Should handle 60-80% of your traffic.
- Tier 2 (Sonnet, Mistral Large, Llama 3 70B) — complex reasoning, nuanced generation. 5-10x cheaper than top-tier.
- Tier 3 (Opus, Claude 3.5 Sonnet) — the hardest tasks only. Premium pricing, reserved for work that genuinely requires it.
The Router Pattern
Bedrock has a built-in solution for within-family routing only. Intelligent Prompt Routing gives you a single serverless endpoint that routes requests between models in the same family based on prompt complexity. Specify a model family ARN (e.g., Anthropic Claude), and Bedrock decides whether each request goes to Haiku or Sonnet. AWS claims up to 30% cost reduction — from what I've seen with customers, that's a reasonable ballpark. Zero custom code, you just swap your model ID for the prompt router ARN.
For cross-family routing (Haiku for classification, Mistral for extraction, Sonnet for complex reasoning), you need to build something yourself. And honestly, that's what most teams in production do. There's no mature off-the-shelf framework that solves this well enough to recommend without caveats.
Three patterns that work, roughly in order of maturity:
Rule-based classification is where most teams start and many stay. Write 5-10 rules based on structural signals — task type keywords, input length, expected output format, conversation depth. No ML, no dependencies, deployable in a day. Handles 70-80% of routing decisions correctly. It's just a few hundred lines of conditional logic.
Confidence-based cascading is the highest-ROI pattern I've seen. Send every request to the cheapest model first. Check the response for hedging language, malformed output, or missing required fields. If it fails, escalate to the next tier. Expect an extra model call's worth of latency on the ~5-15% of escalated requests. One team I worked with — an e-commerce platform processing product queries — went from $38K/month to $15K/month this way. Their escalation rate was only 8%. You don't need the classifier to be perfect; you need it to be right often enough that escalation handles the exceptions. If your escalation rate creeps above 25-30%, fix the classifier.
Hybrid rules + ML classifier is the graduation step. Once you have a few weeks of production traffic with labeled outcomes, train a small classifier (DistilBERT or logistic regression on TF-IDF) that adds <5ms latency. The classifier itself should add single-digit milliseconds; the real cost is the engineering discipline of labeling data and retraining periodically. Teams that do this typically improve routing accuracy from ~78% to ~91%.
I looked hard at existing frameworks. RouteLLM from the LMSys team is research-backed but fundamentally a two-model router trained on general chat data — for domain-specific Bedrock workloads, you'd need to retrain. LiteLLM is excellent for infrastructure (unified API, fallbacks, retries) but its routing is load balancing, not quality-based model selection. Neither is a drop-in solution for the cross-family problem.
Start with Intelligent Prompt Routing. When you have evidence that cross-family routing would save meaningful money, build a rule-based classifier. It's predictable, debuggable, and the teams I work with who've gone this route consistently report 40-70% cost reductions.
Benchmarking Model Quality vs Cost
Don't assume you need the most expensive model. Run your actual tasks across multiple models and measure accuracy on your specific use case, cost per task (not per token), and latency.
We ran this exercise with the financial services customer. Their ticket classification task scored 94% accuracy on Haiku vs 97% on Sonnet — at 1/15th the cost. That 3% gap didn't matter for their use case. The numbers will vary for yours, but the pattern is consistent: benchmark before you commit.
Automating Quality Evaluation with LLM-as-a-Judge
For classification or extraction you can validate quality programmatically, but for open-ended generation — summaries, explanations, customer responses — scoring traditionally meant human reviewers. That doesn't scale.
Bedrock has a built-in answer: LLM-as-a-judge model evaluation. You provide a dataset of prompts (optionally with ground truth responses), pick an evaluator model, and Bedrock runs the candidate model's outputs through the judge across metrics like correctness, completeness, helpfulness, coherence, relevance, and safety. It runs as a managed job — you get back per-prompt scores and aggregates, stored in S3.
The point is to prove that a model costing 5-15x less per token is "good enough" before you move real traffic. Prepare a JSONL file with your representative prompts, run the same evaluation job against Haiku, Sonnet, and whatever else you're considering, and compare scores side by side. AWS published a walkthrough with code samples and there's a companion Jupyter notebook on GitHub.
A few things I've learned using this with customers.
Use the same evaluator model across all comparisons — if you judge Haiku's output with one model and Sonnet's with another, you're measuring evaluator differences, not model differences.
Include your actual production prompts, not generic benchmarks. MT-Bench is fine for a sanity check, but your ticket classification prompts will behave differently than academic question-answering.
The evaluation itself costs money (the judge model processes every output), so start with 200-500 representative prompts rather than your entire dataset. That's usually enough to see whether the cheap model is clearly worse, clearly fine, or "needs more investigation."
Ground truth is optional but valuable — for tasks where you have known-good outputs, the faithfulness and correctness metrics become much more meaningful.
benchmark_bedrock.py captures cost and latency; LLM-as-a-judge scores quality. Run both and you've got the full cost-vs-quality picture without manual review.
Prompt Caching: 90% Savings on Repeated Context
It surprises me how few teams have this enabled. Cached input tokens are billed at 90% less than standard pricing.
When you send a request with caching enabled, Bedrock stores the prefix of your prompt. Subsequent requests sharing that prefix hit the cache. Cache hits cost ~10% of the normal rate. Cache writes (first request) cost slightly more, so it only saves money when you reuse the same prefix multiple times — but if you have a system prompt or few-shot examples, the payback is immediate.
Where it shines: system prompts over 1,500 tokens, static few-shot examples, shared document context in RAG applications, multi-turn conversation history.
The math is stark. A 2,000-token system prompt with 10,000 requests/day on Claude Sonnet:
| Scenario | Daily Cost (system prompt portion) |
|---|---|
| No caching | 2,000 × 10,000 × $0.003/1K = $60.00 |
| With caching (99% hit rate) | ~$7.00 |
$53/day. Over $1,500/month. On the system prompt alone. I walked through this calculation with a customer last month and they enabled caching before our call ended.
Gotchas to keep in mind:
- Minimum cacheable prefix length varies by model (typically 1,024-2,048 tokens)
- Exact-prefix matching — whitespace and ordering matter
- TTL / expiry after inactivity (your cache goes cold if traffic drops)
- Not all models support caching yet — check current docs
Building a Benchmarking Framework
Your optimization decisions should be driven by data from your actual workloads. Here's what my team recommends.
For every model and configuration you test, capture: cost per task, input/output token counts, latency (time to first token and total), and a task-specific quality score.
The process: pick 3-5 tasks that represent your actual workload mix, create 100-500 test examples per task with known-good outputs, run each task on every candidate model, calculate cost per task (not per token — models tokenize differently), and plot cost vs quality to find the knee of the curve. Even a rough scatter plot is enough — you're looking for "same quality, much cheaper" or "slightly worse quality, dramatically cheaper."
I've put together a Python script (benchmark_bedrock.py) that automates this — runs prompts across multiple Bedrock models, records token counts and latency, outputs CSV, prints summary stats. Fork it, adapt it to your workloads.
When interpreting results, look for models where quality plateaus (Haiku at 93% vs Sonnet at 95% — is that 2% worth 10x?), latency outliers, token efficiency differences (a model that's cheaper per token but 2x more verbose isn't actually cheaper), and batch candidates.
Unified GenAI Cost Visibility with DoiT GenAI Intelligence
Most teams I work with aren't just running Bedrock. They've got OpenAI for some workloads, Anthropic direct for others, maybe Vertex AI in a GCP project. Cost data ends up scattered across four consoles with incompatible dimensions.
We built GenAI Intelligence to fix this. It consolidates AI cost and usage data from Bedrock, Vertex AI, Azure ML, Databricks, Anthropic, and OpenAI into a single normalized view. Your Bedrock spend appears alongside everything else. No custom ETL.
For Bedrock, the value is in the normalized labels. Model / Model Family tracks cost per model across providers. Usage Type breaks down input vs output token spend — the split most teams undercount. Cached shows whether operations used cached tokens, so you can measure caching ROI from billing data. GenAI Spend flags all generative AI costs regardless of provider for total budget tracking.
You can ask questions like "How much did cached tokens on Claude Sonnet save us last month?" or "What's our spend split between Haiku and Sonnet for ticket classification?" — across providers, in one place.
The dashboard ships with preset reports: spend by provider (monthly), cost by model family (daily for last 14 days — catches spikes fast), token breakdown by model (30 days), and hourly token usage by provider (last 2 days — useful for correlating with deployments). Because GenAI data flows into our Cloud Analytics engine, you also get budgets, allocations, anomaly detection, and forecasting on your AI spend.
Say you're running Haiku for classification and Sonnet for reasoning, with caching on both:
| Model | Cached | Monthly Cost | Token Usage |
|---|---|---|---|
| Claude Haiku | true | $45 | 18M tokens |
| Claude Haiku | false | $320 | 12M tokens |
| Claude Sonnet | true | $180 | 4M tokens |
| Claude Sonnet | false | $2,100 | 6M tokens |
Haiku's cache hit rate is solid (60% of tokens cached). Sonnet's is poor (40%). Fixing Sonnet's caching could save hundreds per month — and you spotted it without writing a single CloudWatch query.
To get started: connect your AWS account to DoiT, go to Dashboards → GenAI Intelligence, and your Bedrock data populates automatically. No instrumentation needed on the Bedrock side.
Observability with OpenTelemetry
Benchmarking gives you a point-in-time snapshot. Production workloads drift — prompts change, traffic patterns shift, new models get released. GenAI Intelligence covers the billing side of ongoing visibility. For application-level instrumentation — latency, routing decisions, cache hits per request — OpenTelemetry is what I recommend.
The GenAI semantic conventions mean any OTel-compatible backend (Grafana, Datadog, Honeycomb) will understand your spans. Wrap every Bedrock invocation in a span with these attributes:
gen_ai.system—"aws.bedrock"gen_ai.request.model— the model IDgen_ai.usage.input_tokens/output_tokens— from response metadatagen_ai.usage.cost— calculated from your pricing tabletask.type— your application-level task name
What you get: cost-per-task trends over time (catches prompt drift before it hits your bill), model routing validation (one customer's Tier 1 was handling 40% instead of the expected 70% — a bug costing $800/month extra), and cache hit rate monitoring (catch drops within hours, not at end of billing cycle).
You also get the data to correlate with CloudWatch's ModelUnitUtilization for provisioned throughput decisions.
Minimal instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer("bedrock-client")
def invoke_with_telemetry(client, model_id, prompt, max_tokens, task_type="unknown"): with tracer.start_as_current_span("bedrock.invoke") as span: span.set_attribute("gen_ai.system", "aws.bedrock") span.set_attribute("gen_ai.request.model", model_id) span.set_attribute("gen_ai.request.max_tokens", max_tokens) span.set_attribute("task.type", task_type)
resp = client.invoke_model(modelId=model_id, body=body, contentType="application/json") data = json.loads(resp["body"].read())
in_tok = data["usage"]["input_tokens"] out_tok = data["usage"]["output_tokens"] span.set_attribute("gen_ai.usage.input_tokens", in_tok) span.set_attribute("gen_ai.usage.output_tokens", out_tok)
return dataFrom there, build four panels — daily cost by model, cost per task by model, cache hit rate, p50/p95 latency — and review them weekly.
Common Pitfalls
The mistakes I see most often, in rough order of how much money they waste:
Monitor provisioned throughput utilization weekly. Below 50% sustained? Switch to on-demand. That media company was wasting more on idle provisioned capacity than they spent on actual inference.
Model and cap output tokens. Output tokens cost 3-5x more than input. One team cut their output costs 40% just by adding "respond in JSON, no explanations" to their extraction prompts.
Batch everything that doesn't need sub-24-hour latency. 50% discount. No quality compromise.
Don't let Sonnet be the default. Using it for everything because it worked during prototyping is the most common form of overspend I see.
Turn on caching wherever you reuse prompts. Static system prompt or few-shot examples? Minutes of work, immediate savings.
Measure per task, not per token. A cheaper model that produces more verbose output can cost more per task. Benchmark at the task level.
Putting It All Together
The optimization path I walk most teams through:
- Batch inference — move async workloads. Immediate 50% savings, zero quality impact.
- Prompt caching — enable for repeated context. Minutes of work.
- Benchmark model alternatives — you'll likely find 60-80% of your workload runs fine on Tier 1.
- Model routing — build a classifier. Start with rules.
- Provisioned throughput — only after the above are stable and your usage is predictable.
These compound. The financial services customer I mentioned at the top? They did steps 1-4 over six weeks. Batch inference saved 22%. Caching saved 18%. Model routing saved another 25%. Total: from $40K to $18K.
This isn't a one-time project, though. The Bedrock landscape shifts constantly. When Claude 3.5 Haiku launched, it was meaningfully better than Claude 3 Haiku at roughly the same price point — teams that had hardcoded model IDs missed the upgrade for months. AWS has adjusted Bedrock pricing multiple times without much fanfare. And your own workload drifts: prompts get longer, new task types get added, traffic ratios shift.
What works: a lightweight quarterly review. Re-run your benchmarks against any new models (takes an afternoon with the benchmarking script). Use LLM-as-a-judge to validate quality hasn't regressed. Check your OTel dashboards for routing drift — is Tier 1 still handling the traffic percentage you expect? Has your cache hit rate changed?
If you're on DoiT, GenAI Intelligence will show you cost-per-model trends that make it obvious when something has shifted. The whole review takes half a day once you've done it once.
The media company I mentioned earlier discovered during one of these reviews that a prompt change had silently dropped their cache hit rate from 94% to 61% — an extra $2K/month that nobody had noticed.
If you're on Bedrock today and want a second set of eyes on your bill, I'm always happy to look — especially if you're in APAC. And if you're already a DoiT customer, you've got the GenAI Intelligence dashboard — use it as your starting point for that quarterly review.
The benchmarking script referenced in this article is available at https://github.com/p-obrien/bedrock-cost-model-blog. Run it against your own workloads to generate the data you need for informed decisions.