Case study

Cost Control for AI-Backed Features: Token Budgets and Semantic Caching

Reducing LLM API spend without degrading UX — semantic caching, per-tenant token budgets, and model selection by task complexity.

RedisPostgreSQLNode.js

Context

After we shipped the LLM-powered summarization feature, API costs spiked. Some tenants were heavy users; others barely touched it. We had no visibility into who was spending what, and identical or near-identical requests were hitting the API repeatedly. We needed to bring costs under control without making the feature feel worse for users.

Constraints

  • Multi-tenant — we needed per-tenant visibility and limits, not just a global cap
  • Semantic similarity — two slightly different prompts might deserve the same cached response
  • We couldn't degrade quality — the fallback to a cheaper/smaller model had to be acceptable for the use case

Architecture

We implemented semantic caching in Redis: before calling the LLM, we hash the normalized input (trimmed, lowercased, key fields extracted) and check for a cached response. If we find one within a similarity threshold, we return it. We added a tenants.llm_token_budget column and track usage per tenant per month; when a tenant hits their budget, we serve cached or fallback responses and notify them. For simple tasks (short summaries, classification), we route to a smaller, cheaper model; for complex ones, we use the larger model. We batch non-urgent requests where possible to reduce round trips. The key was measuring first — we added cost tracking before we optimized, so we knew where the spend was.

Alternatives considered

  • Hard rate limits per user: Would have frustrated power users and didn't address the root cause — duplicate or near-duplicate requests.
  • Self-host a smaller open-source model: Operational overhead, GPU costs, and quality tradeoffs. For our scale, managed APIs with caching were the right balance.

Lessons learned

  • Measure before you optimize. We added cost-per-tenant tracking and only then saw that 20% of requests were cacheable duplicates.
  • Semantic caching has limits — we use exact-match hashing for now; true embedding-based similarity would add complexity we didn't need yet.
  • Per-tenant budgets create the right incentives. Tenants self-regulate when they see usage; we avoid surprise bills.