A surge in Large Language Model (LLM) costs – a 30% month-over-month increase – demanded immediate attention. While traffic was growing, it didn’t justify the escalating expenses. The root cause wasn’t increased usage, but rather redundant queries. Users were asking the same questions in countless different ways: “What’s your return policy?”, “How do I return something?”, and “Can I get a refund?” Each unique phrasing triggered a full LLM call, incurring significant costs for essentially identical responses.
Traditional, exact-match caching proved woefully inadequate, capturing only 18% of these repetitive requests. The nuance of human language meant that semantically similar questions bypassed the cache entirely. The solution? Semantic caching – a system that understands the meaning of a query, not just its literal wording. Implementing this approach boosted our cache hit rate to 67%, slashing LLM API costs by 73%. However, achieving these gains required navigating complexities often overlooked in naive implementations.
Why Exact-Match Caching Falls Short
Conventional caching relies on the query text as the cache key. This works flawlessly when queries are identical. The process is straightforward: hash the query text, and if that hash exists in the cache, return the stored response.
# Exact-match caching
cache_key = hash(query_text)
if cache_key in cache:
return cache[cache_key]
But users rarely phrase questions identically. An analysis of 100,000 production queries revealed a stark reality:
- Only 18% were exact duplicates.
- 47% were semantically similar – same intent, different wording.
- 35% were genuinely novel queries.
That 47% represented a massive, untapped opportunity for cost savings. Each semantically similar query triggered a full LLM call, generating a response nearly indistinguishable from one already computed and stored.
The Architecture of Semantic Caching
Semantic caching replaces text-based keys with embedding-based similarity lookup. Instead of hashing the query, it transforms it into a vector representation and searches for similar vectors within a vector database.
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = VectorStore() # FAISS, Pinecone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional[str]:
"""Return cached response if semantically similar query exists."""
query_embedding = self.embedding_model.encode(query)
# Find most similar cached query
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= self.threshold:
cache_id = matches[0].id
return self.response_store.get(cache_id)
return None
def set(self, query: str, response: str):
"""Cache query-response pair."""
query_embedding = self.embedding_model.encode(query)
cache_id = generate_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
'query': query,
'response': response,
'timestamp': datetime.utcnow()
})
The core insight is simple: embed queries into vector space and find cached queries within a defined similarity threshold. But the devil, as always, is in the details.
The Critical Role of the Similarity Threshold
The similarity threshold is the linchpin of semantic caching. Set it too high, and you miss valid cache hits, negating the benefits. Set it too low, and you risk returning incorrect responses. Our initial threshold of 0.85 seemed reasonable – 85% similar should equate to “the same question,” right?
Wrong. At 0.85, we encountered problematic cache hits, such as:
- Query: “How do I cancel my subscription?”
- Cached: “How do I cancel my order?”
- Similarity: 0.87
These are distinct questions requiring different answers. Returning the cached response would have been a frustrating experience for the user. We discovered that optimal thresholds vary significantly based on query type.
| Query type | Optimal threshold | Rationale |
| FAQ-style questions | 0.94 | High precision needed; incorrect answers erode trust. |
| Product searches | 0.88 | More tolerance for near-matches. |
| Support queries | 0.92 | Balance between coverage and accuracy. |
| Transactional queries | 0.97 | Very low tolerance for errors. |
We implemented query-type-specific thresholds to address this nuance.
class AdaptiveSemanticCache:
def __init__(self):
self.thresholds = {
'faq': 0.94,
'search': 0.88,
'support': 0.92,
'transactional': 0.97,
'default': 0.92
}
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds['default'])
def get(self, query: str) -> Optional[str]:
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= threshold:
return self.response_store.get(matches[0].id)
return None
Tuning the Thresholds: A Data-Driven Approach
Blindly setting thresholds is a recipe for disaster. We needed ground truth – a clear understanding of which query pairs truly represented the “same intent.” Our methodology involved a rigorous, multi-step process:
- Sample Query Pairs: We sampled 5,000 query pairs across various similarity levels (0.80-0.99).
- Human Labeling: Annotators labeled each pair as “same intent” or “different intent.” We used three annotators per pair and employed a majority vote to ensure accuracy.
- Compute Precision/Recall Curves: For each threshold, we calculated precision (of cache hits, what fraction had the same intent?) and recall (of same-intent pairs, what fraction did we cache-hit?).
- Select Threshold Based on Cost of Errors: For FAQ queries, where incorrect answers damage trust, we optimized for precision (0.94 threshold yielded 98% precision). For search queries, where a missed cache hit simply costs money, we prioritized recall (0.88 threshold).
Did You Know?
Latency Considerations
Semantic caching introduces latency. Embedding the query and searching the vector store take time. Our measurements revealed:
| Operation | Latency (p50) | Latency (p99) |
| Query embedding | 12ms | 28ms |
| Vector search | 8ms | 19ms |
| Total cache lookup | 20ms | 47ms |
| LLM API call | 850ms | 2400ms |
The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at the 99th percentile, the 47ms overhead is acceptable. However, cache misses now incur an additional 20ms delay. At our 67% hit rate, the overall impact is overwhelmingly positive:
- Before: 100% of queries × 850ms = 850ms average
- After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average
This translates to a net latency improvement of 65% alongside the substantial cost reduction.
Cache Invalidation: Keeping Responses Fresh
Cached responses inevitably become stale. Product information changes, policies are updated, and yesterday’s correct answer can become today’s misinformation. We implemented a three-pronged invalidation strategy:
- Time-Based TTL: Simple expiration based on content type (e.g., pricing updates every 4 hours, policies every 7 days).
- Event-Based Invalidation: When underlying data changes, we invalidate related cache entries.
- Staleness Detection: For responses that might become stale without explicit events, we periodically re-run the query against current data and compare the semantic similarity of the responses. If the similarity falls below a threshold, we invalidate the cached entry.
Production Results and Lessons Learned
After three months in production, the results were compelling:
| Metric | Before | After | Change |
| Cache hit rate | 18% | 67% | +272% |
| LLM API costs | $47K/month | $12.7K/month | -73% |
| Average latency | 850ms | 300ms | -65% |
| False-positive rate | N/A | 0.8% | — |
| Customer complaints (wrong answers) | Baseline | +0.3% | Minimal increase |
The 0.8% false-positive rate was within acceptable bounds. These cases primarily occurred at the boundaries of our thresholds. What challenges did your team face when implementing caching strategies? And how did you address them?
Beyond the Basics: Advanced Considerations
While the core principles of semantic caching are relatively straightforward, achieving optimal performance requires careful attention to detail. Consider these advanced considerations:
- Vector Database Selection: The choice of vector database (FAISS, Pinecone, Milvus, etc.) depends on your scale and performance requirements.
- Embedding Model Fine-tuning: Fine-tuning your embedding model on your specific dataset can significantly improve accuracy.
- Query Classification: Accurate query classification is crucial for applying the correct similarity threshold.
- Monitoring and Alerting: Continuously monitor cache hit rates, false-positive rates, and latency to identify and address potential issues.
Frequently Asked Questions About Semantic Caching
-
What is semantic caching and how does it differ from traditional caching?
Semantic caching focuses on the meaning of a query, using embeddings to find similar questions, while traditional caching relies on exact text matches. This allows semantic caching to capture redundancy that exact-match caching misses.
-
How do you determine the optimal similarity threshold for semantic caching?
The optimal threshold varies by query type. We recommend a data-driven approach involving human labeling and precision/recall analysis to find the best balance between accuracy and coverage.
-
What are the key challenges of implementing semantic caching?
The main challenges include threshold tuning, cache invalidation, and managing the latency overhead associated with embedding and vector search.
-
What vector databases are commonly used with semantic caching?
Popular choices include FAISS, Pinecone, Milvus, and Weaviate. The best option depends on your specific needs and scale.
-
How important is cache invalidation in a semantic caching system?
Cache invalidation is critical. Stale responses can erode user trust and lead to inaccurate information. A combination of TTL, event-based, and staleness detection strategies is recommended.
-
Can semantic caching be used with any LLM?
Yes, semantic caching is agnostic to the underlying LLM. It focuses on reducing the number of calls to the LLM, regardless of the model used.
Ready to optimize your LLM costs and improve user experience? Share this article with your team and let us know your thoughts in the comments below!
Disclaimer: This article provides general information and should not be considered professional advice. Consult with a qualified expert for specific guidance related to your situation.
Discover more from Archyworldys
Subscribe to get the latest posts sent to your email.