Overview
Shannon implements a multi-layered caching system for LLM responses, providing significant cost savings and latency reduction for repeated or similar queries. The caching system supports both in-memory and Redis backends.Architecture
Cache Backends
In-Memory Cache (Default)
The default caching backend uses an LRU (Least Recently Used) dictionary with automatic eviction. Features:- Zero external dependencies
- Fast lookup (O(1) average)
- Automatic eviction when capacity reached
- Hit rate tracking
- Not shared across instances
- Lost on service restart
Redis Cache
For production deployments, Redis provides distributed caching across multiple instances. Features:- Distributed across all LLM service instances
- Persistent storage
- Automatic TTL expiration
- High availability with Redis Sentinel/Cluster
Configuration
Global Settings
Configure caching inconfig/models.yaml:
| Parameter | Default | Description |
|---|---|---|
enabled | true | Master switch for caching |
ttl_seconds | 3600 | Default cache entry lifetime |
max_cache_size_mb | 2048 | Maximum cache size in MB |
similarity_threshold | 0.95 | Threshold for semantic matching |
Per-Request Override
Override cache behavior on individual requests:Cache Key Generation
Cache keys are generated deterministically from request parameters:Included Parameters
- Message content and roles
- Model tier and specific model
- Temperature and max_tokens
- Function definitions
- Random seed
Excluded Parameters
- Streaming flag (streaming not cached)
- Session/task IDs (cache is request-based)
Caching Rules
When Responses Are Cached
Responses are cached when:- Caching is enabled globally
- Request is non-streaming
- Response has non-empty content OR has function_call
- Finish reason is not “length” (truncated) or “content_filter”
- For JSON mode: content is valid JSON object
When Responses Are NOT Cached
Cache Validation
Before serving cached responses, Shannon validates:- Finish Reason Check: Skips truncated or filtered responses
- JSON Mode Validation: Ensures valid JSON object for JSON mode
- Content Presence: Requires non-empty content or function_call
Environment Variables
| Variable | Default | Description |
|---|---|---|
REDIS_URL | - | Full Redis connection URL |
LLM_REDIS_URL | - | Alternative Redis URL |
REDIS_HOST | redis | Redis hostname |
REDIS_PORT | 6379 | Redis port |
REDIS_PASSWORD | - | Optional Redis password |
Performance Impact
Latency Reduction
| Scenario | Latency |
|---|---|
| Cache hit | 1-5ms |
| Cache miss (small model) | 500-2000ms |
| Cache miss (large model) | 2000-10000ms |
Cost Savings
Cache hits eliminate LLM provider costs entirely:- Typical hit rates: 20-40% for diverse workloads
- High hit rates: 60-80% for repetitive queries
- Potential cost reduction: 20-80% depending on workload
Monitoring
Cache Metrics
The LLM service exposes cache metrics:API Response Fields
Best Practices
Maximize Cache Hits
- Normalize prompts: Consistent formatting improves hit rates
- Use deterministic seeds: Set
seedfor reproducible outputs - Standardize temperatures: Use consistent temperature values
- Reuse system prompts: Keep system messages consistent
Cache Key Strategy
Redis Configuration
For production:Troubleshooting
Low Hit Rate
- Check prompt normalization
- Verify temperature consistency
- Review cache TTL settings
- Monitor cache eviction rate
Cache Not Working
- Verify
prompt_cache.enabled: truein config - Check Redis connection (if using Redis)
- Ensure requests are non-streaming
- Verify responses are not being filtered
Memory Issues
- Reduce
max_cache_size_mb - Use Redis for large-scale deployments
- Implement cache partitioning by tenant