Troubleshooting

Quick Diagnostics

Before diving into specific issues, run these quick checks:

# Check all services are running
docker compose ps

# View recent logs from all services
docker compose logs --tail=50

# Check specific service health
curl http://localhost:8080/health
curl http://localhost:8000/health  # LLM Service

Installation & Setup Issues

Docker Compose Fails to Start

Symptoms:

Services won’t start
Exit code errors
Container crashes immediately

Common Causes:

1. Docker daemon not running

Check:

docker info

Solution:

# macOS
open -a Docker

# Linux
sudo systemctl start docker

# Verify
docker info

2. Port conflicts

Check which ports are in use:

# Check all Shannon ports
lsof -i :8080  # Gateway
lsof -i :50051 # Agent Core
lsof -i :50052 # Orchestrator
lsof -i :8000  # LLM Service
lsof -i :5432  # PostgreSQL
lsof -i :6379  # Redis
lsof -i :6333  # Qdrant
lsof -i :7233  # Temporal

Solution - Kill conflicting processes:

# Find process using port
lsof -ti :8080

# Kill the process (macOS/Linux)
kill -9 $(lsof -ti :8080)

Solution - Change Shannon ports: Edit docker-compose.yml to use different ports:

gateway:
  ports:
    - "8081:8080"  # Use 8081 instead of 8080

3. Insufficient system resources

Check Docker resources:

docker system df
docker stats

Solution - Increase Docker resources:

macOS: Docker Desktop → Preferences → Resources
- RAM: Minimum 8GB (16GB recommended)
- CPUs: Minimum 4 cores
- Disk: Minimum 20GB free

Linux: Edit Docker daemon config

sudo nano /etc/docker/daemon.json

{
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 64000,
      "Soft": 64000
    }
  }
}

4. Missing .env file

Error: WARNING: The OPENAI_API_KEY variable is not setSolution:

# Create .env from template
make setup

# Or manually
cp .env.example .env

# Add your API keys
echo "OPENAI_API_KEY=sk-..." >> .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

5. Python WASI interpreter missing

Error: python_wasi/bin/python3.11: No such file or directorySolution:

# Download and setup Python WASI (20MB)
./scripts/setup_python_wasi.sh

# Verify installation
ls -lh python_wasi/bin/python3.11

API & Connection Issues

401 Unauthorized

Symptoms:

HTTP 401 responses
“Unauthorized” error messages

Diagnosis:

# Check if auth is enabled
docker compose exec orchestrator env | grep GATEWAY_SKIP_AUTH

Solution 1: Disable authentication (development)

Edit .env:

GATEWAY_SKIP_AUTH=1  # 1 = disabled, 0 = enabled

Restart:

docker compose restart gateway

Test:

curl http://localhost:8080/api/v1/tasks
# Should work without X-API-Key header

Solution 2: Provide valid API key (production)

Request with API key:

curl -H "X-API-Key: sk_test_123456" \
  http://localhost:8080/api/v1/tasks

Python SDK:

from shannon import ShannonClient

client = ShannonClient(
    base_url="http://localhost:8080",
    api_key="sk_test_123456"
)

Connection Refused / Service Unavailable

Symptoms:

connection refused
dial tcp: connect: connection refused
Services not responding

Diagnosis:

# Check service status
docker compose ps

# Check specific service logs
docker compose logs orchestrator --tail=50
docker compose logs agent-core --tail=50
docker compose logs llm-service --tail=50

# Test endpoints
curl http://localhost:8080/health
curl http://localhost:50052  # Should fail - gRPC doesn't support HTTP GET

Solution 1: Services not ready

Wait for all services to initialize:

# Watch logs until services are ready
docker compose logs -f

# Look for these messages:
# orchestrator: "gRPC server listening on :50052"
# agent-core: "Server started on :50051"
# llm-service: "Uvicorn running on http://0.0.0.0:8000"
# gateway: "Gateway listening on :8080"

Typical startup time: 30-60 seconds

Solution 2: Service crashed

Check for crash errors:

docker compose logs orchestrator | grep -i error
docker compose logs orchestrator | grep -i fatal

Restart crashed service:

docker compose restart orchestrator
docker compose restart agent-core
docker compose restart llm-service

Full reset if needed:

docker compose down
docker compose up -d

Solution 3: Database connection failed

Check PostgreSQL:

docker compose logs postgres --tail=20

# Test connection
docker compose exec postgres psql -U shannon -d shannon -c "SELECT 1;"

Solution:

# Restart database
docker compose restart postgres

# Wait for it to be ready
docker compose exec postgres pg_isready -U shannon

Task Stuck in RUNNING or QUEUED State

Symptoms:

Task never completes
Status remains RUNNING for hours
No progress updates

Diagnosis:

# Check Temporal workflows
docker compose logs temporal --tail=100

# Check orchestrator worker
docker compose logs orchestrator | grep -i workflow

# View task in Temporal UI
open http://localhost:8088

Solution 1: LLM API key invalid or quota exceeded

Check LLM service logs:

docker compose logs llm-service | grep -i "api key\|unauthorized\|quota"

Solution:

# Verify API keys in .env
grep -E "OPENAI_API_KEY|ANTHROPIC_API_KEY" .env

# Test API key
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

# Update .env with valid key
nano .env

# Restart LLM service
docker compose restart llm-service

Solution 2: Temporal worker deadlock

Restart Temporal workers:

docker compose restart orchestrator

# Check workflow in Temporal UI
open http://localhost:8088
# Navigate to Workflows → Find your workflow → View execution history

Force workflow termination (last resort):

# In Temporal UI: Workflows → Select workflow → Terminate

Solution 3: Circuit breaker open

Check circuit breaker status:

docker compose logs orchestrator | grep -i "circuit"

Circuit breakers protect against cascading failures:

LLM Service circuit breaker
Database circuit breaker
Redis circuit breaker

Solution - Wait for automatic recovery (30-60 seconds) Or restart services:

docker compose restart orchestrator agent-core llm-service

Budget & Cost Issues

Budget Exceeded Errors

Symptoms:

budget exceeded error
Tasks fail with cost limit errors
HTTP 429 (Rate Limited) Payment Required

Diagnosis:

# Check budget configuration
docker compose exec orchestrator env | grep BUDGET
docker compose exec orchestrator env | grep MAX_COST

Solution 1: Increase budget limits

Edit .env:

MAX_COST_PER_REQUEST=1.00    # Increase from 0.50
MAX_TOKENS_PER_REQUEST=20000  # Increase from 10000

Restart:

docker compose restart orchestrator llm-service

Budgets are configured server-side via environment variables. The SDK does not accept per-request budget parameters.

Solution 2: Use simpler execution mode

# Instead of advanced mode
client.submit_task(query="...", # Mode auto-selected)

# Advanced → Standard → Simple (cheapest)

Cost comparison:

Simple: 1 LLM call, $0.01-0.05
Standard: 3-5 LLM calls, $0.05-0.20
Advanced: 10+ LLM calls, $0.20-1.00+

Solution 3: Disable budget enforcement (development only)

⚠️ Warning: Only for development/testingEdit .env:

LLM_DISABLE_BUDGETS=1  # Disable budget checks

Restart:

docker compose restart orchestrator llm-service

Performance Issues

Slow Response Times

Symptoms:

Tasks take 2-3x longer than expected
High latency
Timeouts

Diagnosis:

# Check resource usage
docker stats

# Check for slow queries
docker compose logs postgres | grep "duration:"

# Check Redis latency
docker compose exec redis redis-cli --latency

# Check Qdrant performance
curl http://localhost:6333/metrics

Solution 1: Insufficient CPU/Memory

Check resources:

docker stats
# Look for CPU > 80% or Memory near limit

Increase Docker resources:

macOS: Docker Desktop → Resources → increase RAM to 16GB, CPUs to 6
Linux: More powerful machine or reduce concurrent workflows

Tune worker concurrency in .env:

WORKER_ACT_CRITICAL=5   # Reduce from 10
WORKER_WF_CRITICAL=3     # Reduce from 5
TOOL_PARALLELISM=2       # Reduce from 5

Solution 2: Cold start / cache misses

First request is always slower (10-30s)Subsequent requests use caching:

LLM response cache (Redis)
Session context cache
Tool result cache

Solution: Warm up with a test request

curl -X POST http://localhost:8080/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{"query": "Hello"}'

Solution 3: Database connection pool exhausted

Increase pool size in .env:

DB_MAX_OPEN_CONNS=50    # Increase from 25
DB_MAX_IDLE_CONNS=10    # Increase from 5

Restart:

docker compose restart orchestrator

Tokens > 0 but empty result

Symptoms:

Database or logs show non‑zero completion tokens, but the final result text is empty.
Complex prompts return nothing while simple prompts work.

Cause:

Some GPT‑5 chat responses return content as structured parts instead of a plain string. Older parsing could miss the text. This is fixed by routing GPT‑5 models via the Responses API and defensively normalizing content for chat responses.

Fix (Shannon ≥ 2025‑11‑05):

LLM Service routes GPT‑5 models to the Responses API and prefers output_text when available.
Chat providers normalize content by joining text parts when a list is returned.
If you upgraded from an older build, restart the LLM Service to clear cached empty responses.

Verify:

Re‑run a long, multi‑paragraph prompt. result length should be > 0 and session history should include the assistant message.

High Memory Usage

Symptoms:

OOM (Out of Memory) errors
Container restarts
Swap usage high

Diagnosis:

docker stats

# Check session cache size
docker compose logs orchestrator | grep "session.*cache"

Solution: Reduce cache sizes

Edit config/shannon.yaml or set env vars:

# Reduce session cache
SESSION_CACHE_SIZE=5000  # From 10000

# Reduce history
SESSION_MAX_HISTORY=250  # From 500

# Reduce LRU caches
TOOL_CACHE_SIZE=1000     # From 5000

Restart:

docker compose restart orchestrator agent-core

Data & State Issues

Sessions Not Persisting

Symptoms:

Session context lost between requests
Agent doesn’t remember previous tasks

Diagnosis:

# Check Redis connectivity
docker compose exec orchestrator nc -zv redis 6379

# Check session data
docker compose exec redis redis-cli KEYS "session:*"

Solution 1: Redis connection failed

Check Redis status:

docker compose ps redis
docker compose logs redis --tail=20

Restart Redis:

docker compose restart redis

Test connection:

docker compose exec redis redis-cli ping
# Should return "PONG"

Solution 2: Session expired (TTL)

Sessions expire after 30 days by defaultIncrease TTL in .env:

REDIS_TTL_SECONDS=7776000  # 90 days

Check session expiry:

docker compose exec redis redis-cli TTL "session:YOUR_SESSION_ID"
# Returns seconds until expiry, or -1 for no expiry

Solution 3: Using consistent session IDs

Provide a stable session_id explicitly:

session_id = "user-123-conversation"

handle1 = client.submit_task("Load data", session_id=session_id)
handle2 = client.submit_task("Analyze data", session_id=session_id)

Database Migration Errors

Symptoms:

Table doesn’t exist errors
Column not found errors
Schema version mismatch

Solution:

# Run migrations
docker compose exec orchestrator make migrate

# Or reset database (⚠️ DESTRUCTIVE)
docker compose down -v  # Remove volumes
docker compose up -d

Debugging Tools

Viewing Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f orchestrator
docker compose logs -f agent-core
docker compose logs -f llm-service

# Last N lines
docker compose logs --tail=100 orchestrator

# Search logs
docker compose logs orchestrator | grep -i error
docker compose logs orchestrator | grep "task_id=YOUR_TASK_ID"

Temporal UI

Access: http://localhost:8088 Features:

View all workflows
See execution history
Replay failed workflows
Terminate stuck workflows
Time-travel debugging

Usage:

Navigate to Workflows
Search by workflow ID (task ID)
View execution history to see where it failed
Check Activity logs for detailed errors

Prometheus Metrics

# Orchestrator metrics
curl http://localhost:2112/metrics

# Agent Core metrics
curl http://localhost:2113/metrics

# LLM Service metrics
curl http://localhost:8000/metrics

Key metrics:

tasks_submitted_total
tasks_completed_total
tasks_failed_total
llm_requests_total
circuit_breaker_state

Real-time Monitoring

For real-time views of task execution:

Use the Shannon Desktop App (Runs view and Run Details) for live event streams
Use Prometheus/Grafana for metrics once configured (see Monitoring concepts)

Getting Help

Installation Guide

Detailed setup instructions

API Documentation

Complete API reference

GitHub Issues

Report bugs or request features

Quick Reference Commands

# Health checks
curl http://localhost:8080/health
curl http://localhost:8000/health

# Service status
docker compose ps
docker stats

# Restart services
docker compose restart orchestrator
docker compose restart agent-core
docker compose restart llm-service

# View logs
docker compose logs -f orchestrator

# Full reset
docker compose down -v
docker compose up -d

# Database access
docker compose exec postgres psql -U shannon -d shannon

# Redis CLI
docker compose exec redis redis-cli

# Check environment
docker compose exec orchestrator env | grep -E "OPENAI|ANTHROPIC"

Getting Started

Core Concepts

Guides

Troubleshooting

Quick Diagnostics

Installation & Setup Issues

Docker Compose Fails to Start

API & Connection Issues

401 Unauthorized

Connection Refused / Service Unavailable

Task Stuck in RUNNING or QUEUED State

Budget & Cost Issues

Budget Exceeded Errors

Performance Issues

Slow Response Times

Tokens > 0 but empty result

High Memory Usage

Data & State Issues

Sessions Not Persisting

Database Migration Errors

Debugging Tools

Viewing Logs

Temporal UI

Prometheus Metrics

Real-time Monitoring

Getting Help

Installation Guide

API Documentation

GitHub Issues

Quick Reference Commands

Getting Started

Core Concepts

Guides

Documentation Index

​Quick Diagnostics

​Installation & Setup Issues

​Docker Compose Fails to Start

​API & Connection Issues

​401 Unauthorized

​Connection Refused / Service Unavailable

​Task Stuck in RUNNING or QUEUED State

​Budget & Cost Issues

​Budget Exceeded Errors

​Performance Issues

​Slow Response Times

​Tokens > 0 but empty result

​High Memory Usage

​Data & State Issues

​Sessions Not Persisting

​Database Migration Errors

​Debugging Tools

​Viewing Logs

​Temporal UI

​Prometheus Metrics

​Real-time Monitoring

​Getting Help

Installation Guide

API Documentation

GitHub Issues

​Quick Reference Commands

Quick Diagnostics

Installation & Setup Issues

Docker Compose Fails to Start

API & Connection Issues

401 Unauthorized

Connection Refused / Service Unavailable

Task Stuck in RUNNING or QUEUED State

Budget & Cost Issues

Budget Exceeded Errors

Performance Issues

Slow Response Times

Tokens > 0 but empty result

High Memory Usage

Data & State Issues

Sessions Not Persisting

Database Migration Errors

Debugging Tools

Viewing Logs

Temporal UI

Prometheus Metrics

Real-time Monitoring

Getting Help

Quick Reference Commands