> ## Documentation Index
> Fetch the complete documentation index at: https://docs.shannon.run/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring & Observability

> Monitor Shannon task execution and system health

<Note>
  Monitoring documentation is under development. Core concepts are outlined below.
</Note>

## Overview

Shannon provides comprehensive monitoring and observability features to track task execution, system performance, and resource usage in production environments.

## Monitoring Capabilities

### Task Monitoring

Track individual task execution:

* Execution status and progress
* Resource consumption
* Error rates and types
* Latency metrics
* Cost tracking

### System Monitoring

Monitor Shannon infrastructure:

* Service health status
* API endpoint latency
* Queue depths
* Agent availability
* LLM provider status

## Metrics

### Task Metrics

| Metric                   | Description                     | Unit  |
| ------------------------ | ------------------------------- | ----- |
| `task.latency`           | End-to-end task completion time | ms    |
| `task.cost`              | Total cost per task             | USD   |
| `task.tokens.input`      | Input tokens consumed           | count |
| `task.tokens.output`     | Output tokens generated         | count |
| `task.iterations`        | Number of agent iterations      | count |
| `task.tools.invocations` | Tool usage count                | count |

### System Metrics

| Metric          | Description        | Unit     |
| --------------- | ------------------ | -------- |
| `api.latency`   | API response time  | ms       |
| `api.requests`  | Request rate       | req/s    |
| `api.errors`    | Error rate         | errors/s |
| `queue.depth`   | Tasks waiting      | count    |
| `agents.active` | Active agent count | count    |

## Health Checks

### API Health

```bash theme={null}
curl http://localhost:8080/health
```

Response (gateway):

```json theme={null}
{
  "status": "healthy",
  "version": "0.3.0",
  "time": "2025-01-20T10:00:00Z",
  "checks": {
    "gateway": "ok"
  }
}
```

Readiness (checks orchestrator connectivity):

```bash theme={null}
curl http://localhost:8080/readiness
```

Response:

```json theme={null}
{
  "status": "ready",
  "version": "0.3.0",
  "time": "2025-01-20T10:00:02Z",
  "checks": {
    "orchestrator": "ok"
  }
}
```

### Component Health

```python theme={null}
import requests

health = requests.get("http://localhost:8080/health").json()
print("Gateway:", health.get("status"), health.get("checks"))

ready = requests.get("http://localhost:8080/readiness").json()
print("Readiness:", ready.get("status"), ready.get("checks"))
```

## Logging

### Log Levels

Shannon uses structured logging with levels:

* `DEBUG` - Detailed diagnostic information
* `INFO` - General operational messages
* `WARN` - Warning conditions
* `ERROR` - Error conditions
* `FATAL` - Critical failures

### Log Format

```json theme={null}
{
  "timestamp": "2024-10-27T10:00:00Z",
  "level": "INFO",
  "service": "orchestrator",
  "task_id": "task-dev-1730000000",
  "message": "Task submitted",
  "metadata": {
    "mode": "standard",
    "estimated_cost": 0.15
  }
}
```

## Dashboards

### Task Dashboard

Monitor task execution in real-time:

* Active tasks
* Completion rate
* Average latency
* Error rate
* Cost per hour

### System Dashboard

Track system health:

* Service status
* Resource utilization
* Queue lengths
* Provider availability

## Alerting

### Alert Types

Configure alerts for:

* Task failures
* Budget exceeded
* High latency
* Service degradation
* Rate limiting

### Alert Configuration

```yaml theme={null}
alerts:
  - name: high_error_rate
    condition: error_rate > 0.05
    action: notify
    channels: [email, slack]

  - name: budget_warning
    condition: daily_cost > 100
    action: notify
    channels: [email]

  - name: service_down
    condition: health_check_failed
    action: page
    channels: [pagerduty]
```

## Prometheus Integration

Export metrics to Prometheus (example scrape targets for local dev):

```yaml theme={null}
# prometheus.yml
scrape_configs:
  - job_name: 'orchestrator'
    static_configs:
      - targets: ['localhost:2112']   # Go Orchestrator /metrics

  - job_name: 'agent_core'
    static_configs:
      - targets: ['localhost:2113']   # Rust Agent Core /metrics

  - job_name: 'llm_service'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['localhost:8000']   # Python LLM Service /metrics
```

### Available Metrics

```
# HELP shannon_task_total Total number of tasks
# TYPE shannon_task_total counter
shannon_task_total{status="completed"} 1234
shannon_task_total{status="failed"} 12

# HELP shannon_task_duration_seconds Task execution duration
# TYPE shannon_task_duration_seconds histogram
shannon_task_duration_seconds_bucket{le="1.0"} 100
shannon_task_duration_seconds_bucket{le="5.0"} 450
```

## Grafana Dashboards

Pre-built Grafana dashboards for:

* Task analytics
* Cost tracking
* Performance monitoring
* Error analysis

## OpenTelemetry

Shannon supports OpenTelemetry for distributed tracing:

```python theme={null}
from opentelemetry import trace
from shannon import ShannonClient

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("analyze_data"):
    client = ShannonClient()
    handle = client.submit_task(query="Analyze dataset")
    result = client.get_status(handle.task_id)
```

## Best Practices

1. **Set up alerts** for critical metrics
2. **Monitor costs** to prevent budget overruns
3. **Track error patterns** to identify issues
4. **Use distributed tracing** for debugging
5. **Archive logs** for compliance
6. **Create custom dashboards** for your use case
7. **Implement SLOs** for reliability

## Debugging

### Enable Debug Logging

```python theme={null}
import logging
logging.basicConfig(level=logging.DEBUG)

# Now Shannon will output detailed logs
client = ShannonClient()
```

### Trace Requests

Use distributed tracing via OpenTelemetry or increase logging verbosity in services. Refer to your observability stack configuration (Jaeger/Tempo) for exporters.

## Next Steps

<CardGroup cols={2}>
  <Card title="Troubleshooting" icon="wrench" href="/en/quickstart/troubleshooting">
    Common issues and solutions
  </Card>

  <Card title="Cost Control" icon="dollar" href="/en/quickstart/concepts/cost-control">
    Manage and optimize costs
  </Card>
</CardGroup>
