Docs/Aigovernance/Airesilient

AI Resilient (SRE)

Ensure your AI systems deliver low latency, high performance, and maximum availability through Site Reliability Engineering principles.

What is AI Resilient (SRE)?

Resilient (SRE) for AI emphasizes latency, performance, availability, and reliability through SRE principles. It ensures your AI agents and LLM applications meet SLAs/SLOs and maintain operational health:

Low Latency - Fast response times and optimized performance
High Performance - Maximum throughput and efficiency
️ Reliable - High availability and fault tolerance with error budgets
Observable - Complete visibility into system behavior
Scalable - Handle increasing load gracefully
Monitored - Proactive alerting and incident response

Key Capabilities

1. End-to-End Tracing

Follow requests through your entire stack:

typescript

// Trace shows complete execution Trace: customer-support-query (2.5s) ├─ validate-input (10ms) ├─ retrieve-context (200ms) │ ├─ vector-search (150ms) │ └─ fetch-history (50ms) ├─ llm-inference (2.0s) │ ├─ prompt-construct (20ms) │ ├─ api-call (1.95s) │ └─ parse-response (30ms) └─ format-output (290ms) // View in dashboard const trace = await ants.traces.get('trace_abc123') console.log(trace.duration) // 2500ms console.log(trace.spans.length) // 8 spans

2. Performance Monitoring

Track key performance indicators:

python

# Get performance metrics metrics = ants.sre.get_metrics( agent='customer-support', period='last_24h' ) print(f"Latency p50: {metrics.latency.p50}ms") print(f"Latency p95: {metrics.latency.p95}ms") print(f"Latency p99: {metrics.latency.p99}ms") print(f"Throughput: {metrics.throughput} req/s") print(f"Error rate: {metrics.error_rate}%") print(f"Success rate: {metrics.success_rate}%")

3. Automated Alerting

Get notified before users complain:

typescript

// Create SLO-based alert await ants.sre.createAlert({ name: 'High Error Rate', condition: 'error_rate > 5%', window: '5m', severity: 'critical', channels: ['slack', 'pagerduty'] }) await ants.sre.createAlert({ name: 'Slow Response', condition: 'p95_latency > 3000ms', window: '10m', severity: 'warning', channels: ['email', 'slack'] })

4. Incident Management

Quickly diagnose and resolve issues:

python

# Get active incidents incidents = ants.sre.get_incidents(status='open') for incident in incidents: print(f"Incident: {incident.title}") print(f" Severity: {incident.severity}") print(f" Started: {incident.start_time}") print(f" Affected: {incident.affected_users} users") # Get root cause analysis root_cause = ants.sre.analyze_incident(incident.id) print(f" Root cause: {root_cause.description}")

Service Level Objectives (SLOs)

Define SLOs

typescript

// Set SLOs for your service await ants.sre.createSLO({ name: 'Customer Support Availability', sli: 'availability', target: 99.9, // 99.9% uptime window: '30d' }) await ants.sre.createSLO({ name: 'Response Time', sli: 'latency', target: 2000, // p95 < 2000ms percentile: 95, window: '7d' }) await ants.sre.createSLO({ name: 'Error Budget', sli: 'error_rate', target: 1.0, // < 1% errors window: '30d' })

Monitor SLO Compliance

python

# Check SLO status slo_status = ants.sre.get_slo_status('customer-support-availability') print(f"SLO: {slo_status.name}") print(f"Target: {slo_status.target}%") print(f"Current: {slo_status.current}%") print(f"Error budget remaining: {slo_status.error_budget}%") print(f"Status: {slo_status.status}") # healthy, at-risk, breached

Performance Optimization

Identify Bottlenecks

typescript

// Find slow operations const bottlenecks = await ants.sre.findBottlenecks({ agent: 'customer-support', threshold: 1000, // > 1 second period: 'last_7_days' }) bottlenecks.forEach(bn => { console.log(`${bn.operation}: ${bn.p95}ms`) console.log(` Occurrences: ${bn.count}`) console.log(` Total time wasted: ${bn.total_time / 1000}s`) })

Performance Trends

python

# Track performance over time trends = ants.sre.get_performance_trends( agent='customer-support', metrics=['latency_p95', 'throughput', 'error_rate'], period='last_30_days', granularity='daily' ) # Plot trends plt.plot(trends.dates, trends.latency_p95) plt.title('P95 Latency Trend') plt.show()

Error Tracking

Error Analysis

typescript

// Get error breakdown const errors = await ants.sre.getErrors({ agent: 'customer-support', period: 'last_24h', groupBy: 'error_type' }) errors.forEach(error => { console.log(`${error.type}: ${error.count} occurrences`) console.log(` First seen: ${error.first_seen}`) console.log(` Last seen: ${error.last_seen}`) console.log(` Affected users: ${error.affected_users}`) })

Error Rate Monitoring

python

# Monitor error rates error_metrics = ants.sre.get_error_metrics( agent='customer-support', period='last_7_days' ) print(f"Total errors: {error_metrics.total}") print(f"Error rate: {error_metrics.rate}%") print(f"MTBF: {error_metrics.mtbf} hours") # Mean time between failures print(f"MTTR: {error_metrics.mttr} minutes") # Mean time to recovery

Availability Monitoring

Uptime Tracking

typescript

// Get uptime statistics const uptime = await ants.sre.getUptime({ agent: 'customer-support', period: 'last_30_days' }) console.log(`Uptime: ${uptime.percentage}%`) console.log(`Downtime: ${uptime.downtime_minutes} minutes`) console.log(`Incidents: ${uptime.incident_count}`) console.log(`MTTR: ${uptime.mttr} minutes`)

Health Checks

python

# Configure health checks ants.sre.configure_health_check({ 'endpoint': '/api/health', 'interval': 60, # seconds 'timeout': 5, 'retries': 3, 'expected_status': 200 }) # Get current health status health = ants.sre.get_health_status('customer-support') print(f"Status: {health.status}") # healthy, degraded, down print(f"Last check: {health.last_check}") print(f"Response time: {health.response_time}ms")

Load Testing

Simulate Traffic

typescript

// Run load test const loadTest = await ants.sre.createLoadTest({ agent: 'customer-support', duration: '5m', rps: 100, // 100 requests per second rampUp: '30s', regions: ['us-east-1', 'eu-west-1'] }) // Get results const results = await loadTest.getResults() console.log(`P95 latency: ${results.latency.p95}ms`) console.log(`Error rate: ${results.error_rate}%`) console.log(`Throughput: ${results.throughput} req/s`)

Dashboards

Real-Time Monitoring

code

┌─ Agent Performance Dashboard ──────────┐ │ │ │ Throughput: 45 req/s (↑ 5%) │ │ Latency P95: 1,850ms (↓ 12%) │ │ Error Rate: 0.3% (↓ 0.2%) │ │ Availability: 99.95% (target: 99.9%) │ │ │ │ Active Requests: 12 │ │ Queue Depth: 3 │ │ CPU Usage: 45% │ │ Memory: 2.1GB / 4GB │ │ │ │ Recent Errors: │ │ • Timeout (2) - 5 min ago │ │ • Rate limit (1) - 15 min ago │ │ │ └─────────────────────────────────────────┘

Custom Dashboards

python

# Create custom dashboard dashboard = ants.sre.create_dashboard({ 'name': 'Customer Support Overview', 'widgets': [ {'type': 'timeseries', 'metric': 'throughput'}, {'type': 'gauge', 'metric': 'latency_p95'}, {'type': 'counter', 'metric': 'error_count'}, {'type': 'heatmap', 'metric': 'latency_distribution'} ], 'refresh_interval': 30 # seconds })

Incident Response

Automated Runbooks

typescript

// Define runbook await ants.sre.createRunbook({ name: 'High Error Rate Response', trigger: 'error_rate > 5%', steps: [ { action: 'notify', channels: ['pagerduty'] }, { action: 'scale', instances: 2 }, { action: 'enable', feature: 'circuit_breaker' }, { action: 'notify', channels: ['slack'], message: 'Mitigation applied' } ] })

Post-Mortem Analysis

python

# Generate post-mortem post_mortem = ants.sre.create_post_mortem( incident_id='inc_123', include_timeline=True, include_metrics=True, include_logs=True ) # Export post_mortem.export('incident_123_post_mortem.pdf')

Best Practices

1. Set Clear SLOs

typescript

// Define measurable objectives await ants.sre.createSLO({ name: 'API Availability', target: 99.9, // Specific target window: '30d' // Time window })

2. Monitor Proactively

python

# Don't wait for users to report issues ants.sre.create_alert({ 'condition': 'latency_p95 > 2000', 'action': 'investigate_before_users_complain' })

3. Automate Response

typescript

// Automated remediation await ants.sre.createAutomation({ trigger: 'high_error_rate', action: 'restart_service', conditions: { duration: '5m', threshold: 10 } })

4. Learn from Incidents

python

# Always do post-mortems for incident in closed_incidents: post_mortem = ants.sre.create_post_mortem(incident.id) post_mortem.share_with_team()

Integration with Tools

PagerDuty

typescript

await ants.sre.integrateWith('pagerduty', { apiKey: process.env.PAGERDUTY_API_KEY, serviceId: 'PXXXXXX', escalationPolicy: 'PXXXXXX' })

Grafana

python

# Export metrics to Grafana ants.sre.configure_export({ 'provider': 'prometheus', 'endpoint': 'http://prometheus:9090', 'interval': 15 # seconds })

Next Steps

Distributed Tracing - Deep dive into tracing
Metrics & Monitoring - Comprehensive metrics guide
Alerting - Set up effective alerts

Learn About Tracing →