AI Resilient (SRE)
Ensure your AI systems deliver low latency, high performance, and maximum availability through Site Reliability Engineering principles.
What is AI Resilient (SRE)?
Resilient (SRE) for AI emphasizes latency, performance, availability, and reliability through SRE principles. It ensures your AI agents and LLM applications meet SLAs/SLOs and maintain operational health:
- Low Latency - Fast response times and optimized performance
- High Performance - Maximum throughput and efficiency
- ️ Reliable - High availability and fault tolerance with error budgets
- Observable - Complete visibility into system behavior
- Scalable - Handle increasing load gracefully
- Monitored - Proactive alerting and incident response
Key Capabilities
1. End-to-End Tracing
Follow requests through your entire stack:
// Trace shows complete execution
Trace: customer-support-query (2.5s)
├─ validate-input (10ms)
├─ retrieve-context (200ms)
│ ├─ vector-search (150ms)
│ └─ fetch-history (50ms)
├─ llm-inference (2.0s)
│ ├─ prompt-construct (20ms)
│ ├─ api-call (1.95s)
│ └─ parse-response (30ms)
└─ format-output (290ms)
// View in dashboard
const trace = await ants.traces.get('trace_abc123')
console.log(trace.duration) // 2500ms
console.log(trace.spans.length) // 8 spans2. Performance Monitoring
Track key performance indicators:
# Get performance metrics
metrics = ants.sre.get_metrics(
agent='customer-support',
period='last_24h'
)
print(f"Latency p50: {metrics.latency.p50}ms")
print(f"Latency p95: {metrics.latency.p95}ms")
print(f"Latency p99: {metrics.latency.p99}ms")
print(f"Throughput: {metrics.throughput} req/s")
print(f"Error rate: {metrics.error_rate}%")
print(f"Success rate: {metrics.success_rate}%")3. Automated Alerting
Get notified before users complain:
// Create SLO-based alert
await ants.sre.createAlert({
name: 'High Error Rate',
condition: 'error_rate > 5%',
window: '5m',
severity: 'critical',
channels: ['slack', 'pagerduty']
})
await ants.sre.createAlert({
name: 'Slow Response',
condition: 'p95_latency > 3000ms',
window: '10m',
severity: 'warning',
channels: ['email', 'slack']
})4. Incident Management
Quickly diagnose and resolve issues:
# Get active incidents
incidents = ants.sre.get_incidents(status='open')
for incident in incidents:
print(f"Incident: {incident.title}")
print(f" Severity: {incident.severity}")
print(f" Started: {incident.start_time}")
print(f" Affected: {incident.affected_users} users")
# Get root cause analysis
root_cause = ants.sre.analyze_incident(incident.id)
print(f" Root cause: {root_cause.description}")Service Level Objectives (SLOs)
Define SLOs
// Set SLOs for your service
await ants.sre.createSLO({
name: 'Customer Support Availability',
sli: 'availability',
target: 99.9, // 99.9% uptime
window: '30d'
})
await ants.sre.createSLO({
name: 'Response Time',
sli: 'latency',
target: 2000, // p95 < 2000ms
percentile: 95,
window: '7d'
})
await ants.sre.createSLO({
name: 'Error Budget',
sli: 'error_rate',
target: 1.0, // < 1% errors
window: '30d'
})Monitor SLO Compliance
# Check SLO status
slo_status = ants.sre.get_slo_status('customer-support-availability')
print(f"SLO: {slo_status.name}")
print(f"Target: {slo_status.target}%")
print(f"Current: {slo_status.current}%")
print(f"Error budget remaining: {slo_status.error_budget}%")
print(f"Status: {slo_status.status}") # healthy, at-risk, breachedPerformance Optimization
Identify Bottlenecks
// Find slow operations
const bottlenecks = await ants.sre.findBottlenecks({
agent: 'customer-support',
threshold: 1000, // > 1 second
period: 'last_7_days'
})
bottlenecks.forEach(bn => {
console.log(`${bn.operation}: ${bn.p95}ms`)
console.log(` Occurrences: ${bn.count}`)
console.log(` Total time wasted: ${bn.total_time / 1000}s`)
})Performance Trends
# Track performance over time
trends = ants.sre.get_performance_trends(
agent='customer-support',
metrics=['latency_p95', 'throughput', 'error_rate'],
period='last_30_days',
granularity='daily'
)
# Plot trends
import matplotlib.pyplot as plt
plt.plot(trends.dates, trends.latency_p95)
plt.title('P95 Latency Trend')
plt.show()Error Tracking
Error Analysis
// Get error breakdown
const errors = await ants.sre.getErrors({
agent: 'customer-support',
period: 'last_24h',
groupBy: 'error_type'
})
errors.forEach(error => {
console.log(`${error.type}: ${error.count} occurrences`)
console.log(` First seen: ${error.first_seen}`)
console.log(` Last seen: ${error.last_seen}`)
console.log(` Affected users: ${error.affected_users}`)
})Error Rate Monitoring
# Monitor error rates
error_metrics = ants.sre.get_error_metrics(
agent='customer-support',
period='last_7_days'
)
print(f"Total errors: {error_metrics.total}")
print(f"Error rate: {error_metrics.rate}%")
print(f"MTBF: {error_metrics.mtbf} hours") # Mean time between failures
print(f"MTTR: {error_metrics.mttr} minutes") # Mean time to recoveryAvailability Monitoring
Uptime Tracking
// Get uptime statistics
const uptime = await ants.sre.getUptime({
agent: 'customer-support',
period: 'last_30_days'
})
console.log(`Uptime: ${uptime.percentage}%`)
console.log(`Downtime: ${uptime.downtime_minutes} minutes`)
console.log(`Incidents: ${uptime.incident_count}`)
console.log(`MTTR: ${uptime.mttr} minutes`)Health Checks
# Configure health checks
ants.sre.configure_health_check({
'endpoint': '/api/health',
'interval': 60, # seconds
'timeout': 5,
'retries': 3,
'expected_status': 200
})
# Get current health status
health = ants.sre.get_health_status('customer-support')
print(f"Status: {health.status}") # healthy, degraded, down
print(f"Last check: {health.last_check}")
print(f"Response time: {health.response_time}ms")Load Testing
Simulate Traffic
// Run load test
const loadTest = await ants.sre.createLoadTest({
agent: 'customer-support',
duration: '5m',
rps: 100, // 100 requests per second
rampUp: '30s',
regions: ['us-east-1', 'eu-west-1']
})
// Get results
const results = await loadTest.getResults()
console.log(`P95 latency: ${results.latency.p95}ms`)
console.log(`Error rate: ${results.error_rate}%`)
console.log(`Throughput: ${results.throughput} req/s`)Dashboards
Real-Time Monitoring
┌─ Agent Performance Dashboard ──────────┐
│ │
│ Throughput: 45 req/s (↑ 5%) │
│ Latency P95: 1,850ms (↓ 12%) │
│ Error Rate: 0.3% (↓ 0.2%) │
│ Availability: 99.95% (target: 99.9%) │
│ │
│ Active Requests: 12 │
│ Queue Depth: 3 │
│ CPU Usage: 45% │
│ Memory: 2.1GB / 4GB │
│ │
│ Recent Errors: │
│ • Timeout (2) - 5 min ago │
│ • Rate limit (1) - 15 min ago │
│ │
└─────────────────────────────────────────┘Custom Dashboards
# Create custom dashboard
dashboard = ants.sre.create_dashboard({
'name': 'Customer Support Overview',
'widgets': [
{'type': 'timeseries', 'metric': 'throughput'},
{'type': 'gauge', 'metric': 'latency_p95'},
{'type': 'counter', 'metric': 'error_count'},
{'type': 'heatmap', 'metric': 'latency_distribution'}
],
'refresh_interval': 30 # seconds
})Incident Response
Automated Runbooks
// Define runbook
await ants.sre.createRunbook({
name: 'High Error Rate Response',
trigger: 'error_rate > 5%',
steps: [
{ action: 'notify', channels: ['pagerduty'] },
{ action: 'scale', instances: 2 },
{ action: 'enable', feature: 'circuit_breaker' },
{ action: 'notify', channels: ['slack'], message: 'Mitigation applied' }
]
})Post-Mortem Analysis
# Generate post-mortem
post_mortem = ants.sre.create_post_mortem(
incident_id='inc_123',
include_timeline=True,
include_metrics=True,
include_logs=True
)
# Export
post_mortem.export('incident_123_post_mortem.pdf')Best Practices
1. Set Clear SLOs
// Define measurable objectives
await ants.sre.createSLO({
name: 'API Availability',
target: 99.9, // Specific target
window: '30d' // Time window
})2. Monitor Proactively
# Don't wait for users to report issues
ants.sre.create_alert({
'condition': 'latency_p95 > 2000',
'action': 'investigate_before_users_complain'
})3. Automate Response
// Automated remediation
await ants.sre.createAutomation({
trigger: 'high_error_rate',
action: 'restart_service',
conditions: { duration: '5m', threshold: 10 }
})4. Learn from Incidents
# Always do post-mortems
for incident in closed_incidents:
post_mortem = ants.sre.create_post_mortem(incident.id)
post_mortem.share_with_team()Integration with Tools
PagerDuty
await ants.sre.integrateWith('pagerduty', {
apiKey: process.env.PAGERDUTY_API_KEY,
serviceId: 'PXXXXXX',
escalationPolicy: 'PXXXXXX'
})Grafana
# Export metrics to Grafana
ants.sre.configure_export({
'provider': 'prometheus',
'endpoint': 'http://prometheus:9090',
'interval': 15 # seconds
})Next Steps
- Distributed Tracing - Deep dive into tracing
- Metrics & Monitoring - Comprehensive metrics guide
- Alerting - Set up effective alerts