Docs/Aigovernance/Airesilient

AI Resilient (SRE)

Ensure your AI systems deliver low latency, high performance, and maximum availability through Site Reliability Engineering principles.

What is AI Resilient (SRE)?

Resilient (SRE) for AI emphasizes latency, performance, availability, and reliability through SRE principles. It ensures your AI agents and LLM applications meet SLAs/SLOs and maintain operational health:

  • Low Latency - Fast response times and optimized performance
  • High Performance - Maximum throughput and efficiency
  • Reliable - High availability and fault tolerance with error budgets
  • Observable - Complete visibility into system behavior
  • Scalable - Handle increasing load gracefully
  • Monitored - Proactive alerting and incident response

Key Capabilities

1. End-to-End Tracing

Follow requests through your entire stack:

typescript
// Trace shows complete execution Trace: customer-support-query (2.5s) validate-input (10ms) retrieve-context (200ms) vector-search (150ms) fetch-history (50ms) llm-inference (2.0s) prompt-construct (20ms) api-call (1.95s) parse-response (30ms) format-output (290ms) // View in dashboard const trace = await ants.traces.get('trace_abc123') console.log(trace.duration) // 2500ms console.log(trace.spans.length) // 8 spans

2. Performance Monitoring

Track key performance indicators:

python
# Get performance metrics metrics = ants.sre.get_metrics( agent='customer-support', period='last_24h' ) print(f"Latency p50: {metrics.latency.p50}ms") print(f"Latency p95: {metrics.latency.p95}ms") print(f"Latency p99: {metrics.latency.p99}ms") print(f"Throughput: {metrics.throughput} req/s") print(f"Error rate: {metrics.error_rate}%") print(f"Success rate: {metrics.success_rate}%")

3. Automated Alerting

Get notified before users complain:

typescript
// Create SLO-based alert await ants.sre.createAlert({ name: 'High Error Rate', condition: 'error_rate > 5%', window: '5m', severity: 'critical', channels: ['slack', 'pagerduty'] }) await ants.sre.createAlert({ name: 'Slow Response', condition: 'p95_latency > 3000ms', window: '10m', severity: 'warning', channels: ['email', 'slack'] })

4. Incident Management

Quickly diagnose and resolve issues:

python
# Get active incidents incidents = ants.sre.get_incidents(status='open') for incident in incidents: print(f"Incident: {incident.title}") print(f" Severity: {incident.severity}") print(f" Started: {incident.start_time}") print(f" Affected: {incident.affected_users} users") # Get root cause analysis root_cause = ants.sre.analyze_incident(incident.id) print(f" Root cause: {root_cause.description}")

Service Level Objectives (SLOs)

Define SLOs

typescript
// Set SLOs for your service await ants.sre.createSLO({ name: 'Customer Support Availability', sli: 'availability', target: 99.9, // 99.9% uptime window: '30d' }) await ants.sre.createSLO({ name: 'Response Time', sli: 'latency', target: 2000, // p95 < 2000ms percentile: 95, window: '7d' }) await ants.sre.createSLO({ name: 'Error Budget', sli: 'error_rate', target: 1.0, // < 1% errors window: '30d' })

Monitor SLO Compliance

python
# Check SLO status slo_status = ants.sre.get_slo_status('customer-support-availability') print(f"SLO: {slo_status.name}") print(f"Target: {slo_status.target}%") print(f"Current: {slo_status.current}%") print(f"Error budget remaining: {slo_status.error_budget}%") print(f"Status: {slo_status.status}") # healthy, at-risk, breached

Performance Optimization

Identify Bottlenecks

typescript
// Find slow operations const bottlenecks = await ants.sre.findBottlenecks({ agent: 'customer-support', threshold: 1000, // > 1 second period: 'last_7_days' }) bottlenecks.forEach(bn => { console.log(`${bn.operation}: ${bn.p95}ms`) console.log(` Occurrences: ${bn.count}`) console.log(` Total time wasted: ${bn.total_time / 1000}s`) })
python
# Track performance over time trends = ants.sre.get_performance_trends( agent='customer-support', metrics=['latency_p95', 'throughput', 'error_rate'], period='last_30_days', granularity='daily' ) # Plot trends plt.plot(trends.dates, trends.latency_p95) plt.title('P95 Latency Trend') plt.show()

Error Tracking

Error Analysis

typescript
// Get error breakdown const errors = await ants.sre.getErrors({ agent: 'customer-support', period: 'last_24h', groupBy: 'error_type' }) errors.forEach(error => { console.log(`${error.type}: ${error.count} occurrences`) console.log(` First seen: ${error.first_seen}`) console.log(` Last seen: ${error.last_seen}`) console.log(` Affected users: ${error.affected_users}`) })

Error Rate Monitoring

python
# Monitor error rates error_metrics = ants.sre.get_error_metrics( agent='customer-support', period='last_7_days' ) print(f"Total errors: {error_metrics.total}") print(f"Error rate: {error_metrics.rate}%") print(f"MTBF: {error_metrics.mtbf} hours") # Mean time between failures print(f"MTTR: {error_metrics.mttr} minutes") # Mean time to recovery

Availability Monitoring

Uptime Tracking

typescript
// Get uptime statistics const uptime = await ants.sre.getUptime({ agent: 'customer-support', period: 'last_30_days' }) console.log(`Uptime: ${uptime.percentage}%`) console.log(`Downtime: ${uptime.downtime_minutes} minutes`) console.log(`Incidents: ${uptime.incident_count}`) console.log(`MTTR: ${uptime.mttr} minutes`)

Health Checks

python
# Configure health checks ants.sre.configure_health_check({ 'endpoint': '/api/health', 'interval': 60, # seconds 'timeout': 5, 'retries': 3, 'expected_status': 200 }) # Get current health status health = ants.sre.get_health_status('customer-support') print(f"Status: {health.status}") # healthy, degraded, down print(f"Last check: {health.last_check}") print(f"Response time: {health.response_time}ms")

Load Testing

Simulate Traffic

typescript
// Run load test const loadTest = await ants.sre.createLoadTest({ agent: 'customer-support', duration: '5m', rps: 100, // 100 requests per second rampUp: '30s', regions: ['us-east-1', 'eu-west-1'] }) // Get results const results = await loadTest.getResults() console.log(`P95 latency: ${results.latency.p95}ms`) console.log(`Error rate: ${results.error_rate}%`) console.log(`Throughput: ${results.throughput} req/s`)

Dashboards

Real-Time Monitoring

code
Agent Performance Dashboard Throughput: 45 req/s ( 5%) Latency P95: 1,850ms ( 12%) Error Rate: 0.3% ( 0.2%) Availability: 99.95% (target: 99.9%) Active Requests: 12 Queue Depth: 3 CPU Usage: 45% Memory: 2.1GB / 4GB Recent Errors: Timeout (2) - 5 min ago Rate limit (1) - 15 min ago

Custom Dashboards

python
# Create custom dashboard dashboard = ants.sre.create_dashboard({ 'name': 'Customer Support Overview', 'widgets': [ {'type': 'timeseries', 'metric': 'throughput'}, {'type': 'gauge', 'metric': 'latency_p95'}, {'type': 'counter', 'metric': 'error_count'}, {'type': 'heatmap', 'metric': 'latency_distribution'} ], 'refresh_interval': 30 # seconds })

Incident Response

Automated Runbooks

typescript
// Define runbook await ants.sre.createRunbook({ name: 'High Error Rate Response', trigger: 'error_rate > 5%', steps: [ { action: 'notify', channels: ['pagerduty'] }, { action: 'scale', instances: 2 }, { action: 'enable', feature: 'circuit_breaker' }, { action: 'notify', channels: ['slack'], message: 'Mitigation applied' } ] })

Post-Mortem Analysis

python
# Generate post-mortem post_mortem = ants.sre.create_post_mortem( incident_id='inc_123', include_timeline=True, include_metrics=True, include_logs=True ) # Export post_mortem.export('incident_123_post_mortem.pdf')

Best Practices

1. Set Clear SLOs

typescript
// Define measurable objectives await ants.sre.createSLO({ name: 'API Availability', target: 99.9, // Specific target window: '30d' // Time window })

2. Monitor Proactively

python
# Don't wait for users to report issues ants.sre.create_alert({ 'condition': 'latency_p95 > 2000', 'action': 'investigate_before_users_complain' })

3. Automate Response

typescript
// Automated remediation await ants.sre.createAutomation({ trigger: 'high_error_rate', action: 'restart_service', conditions: { duration: '5m', threshold: 10 } })

4. Learn from Incidents

python
# Always do post-mortems for incident in closed_incidents: post_mortem = ants.sre.create_post_mortem(incident.id) post_mortem.share_with_team()

Integration with Tools

PagerDuty

typescript
await ants.sre.integrateWith('pagerduty', { apiKey: process.env.PAGERDUTY_API_KEY, serviceId: 'PXXXXXX', escalationPolicy: 'PXXXXXX' })

Grafana

python
# Export metrics to Grafana ants.sre.configure_export({ 'provider': 'prometheus', 'endpoint': 'http://prometheus:9090', 'interval': 15 # seconds })

Next Steps

Learn About Tracing →

© 2026 ANTS Platform, Inc.Docs v1.0 · Last updated June 2026