Topic 5: Monitoring and Observability
Monitoring and observability are essential DevOps practices that help you understand the health, performance, and reliability of your applications and infrastructure. In this topic, you'll learn how to set up monitoring using Prometheus and visualize metrics with Grafana. You'll then explore AI agents with n8n and agentic workflows which enable you to automate incident response instead of relying purely on manual intervention.
Why are monitoring and observability important for cloud-native applications?
Cloud-native applications are typically distributed, dynamic, and run across many services and environments. Monitoring and observability are critical because they:
- Help detect and resolve issues quickly, minimizing downtime.
- Provide visibility into system health, performance, and user experience.
- Enable proactive alerting and troubleshooting in complex, rapidly changing environments.
- Support scalability and reliability by identifying bottlenecks and failures.
- Allow teams to understand dependencies and interactions between microservices.
Without effective monitoring and observability, it becomes difficult to maintain, debug, and optimize cloud-native systems.
How can AI agents enhance your monitoring systems?
Monitoring systems generate constant streams of alerts. Traditionally, engineers manually investigate and fix each alert. By equipping AI agents with the appropriate logic, you enable them to:
- Respond instantly to alerts without human delays
- Analyze logs and metrics to find root causes automatically
- Execute fixes (restart services, scale resources, rollback deployments) independently
- Learn from incidents to improve future responses
- Free teams from repetitive troubleshooting to focus on long-term reliability
Study
- What is Monitoring and Observability in DevOps?
- Prometheus Overview
- Grafana Overview
- Prometheus + Grafana Integration
- What are AI agents?
- What are agentic workflows?
- n8n Overview
Key Concepts
- Metrics: Quantitative data about your systems (CPU, memory, requests, etc.)
- Alerting: Automated notifications based on metric thresholds
- Dashboards: Visual representations of metrics for quick insights
- Instrumentation: Adding code or exporters to expose metrics
Hands-on Tasks
1. Set Up Prometheus
-
Create a minimal
prometheus.ymlconfig:global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090'] -
Install Prometheus using Docker:
docker run \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus -
Add your application's metrics endpoint to
static_configsas needed.
2. Set Up Grafana
- Install Grafana using Docker:
docker run -d --name=grafana -p 3000:3000 grafana/grafana - Access Grafana at http://localhost:3000 (default login:
admin/admin) - Add Prometheus as a data source (URL:
http://host.docker.internal:9090orhttp://localhost:9090) - Add and connect your cloud provider's metrics if applicable (e.g., AWS CloudWatch, Azure Monitor)
3. Create Dashboards
- Create a new dashboard and add panels using PromQL queries (e.g.,
up,http_requests_total) - Visualize metrics from your application or infrastructure
4. Instrument a Sample App
- For Node.js: Use prom-client to expose metrics
- For Python: Use prometheus_client
- Add the metrics endpoint to Prometheus config and visualize in Grafana
5. Build an AI agent with n8n
- Install n8n using Docker:
docker run -d -p 5678:5678 --name n8n n8nio/n8n:latest
- Access n8n at http://localhost:5678. Create your login.
Build your agent:
- Create a Schedule trigger (e.g. every 1-2 minutes)
- Query your Prometheus API for a specific metric
- Add an If node to check if the metric exceeds a certain threshold
- Call an LLM to analyze the anomaly and suggest root causes and remediation steps (OpenAI API)
- Send the analysis via email or Slack
- (Optional) Add automated remediation steps
Test:
- Intentionally trigger high traffic or errors on your monitored application
- Verify the agent detects the anomaly, analyzes root cause, and takes action
Test Your Knowledge
Use these prompts to test your understanding:
- What is the difference between monitoring and observability?
- How does Prometheus collect metrics from applications?
- What is PromQL and how is it used in Grafana dashboards?
- How would you set up alerting for high CPU usage using Prometheus?
- What are exporters in the context of Prometheus?
- How do you add a new data source in Grafana?
- What are some best practices for dashboard design?
- What are the key components of AI agent architecture?
- How does an LLM help an agent make decisions?