AWS Observability Overview
AWS provides a comprehensive suite of observability tools that integrate natively with AWS services. CloudWatch serves as the foundation for logs, metrics, and alarms, while X-Ray provides distributed tracing capabilities.
When to Use AWS-Native vs Third-Party
- AWS-Native: Deep AWS integration, pay-as-you-go, no additional infrastructure
- Third-Party: Multi-cloud support, advanced features, unified billing
Cost Consideration
AWS observability costs can escalate quickly with high log volumes. Plan retention policies and sampling strategies from the start.
CloudWatch Fundamentals
CloudWatch Logs Setup
CloudWatch Logs collects and stores log data from AWS services and applications. Logs are organised into log groups and log streams.
# Terraform: CloudWatch Log Group with retention# Log stream for specific instance/container# Log stream for specific instance/container# Log stream for specific instance/container# Log stream for specific instance/container# Log stream for specific instance/containerfic instance/container
resource "aws_cloudwatch_log_stream" "app_stream" {
name = "container-1"
log_group_name = aws_cloudwatch_log_group.app_logs.name
}CloudWatch Agent Installation
# Install CloudWatch Agent on EC2# Configure agent# Configure agent# Configure agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agent# Start agenttart agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
-sCloudWatch Logs Insights
# Find errors in the last hour# Count errors by service# Count errors by service# Count errors by service# Count errors by service# Count errors by service# P99 latency from structured logs# P99 latency from structured logs# P99 latency from structured logs# P99 latency from structured logs# P99 latency from structured logs# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requests# Find slow requeststs
fields @timestamp, @message, duration
| filter duration > 1000
| sort duration desc
| limit 50Custom Metrics
// Node.js: Publishing custom metrics// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric// Publish order value metric/ Publish order value metric
await publishMetric(
'OrderValue',
99.99,
'None',
[{ name: 'Environment', value: 'production' }]
);AWS X-Ray for Distributed Tracing
X-Ray traces requests as they travel through your application, providing service maps, trace analysis, and performance insights.
X-Ray Concepts
- Segments: Work done by a service for a request
- Subsegments: Granular timing for specific operations
- Annotations: Indexed key-value pairs for filtering
- Metadata: Non-indexed additional data
X-Ray SDK Integration (Node.js)
// X-Ray SDK setup// Capture all AWS SDK calls// Capture all AWS SDK calls// Capture all AWS SDK calls// Capture all AWS SDK calls// Capture HTTP requests// Capture HTTP requests// Capture HTTP requests// Capture HTTP requests// Middleware to create segments// Middleware to create segments// Middleware to create segments Middleware to create segments
app.use(AWSXRay.express.openSegment('MyApp'));
app.get('/api/orders/:id', async (req, res) => {
const segment = AWSXRay.getSegment();
// Add annotation (indexed, searchable)
segment.addAnnotation('orderId', req.params.id);
// Add metadata (not indexed)
segment.addMetadata('request', { headers: req.headers });
// Create subsegment for database call
const subsegment = segment.addNewSubsegment('DynamoDB-GetOrder');
try {
const order = await getOrderFromDynamoDB(req.params.id);
subsegment.close();
res.json(order);
} catch (error) {
subsegment.addError(error);
subsegment.close();
throw error;
}
});
app.use(AWSXRay.express.closeSegment());Sampling Rules
# Terraform: X-Ray sampling rule# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errors# Higher sampling for errorsigher sampling for errors
resource "aws_xray_sampling_rule" "error_sampling" {
rule_name = "error-requests"
priority = 100 # Higher priority
version = 1
reservoir_size = 50
fixed_rate = 1.0 # Sample all errors
url_path = "*"
host = "*"
http_method = "*"
service_type = "*"
service_name = "*"
resource_arn = "*"
attributes = {
"http.status_code" = "5*"
}
}Container Insights for EKS
Container Insights provides performance monitoring for EKS clusters, collecting metrics at the cluster, node, pod, and container level.
Enable Container Insights
# Enable Container Insights on EKS# Deploy CloudWatch agent as DaemonSet# Deploy CloudWatch agent as DaemonSet# Deploy CloudWatch agent as DaemonSet# Deploy CloudWatch agent as DaemonSet# Deploy CloudWatch agent as DaemonSetloudWatch agent as DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yamlFluent Bit Configuration for EKS
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: amazon-cloudwatch
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/*.log
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[OUTPUT]
Name cloudwatch_logs
Match application.*
region eu-west-1
log_group_name /aws/eks/my-cluster/application
log_stream_prefix ${HOSTNAME}-
auto_create_group trueAWS Distro for OpenTelemetry (ADOT)
ADOT is AWS's distribution of OpenTelemetry, providing vendor-neutral instrumentation that integrates with X-Ray and CloudWatch.
ADOT Collector Configuration
# adot-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 30s
send_batch_size: 8192
resource:
attributes:
- key: cloud.provider
value: aws
action: upsert
exporters:
awsxray:
region: eu-west-1
awsemf:
region: eu-west-1
namespace: MyApp
log_group_name: '/aws/otel/metrics'
dimension_rollup_option: NoDimensionRollup
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [awsxray]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [awsemf]Deploy ADOT on EKS
# Deploy ADOT Collector as DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: adot-collector
namespace: observability
spec:
selector:
matchLabels:
app: adot-collector
template:
metadata:
labels:
app: adot-collector
spec:
serviceAccountName: adot-collector
containers:
- name: collector
image: amazon/aws-otel-collector:latest
args:
- --config=/etc/otel/config.yaml
env:
- name: AWS_REGION
value: eu-west-1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 200m
memory: 256Mi
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: adot-configInfrastructure as Code Setup
Complete Terraform Module
# main.tf - AWS Observability Infrastructure# CloudWatch Log Groups# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# SNS Topic for Alerts# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# CloudWatch Alarms# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# IAM Role for X-Ray# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard# CloudWatch Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "${var.service_name}-dashboard"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0
y = 0
width = 12
height = 6
properties = {
metrics = [
["AWS/ApiGateway", "Count", "ApiName", var.api_name],
[".", "5XXError", ".", "."],
[".", "4XXError", ".", "."]
]
period = 300
stat = "Sum"
region = var.region
title = "API Requests"
}
},
{
type = "metric"
x = 12
y = 0
width = 12
height = 6
properties = {
metrics = [
["AWS/ApiGateway", "Latency", "ApiName", var.api_name, { stat = "p50" }],
["...", { stat = "p95" }],
["...", { stat = "p99" }]
]
period = 300
region = var.region
title = "API Latency"
}
}
]
})
}CloudWatch Dashboards
Dashboard Best Practices
- Summary view: Key metrics at a glance (error rates, latency, throughput)
- Service view: Per-service metrics and health
- Infrastructure view: EC2, RDS, Lambda resource utilisation
Dashboard Tip
Use anomaly detection bands on dashboards to quickly spot unusual behaviour. CloudWatch can automatically calculate expected ranges.
Alerting and Incident Response
Composite Alarms
# Composite alarm: Alert only when multiple conditions are met
resource "aws_cloudwatch_composite_alarm" "service_degraded" {
alarm_name = "${var.service_name}-service-degraded"
alarm_rule = "ALARM(\"${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name}\") AND ALARM(\"${aws_cloudwatch_metric_alarm.high_latency.alarm_name}\")"
alarm_actions = [aws_sns_topic.alerts.arn]
alarm_description = "Service is degraded: both error rate and latency are high"
}EventBridge Integration
# Trigger automation on alarm state change
resource "aws_cloudwatch_event_rule" "alarm_trigger" {
name = "alarm-state-change"
description = "Trigger on CloudWatch alarm state changes"
event_pattern = jsonencode({
source = ["aws.cloudwatch"]
detail-type = ["CloudWatch Alarm State Change"]
detail = {
alarmName = [{ prefix = var.service_name }]
state = { value = ["ALARM"] }
}
})
}
resource "aws_cloudwatch_event_target" "lambda" {
rule = aws_cloudwatch_event_rule.alarm_trigger.name
target_id = "incident-response"
arn = aws_lambda_function.incident_response.arn
}Cost Optimisation
CloudWatch Pricing Breakdown
- Logs: £0.50/GB ingested + £0.03/GB stored (after first 5GB)
- Metrics: £0.30 per metric/month (first 10k metrics)
- Dashboards: £3.00/dashboard/month
- Alarms: £0.10/alarm/month (standard resolution)
Cost Reduction Strategies
- Set appropriate log retention periods
- Use log filters to reduce stored volume
- Archive old logs to S3 via subscription filters
- Use X-Ray sampling to reduce trace costs
- Choose standard resolution (1 minute) vs high resolution (1 second) metrics
# Archive logs to S3 via subscription filter
resource "aws_cloudwatch_log_subscription_filter" "archive" {
name = "archive-to-s3"
log_group_name = aws_cloudwatch_log_group.app.name
filter_pattern = ""
destination_arn = aws_kinesis_firehose_delivery_stream.logs.arn
role_arn = aws_iam_role.cloudwatch_to_firehose.arn
}
resource "aws_kinesis_firehose_delivery_stream" "logs" {
name = "logs-archive"
destination = "extended_s3"
extended_s3_configuration {
role_arn = aws_iam_role.firehose.arn
bucket_arn = aws_s3_bucket.logs_archive.arn
prefix = "logs/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
error_output_prefix = "errors/"
buffering_size = 128
buffering_interval = 300
compression_format = "GZIP"
}
}Troubleshooting
Common issues and solutions when setting up AWS observability.
CloudWatch Agent Not Sending Metrics
Symptom: Custom metrics not appearing in CloudWatch despite agent running.
Common causes:
- IAM role missing CloudWatch permissions
- Agent configuration file syntax errors
- Wrong region configured
- Metric namespace or dimensions exceeding limits
Solution:
# Check CloudWatch agent status sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a status # View agent logs for errors sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log # Verify IAM permissions (attach CloudWatchAgentServerPolicy) aws iam list-attached-role-policies --role-name EC2-CloudWatch-Role # Validate configuration file sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a fetch-config -c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json -s
X-Ray Traces Missing or Incomplete
Symptom: Some services appear in traces but others are missing, or traces are incomplete.
Common causes:
- X-Ray daemon not running or unreachable
- Sampling rate too low
- Missing instrumentation in downstream services
- VPC endpoints not configured for private subnets
Solution:
# Check X-Ray daemon status
sudo systemctl status xray
# Verify daemon can reach X-Ray API
curl -X POST http://127.0.0.1:2000/GetSamplingRules
# Configure higher sampling rate for debugging
{
"version": 2,
"rules": [
{
"description": "Debug sampling",
"service_name": "*",
"http_method": "*",
"url_path": "*",
"fixed_target": 10,
"rate": 1.0
}
]
}
# Ensure trace context propagation headers are forwarded
# X-Amzn-Trace-Id header must be passed between servicesLog Group Reaching Retention or Storage Limits
Error: “ResourceLimitExceededException” or unexpectedly high CloudWatch costs.
Common causes:
- No retention policy set (logs kept indefinitely)
- Excessive debug logging in production
- Log events exceeding size limits
- Too many unique log streams
Solution:
# Set retention policy on log groups aws logs put-retention-policy \ --log-group-name /aws/lambda/my-function \ --retention-in-days 30 # List log groups without retention (potential cost issue) aws logs describe-log-groups \ --query 'logGroups[?!retentionInDays].logGroupName' # Export old logs to S3 for cheaper storage aws logs create-export-task \ --log-group-name /aws/ecs/my-service \ --from 1609459200000 --to 1612137600000 \ --destination my-log-archive-bucket \ --destination-prefix exports/
Container Insights Not Showing EKS Metrics
Symptom: Container Insights enabled but no metrics appearing for pods or nodes.
Common causes:
- CloudWatch agent DaemonSet not deployed
- IRSA not configured for the agent ServiceAccount
- Fluent Bit not forwarding logs
- Cluster name mismatch in configuration
Solution:
# Verify Container Insights addon is enabled aws eks describe-addon --cluster-name my-cluster \ --addon-name amazon-cloudwatch-observability # Check CloudWatch agent pods are running kubectl get pods -n amazon-cloudwatch -l app=cloudwatch-agent # Verify IRSA ServiceAccount annotation kubectl describe sa cloudwatch-agent -n amazon-cloudwatch # Check agent logs kubectl logs -n amazon-cloudwatch -l app=cloudwatch-agent --tail=50
CloudWatch Alarms Not Triggering
Symptom: Alarm stays in OK or INSUFFICIENT_DATA despite threshold being breached.
Common causes:
- Metric dimensions don't match alarm configuration
- Evaluation period longer than expected
- Missing datapoints treated as “not breaching”
- Alarm in wrong region
Solution:
# Check alarm configuration and state aws cloudwatch describe-alarms --alarm-names MyAlarm # Verify metric data exists with exact dimensions aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --start-time 2024-01-01T00:00:00Z \ --end-time 2024-01-01T01:00:00Z \ --period 300 \ --statistics Average # Set treat-missing-data appropriately aws cloudwatch put-metric-alarm \ --alarm-name MyAlarm \ --treat-missing-data breaching
Conclusion
AWS provides a comprehensive observability stack that integrates seamlessly with AWS services. CloudWatch serves as the central hub for logs, metrics, and alarms, while X-Ray provides distributed tracing capabilities essential for debugging microservices.
For organisations standardising on OpenTelemetry, ADOT provides a path to vendor-neutral instrumentation while still leveraging AWS-native backends. Container Insights extends observability into EKS workloads with minimal configuration.
Start with the basics: logs, metrics, and alarms, then progressively add tracing and custom metrics as your observability maturity grows. Use Infrastructure as Code to ensure your observability setup is reproducible and version-controlled.

