In this blog post, we'll walk through setting up a robust monitoring and alerting system for a Kubernetes cluster on AWS EKS. We'll use the kube-prometheus-stack
to deploy Prometheus and Grafana, configure an ALB Ingress for external access, and set up Amazon SNS as the receiver for Alertmanager to handle alerts.
Using Amazon SNS as the alert receiver is particularly useful when an organization does not use Slack or wants a centralized way to distribute alerts via email. While Alertmanager supports email notifications, setting up email as a direct receiver requires configuring an SMTP configuration resulting in administrative overhead. With SNS, you can subscribe multiple email addresses to a single topic without additional SMTP setup, making it a simpler and more scalable solution.
However, it's important to note that SNS has limitations when handling HTML content. Alerts sent via SNS will be received as plain text rather than formatted HTML, which may impact readability.
Prerequisites
Before we begin, ensure you have the following:
- An AWS account with a running EKS cluster.
kubectl
andhelm
installed on your local machine.- AWS CLI configured with the necessary credentials.
- AWS Load Balancer Controller installed in the EKS cluster.
kubectl
pointing to the correct cluster context.
Step 1: Install the kube-prometheus-stack
Using Helm
The kube-prometheus-stack
is a collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules that provide easy deployment and management of Prometheus and Grafana on Kubernetes.
1.1 Add the Prometheus Community Helm Repository
First, add the Prometheus Community Helm repository and update it:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
1.2 Deploy the kube-prometheus-stack
Create a prom-operator-values.yaml
file to customize the deployment. For example, set up the Grafana admin password, some custom dashboards, grafana ingress and other configurations:
grafana:
adminPassword: "your-secure-password"
### Provision grafana-dashboards-kubernetes ###
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'grafana-dashboards-kubernetes'
orgId: 1
folder: 'Kubernetes'
type: file
disableDeletion: true
editable: true
options:
path: /var/lib/grafana/dashboards/grafana-dashboards-kubernetes
dashboards:
grafana-dashboards-kubernetes:
k8s-system-api-server:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-api-server.json
token: ''
k8s-system-coredns:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-coredns.json
token: ''
k8s-views-global:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-global.json
token: ''
k8s-views-namespaces:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-namespaces.json
token: ''
k8s-views-nodes:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-nodes.json
token: ''
k8s-views-pods:
url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-pods.json
token: ''
ingress:
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/load-balancer-name: grafana-alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<AWS_REGION>:<ACCOUNT_ID>:certificate/a0ff498e-62fd-4397-8d4a-626360465d32
enabled: true
hosts:
- grafana-test.apps.xyz-company.com
labels: {}
path: /
Deploy the kube-prometheus-stack
using Helm:
helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--values prom-operator-values.yaml
This command installs Prometheus, Grafana, and Alertmanager in the monitoring
namespace.
1.3 Access Prometheus and Alertmanager UIs
To access the Prometheus and Alertmanager web UIs, use kubectl port-forward
:
- Prometheus UI:
kubectl port-forward svc/prometheus-operated 9090:9090 --namespace monitoring
Open your browser and go to http://localhost:9090
.
- Alertmanager UI:
kubectl port-forward svc/alertmanager-operated 9093:9093 --namespace monitoring
Open your browser and go to http://localhost:9093
.
To access the Grafana web UI:
- Create a DNS entry mapping the domain define in ingress with created loadbalancer url.
- Use
admin
as the username and the password defined inprom-operator-values.yaml
to log in to Grafana.
Step 2: Set Up Amazon SNS for Alertmanager
To receive alerts via email or other channels, we'll configure Amazon SNS as the receiver for Alertmanager.
2.1 Create an SNS Topic
Create an SNS topic to receive alerts:
aws sns create-topic --name alertTopic
Note the ARN (Amazon Resource Name) from the output, as it will be needed in later steps.
2.2 Create an Email Subscription for SNS
Subscribe your email to the SNS topic to receive notifications:
aws sns subscribe \
--topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
--protocol email \
--notification-endpoint <your-email@example.com>
Check your email inbox for a confirmation message and click the link to activate the subscription.
2.3 Create a New IAM Role for EKS Node Group to Assume and Publish to SNS
2.3.1 Create a Trust Relationship
Create a trust relationship policy to allow the EKS node group role to assume the new role:
cat << EOF > trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<ACCOUNT_ID>:role/eks-node-group-role"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
Create the IAM role:
aws iam create-role \
--role-name alertmanager_role \
--assume-role-policy-document file://trust-policy.json
2.3.2 Attach Permissions to Publish to SNS
Create a policy to allow the role to publish messages to the SNS topic:
cat << EOF > sns-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sns:Publish",
"Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
}
]
}
EOF
Attach the policy to the new role:
aws iam put-role-policy \
--role-name alertmanager_role \
--policy-name SNSPublishPolicy \
--policy-document file://sns-policy.json
2.4 Create a Resource-Based Policy for SNS
Allow this new role named alertmanager_role
to publish alerts to the SNS topic by attaching a resource-based policy:
cat << EOF > sns-resource-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
},
"Action": "sns:Publish",
"Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
}
]
}
EOF
Apply the policy to the SNS topic:
aws sns set-topic-attributes \
--topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
--attribute-name Policy \
--attribute-value file://sns-resource-policy.json
2.5 Update the IAM Role for the EKS Node Group
Allow the EKS node group to assume this new role named alertmanager_role
by attaching a policy:
cat << EOF > eks-assume-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
}
]
}
EOF
Attach the policy to the EKS node group role:
aws iam put-role-policy \
--role-name eks-node-group-role \
--policy-name assume_alertmanager_role_policy \
--policy-document file://eks-assume-policy.json
Note:
- your EKS node Group role might be different than mine e.g.
eks-node-group-role
Step 3: Configure Alertmanager to Send Alerts to SNS
Update the alertmanager.yml
configuration to send alerts to the SNS topic in same helm values file named as prom-operator-values.yaml
:
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['job', 'alertname', 'priority']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'sns-receivers'
receivers:
- name: 'null' # Add this to your config as well
- name: sns-receivers
sns_configs:
- api_url: https://sns.<AWS_REGION>.amazonaws.com
topic_arn: arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic
sigv4:
region: <AWS_REGION>
role_arn: arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role
subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]'
message: |-
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list_markdown" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list_markdown" .Alerts.Resolved }}
{{ end }}
send_resolved: true # Sends notification when alert is resolved
Apply the updated configuration:
helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--values prom-operator-values.yaml
Step 4: Configure custom rules for custom alerts
Update the prom-operator-values.yaml
to add:
additionalPrometheusRulesMap:
custom-rules:
groups:
- name: customGroupA.rules
rules:
- alert: Custom-Alert Instance High CPU Utilization
annotations:
description: CPU usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
summary: CPU usage is over 75% (instance {{ $labels.instance }})
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
for: 5m
labels:
severity: critical
- alert: Custom-Alert Instance High Memory Utilization
annotations:
description: Memory usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
summary: Memory usage is over 75% (instance {{ $labels.instance }})
expr: 100 - (sum by(instance) (node_memory_MemAvailable_bytes) / sum by(instance) (node_memory_MemTotal_bytes) * 100) > 75
for: 5m
labels:
severity: critical
Apply the updated configuration:
helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--values prom-operator-values.yaml
Full helm values file can be found on GitHub here
Conclusion
In this blog post, we set up a comprehensive monitoring and alerting system for a Kubernetes cluster running on AWS EKS. We deployed Prometheus and Grafana using the kube-prometheus-stack
, configured an ALB Ingress for external access to Grafana, and set up Amazon SNS as the receiver for Alertmanager to receive alerts.
With this setup, you can now monitor your Kubernetes cluster, visualize metrics using Grafana, and receive alerts via Alertmanager and SNS. This ensures that your cluster is both observable and resilient to issues.
Author Of article : Muhammad Ahmad Khan Read full article