In this blog post, we'll walk through setting up a robust monitoring and alerting system for a Kubernetes cluster on AWS EKS. We'll use the kube-prometheus-stack to deploy Prometheus and Grafana, configure an ALB Ingress for external access, and set up Amazon SNS as the receiver for Alertmanager to handle alerts.

Using Amazon SNS as the alert receiver is particularly useful when an organization does not use Slack or wants a centralized way to distribute alerts via email. While Alertmanager supports email notifications, setting up email as a direct receiver requires configuring an SMTP configuration resulting in administrative overhead. With SNS, you can subscribe multiple email addresses to a single topic without additional SMTP setup, making it a simpler and more scalable solution.

However, it's important to note that SNS has limitations when handling HTML content. Alerts sent via SNS will be received as plain text rather than formatted HTML, which may impact readability.

Prerequisites

Before we begin, ensure you have the following:

  1. An AWS account with a running EKS cluster.
  2. kubectl and helm installed on your local machine.
  3. AWS CLI configured with the necessary credentials.
  4. AWS Load Balancer Controller installed in the EKS cluster.
  5. kubectl pointing to the correct cluster context.

Step 1: Install the kube-prometheus-stack Using Helm

The kube-prometheus-stack is a collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules that provide easy deployment and management of Prometheus and Grafana on Kubernetes.

1.1 Add the Prometheus Community Helm Repository

First, add the Prometheus Community Helm repository and update it:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

1.2 Deploy the kube-prometheus-stack

Create a prom-operator-values.yaml file to customize the deployment. For example, set up the Grafana admin password, some custom dashboards, grafana ingress and other configurations:

grafana:
  adminPassword: "your-secure-password"
  ### Provision grafana-dashboards-kubernetes ###
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'grafana-dashboards-kubernetes'
        orgId: 1
        folder: 'Kubernetes'
        type: file
        disableDeletion: true
        editable: true
        options:
          path: /var/lib/grafana/dashboards/grafana-dashboards-kubernetes
  dashboards:
    grafana-dashboards-kubernetes:
      k8s-system-api-server:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-api-server.json
        token: ''
      k8s-system-coredns:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-coredns.json
        token: ''
      k8s-views-global:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-global.json
        token: ''
      k8s-views-namespaces:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-namespaces.json
        token: ''
      k8s-views-nodes:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-nodes.json
        token: ''
      k8s-views-pods:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-pods.json
        token: ''
  ingress:
    annotations:
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/load-balancer-name: grafana-alb
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
      alb.ingress.kubernetes.io/ssl-redirect: '443'
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<AWS_REGION>:<ACCOUNT_ID>:certificate/a0ff498e-62fd-4397-8d4a-626360465d32
    enabled: true
    hosts:
    - grafana-test.apps.xyz-company.com
    labels: {}
    path: /

Deploy the kube-prometheus-stack using Helm:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml

This command installs Prometheus, Grafana, and Alertmanager in the monitoring namespace.

1.3 Access Prometheus and Alertmanager UIs

To access the Prometheus and Alertmanager web UIs, use kubectl port-forward:

  • Prometheus UI:
  kubectl port-forward svc/prometheus-operated 9090:9090 --namespace monitoring

Open your browser and go to http://localhost:9090.

  • Alertmanager UI:
  kubectl port-forward svc/alertmanager-operated 9093:9093 --namespace monitoring

Open your browser and go to http://localhost:9093.

To access the Grafana web UI:

  • Create a DNS entry mapping the domain define in ingress with created loadbalancer url.
  • Use admin as the username and the password defined in prom-operator-values.yaml to log in to Grafana.

Step 2: Set Up Amazon SNS for Alertmanager

To receive alerts via email or other channels, we'll configure Amazon SNS as the receiver for Alertmanager.

2.1 Create an SNS Topic

Create an SNS topic to receive alerts:

aws sns create-topic --name alertTopic

Note the ARN (Amazon Resource Name) from the output, as it will be needed in later steps.

2.2 Create an Email Subscription for SNS

Subscribe your email to the SNS topic to receive notifications:

aws sns subscribe \
    --topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
    --protocol email \
    --notification-endpoint <your-email@example.com>

Check your email inbox for a confirmation message and click the link to activate the subscription.

2.3 Create a New IAM Role for EKS Node Group to Assume and Publish to SNS

2.3.1 Create a Trust Relationship

Create a trust relationship policy to allow the EKS node group role to assume the new role:

cat << EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/eks-node-group-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

Create the IAM role:

aws iam create-role \
    --role-name alertmanager_role \
    --assume-role-policy-document file://trust-policy.json

2.3.2 Attach Permissions to Publish to SNS

Create a policy to allow the role to publish messages to the SNS topic:

cat << EOF > sns-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
    }
  ]
}
EOF

Attach the policy to the new role:

aws iam put-role-policy \
    --role-name alertmanager_role \
    --policy-name SNSPublishPolicy \
    --policy-document file://sns-policy.json

2.4 Create a Resource-Based Policy for SNS

Allow this new role named alertmanager_role to publish alerts to the SNS topic by attaching a resource-based policy:

cat << EOF > sns-resource-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
      },
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
    }
  ]
}
EOF

Apply the policy to the SNS topic:

aws sns set-topic-attributes \
    --topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
    --attribute-name Policy \
    --attribute-value file://sns-resource-policy.json

2.5 Update the IAM Role for the EKS Node Group

Allow the EKS node group to assume this new role named alertmanager_role by attaching a policy:

cat << EOF > eks-assume-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
    }
  ]
}
EOF

Attach the policy to the EKS node group role:

aws iam put-role-policy \
    --role-name eks-node-group-role \
    --policy-name assume_alertmanager_role_policy \
    --policy-document file://eks-assume-policy.json

Note:

  • your EKS node Group role might be different than mine e.g. eks-node-group-role

Step 3: Configure Alertmanager to Send Alerts to SNS

Update the alertmanager.yml configuration to send alerts to the SNS topic in same helm values file named as prom-operator-values.yaml:

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job', 'alertname', 'priority']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'sns-receivers'
    receivers:
    - name: 'null'  # Add this to your config as well
    - name: sns-receivers
      sns_configs:
        - api_url: https://sns.<AWS_REGION>.amazonaws.com
          topic_arn: arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic
          sigv4:
            region: <AWS_REGION>
            role_arn: arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role
          subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]'
          message: |-
            {{ if gt (len .Alerts.Firing) 0 }}
            Alerts Firing:
            {{ template "__text_alert_list_markdown" .Alerts.Firing }}
            {{ end }}
            {{ if gt (len .Alerts.Resolved) 0 }}
            Alerts Resolved:
            {{ template "__text_alert_list_markdown" .Alerts.Resolved }}
            {{ end }}
          send_resolved: true  # Sends notification when alert is resolved

Apply the updated configuration:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml

Step 4: Configure custom rules for custom alerts

Update the prom-operator-values.yaml to add:

additionalPrometheusRulesMap:
  custom-rules:
    groups:
    - name: customGroupA.rules
      rules:
      - alert: Custom-Alert Instance High CPU Utilization
        annotations:
          description: CPU usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
          summary: CPU usage is over 75% (instance {{ $labels.instance }})
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
        for: 5m
        labels:
          severity: critical
      - alert: Custom-Alert Instance High Memory Utilization
        annotations:
          description: Memory usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
          summary: Memory usage is over 75% (instance {{ $labels.instance }})
        expr: 100 - (sum by(instance) (node_memory_MemAvailable_bytes) / sum by(instance) (node_memory_MemTotal_bytes) * 100) > 75
        for: 5m
        labels:
          severity: critical

Apply the updated configuration:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml

Full helm values file can be found on GitHub here

Conclusion

In this blog post, we set up a comprehensive monitoring and alerting system for a Kubernetes cluster running on AWS EKS. We deployed Prometheus and Grafana using the kube-prometheus-stack, configured an ALB Ingress for external access to Grafana, and set up Amazon SNS as the receiver for Alertmanager to receive alerts.

With this setup, you can now monitor your Kubernetes cluster, visualize metrics using Grafana, and receive alerts via Alertmanager and SNS. This ensures that your cluster is both observable and resilient to issues.

Author Of article : Muhammad Ahmad Khan Read full article