How to Monitor Kubernetes Clusters with Prometheus and Grafana on AWS EKS: Complete Tutorial
By Braincuber Team
Published on March 11, 2026
We got called in at 2 AM because a D2C brand's payment service crashed and nobody knew for 47 minutes. Their EKS cluster was running 14 microservices across 3 node groups with zero monitoring. A memory leak in one pod cascaded across 4 services before anyone noticed. $8,300 in lost orders. The fix was not a code patch — it was visibility. Prometheus scraping metrics every 15 seconds and Grafana dashboards showing CPU/memory in real time would have caught the spike in under 60 seconds. Here is how to set it up on AWS EKS.
What You'll Learn:
- How to install AWS CLI, eksctl, kubectl, and Helm on your server
- How to create an EKS cluster and install the Kubernetes Metrics Server
- How to configure IAM OIDC and the EBS CSI Driver for persistent storage
- How to deploy Prometheus and Grafana using the kube-prometheus-stack Helm chart
- How to create Grafana dashboards and monitor a deployed NGINX application
Monitoring vs. Observability: The Difference That Costs You Money
Most teams use these words interchangeably. They are not the same thing. Monitoring tells you what is happening (CPU at 92%, response time at 3.4s). Observability tells you why it is happening (a specific pod's garbage collector is thrashing because a memory limit is set 128MB too low).
Monitoring (Known Unknowns)
Tracks predefined metrics in real time: CPU usage, memory consumption, request counts, error rates. Those colorful dashboards on the wall of the IT department. Answers: "Is the system healthy right now?"
Observability (Unknown Unknowns)
Goes deeper using the three pillars: Metrics (time-series CPU/memory data), Logs (historical event records for root cause analysis), and Traces (request flow through microservices for latency debugging).
The Tools: Prometheus + Grafana
Prometheus (Data Collector)
Open-source metrics scraper. Pulls time-series data from your pods every 15-30 seconds. Includes AlertManager for firing alerts, PushGateway for short-lived jobs, and exporters for third-party services. Zero licensing cost.
Grafana (Visualizer)
Transforms raw Prometheus metrics into live dashboards. Pre-built templates (like Node Exporter dashboard ID 15760) give you CPU, memory, network, and pod-level views within minutes. Also open-source and free.
Step by Step Guide: Deploy Prometheus and Grafana on EKS
Prerequisites
You need an AWS account with access keys configured and an EC2 instance running Ubuntu 22.04 (or any Linux/Mac environment). We will install all CLI tools on this server.
Install AWS CLI, eksctl, kubectl, and Helm
Install all four CLI tools on your server. AWS CLI authenticates with AWS. eksctl creates and manages EKS clusters. kubectl interacts with Kubernetes. Helm is the package manager that deploys Prometheus and Grafana via charts.
# AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip && unzip awscliv2.zip
sudo ./aws/install
aws configure # Enter Access Key, Secret Key, Region
# eksctl
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
# kubectl
curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x ./kubectl && sudo mv ./kubectl /usr/local/bin
# Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh && ./get_helm.sh
Create the EKS Cluster
Use eksctl to spin up a 2-node EKS cluster. This provisions the VPC, subnets, IAM roles, and node group. Takes about 15-20 minutes. Then verify with kubectl get nodes.
eksctl create cluster \
--name my-monitoring-cluster \
--version 1.30 \
--region us-east-1 \
--nodegroup-name worker-nodes \
--node-type t2.medium \
--nodes 2 \
--nodes-min 2 \
--nodes-max 3
# Verify nodes are ready
kubectl get nodes
Install the Metrics Server
The Metrics Server collects CPU, memory, and network usage data from Kubelets on each node. Prometheus scrapes this data. Without it, kubectl top pods returns nothing and Prometheus has no resource metrics to collect.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify deployment
kubectl get deployment metrics-server -n kube-system
Configure IAM OIDC and EBS CSI Driver
Prometheus needs persistent storage to retain metrics data across pod restarts. The IAM OIDC provider lets Kubernetes pods assume AWS IAM roles. The EBS CSI Driver dynamically creates EBS volumes as persistent storage for Prometheus pods. Skip this and your Prometheus data disappears every time the pod restarts.
# Associate IAM OIDC provider
eksctl utils associate-iam-oidc-provider \
--cluster my-monitoring-cluster --approve
# Create EBS CSI Driver IAM role
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster my-monitoring-cluster \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
# Add the EBS CSI Driver addon (replace AWS_ACCOUNT_ID)
eksctl create addon \
--name aws-ebs-csi-driver \
--cluster my-monitoring-cluster \
--service-account-role-arn arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole \
--force
Install Prometheus and Grafana via Helm
Add the Helm repos, create a prometheus namespace, and install the kube-prometheus-stack chart. This single chart deploys Prometheus server, Grafana, AlertManager, and all required exporters. Then change both Prometheus and Grafana services from ClusterIP to LoadBalancer to access their dashboards from your browser.
# Add Helm repos
helm repo add stable https://charts.helm.sh/stable
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# Create namespace and install
kubectl create namespace prometheus
helm install stable prometheus-community/kube-prometheus-stack -n prometheus
# Verify everything is running
kubectl get all -n prometheus
# Expose Prometheus dashboard (change ClusterIP to LoadBalancer)
kubectl edit svc stable-kube-prometheus-sta-prometheus -n prometheus
# Expose Grafana dashboard (change ClusterIP to LoadBalancer)
kubectl edit svc stable-grafana -n prometheus
# Get Grafana admin password
kubectl get secret --namespace prometheus stable-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Grafana Login Credentials
Default username is admin. The password is auto-generated and stored as a Kubernetes secret. Use the kubectl get secret command above to retrieve it. Change the default password immediately after first login, especially if your LoadBalancer is internet-facing.
Configure Grafana Dashboards
Open Grafana via the LoadBalancer URL. Go to Add your first data source, choose Prometheus, and enter the Prometheus service URL. Click "Save and Test." Then go to Dashboards, click Import, enter dashboard ID 15760 (Node Exporter Full), select your Prometheus data source, and click Load. You now have real-time CPU, RAM, network, and pod metrics.
Deploy a Test Application and Monitor It
Deploy an NGINX application with 2 replicas to see monitoring in action. Apply the YAML below, verify pods are running, then refresh Grafana to see the new pod metrics appear in your dashboard. This confirms end-to-end monitoring is working.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-app
spec:
replicas: 2
selector:
matchLabels:
app: nginx-app
template:
metadata:
labels:
app: nginx-app
spec:
containers:
- name: nginx-app
image: nginx:latest
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-app
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
selector:
app: nginx-app
kubectl apply -f deployment.yml
kubectl get deployment
kubectl get pods
# Clean up when done (avoid AWS charges!)
eksctl delete cluster my-monitoring-cluster --region us-east-1
Delete the Cluster When Done
An EKS cluster with 2x t2.medium nodes costs roughly $4.37/day ($0.10/hr EKS + $0.0464/hr per node). Always run eksctl delete cluster when you are finished testing. Forgetting to do this is a $131/month mistake we have seen dozens of times.
Key Components at a Glance
| Component | Purpose | Why You Need It |
|---|---|---|
| Metrics Server | Collects CPU/memory from Kubelets | Without it, kubectl top returns nothing |
| IAM OIDC Provider | Lets pods assume IAM roles | Required for EBS CSI Driver access |
| EBS CSI Driver | Creates persistent EBS volumes | Prometheus needs persistent storage |
| kube-prometheus-stack | Helm chart for the full monitoring stack | Installs Prometheus, Grafana, AlertManager in one command |
| Dashboard ID 15760 | Pre-built Node Exporter dashboard | Instant CPU, RAM, network, disk metrics visualization |
Frequently Asked Questions
Can I use Prometheus and Grafana outside of Kubernetes?
Yes. Both tools work with any infrastructure. You can monitor standalone EC2 instances, Docker containers, or bare-metal servers. Kubernetes is just the most common use case because of its dynamic pod scheduling.
Why use the kube-prometheus-stack instead of installing separately?
The stack bundles Prometheus, Grafana, AlertManager, node-exporter, and kube-state-metrics in one Helm chart. Installing them separately means managing 5+ deployments and their configurations individually. The stack handles all of that in one command.
How much storage does Prometheus need for metrics retention?
Default retention is 15 days. A small cluster generates roughly 1-2 GB per day of metrics. For 15 days, plan for 20-30 GB of EBS storage. You can adjust retention with the --storage.tsdb.retention.time flag.
Is exposing Grafana via LoadBalancer safe for production?
Not without additional security. For production, use an Ingress controller with TLS termination, restrict access via security groups, and enable Grafana's built-in authentication with SSO or LDAP integration.
What Grafana dashboard ID should I use for Kubernetes monitoring?
Dashboard 15760 (Node Exporter Full) is the most popular for node-level metrics. For pod-level Kubernetes views, try 6417 (Kubernetes Cluster) or 315 (Kubernetes cluster monitoring via Prometheus).
Running EKS Without Monitoring Dashboards?
We have diagnosed $8,300 outages caused by invisible memory leaks in unmonitored clusters. Whether you need Prometheus alerting pipelines, Grafana dashboard architecture, or full Odoo ERP infrastructure on AWS with production-grade observability — we build the DevOps stack so your team stops firefighting at 2 AM.
