Scaling Like a Pro: Horizontal Pod Autoscaling in Kubernetes
DevOps engineer & developer passionate about building scalable, reliable systems. I design and automate pipelines, manage cloud infrastructure, and ensure deployments run smoothly. Turning complex workflows into seamless operations is my craft.
“Why run 10 pods when 2 will do? And why run 2 when traffic surges to 1,000 users?”
Enter Horizontal Pod Autoscaler (HPA): Kubernetes' secret weapon for scaling smart.
🎬 Quick Story
I once launched an app demo during a DevOps webinar. The app was running in a K8s cluster — just 1 pod. Everything looked great on the outside… until the traffic hit.
As attendees flooded in to test the app, that single pod choked under pressure, CPU usage shot through the roof, and eventually — it crashed.
⚠️ No replicas. No autoscaling. No safety net.
In seconds, my demo became a case study in what not to do with production-like environments.
That day, I learned a painful but priceless lesson:
1 pod ≠ 1,000 users.
That’s when I discovered HPA — a controller that scales pods dynamically based on CPU, memory, or custom metrics. And today, I won’t deploy a service without it.
Thanks to HPA, Kubernetes is no longer just a scheduler — it’s a smart traffic cop, scaling up during peak hours and scaling down to save resources.
What is Horizontal Pod Autoscaling (HPA)?
HPA automatically adjusts the number of pods in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.
Default metric: CPU utilization
Others supported: Memory, custom metrics, Prometheus metrics (via metrics-server or adapters)
Let’s Build It – Step by Step
We’ll deploy a simple Flask app that spikes CPU on demand to test HPA.
app.py (CPU Burner)
from flask import Flask
import time
app = Flask(__name__)
@app.route('/')
def index():
return "Hello, world!"
@app.route('/load')
def load():
start = time.time()
while time.time() - start < 10:
pass # burn CPU for 10 seconds
return "CPU load generated!"
Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install flask
CMD ["python", "app.py"]
Kubernetes Manifests
Deployment (flask-deploy.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-app
spec:
replicas: 1
selector:
matchLabels:
app: flask
template:
metadata:
labels:
app: flask
spec:
containers:
- name: flask
image: your-dockerhub/flask-cpu-app
ports:
- containerPort: 5000
resources:
limits:
cpu: "500m"
requests:
cpu: "200m"
Service (flask-svc.yaml)
apiVersion: v1
kind: Service
metadata:
name: flask-svc
spec:
selector:
app: flask
ports:
- protocol: TCP
port: 80
targetPort: 5000
HPA (hpa.yaml)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flask-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flask-app
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Deploy & Scale
kubectl apply -f flask-deploy.yaml
kubectl apply -f flask-svc.yaml
kubectl apply -f hpa.yaml
Confirm HPA is Working
kubectl get hpa
You’ll see output like:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
flask-hpa Deployment/flask-app 20%/50% 1 5 1
Test It: Simulate CPU Load
Let’s hit the /load endpoint repeatedly using hey or curl:
hey -z 30s -c 10 http://<node-ip>:<node-port>/load
After a few seconds, check HPA again:
kubectl get hpa
You should see something like:
flask-hpa Deployment/flask-app 160%/50% 1 5 3
📈 Result: Pods automatically scaled from 1 → 3 based on CPU!
📉 Cool Down
Once load subsides, Kubernetes gradually scales pods back down to the minimum (1), conserving cluster resources.
Key Takeaways
HPA keeps your app lean when idle and powerful when under load.
CPU requests/limits are essential for autoscaling.
You can scale on memory or custom metrics using
autoscaling/v2.HPA = built-in resiliency + performance optimization.