Skip to main content

Command Palette

Search for a command to run...

Scaling Like a Pro: Horizontal Pod Autoscaling in Kubernetes

Updated
3 min read
A

DevOps engineer & developer passionate about building scalable, reliable systems. I design and automate pipelines, manage cloud infrastructure, and ensure deployments run smoothly. Turning complex workflows into seamless operations is my craft.

“Why run 10 pods when 2 will do? And why run 2 when traffic surges to 1,000 users?”

Enter Horizontal Pod Autoscaler (HPA): Kubernetes' secret weapon for scaling smart.


🎬 Quick Story

I once launched an app demo during a DevOps webinar. The app was running in a K8s cluster — just 1 pod. Everything looked great on the outside… until the traffic hit.

As attendees flooded in to test the app, that single pod choked under pressure, CPU usage shot through the roof, and eventually — it crashed.

⚠️ No replicas. No autoscaling. No safety net.

In seconds, my demo became a case study in what not to do with production-like environments.

That day, I learned a painful but priceless lesson:
1 pod ≠ 1,000 users.

That’s when I discovered HPA — a controller that scales pods dynamically based on CPU, memory, or custom metrics. And today, I won’t deploy a service without it.

Thanks to HPA, Kubernetes is no longer just a scheduler — it’s a smart traffic cop, scaling up during peak hours and scaling down to save resources.


What is Horizontal Pod Autoscaling (HPA)?

HPA automatically adjusts the number of pods in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.

Default metric: CPU utilization
Others supported: Memory, custom metrics, Prometheus metrics (via metrics-server or adapters)


Let’s Build It – Step by Step

We’ll deploy a simple Flask app that spikes CPU on demand to test HPA.

app.py (CPU Burner)

from flask import Flask
import time
app = Flask(__name__)

@app.route('/')
def index():
    return "Hello, world!"

@app.route('/load')
def load():
    start = time.time()
    while time.time() - start < 10:
        pass  # burn CPU for 10 seconds
    return "CPU load generated!"

Dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install flask
CMD ["python", "app.py"]

Kubernetes Manifests

Deployment (flask-deploy.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flask
  template:
    metadata:
      labels:
        app: flask
    spec:
      containers:
      - name: flask
        image: your-dockerhub/flask-cpu-app
        ports:
        - containerPort: 5000
        resources:
          limits:
            cpu: "500m"
          requests:
            cpu: "200m"

Service (flask-svc.yaml)

apiVersion: v1
kind: Service
metadata:
  name: flask-svc
spec:
  selector:
    app: flask
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000

HPA (hpa.yaml)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: flask-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: flask-app
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Deploy & Scale

kubectl apply -f flask-deploy.yaml
kubectl apply -f flask-svc.yaml
kubectl apply -f hpa.yaml

Confirm HPA is Working

kubectl get hpa

You’ll see output like:

NAME         REFERENCE              TARGETS   MINPODS   MAXPODS   REPLICAS
flask-hpa    Deployment/flask-app   20%/50%   1         5         1

Test It: Simulate CPU Load

Let’s hit the /load endpoint repeatedly using hey or curl:

hey -z 30s -c 10 http://<node-ip>:<node-port>/load

After a few seconds, check HPA again:

kubectl get hpa

You should see something like:

flask-hpa    Deployment/flask-app   160%/50%   1         5         3

📈 Result: Pods automatically scaled from 1 → 3 based on CPU!


📉 Cool Down

Once load subsides, Kubernetes gradually scales pods back down to the minimum (1), conserving cluster resources.


Key Takeaways

  • HPA keeps your app lean when idle and powerful when under load.

  • CPU requests/limits are essential for autoscaling.

  • You can scale on memory or custom metrics using autoscaling/v2.

  • HPA = built-in resiliency + performance optimization.

More from this blog

Stack OverFlowed

14 posts