Reliable Down-Scaling on GKE

In my team, we regularly run batch jobs with very specific hardware requirements. A typical example is model training which usually requires (multiple) GPUs to finish within a reasonable amount of time. To run such jobs, we use the Google Kubernetes Engine (GKE) which is a managed service on the Google Cloud Platform (GCP). The workflow is to submit a job and let Kubernetes allocate the required resources, run the job, and finally deallocate the acquired resources to avoid unnecessary costs. Typically, Kubernetes acquires new resources by provisioning new compute nodes and adding them to the cluster.

GKE makes this workflow a no-brainer, but sometimes gets tricky when it comes to the details. In the past, we had the problem that Kubernetes sometimes didn’t remove (apparently) unused nodes from the cluster again, increasing our cloud bill on a regular basis. Before clarifying why our nodes got stuck, let’s first look at a typical GKE configuration using Terraform:

provider "google-beta" {
  project = "myproject"
  region  = "europe-west1"
}

resource "google_container_cluster" "primary" {
  name               = "my-cluster"
  initial_node_count = 2
  subnetwork         = "default-europe-west1"

  addons_config {
    kubernetes_dashboard {
      disabled = true
    }
  }

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    disk_size_gb = 20
  }
}

resource "google_container_node_pool" "model-training-np" {
  provider           = "google-beta"
  name               = "training-np"
  cluster            = "my-cluster"
  initial_node_count = 0

  autoscaling {
    min_node_count = 0
    max_node_count = 3
  }

  node_config {
    machine_type = "n1-standard-32"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    disk_size_gb = 100

    guest_accelerator {
      type  = "nvidia-tesla-k80"
      count = 4
    }
  }
}

This configuration creates a GKE cluster with a default node pool of two compute nodes, each one having one vCPU and about 4GB of RAM. That’s usually sufficient for running all the GKE system services. In addition, there is a second node pool for model training where the initial node count is zero. This node pool will provide us with on-demand nodes, each one having 32 vCPUs, 120GB of RAM, and 4 GPUs! Note that autoscaling ensures a maximum number of three nodes. Thus, whenever a job is submitted to this node pool, Kubernetes automatically provisions up to three nodes in order to run our jobs. First, let’s deploy this configuration to your GCP project:

gcloud auth application-default login # authenticate yourself to GCP
terraform plan # check what Terraform is going to do
terraform apply # apply the changes

The following snippet illustrates how a typical job configuration looks like. Notice that we use a node selector to assign this job to our training node pool.

apiVersion: batch/v1
kind: Job
metadata:
  name: "build-model"
spec:
  template:
    spec:
      containers:
      - name: "build-model"
        image: "python:3.7.3-alpine3.9"
        command: ["echo", "train my model"]
        resources:
          requests:
            cpu: 24
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-nodepool: "training-np"
  backoffLimit: 0

Let’s submit this job to our freshly created GKE cluster:

kubectl apply -f job.yaml

You’ll see that Kubernetes now realizes that it is not able to schedule the job with the currently available nodes (remember that the initial node count of our training node pool is zero!). Thus, it creates a new node, adds it to the training node pool, and eventually runs the job on that node. So far, so good. When the job has finished, we expect Kubernetes to autoscale the node count to zero again. In case of GKE, it usually takes 20 to 30 minutes for the autoscaler to become active. As it turns out, however, you’ll notice that such nodes sometimes just remain inside the node pool, because Kubernetes re-scheduled system services to our training nodes. To avoid running services in places where they do not belong, we must add a taint to our node pool configuration.

...
  node_config {
    ...
    taint {
      key    = "special"
      value  = "strong-cpu"
      effect = "NO_SCHEDULE"
    }
  }
...

A taint makes Kubernetes only assign workloads which tolerate that taint. Thus, we also add a toleration for that taint to our job configuration:

...
      tolerations:
      - key: "special"
        operator: "Equal"
        value: "strong-cpu"
        effect: "NoSchedule"
...

That’s basically it! Now we made sure that the autoscaling works reliably and our manager is happy that we only pay for resources that we actually use.