Back to blog

Why Kubernetes Is So Complex

10/17/2022

The container wars are over, and Kubernetes won. Docker the company has given up on Swarm, and is refocusing on developer tools, and Mesos development has slowed down to a crawl.

Kubernetes is embraced by large companies with dedicated infrastructure teams, but the hobbyists and small businesses are being left behind. Unnecessary complexity is probably the commonly mentioned reason for not adopting Kubernetes.

But why is Kubernetes so complex? And is the complexity actually unnecessary? Let's look at how Kubernetes works behind the scenes, and why the complexity may be a tradeoff worth making.

Kubernetes is modular#

Beginners are often overwhelmed by Kubernetes. CRD? CNI? CSI? etcd? GroupVersionKind? Why is deploying containers so complicated?

At the core of Kubernetes is the API server, which is a CRUD API, meaning we can create, read, update and delete resources. The key to understanding the API server is the CustomResourceDefinition. They tell the API server which resources exist and what fields they have. Core resources are technically not CRDs, but their behavior is otherwise identical.

When we send a resource manifest to the API server, the following happens:

  1. It validates all fields against a stored CustomResourceDefinition.
  2. It calls registered webhooks ("admission controllers"). They perform additional validations that are specific to the resource.
  3. It stores the resource in its storage backend (typically etcd).

At this stage, we haven't actually done anything yet. No images are being pulled, no containers are being deployed. How does that happen? We left out one feature of the API server: The ability to watch for changes.

Nearly everything in Kubernetes happens through programs that watch the API server for changes, we call them controllers. When a resource changes, a controller watching it will run and perform an action. We call that reconciliation, turning the desired state into the actual state. When we create a Deployment, a controller uses the information provided in the deployment manifest to find one or more nodes, and creates Pod resources for the nodes. The node agent or kubelet watches the pod resource and deploys the container(s). When the pods are deployed (or failed to deploy), the kubelet returns its result by writing to the pod's status field.

This mode of operation is at the very core of Kubernetes, resources are detached from their implementation. The CSI (Container Storage Interface) is an excellent example of the strengths of this approach. Where Docker volumes are stored locally, Kubernetes' flexibility allows it to support any kind of volume: local, Ceph, NFS or provider-specific volume implementations.

It is important that we don't do the same thing twice however - if we try to provision a local and a NFS volume to the same mount point, we'll run into problems. Most resources are reconciled by a single controller, but for storage, you sometimes have more. The storage class is a string field on volume claims that identified the responsible controller. Notably, this is not a feature of the API server, we depend on the controllers to respect it.

Kubernetes is eventually consistent#

Anyone using Kubernetes will have come across one thing: CrashLoopBackOff. But why does it occur?

Kubernetes is a big system with many components, and some things must happen in a certain order. However, there is no attempt to synchronize or order operations, instead, failed operations are retried.

When a pod requires a volume, it will have to wait for the CSI driver to create and mount it. How long does that take? Usually a couple of seconds, but if there's an error with the CSI driver, it may not happen for hours or days. So when a pod is created, the kubelet will check if the volume is available. If it is not, it will wait a few seconds and try again. But, as to not waste system resources, it will wait a little longer every time. This is called error back-off, and it's built into all controllers.

All the above has one major implication: It's difficult to figure out where things went wrong. When docker run fails, you will be told why. When you create a deployment, your initial feedback will be nothing at all. Only the status of the deployment and pod will tell you more. And even they may not point directly at the problem.

With some experience and a user interface like Lens, debugging becomes easier. And there are great monitoring solutions for production use. But this is still a big hurdle for beginners taking their first steps with Kubernetes.

Kubernetes is declarative#

Much has been said about the advantages of declarative configuration, infrastructure as code or gitops. If you believe in them, Kubernetes is for you. Kubernetes is the natural progression of Terraform into an always running API. Only Kubernetes does not require fixed ordering and scales infinitely better.

Flux lets us use a git repository as the single source of truth for our cluster, but for dynamic resources we will want to use the API server instead. We use both of these approaches; Flux sets up our cluster, and applications are created using our own Application resource.

Let's write a controller#

What does a controller actually look like? For Cloudplane, we have a resource to request SMTP credentials for our applications.

apiVersion: cloudplane.org/v1alpha1
kind: SMTPCredentials
metadata:
  name: example-app-smtp

When we submit this resources to the API server, a controller will be watching this resource type and use it to generate a secret. By convention, we emit the secret with the same name as the SMTPCredentials resource.

For development, we have deployed a instance of MailHog to our cluster, a simple mail server that catches all mails without forwarding them. It allows any user/password combination. Here is what our mailhog controller looks like:

func (r *SMTPCredentialsReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
  secret := corev1.Secret{
    TypeMeta: metav1.TypeMeta{APIVersion: corev1.SchemeGroupVersion.String(), Kind: "Secret"},
    ObjectMeta: metav1.ObjectMeta{
      Name:      req.Name,
      Namespace: req.Namespace,
    },
    Data: map[string][]byte{
      "host":     []byte("mailhog.dev"),
      "port":     []byte("1025"),
      "username": []byte("foo@example.com"),
      "password": []byte("smtp-password"),
    },
  }

  if err := ctrl.SetControllerReference(&cred, object, r.Scheme); err != nil {
    return ctrl.Result{}, err
  }

  if err := r.Patch(ctx, object, client.Apply, client.ForceOwnership, client.FieldOwner("smtpcredentials-controller")); err != nil {
    return ctrl.Result{}, err
  }

  return ctrl.Result{}, nil
}

Since we already know the hostname of our mailhog instance, and we don't need to generate any username/password, all values are static. Setting an owner reference on the secret means it will be deleted by the API server when the SMTPCredentials resource is deleted, and finally Patch creates or updates the secret. If this function returned an error, it would be automatically retried and we would see the error back-off behavior described earlier.

In production, we replace this controller with our implementation for SES.

Kubernetes works#

Kubernetes was created by engineers at Google who had been running a similar system for years. They knew exactly what they were doing. The design of Kubernetes is very intentional.

There are aspects of Kubernetes that are unfinished (StatefulSets) or need to be reworked (Ingress->Gateway API), it is not a perfect system. But it probably is the best system we currently have.

The good news is that there are many efforts to make Kubernetes easier to use. Acorn aims to simplify application packaging and deployment. And services like Render and our very own Cloudplane use Kubernetes under the hood to deliver a user-friendly solution that requires zero Kubernetes knowledge.