10/17/2022
The container wars are over, and Kubernetes won. Docker the company has given up on Swarm, and is refocusing on developer tools, and Mesos development has slowed down to a crawl.
Kubernetes is embraced by large companies with dedicated infrastructure teams, but the hobbyists and small businesses are being left behind. Unnecessary complexity is probably the commonly mentioned reason for not adopting Kubernetes.
But why is Kubernetes so complex? And is the complexity actually unnecessary? Let's look at how Kubernetes works behind the scenes, and why the complexity may be a tradeoff worth making.
Beginners are often overwhelmed by Kubernetes. CRD
? CNI
? CSI
? etcd
? GroupVersionKind
? Why is deploying
containers so complicated?
At the core of Kubernetes is the API server, which is a CRUD API, meaning we can create, read, update and delete
resources. The key to understanding the API server is the CustomResourceDefinition
. They tell the API server which
resources exist and what fields they have. Core resources are technically not CRDs, but their behavior is otherwise
identical.
When we send a resource manifest to the API server, the following happens:
CustomResourceDefinition
.At this stage, we haven't actually done anything yet. No images are being pulled, no containers are being deployed. How does that happen? We left out one feature of the API server: The ability to watch for changes.
Nearly everything in Kubernetes happens through programs that watch the API server for changes, we call them
controllers. When a resource changes, a controller watching it will run and perform an action. We call that
reconciliation, turning the desired state into the actual state. When we create a Deployment
, a controller uses the
information provided in the deployment manifest to find one or more nodes, and creates Pod
resources for the nodes.
The node agent or kubelet watches the pod resource and deploys the container(s). When the pods are deployed (or failed
to deploy), the kubelet returns its result by writing to the pod's status field.
This mode of operation is at the very core of Kubernetes, resources are detached from their implementation. The CSI (Container Storage Interface) is an excellent example of the strengths of this approach. Where Docker volumes are stored locally, Kubernetes' flexibility allows it to support any kind of volume: local, Ceph, NFS or provider-specific volume implementations.
It is important that we don't do the same thing twice however - if we try to provision a local and a NFS volume to the same mount point, we'll run into problems. Most resources are reconciled by a single controller, but for storage, you sometimes have more. The storage class is a string field on volume claims that identified the responsible controller. Notably, this is not a feature of the API server, we depend on the controllers to respect it.
Anyone using Kubernetes will have come across one thing: CrashLoopBackOff
. But why does it occur?
Kubernetes is a big system with many components, and some things must happen in a certain order. However, there is no attempt to synchronize or order operations, instead, failed operations are retried.
When a pod requires a volume, it will have to wait for the CSI driver to create and mount it. How long does that take? Usually a couple of seconds, but if there's an error with the CSI driver, it may not happen for hours or days. So when a pod is created, the kubelet will check if the volume is available. If it is not, it will wait a few seconds and try again. But, as to not waste system resources, it will wait a little longer every time. This is called error back-off, and it's built into all controllers.
All the above has one major implication: It's difficult to figure out where things went wrong. When docker run
fails,
you will be told why. When you create a deployment, your initial feedback will be nothing at all. Only the status of the
deployment and pod will tell you more. And even they may not point directly at the problem.
With some experience and a user interface like Lens, debugging becomes easier. And there are great monitoring solutions for production use. But this is still a big hurdle for beginners taking their first steps with Kubernetes.
What does a controller actually look like? For Cloudplane, we have a resource to request SMTP credentials for our applications.
apiVersion: cloudplane.org/v1alpha1
kind: SMTPCredentials
metadata:
name: example-app-smtp
When we submit this resources to the API server, a controller will be watching this resource type and use it to generate a secret. By convention, we emit the secret with the same name as the SMTPCredentials resource.
For development, we have deployed a instance of MailHog to our cluster, a simple mail server that catches all mails without forwarding them. It allows any user/password combination. Here is what our mailhog controller looks like:
func (r *SMTPCredentialsReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
secret := corev1.Secret{
TypeMeta: metav1.TypeMeta{APIVersion: corev1.SchemeGroupVersion.String(), Kind: "Secret"},
ObjectMeta: metav1.ObjectMeta{
Name: req.Name,
Namespace: req.Namespace,
},
Data: map[string][]byte{
"host": []byte("mailhog.dev"),
"port": []byte("1025"),
"username": []byte("foo@example.com"),
"password": []byte("smtp-password"),
},
}
if err := ctrl.SetControllerReference(&cred, object, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Patch(ctx, object, client.Apply, client.ForceOwnership, client.FieldOwner("smtpcredentials-controller")); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Since we already know the hostname of our mailhog instance, and we don't need to generate any username/password, all
values are static. Setting an owner reference on the secret means it will be deleted by the API server when the
SMTPCredentials
resource is deleted, and finally Patch
creates or updates the secret. If this function returned an
error, it would be automatically retried and we would see the error back-off behavior described earlier.
In production, we replace this controller with our implementation for SES.
Kubernetes was created by engineers at Google who had been running a similar system for years. They knew exactly what they were doing. The design of Kubernetes is very intentional.
There are aspects of Kubernetes that are unfinished (StatefulSets) or need to be reworked (Ingress->Gateway API), it is not a perfect system. But it probably is the best system we currently have.
The good news is that there are many efforts to make Kubernetes easier to use. Acorn aims to simplify application packaging and deployment. And services like Render and our very own Cloudplane use Kubernetes under the hood to deliver a user-friendly solution that requires zero Kubernetes knowledge.