Most applications need full text search. When done right, it makes finding what you’re looking for faster and more accurate. There are a number of options to choose from like Elasticsearch, Algolia, and Redisearch. If cost is an issue (like it is for most of us) the open-source version of Elasticsearch is a great choice. Going with the open source version of Elasticsearch will require a bit more work on your end but in the end you’ll have enterprise level search at a fraction of the cost of commercial solutions.
A container management tool like Kubernetes makes installing / maintaining Elasticsearch much easier. In essence Kubernetes automates the installation and management of docker containers. It runs on most cloud platforms including the big three: Amazon Web Services, Google Cloud and Azure. I’ll be using Google Cloud for this article.
Kubernetes is a large topic. If you’re new to Kubernetes I highly recommend you read Kubernetes In Action published by Manning Publications. It’s a fantastic book that will serve as a reference long after you’ve finished reading it.
What You’ll need To Follow Along
If you want to follow along with this article you’ll need:
- A Google Cloud account and project. Google cloud is a paid service but you get a $300.00 credit when you initially sign up and we’ll clean up when we’re done to avoid charges. Nothing we do here should put you over your credit.
- The gcloud and kubectl command line utilities installed and configured for your Google Cloud project.
Some Basic Kubernetes Terminology
A very brief introduction to Kubernetes terminology will help you get through the rest of this article. A node in Kubernetes is typically a virtual machine (VM) provided by your cloud provider. A pod is a collection of containers (typically docker containers but they can be other types like rkt). A container contains an OS and all of the supporting files you need to run an application. A container typically runs a single process like Elasticsearch in the foreground and terminates when the process is killed or dies. Containers normally log to stdout and stderr which are captured and sent to a logging service like Google’s Stack Driver.
Kubernetes lets you declare the resources you need in yaml or json files called manifests. There are a large number of possible manifests. I’m only going to talk about the ones we’ll use for our Elasticsearch cluster: stateful sets, configmaps and services.
A Kubernetes Stateful Set For Elasticsearch
A Kubernetes stateful set is used to create a set of pods for applications that need a stable network identity and persistent storage. They are perfect for Elasticsearch. Here’s the stateful set we’re going to use for this article:
apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: esnode spec: serviceName: es-sts-governor replicas: 2 updateStrategy: type: RollingUpdate template: metadata: labels: app: es-cluster spec: securityContext: fsGroup: 1000 initContainers: - name: init-sysctl image: busybox imagePullPolicy: IfNotPresent securityContext: privileged: true command: ["sysctl", "-w", "vm.max_map_count=262144"] containers: - name: elasticsearch resources: requests: memory: 9Gi securityContext: privileged: true runAsUser: 1000 capabilities: add: - IPC_LOCK - SYS_RESOURCE image: docker.elastic.co/elasticsearch/elasticsearch-platinum:6.2.4 env: - name: ES_JAVA_OPTS valueFrom: configMapKeyRef: name: es-config key: ES_JAVA_OPTS readinessProbe: httpGet: scheme: HTTP path: /_cluster/health?local=true port: 9200 initialDelaySeconds: 5 ports: - containerPort: 9200 name: es-http - containerPort: 9300 name: es-transport volumeMounts: - name: es-data mountPath: /usr/share/elasticsearch/data - name: elasticsearch-yml mountPath: /usr/share/elasticsearch/config/elasticsearch.yml subPath: elasticsearch.yml volumes: - name: elasticsearch-yml configMap: name: es-config items: - key: elasticsearch.yml path: elasticsearch.yml volumeClaimTemplates: - metadata: name: es-data spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 100Gi
If you’re new to Kubernetes this is probably a little daunting. This manifest is the most complex one we’re going to use in this article. I struggled to understand manifests like this too so I’ll cover most of it before moving on.
- Line 4: The name of the service is esnode. It’s under metadata.name. The service name is how you’ll reference the stateful set when you use kubectl e.g. kubectl get sts
- Line 6: The governing service is es-sts-governor. It’s under spec.serviceName. A governing service is a headless kubernetes service that is used to provide the network identity for the pods the stateful set creates. If that sounds like gibberish…don’t worry. I’ll explain it when I talk about services.
- Line 7: spec.replicas specifies the number of pods the stateful set will create. Two in our case.
- Line 10: spec.template is the pod template. Stateful sets use pod templates as “cookie cutters” to create identical pods. The number of pods it will create is specified in spec.replicas. Pods are arguably the most important object in Kubernetes. To keep this article manageable, I’m not going to cover them here. I encourage you to read more about them in the Kubernetes documentation or in the Kubernetes In Action book.
- Line 13: spec.template.metadata.labels allows you to label your pods. Labels are how services identify the pods they provide access to. More on services in a bit. Make sure you give your pods a label. Our label has a key of “app” and a value of “es-cluster”. You can use your own key:value pair if you prefer.
- Line 16: I use spec.template.spec.securityContext.fsGroup to set the group ID for files in the container. This setting is needed because Elasticsearch runs with userID:groupID of 1000:1000. If this isn’t set you’ll run into permission errors.
Line 17: The spec.template.spec.initContainers block:
initContainers: - name: init-sysctl image: busybox imagePullPolicy: IfNotPresent securityContext: privileged: true command: ["sysctl", "-w", "vm.max_map_count=262144"]
is also meant for Elasticsearch. It sets vm.maxmapcount in /etc/sysctl.conf to help avoid out of memory exceptions. You can read more about the setting in the Elasticsearch Documentation.
Line 24: The spec.template.spec.containers list is where we finally install our Elasticsearch docker container. The resources block:
resources: requests: memory: 9Gi
tells Kubernetes to schedule the Pod onto a node with 9 Gibibytes of available memory. This will work with the nodes we’re going to create in this article, but you’ll want to adjust it when you create your actual cluster.
Line 29: The spec.template.spec.containers.securityContext block
securityContext: privileged: true runAsUser: 1000 capabilities: add: - IPC_LOCK - SYS_RESOURCE
sets the user the container should run as and adds some configuration options to help Elasticsearch run properly.
Line 36: The docker image we use:
is provided by Elastic. It includes x-pack. X-pack used to require a commercial license for most of it’s features. However, as of the 6.3 release, much of x-pack’s functionality became free to use like the rest of Elasticsearch. Elastic also provides an image without x-pack if you prefer.
- Line 37: The spec.template.spec.containers.env block sets environment variables in the container. Environment variables simplify passing configuration data to your application. Our manifest configures the ES_JAVA_OPTS environment variable that sets the JVM options for Elasticsearch. The value is set by a configmap. A configmap is another kubernetes resource that I talk about a bit later.
- Line 43: The spec.template.spec.containers.readinessProbe block is an important part of the manifest. Kubernetes will invoke your readiness probe after the pod starts to determine if the pod is healthy enough to receive traffic. Unlike a livenessProbe, Kubernetes won’t kill a pod that fails a readinessProbe. It just won’t send traffic there.
- Line 49: The spec.template.spec.containers.ports section isn’t strictly necessary, but it does serve as documentation for the ports Elasticsearch uses and you can use the port names in other Kubernetes objects instead of the port number if you like.
- Line 54: spec.template.spec.containers.volumeMounts is where Kubernetes will mount pod volumes in the container. A volume is typically a directory that may or may not contain files. A volume belongs to a pod so any container running in the pod can mount it. The volumeMount specifies where the volume should be mounted in the container’s file system.
- Line 60: spec.template.spec.containers.volumes are typically directories that can be mounted into a container’s filesystem. We use the volume in this manifest to mount our elasticsearch.yml config file from a Kubernetes configmap.
Line 67: The spec.template.spec.containers.volumeClaimTemplates section makes stateful sets really useful. It tells Kubernetes to provision a disk of a certain type, the access mode to use, the disk size, etc. Google Cloud uses gcePersistentDisks by default. Because I don’t specify the type of disk (with storageClassName) gcePersistentDisks will be used. The accessMode “ReadWriteOnce” means only a single node can mount the volume for reading and writing. We request a 100 Gibibyte disk with the resource request block:
resources: requests: storage: 100Gi
I also want to cover config maps since we use one in our statefulset. A config map is a Kubernetes resource that is used to store configuration data. Data is stored in a map that contains key/value pairs. Values can range from short literal values to complete config files. Here’s our config map file:
apiVersion: v1 kind: ConfigMap metadata: name: es-config data: elasticsearch.yml: | cluster.name: your-cluster-name network.host: "0.0.0.0" bootstrap.memory_lock: false discovery.zen.ping.unicast.hosts: es-sts-governor discovery.zen.minimum_master_nodes: 1 xpack.security.enabled: false xpack.monitoring.enabled: false ES_JAVA_OPTS: -Xms4500m -Xmx4500m
- Line 5: The config map data block contains the actual key/values. The elasticsearch.yml key contains the yaml that elasticsearch expects in it’s config file (the | character indicates a literal multi-line value follows). When the stateful set creates a volume using our configMap, the key ‘elasticsearch.yml’ will be used as the filename for a file mounted in the volumeMount directory and the file will contain the yaml.
- Line 10: The discovery.zen.ping.unicast.hosts setting could use a longer explanation. You may recall that es-sts-governor is the name of the governing service for the stateful set. When the service is created, kubernetes creates an internal DNS entry using it’s name. Elasticsearch will issue a DNS query when it starts up for ‘es-sts-governor’ which will return a list of all of the IP addresses for the pods the service governs. Elasticsearch will then use the pod IPs to setup or join a cluster. Checkout the elasticsearch documentation for an explanation of the other settings. You should familiarize yourself with them before you move to a live environment.
- Line 12: xpack security and monitoring are turned off to avoid complicating this article. I recommend you use both in your production cluster.
- Line 14: The ESJAVAOPTS key is used to populate an environment variable in our stateful set. The environment variable will contain the string value “-Xms4500m -Xmx4500m”
Kubernetes services are used to map a single IP address to a group of pods that provide a service (like Elasticsearch). We are using two types of services in our stateful set, a ‘headless’ governing service and a load balancer. Here’s the manifest for the headless service:
apiVersion: v1 kind: Service metadata: name: es-sts-governor spec: clusterIP: None selector: app: es-cluster ports: - name: transport port: 9300
- Line 4: The name of the service is es-sts-governor
- Line 6: A clusterIP of None is what makes this service ‘headless’. A DNS query for a ‘headless’ service resolves to the IP addresses for the pods it governs. It’s just what we need to enable Elasticsearch to cluster using the discovery.zen.ping.unicast.hosts setting. Elasticsearch will issue a DNS query for the service when it starts up and receive a list of IP addresses for all of the pods running Elasticsearch. They can then form an Elasticsearch cluster.
- Line 8: The service applies to pods that have a label of “app” with a value of “es-cluster”.
By default pods are only accessible on internal, private IP addresses. That’s fine in a lot of scenarios, but if you want to access your pods from outside your network, over the Internet for example, you’ll want to set up a load balancer.
apiVersion: v1 kind: Service metadata: name: es-load-balancer spec: selector: app: es-cluster ports: - name: http port: 80 targetPort: 9200 type: LoadBalancer
- Line 4: The name of the service is es-load-balancer
- Line 6: The “selector” property tells the service to route traffic to pods that have a label of “app” with a value of “es-cluster”.
- Line 9: The service routes traffic it receives on port 80 to node port 9200, which is the port the Elasticsearch REST api runs on.
Setting up the cluster
Now that we’ve gone over our Kubernetes manifests, let’s set up our cluster! I set up a repo on Github that contains our manifests. You can clone the repo to get the files if you like. Alternatively, you can just copy the stateful set, config map and services into files on your local machine. Put them all in a single directory (it doesn’t matter where…your home directory is fine). Open up a terminal and CD into the directory you created. Then do the following in the order I’ve written here to minimize errors. First, let’s create a container cluster in Google Cloud that we can use for our Elasticsearch cluster.
gcloud container clusters create es-cluster --zone us-central1-a --machine-type n1-highmem-2 --num-nodes 2
Note: If you receive an error like: “The Kubernetes Engine API is not enabled for project…” you’ll need to enable the Kubernetes API through the Google Cloud Console and try again.
You can use kubectl to see the nodes:
kubectl get nodes
Create the stateful set governing service.
kubectl create -f es-govern-for-sts-svc.yaml
kubectl create -f tells Kubernetes to create the service from the file es-govern-for-sts-svc.yaml in the current directory.
Create our config map
kubectl create -f es-configmap.yaml
Create our stateful set.
kubectl create -f es-node-sts.yaml
It will take a minute or two for Kubernetes to provision everything. Here are a few commands you can use to monitor progress:
# Watch the pods being created...remove the --watch to just fetch the list kubectl get po --watch # Inspect the container logs for pod esnode-0 kubectl logs -f esnode-0 # Describe the stateful set kubectl describe sts esnode
Verify the pods are running.
kubectl get pods -o=wide
You can get a shell inside the esnode-0 pod like this if you want to poke around:
kubectl exec -it esnode-0 -- /bin/bash
Now setup the load balancer so we can access Elasticsearch from outside our private network:
kubectl create -f es-load-balancer-svc.yaml
It may take a minute for Google to assign an external IP address to the load balancer. Use kubectl to check for it.
kubectl get svc es-load-balancer
As a last check let’s make sure Elasticsearch is accessible from outside our network with curl:
curl -s http://YOUR_LOAD_BALANCER_EXTERNAL_IP_GOES_HERE
Congratulations! You’ve successfully installed Elasticsearch in Kubernetes. This is a great start to adding full-text search to your application. There’s still work to be done before you move to a live environment. First you’ll want to configure Elasticsearch for TLS encryption. I also suggest you switch the load balancer out for an Ingress. You can read more about Ingresses in the Kubernetes documentation. They provide SSL termination and routing within your cluster. You’ll also want to configure x-pack security and monitoring.
Although this article hasn’t left you with a ‘production ready’ cluster, it covers many of the foundational topics you’ll need to understand. Once you get this down you shouldn’t have any trouble picking the rest up yourself.
Let’s clean things up to avoid charges from Google.
Delete the stateful set. This will also delete the pods. It takes a while to complete.
kubectl delete sts esnode
Delete our services
kubectl delete svc es-load-balancer es-sts-governor
Delete our configmap
kubectl delete cm es-config
Delete our cluster
gcloud container clusters delete es-cluster --zone us-central1-a
Thanks for reading!