November 15, 2020

Building a Solid Foundation: Kubernetes Infrastructure at Sonos

Mike Splain

Cloud Infrastructure Engineering Manager

Over the past few years, our cloud teams at Sonos have grown to support numerous services that serve our millions of customers throughout the world. Our teams have made significant strides in automating much of the work behind orchestrating, configuring and maintaining complex cloud deployments. Early in our cloud journey, Sonos embraced configuration as code and open source projects like Ansible to ensure consistency and reliability across our services. As time went on, our infrastructure teams started getting the same questions over and over:

How can we get code tested and deployed to production faster?
How can we reduce boilerplate and duplication across teams?
How can we continue to support DevOps best practices as teams evolve and change?

Luckily, over the past 5 years, there have been significant advancements in containerization and orchestration which can make answering these questions significantly easier. Although our team experimented with a number of platforms, we ended up choosing Kubernetes as a solution and were impressed with its ability to self-heal, scale, and provide a consistent abstraction layer our developers could more easily interact with. At this point, we easily could have said “Okay, let's move everything to Kubernetes”, but that would miss the additional complexity our Infrastructure teams would be taking on in supporting Kubernetes deployments. Although service teams would no longer have to support complex Ansible and cloud deployments, our team had some work ahead to better prepare for this journey.

Running services in the cloud can be a complex endeavor, but let's dive into a few of the requirements our teams have. First of all, monitoring and logging is essential to successfully making sure applications are functioning properly. When problems arise, this tooling allows our teams to dive in quickly and diagnose the issues. Additionally, secret management and access to other dependencies such as databases, message queues and object stores are required to successfully stand up applications. Finally, we try to utilize autoscaling as much as we can for our services. We needed to ensure our Kubernetes autoscaling was robust and could react quickly to both predictable and unpredictable traffic levels to ensure good service reliability and cost management. We quickly realized our infrastructure teams would need to implement solutions that would take these and other burdens off our services teams, allowing them to focus less on “how” they run their applications and more on “what” their applications should be doing.

Cluster Configuration

When teams at Sonos deploy their applications, we set best practices that enforce reproducible and reliable application deployment across environments. We quickly realized our infrastructure team would have to do the same thing for Kubernetes cluster deployments. We began diving into areas where automation and configuration would be required to help build resilient clusters, allowing our service teams to focus on their applications and our infrastructure teams to focus on solving broad problems for multiple teams at once:

Networking & CIDR Management – To fit into existing infrastructure, we must connect to and manage many network ranges.
DNS – Services utilize DNS for service discovery inside and outside of our clusters.
User and Service account access – Teams and their automation need secure access to their namespaces and cluster tooling.
Alpha & Beta Features – Kubernetes developers often disable new APIs and features by default. We may want to enable these and allow teams to experiment with upcoming features.
Cluster Monitoring, Alerting & Logging – Clusters and their components will need to be monitored so our infrastructure teams can properly maintain them as we onboard service teams.
Autoscaling – Our clusters should scale themselves up or down as necessary to ensure all workloads are scheduled on cost effective instances.
Node Removal – Our clusters and monitoring tools should be able to remove instances or nodes that go bad for a variety of reasons without human intervention.

The next step our team followed was examining “how” we would deliver such a solution to Sonos. We spun up Kubernetes clusters with open source tools such as kOps and eksctl for our team to get familiar with the tooling. We knew we wanted options to be cloud agnostic and work with tooling that could easily be swapped out over time. Quickly, we realized that there were a few issues with building clusters and simply “deploying code to clusters.”

When building infrastructure, our teams try to follow the Pets vs Cattle analogy of Randy Bias and Bill Baker, building highly available services and when unforeseen issues arise, we simply remove bad instances. As we imagined a year or two down the road, we realized our clusters needed to be built with this same theory in mind. We will have many clusters over time and should avoid having clusters become the source of truth. If an issue occurs and a cluster has so many problems it can’t be recovered, we’ll rip it down and replace it with a fresh clean cluster. This required us to start building tooling that would allow us to easily deploy almost identical infrastructure in each of our environments.

Our first goal was to drive our cluster deployments using a combination of tools including kOps, terraform and helm that each manage different, tightly coupled layers of our cluster infrastructure. We began to orchestrate these tools together driven by values files that could be layered across environments to ensure consistency but would give us the ability to configure specific tools when necessary. We broke our configuration into three buckets: cloud account configuration, tier (dev vs prod environments) and cluster specific.

We hard code specific tool versions to ensure consistency across our environments. Our automation will utilize these versions of tools and configuration when spinning up our clusters. In this case, for our team’s development tier, we also disable Velero, a Kubernetes backup tool, since these clusters never contain real data that needs to be retained. Below, you can see how we tie our clusters to our tiers and we can override any value necessary. This abstraction layer gives our team amazing control and allows us to quickly roll out new versions or configurations to development environments and later promote them to production.

Deploying & Managing Clusters

Now that we had configured values and tools the way we wanted, we knew we needed to make it easy for our team to run both locally and in continuous integration. This would be critical to the stability of our clusters and we had to ensure our clusters were testable and easily manageable, just like our application service teams. We wrote shell scripts that would allow our team to securely authenticate with the proper cloud accounts, ensure tool versions defined in our values were utilized and begin working on a given cluster. By building all our tooling in an idempotent fashion, we can easily re-run our automation, allowing for fast iteration. Of course, we don’t typically want our engineers wasting time manually provisioning and running these types of things locally, but building tooling in a fashion that embraces running the same locally as in CI allows our team to quickly debug when issues arise.

Everything we’ve talked about here is versioned and controlled in GitHub, which lead our teams to ask, “how can we better utilize CI/CD tooling to drive cluster automation?” We began to write a variety of tests that cover a range of infrastructure failure cases. For example, we created a simple test that will check our configuration values files to make sure they’re in proper YAML formatting. This resolved a major pain point that would inexplicably cause various issues for our teams. We eventually created additional tests to verify every cluster configuration against a given change, letting us know if we accidentally broke the build before deployment.

Here are a few other tests we developed on our cluster configuration:

Helm linter - We utilize helm to deploy monitoring and other configuration tooling to our clusters. We use this linter to make sure those configurations are formatted correctly.
Terraform Validation - Before updating any infrastructure components, we run a validation to test the changes that are about to occur.
Kind deployment and upgrade - We utilize kind (Kubernetes in Docker) to spin up sample clusters quickly and upgrade them with any changes we’re making.
End to End Cluster - We will also spin up a full cluster in the cloud and check to make sure our configurations are working as planned on certain git branches.

Automation

By using Pull Requests to manage our Kubernetes Automation, we were able to allow our team to rapidly iterate, utilizing CI to test and manage code changes once members of our team reviewed and approved them. As these code changes piled up, our team began to experiment with Prow, a CI/CD system built by Kubernetes in order to speed up their own CI/CD workflows. Since Prow runs on Kubernetes, it gave our team additional practice managing our own workloads, scaling, and automation that would allow us to run many concurrent tests and automation tooling. We set up a Prow component called Tide to manage when our pull requests would merge to make sure even old code changes were always retested against the latest configuration and tests before merging.

Once these code changes are merged, we have additional jobs that will automatically upgrade our team’s development Kubernetes cluster with the latest configuration. If a problem arises, our team is alerted in Slack that we should take a look.

Our next challenge was to build a deployment strategy that allowed us to gate specific versions of our automation code and ensure that we could selectively roll out changes to each cluster once we were confident in them. We realized that embracing the Unix philosophy of "Do One Thing and Do It Well" to write multiple decoupled jobs to accomplish tasks were often more reliable than long-running, complex jobs. For instance, we wrote a job that would check for a successful Kubernetes cluster upgrade and update a cluster version number stored in our configuration repo. If the version needed updating, this job opens a Pull Request for our team to review and approve. Before merging and upgrading this pull request, we run additional tests to ensure the upgrade for a given cluster will be successful. If a problem arises, rolling back is as simple as reverting this pull request in GitHub.

Wrap up

Our teams have learned quite a bit about working with and deploying Kubernetes. As with many highly technical and complicated tools, we found it very useful to “crawl, walk, run,” while applying good software development and deployment principles. Although Sonos is still in its early stages of utilizing Kubernetes, the additional work our team put into our deployment strategies has made it much simpler for us to iterate and spend more time working with service teams on solving broad deployment challenges. In the future, we hope to share additional updates on our Kubernetes deployments, tooling and migration.

Future work

Additional testing strategies – Kubernetes testing and reliability tooling is still relatively young with new projects released constantly.
Automate component upgrades – We’d like to create jobs that can automatically upgrade components and tooling when new versions are released.
Open source some of our components.

Mike splain is a maintainer on the Open Source Kubernetes project kOps and will be speaking on a similar topic at Kubecon North America 2020.

Continue reading in DevOps:

DevOps
Releasing the Sonos System Software
Read More
August 12, 2020