Releasing the Sonos System Software
Director Software Engineering, Sonos Software Release Team
The Sonos sound system is powered by a unique combination of premium hardware and powerful software. We take pride in continually evolving and improving our customers’ experience with their Sonos system by providing regular software releases with new features, improvements to existing features, and bug fixes.
Similar to other modern software companies, we develop software incrementally following Lean and Agile principles. We also practice continuous integration and continuous delivery (CI/CD). Where we’re unusual is in the combination of distinctly different software domains in a single system. This article takes a quick tour of how we manage the software development and release pipelines for each of the domains that make up our system. By the end, you’ll understand the different types of software we release and the release pipelines that carry that software from developers to customers.
Let’s start with the different software domains that make up the Sonos system.
We build and release embedded software for hardware devices. Internally we call this our Player software and it’s the software that runs on our Sonos speakers in a Linux operating system.
We also build and release software for mobile devices, laptops, and desktops. We call this our Controller software and it runs on iOS and Android devices as well as Mac and Windows machines.
Finally, we build and release micro-services that run in the cloud. We call this our Services software and currently we use Amazon Web Services (AWS) as our cloud provider.
Each of these domains has a unique software development culture and set of practices that offer opportunities and challenges for our release pipelines.
Now that we’ve listed the different types of software that make up the Sonos system (Player, Controller, and Services), let’s discuss the release cadences we use for each.
A fixed release schedule for our Player and Controller software
For our Services software, we tend toward an on-demand release cadence with the Service development teams able to release whenever they’re ready. This is usually one or two times during our two-week sprint cycle, although it can be more frequent as we approach key milestones or react to issues. There are a few key elements that make this release cadence possible.
First, the Services are micro-services that are designed to provide single business functions. This makes it easier to quickly build, test, and release changes with minimal risk to other services and the system.
Second, the Services are designed from the ground up as ephemeral cloud applications that allow for quick release, real-time monitoring, and either roll back or roll forward should issues arise.
Third, our Services teams have full ownership of their services with responsibility for both building and operating them. This means there’s no distance between the service running in production, it’s operating health, and the development teams that are enhancing and fixing it.
So what’s the release cadence for our Player and Controller software? Well, here we use a release train model with a fixed release schedule. In general, we follow an 8 week (4 sprint) development and release cycle for major releases combined with a 2 week (1 sprint) cycle for minor releases. This allows us to work on and test large features in our major releases, and enhance and improve existing features in the minor releases that come out between our major releases.
Software release and feature launch
Additionally, we’ve made a distinction between deploying a software release and launching a feature. We use feature toggles to turn features off at build time or experimentation engines to slowly rollout changes at run time. This untethers the feature development and launch process from the release cadence. Teams can put partially complete features in a code base for a release, but then toggle them off before we deploy the release. This means the teams can alpha test features, get feedback, and keep working on them without having to meet a pre-scheduled release date. The experimentation engines we use let teams develop different alternative feature solutions, then experiment with these features, and based on metrics and feedback, decide which is working the best for customers and roll that option out to everyone
OK, so we have Services going out as needed and we have Controller and Player software going out on a fixed cadence. What do the different release pipelines look like?
On the Service side, the release pipeline consists of several environments from Integration, to Test, Performance, Stage, and then Production. New Service releases get deployed to an environment, are tested, and if the tests pass, the release is promoted to the next environment in the pipeline.
The Integration environment is the first production-like environment where a new release is deployed to AWS and functional tests are run against it. This environment can be somewhat unstable as teams work out kinks in new service functionality and APIs. If the release passes the integration tests, it’s promoted to our Test environment.
Test is a more stable environment; it’s where teams work together to verify end-to-end functionality. If the release looks good in Test, it’s promoted to the Performance environment where it is performance tested. This is a special, separate environment where we replicate the production environment and load. This allows us to run load and duration tests on the service without disrupting the teams and helps ensure the service meets its service level targets in production.
After Performance, the service is deployed to Stage. Typically, teams deploy to Stage and Production in quick succession.We ensure that Stage has the same configuration as Production and that deployments to these environments are automated in the same way. Stage is where we do our final check of our deployment scripts and processes as well as our post-deployment verification steps. If those look good, we deploy to Production. We try not to leave new versions of our services on Stage for too long without deploying them to Production because leaving a new service version on Stage for a long time means that Stage has diverged from Production.
Player and Controller release
Like Services, the Sonos Player and Controller software have a release pipeline. However, instead of environments, we use branches and tester pools to define the pipeline and promote Player and Controller builds and releases from Integration to Production. The day we copy the code from one branch to the next is called a branch day. It’s an important day for our release managers, product managers, developers, quality engineers, and Beta team. Each time a release’s code moves up in the branch hierarchy, we get closer to production users.
The first branch in our pipeline is our Integration branch. We expect this branch to be kept at Alpha quality level and we create Alpha releases from it every sprint. These releases go to internal Alpha testers so we can try things out and collect feedback as early in the development process as possible. For us, Alpha quality means the feature can be partially implemented (maybe only one or two of the requirements are done and ready for feedback), but it must not cause major regressions in other features. We find these regressions through build tests, functional automation and manual testing, and Alpha tester feedback.
As mentioned earlier, the development period for Player and Controller software is about 4 sprints (sometimes more, sometimes less, depending on holidays and other company events, such as Hack Week). At the end of this period, we copy the code on the Integration branch to the next branch in our hierarchy—Mainline. We expect the code on this branch to be at least Alpha level. It’s on this branch and at this time that we start to define the release scope. Features that aren’t targeted for the next major release are toggled off. They can stay on and keep testing in the Integration branch, but they should be disabled or hidden on the Mainline branch. Each sprint, we create a release from this branch and send it to external Alpha testers. We get feedback on the new features, enhancements, and bug fixes in the release from testers and from system telemetry. We also look for regressions via automation and manual testing.
You might ask why we essentially have two Alpha branches in our Player and Controller release pipeline: Integration and Mainline. The answer lies in the different development requirements for various features or initiatives. Some changes may need a lot of development and testing time. These features may touch fundamental pieces of our player software and require lots of regression testing. Or, they may be large features that have an impact across our Player, Controller, and Services software and require a long development cycle. Using the Integration branch and our internal Alpha releases allows the teams working on these features to integrate their changes with other changes and to get feedback on their implementation early and regularly over a long period of time. On the other hand, some features are small and don’t require extensive development and testing. Imagine a simple change in one of our controller applications. These changes can be developed, tested, and sent out for feedback more quickly on the Mainline branch without having to wait an extra 4 sprints on the Integration branch.
Returning to our Player and Controller pipeline, after another 4 sprints, the development period ends again and the code is copied from our Mainline branch to our Stage branch. At this point,the code is expected to be at Beta quality. This means all the features for the release are complete. A feature is complete when all of the requirements for the current iteration of the feature are implemented. Features that aren’t complete are toggled off. During the Beta period, we run a set of manual regression tests that cover the main user scenarios. We try to only address bugs during the Beta period to make sure our manual and automated test results are valid (although we allow for some exceptions when necessary). Just as with other branches, every sprint we create Beta releases from the Stage branch and send these out to Beta testers for feedback and to gather telemetry. We’re on the lookout for major bugs that must be fixed prior to the release.
On the next branch day, we move the code from Stage to our final branch, called Production. At this point, the release should be at Release Candidate (RC) level. That means there are no major or higher bugs associated with the features in the release and the code is ready for deployment to the general public. We typically spend one sprint in the RC phase. During this time, we complete our launch preparation including submitting our controller applications to the various app stores (primarily the Apple App Store, Google Play Store, and the Amazon store). When the controller applications are approved in the stores and all other preparation activities are complete, we deploy the release to our general availability (GA) pool. After that, the release code stays on the Production branch. We make minor releases that contain bug fixes and very small feature enhancements off the code in the Production branch. These releases are created each sprint and are considered release candidates as soon as they’re built. We send these to Beta testers to get telemetry data and we submit them to the app stores.
So in summary, we have the Services release pipeline with Integration, Test, Performance, Stage, and Production environments. We also have the Player and Controller release pipeline with Integration, Mainline, Stage, and Production branches. We have continuous integration and deployment for all software, with Services, Players, and Controllers built whenever there are changes (CI builds). We also build our software each night (nightly builds). This allows for more extensive and time consuming automation than is possible on the CI builds. From these nightly builds we create releases and we push these through their respective pipelines. We typically test the services with mock players. We typically test the controllers and players against the production services. This helps us emphasize getting new services code to production and keeping backward compatibility. We have a similar emphasis on the player and controller side by encouraging teams to get their changes into the Integration or Mainline branches even before they’re totally complete so that we can start getting real world feedback via tester reports and telemetry.
Putting it All Together
Some interesting numbers
As we wrap up our tour, it might be fun to think about some numbers.
We deploy new service releases to production about once a day.
We deploy new Player and Controller releases to Alpha, Beta, and Production once every sprint.
Each sprint, we typically have at least 4 different releases going to our different test pools: internal Alpha, release Alpha, release Beta, and general availability.
Each of these releases represents a different major release and code base.
We’re constantly collecting operating telemetry from our services and our releases. We’re also collecting feedback from our Alpha and beta testers. Finally, we’re running tests and identifying bugs through internal automated and manual testing on our CI and nightly builds and on our releases. This lets us find issues early, triage them, and address the key ones before we deploy to production or to general availability. The iterative development, constant integration, and frequent deployment helps us deliver new experiences to our users and to keep the quality of the system high.
What’s next for us? Well, we like our pipelines, but we need to improve efficiency. Here are some things we’re working on:
Tool improvements in terms of our source control system, our continuous integration service, and our workflow automation.
Improving our underlying build scripts to make builds faster.
Standardizing our automation suites and breaking them down into more manageable chunks that can be run to definitely establish our quality levels and speed our pipeline promotion.
Standardizing our build time toggling to make it easier for teams to toggle features on and off.
Lastly, and probably most long term, decoupling our shared player and controller code so that these elements of our systems can develop more independently.
These next steps and improvements are directly tied to delighting our customers. When they buy a Sonos speaker, we aim to give them great hardware that works as part of a larger system (all of the speakers in the house, plus a great control app, plus cloud services), that keeps evolving and improving over time.
Continue reading in DevOps: