Building & Automating Cloud Services Performance Engineering Infrastructure at Sonos
Manager - Software Performance Quality Engineering
At Sonos, quality is a primary focus and performance testing plays a vital role in reducing the risk of cloud services failing under load by consistently deploying quality services to production at a faster pace. Performance automation provides adaptive execution at a high scale with the same real-time quality check.
Performance Engineering is a specialized team that strategizes around infrastructure and load rather than simply scaling indefinitely to sustain that load. Through our planning process, we have made investments up front and have repeatedly been able to cost down our services even though our load has grown over the years.
Our team plans, implements, and builds an infrastructure that can be consistently used for test iterations allowing for predictability, long-term scaling, and cost savings.
Why is this needed ?
When a new service is developed on the cloud or an existing service provides an update, the Performance Engineering team is responsible for testing it at scale. We write scripts, test the service, document the results, and review them with the teams as quickly as possible.
To deliver the best results, we always ask:
How can this be built?
What are we trying to achieve?
As we answer these questions, we’re super clear that we’re going to build this automation in house to reduce costs in tooling and infrastructure, and so we own it and can reuse and repurpose it.
How can this be built ?
The performance testing environment is now readily available to our engineers, any developer or QA engineer. It isn’t limited to a performance test engineer which was the case previously and it took weeks to wrap up a clean test, review the results with the teams, and get the go ahead for migrating the cloud service to production
How do I define the load model for the test execution ?
How & when do I know to publish errors ?
Which testruns are to be published ?
Which results are to be documented & who needs to be notified ?
Is the right loadmodel being executed ?
Is the test running with the right parameters ?
Are the logs available during & after the test execution for validation, debugging purposes ?
Were the results published to the right engineer & did the engineer get notified about the test execution phases ?
What did we achieve ?
The entire infrastructure has to be built on the cloud, which is a challenge. The performance environment is a replica of the production environment so real world scenarios can be replicated. This gives an apples-to-apples comparison for all the tests run. The scenarios involve end-to-end tests across multiple microservices integrated with external partner services.
In a pre-automation world, engineers are required to file a ticket, describe the scenarios, wait for the ticket to be picked up, test modelled and executed. So, the infrastructure needs to be resilient and adaptable to critical scenarios which will be tested ahead of time, assuming that the same result will occur in the production environment, as worst case scenarios. Test runs need to be completed quickly and the results must be error free.
Test scripts are built upon Apache JMeter & Gatling. These scripts run on the containers to scale up the requests to microservice. Containerization used is AWS Fargate & Kubernetes. Jenkins is used to orchestrate these test scripts. Post Automation all this is done with a single click job which reduced the manual effort by 90%.
What did we achieve ?
Monitoring, alerting, and controlled repetition of tests are the main reasons for building automation. This gives engineers (developers and QA engineers) peace of mind so they can observe live interaction of the test runs without constantly pulling information from the cloud to the local computer.
Grafana is the main tool for monitoring and alerting for Sonos performance test automation. It monitors real time errors and success results with detailed description of logs at the client level (container) versus the microservice level. This also provides health checks of the client containers which is mandatory while running heavy load tests.
Grafana trends for various test runs are processed in real time along with pass/fail metrics with color coding depicting the reasons for failure which eradicates lengthy documentation for the stakeholders to review.
What did we achieve ?
Notifications are made available for client side test runs through automation only. This wasn’t available previously. A manual check at the end of the test run was required to determine test run completion and analysis of gigabytes of logs was needed to identify issues for the test run. This is no longer necessary. Grafana provides timely notification to slack/email for errors, test run completions, and results of the completed tests. This has made multi tasking easy for the engineers as notifications report on the tests at regular intervals.
This wasn’t the first out-of-the-box solution we came up with. We iterated on our automation procedures as we moved from basic CLI to Jenkins and other automated tools.
Jenkins is our first approach to moving from initial CLI to automation. It solved the basic necessities at the time, such as triggering, scheduling the jobs, on-the-fly parameterization, and reports generation. But, over the months we came across problems which made us start thinking about what is next for us:
Our tests generate on average 10-30 GB of logs per day which is needed for debugging and is huge for Jenkins containers.
The HTML reports archive took up a lot of disk space on containers and that information was basic and restricted to client side information. There were many days that these logs caused issues to Jenkins containers by crashing and occupying shared resources.
Later we used Datadog, Dynatrace & others but they all restricted development. This is primarily because these solutions solved a single specific use case or current scenario, rather than succeeding across multiple scenarios, because each of them are different depending upon the microservice and tests handled. We are looking more for a “plug n play” solution as new services with new technologies are built.
At this point we paused, took a step back, and started thinking about how to build an in-house automation infrastructure using open source tools (considering incremental costs with others) which is customizable as per teams' requirements. We got a stable automation environment through day-to-day learning over several months.
This Automation is still new and we’re making the best of the feedback received as new services are built for testing.
Automate High Availability scenarios and multi-region testing capabilities.
Replace Jenkins with self-service Cloud API’s. This provides flexibility which Jenkins lacks today, and easy integration with other cloud architecture.
Evolve Performance Test Strategies as a single source of truth for all non-Production environments as part of CI/CD.