June 29, 2021

Automating Data Engineering and Data Discovery at Sonos

Bradford Littooy

Senior Data Engineer

Mark Morganstern

Senior Manager, Data Engineering

Julia Wilkins

Data Scientist

At Sonos, we use data to drive decisions around new product features, track key performance indicators, and help diagnose customer issues.

Sonos, by nature, is a complex product, consisting of three primary components:

Hardware (the physical speaker)
Embedded software (code running on the speaker and on the mobile devices used to control the Sonos system)
Fleet of cloud services (for example, Voice and Music Service Integrations)

With complex products comes complex data. To make an informed, data-driven decision, data across multiple components needs to be examined. This presents a challenge: the datasets must be well-documented and discoverable so a variety of teams can use them. Additionally, the user of a dataset is often not the person or team who generated the data, leading to an inherent barrier in using that dataset.

As time went on, managing the number of datasets and the sheer scale of data became very challenging for our small team of Data Engineers. The number of datasets we were maintaining continued to grow, and our data suffered from a lack of robust documentation and discoverability. This made it difficult for most of our users to use the valuable information in an effective and efficient manner.

We set a goal to develop an efficient way to onboard and expose data consumers (referred to as users in this post) to new data as it was generated, with automated processes and less direct involvement from the Data Engineering team.

To address these challenges, we present Data Disco: a multi-purpose data catalog built completely in-house. Data Disco is designed with two goals in mind:

Make data discoverable, understandable, and queryable regardless of its source, who or what generated it, and who or what is trying to use it.
Leverage Data Disco to automate our data engineering processes that create and manage these datasets.

Creating a Data Catalog

Before diving into the solution, let’s look at the data lifecycle at Sonos, from generation to consumption.

Data Lifecycle

Prior to the creation of Data Disco, we struggled to scale Step 4 (exposing data) and Step 5 (consumption of data). Both areas required data engineering involvement and tacit knowledge from the team or person who created the dataset (who we call the data expert), before a dataset could be actually used.

We started by addressing Step 4: exposing our datasets. Prior to developing the data catalog, every time a new dataset was created a number of manual and automated tasks were required:

Learn the schema (or structure of the data) for that dataset.
Manually create a Snowflake View/Table and/or an Elasticsearch Index to expose the data.
Manually QA the newly created Snowflake or Elasticsearch objects.
Build documentation to describe the dataset, so others could consume it.

Automating Data Engineering

At Sonos, we have hundreds of unique datasets. New datasets are generated daily, and existing datasets are constantly changing as the demand for data changes. Our existing data engineering process wasn’t scaling, and it quickly became impossible to manage the rapidly growing and changing data. We began by building the Schema Crawler, a tool that automatically identifies new datasets and their schemas, giving us the metadata we needed to start cataloging our datasets. Using this metadata from the Schema Crawler, we then built a tool that automatically creates Snowflake views and Elasticsearch Indexes to expose the data, thus eliminating the need for manual data engineering.

The Schema Crawler

The schema crawler was built to identify and store the schema for any dataset entering our system. We built the crawler in Python and then created a custom distributor to scale the crawler on AWS Batch across as much data as we needed. At peak, we can run the crawler across hundreds of unique datasets, comprising 5 TB+ of data and billions of records in a few minutes for an extremely low cost relative to pre-built solutions such as AWS Glue. We also have the flexibility of tracking metadata that is tailored to how Sonos instruments telemetry, which helps us create solutions specific to Sonos. For example, we can track minimum and maximum software build versions for a given data source, which lets us track schemas relative to a specific build.

For each dataset, we identify the following:

All the fields or columns in the dataset and their data types.
Up to 200 unique values per column/field.
The minimum and maximum software build versions in which the dataset and schema was seen. We run multiple versions of software in the field and different versions of software can send data with different schemas as we iterate on data requirements.

Let’s look at an example of the metadata extracted by the Schema Crawler. Here’s the input data:

Dataset # 1

{
  "action_name": "volume down",
  "success": true,
  "new_vol": 3,
  "old_vol": 7,
  "build_version": 2
},
{
  "action_name": "volume up",
  "success": false,
  "new_vol": 10,
  "old_volume": 3,
  "build_version": 1
}

The crawler identified the schema as:

Schema Output

Field Name     Data Type   Example Values   Min Build   Max Build 
------------  -----------  --------------   ---------   ---------
action_name    String      ("up", "down")      1            2
success        Boolean     (true, false)       1            2
new_vol        Int         (3, 10)             1            2
old_vol        Int         (7)                 2            2
old_volume     Int         (3)                 1            1
build_version  Int         (1, 2)              1            2

The crawler saves the schema into the database and continues to update it as new fields are added or no longer seen from that data source. Notice in the above example that the field, old_vol, was introduced starting on build version 2 and old_volume stopped being reported after build version 1, which is important metadata for tracking changes over time.

Schema Abstraction

Once the schema was stored, we wanted to create an abstraction layer on top of it so we gave every field a “Friendly Field Name.”

This abstraction allows us to:

Standardize schemas across datasets. For example, if one dataset contains the field action_name and another dataset has a field called action_id, but the two fields have the same definition, we want both of these datasets to use the same name for that field. In this case, we create a friendly field called ACTION_TYPE and map both fields to that name. Publicly, users will see this field as ACTION_TYPE in both datasets.
Standardize schemas across version changes, without requiring any downstream modifications (for example, old_volume → old_vol mentioned above), which we can now seamlessly map across versions.
Use shorter names and abbreviations at the source to save on data transfer costs, while still allowing for descriptive names.
Use the friendly field name as a way to limit what is exposed to the end user and what is not. Only fields that have friendly fields are exposed. We encouraged schema standardization by automatically giving fields friendly names if they match our common schema.

Once we identified the schema, we could automatically build objects, such as views in Snowflake or Indexes in Elasticsearch, and eliminate manual data engineering previously required.

Automating Snowflake View Creation

In Snowflake, we decided to store all of our data as raw JSON and apply the schema after the fact using metadata from our data catalog. Our schemas can change often, and we don’t enforce data types on ingestion. Storing our data as JSON, rather than writing directly to defined tables, gives us this flexibility by providing a view layer between the raw data and what the user sees. It also lets us hide columns from the user if they contain PII, or create calculated fields on the fly to enhance the user's experience. We built a simple adapter in Python that could automatically construct a view, shown in the example view creation SQL below:

Example Automated View Generation

CREATE OR REPLACE VIEW DATASET_#_1 AS (
SELECT 
   TO_VARCHAR(RAW_JSON[‘action_name’])           as ACTION_TYPE,
   TO_BOOLEAN(RAW_JSON[‘success’])               as SUCCESS,
   TO_INTEGER(RAW_JSON[‘new_vol’])               as NEW_VOLUME_LEVEL,
   CASE WHEN (RAW_JSON[build_version] >= 1 
              AND RAW_JSON[build_version] < 2)
           THEN TO_INTEGER(RAW_JSON[‘old_vol’])
        WHEN RAW_JSON[build_version] > 2
           THEN TO_INTEGER(RAW_JSON[‘old_vol’])  as OLD_VOLUME_LEVEL,
   TO_INTEGER(RAW_JSON[build_version])           as BUILD_VERSON
FROM DATA_SET_#_1)
//sql;

Automatic Aggregates

For large and commonly used datasets, we took the view generation a step further. Using the field values and data types for a given field, every field is given a classification:

Dimension–Low cardinality field such as action_name
Measure–Numeric values that can be summed or averaged (for example, music play seconds)
Time–The source of truth timestamp for a given log or event
Unique Identifier–Fields that are unique and shouldn’t be aggregated

Using the classifications above, we automatically aggregate the data at hourly and daily time grains into tables. We group by Dimensions and average/sum the Measure fields over the Time field. This further automates our data engineering work, allowing us to create highly performant datasets on top of the raw data with very little manual work.

Automatic Elasticsearch Indexes

We also commonly store data in Elasticsearch and expose it via Kibana for more real-time analytics. We scaled out our solution and built an adapter that could read the same metadata from our data catalog, and translate that metadata into an Elasticsearch schema:

Example Automated Elasticsearch Schema

{
  "success": {"type": "boolean"},
  "action_name": {"type": "keyword"},
  "old_vol": {"type": "integer"},
  "new_vol": {"type": "integer"},
  "build_version": {"type": "keyword"}
}

With this automation, we eliminated data engineering involvement for standard object creation in both Snowflake and Elasticsearch. Prior to this, we managed separate schemas for each database, which was quite a burden on our team and required duplicate onboarding efforts for data that lived in both Snowflake and Elasticsearch. Now, if a new database or technology arises, we can simply build a new adapter for that database and deploy objects in the same automated fashion. Once we automated our object creation, we then turned our focus to Documentation and Discovery, in an effort to aid Consumption (Step 5 in the Data Lifecycle diagram).

Documentation and Discovery

Automating object creation took a major weight off of our shoulders and quickly began to save hundreds of hours in tedious engineering work, but we still had a data documentation and discovery problem. Users could access the data, but understanding which dataset to access and what the data meant was still cloudy and problematic. To solve this, we created a webapp in which we crowdsourced documentation and enabled discovery using the metadata from our schema crawler and internal consumer-provided documentation. As part of this webapp, we built a fully standalone API using Python and Flask, so other applications could interact with it. We built the front end using React and hosted the app on an in-house Kubernetes cluster.

Documentation

To address the issues with consumption, we added the following capabilities:

Definitions for each dataset, including the lineage on how this dataset was generated.
Definitions for each field (and if applicable, for field values) in that dataset, so users know what fields mean and how to query them.
Tags for classification and searching.
Data experts or owners for that dataset—contacts for more information or if there was a problem with the data source.
Data source popularity–which data sources are most commonly used.

We try to automate as much of the documentation as possible. Common fields and standard objects are automatically given definitions. For the remaining documentation, we crowdsource it and make it part of the onboarding process.

For a dataset to be usable we require the following fields to be friendly-field mapped and documented. We also strongly recommend additional metadata, such as tagging and assigning data experts, to make the dataset more usable and valuable to others at Sonos.

We structured our app so that every dataset has a “homepage” where all relevant information and configuration about that dataset live.

Discovery

We took a twofold approach to discovery and provided a low-intent search for users who had a vague idea of what they were looking for, and a high-intent discovery for those who knew exactly what they were looking for, but needed clarification or documentation for their specific dataset.

For low-intent searches, we leverage metadata from our catalog, including user-provided descriptions, tagging, field names, and of course, the dataset name to curate a list of most relevant datasets based on the user's search.

For high-intent searches, we provide fast access to a dataset’s homepage via the landing page.

New Architecture

As our data catalog evolved, it became an essential part of our data landscape. The catalog became central in automated object creation and provided users a single source of truth for discovery and documentation. As shown in the diagram below, the data catalog, now known as Data Disco(very), sits in the middle of our data lifecycle and has greatly accelerated Steps 4 and 5 in the flow.

Future Work

Today, teams across the software organization actively use Data Disco for both data discovery and data engineering automation. We finished version 1.0 of the catalog after about two years of engineering effort, including about seven months of development effort on our front webapp.

Over the coming months we’re hoping to expand the functionality on both fronts–discovery and documentation. Here are some features we’re exploring to ensure we can continue to extract value out of data and pursue our quest of being a data driven organization:

Support for creating custom tables and datasets via our web app, expanding on the support we currently have for creating automatic aggregates.
Illustrate relationships between datasets, for example–recommended joins.
Build out more adapters for new databases in addition to Elasticsearch and Snowflake.
Integrate IDL code generation at services generating logs and telemetry that automatically interacts with our data catalog and onboard the data with enriched metadata associated with it. This would eliminate the need for manual mapping and documentation after the data has been accepted into our platform.

Continue reading in Data Engineering:

Machine Learning
,
Data Engineering
Reproducing On-Device Data Accurately for Private-by-Design Voice Control
Read More
April 26, 2023
Analytics
,
Data Engineering
Feature Generating Framework
Read More
June 2, 2022