Feature Generating Framework
Senior Data Engineer, Product Partnering
At Sonos, we face at least three defining dimensions of big data: volume, variety and velocity; the Product Data Engineering team has to think outside of the box to provide efficient ways to work with the company’s complex IoT data.
Every day, when people use their Sonos products and opt in to data sharing, a wide range of data/events used to enhance user experience flows from their devices to us.
Raw events are typically ingested into our cloud warehouse as packed semi-structured JSON documents. These JSON documents are partially unpacked and stored into the tables in the cloud warehouse in real time. Each event is usually stored in one large table, which over time includes hundreds of billions of rows.
Each individual event is depicted by a large number of attributes representing a variety of characteristics. It is not uncommon for there to be hundreds of attributes associated with an event. In addition, there are thousands of events which are in many cases organized in hierarchical orders, so called nested events. This kind of structure produces even more data when flattened.
Sometimes, raw data may contain human or device generated error data too. The complexity and granularity of the events makes it difficult to extract information and isolate trends from raw data and subsequently create knowledge/value to support decision-making processes.
Data on the high level includes signals and noise. Large data may result in a considerable amount of noise and can compromise the performance while executing daily analytical tasks. There are two broad perspectives to approach mining data and extracting value:
Use complex and often low explanatory algorithms
Engineer a library of smart features that will carry the signals
The first approach is focused on the use of complex algorithms, hence the processes and results are difficult to explain. Additionally, our events data is minimally or not at all preprocessed. Even when it is preprocessed, that is done in a highly customized manner which does not allow for reuse or repurpose of it.
The second approach places focus on smart use of the data. This approach uses carefully crafted features as inputs which yields use of simpler algorithms resulting in highly explainable processes and results. In some sense this is a more data-centric approach.
A quick reminder what a feature is:
A characteristic, property or attribute extracted from raw data
Preferably in base units
For example: count, mean, median, standard deviation, variance
Let us take a brief look at the Analytics Lifecycle:
In the image above, we can identify three phases which are directly related to data preparation and notice that these three phases take at least 37.5% of the analytics project, which is a pretty large portion of the whole process.
If we emphasize on extracting the signals from data, we provide more flexibility for downstream users.
A few key benefits of well designed features:
Less resources usage
A crucial benefit of this approach is that these generic smart features can play a multi-purpose role and be used as inputs for predictive modeling and/or any other analytics task at hand.
While this requires more development and investigation work, it provides flexible and reusable data structures that can be combined and tailored in a seemingly inexhaustible manner.
We decided to build a library of customizable plug & play features using metadata driven highly automated processes called feature generating framework (FGF).
The idea that we decided to follow is that the high quality data can be more powerful than the algorithm.
What does the feature engineering process look like?
Finding patterns that lead to feature creation (Exploratory Data Analysis)
Deciding what features to create
Testing the impact of the identified features on the problem (Analytics Lifecycle)
Monitoring the quality of the features and improving features (results driven process)
What is the Feature Generating Framework?
The FGF is a custom-built infrastructure which generates metadata-driven processes of feature creation. This infrastructure allows us to performantly and systematically reduce a body of data into smaller parts or views that yield more information.
It offers a streamlined way to create bite-sized data points that are trustworthy, accessible and understandable.
Most importantly, it automates the approach of how the features are extracted from the data so that the feature generation is performed in a systematic and organized manner. That’s why we call it a declarative feature engineering framework; it uses a parameterized approach to define what to generate and how, and allows highly configurable quantitative or qualitative output features. The input parameters are saved in the metadata model to provide dynamic generation of data wrangling processes. The features are organized in a dimensional model to offer the best performance when selecting data bites for analysis and/or predictive modeling.
The input parameters (metadata) are used and reused to assemble DDLs, DQLs and DMLs to perform the following:
Create data structures
Load the data
Create and start the jobs which drive data loading processes on schedules
Monitor data quality
Export data quality metrics for continuous visual data quality monitoring (DataDog, Tableau)
Any single table or complex SQL query which references multiple tables can be used as a data source for deriving a feature.
There is a one-to-one relationship between data source and feature which allows for implementation of deductive logic and a later bottom-up approach in combining the features into custom datasets for analysis and/or predictive modeling.
Each data source includes a set of default key attributes which implement standardization and reusability with other in-house datasets.
The key attributes serve as consistent identifiers across all the internal data and can be used to stitch together datasets and produce rich high quality data.
The feature generation process starts with either new or existing parameters input into the front-end UI. This is then processed by generic code which assembles SQL statements to either create a new feature or refresh existing feature with newly arrived raw data.
The new set of raw data is cleaned, transformed, and loaded into a dimensional model. Once a feature is created, the data quality (DQ) phase starts. We monitor features quality, applying custom defined calculations and storing the resulting statistics. These statistics are imported into a cloud observability service where the historical trends are visualized over various time frames for anomaly detection and other DQ analytics.
Finally, we assemble merged datasets using available metadata, add new or remove existing features in automated fashion into custom datasets. At this point the raw features and the statistical features are available for data analysis and/or predictive modeling.
Let’s take a closer look at the “Load Data in DM” phase from the above process flow diagram. The FGF shapes and models the data and generates features into a dimensional model.
The dimensional model consists of facts including measures and dimensions including descriptive data about measures. The FGF dimensional model is used to organize and standardize features into data structures. These provide a variety of connectors used to merge features into complex, customized datasets. Quantitative features reside in fact tables and categorical features reside in dimensions.
The keys are ingrained into the tables during data transformation processes. Resulting tables are daily aggregates grouped by the key attributes. Each feature is time-stamped with the standard date attribute which can be customized based on the rules.
The metadata store organizes input parameters into a normalized relational data structure. Metadata is used to support data-centric automation of the feature generation processes.
We use a dynamic programming approach where the code for each phase in the feature generation process (above diagram) is generalized and reused in conjunction with the metadata to create highly customized yet standardized data structures and transformations.
The metadata model is designed in the manner that allows:
Activation and deactivation of most FG processes.
For example: stop temporary data load jobs, stop generation of statistics, deactivate a feature and similar.
Metadata modifications which can affect a feature's content/data.
DQ statistics that can be extended or modified.
Most importantly, the metadata store is directly connected with the dimensional model via unique keys.
Data Quality Monitoring
Provides a glance at historical and current insights about central tendency and distribution of the data/features. It supplies two kinds of DQ monitoring:
Feature values auditing
Data loading metrics
It calculates daily statistics for generated features, such as:
Mean, median, mode, standard deviation, variance, IQR, … etc.
Daily data loading metrics for each fact:
Unique keys ratios
The feature statistics are saved in the data store and the reference keys are provided to the raw features as well as metadata.
DQ monitoring provides a wide range of feature statistics that are used to immediately detect any qualitative and quantitative shifts in data. We use it to remedy data issues promptly and before they dramatically affect any downstream processes.
Below image represents a sample of the median and mean time series feature statistics:
The FGF creates and maintains a library of unique features containing fundamental information about product data that can be used across the company due to the universal structure.
The universal property comes from the relational keys that act as the connectors to the rest of the wide range of internal data.
The simplicity of this solution allows us to effectively extract useful signals from a large volume of data. The consistent structure of the outputs allows us to easily isolate and combine the most relevant information for a given company problem. With the metadata tracking and quality monitoring, users feel confident using these features, trusting in their accuracy.
The engineered features offer a more focused and simplified version of the raw data as opposed to working with the unwieldiness and complexity of multiple raw datasets joined together which may have unwanted information not currently monitored for DQ.
The results are:
Accurate and reliable unit features
Statistical features generated by DQ monitoring
Plug & Play property of the features using consistent identifiers
Metadata to track and reproduce processes
DQ monitoring and alerting
Jobs which consistently maintain data & metadata
Finally, due to all above benefits, the FGF proves to be an irreplaceable savior of Data Engineering time and resources.
Continue reading in Data Engineering: