What To Do When There Is Too Much Data
This post is the second in a five part series that explores the challenges associated with using streaming video data more effectively. The first part outlined the four primary challenges. This post explores the first of those challenges, the volume of data, and looks at ways that data volume can be better managed so that it doesn’t undermine usability.
If there’s one thing that streaming video doesn’t suffer from, it’s a lack of data. With a tech stack composed of dozens of components, spread across physical hardware, virtual instances, and even third-party service providers, streaming video data is a rushing torrent. And in many ways, that can be a benefit across a number of business units. Operations is an obvious one: with more data, network engineers can better ascertain the root cause of issues which impact viewer QoE. Other uses of data may not be so obvious. Ad teams, for example, can use the data to understand how viewers are engaging with ads (and which ads may have not displayed correctly). Product teams can use the data to get a better picture of how viewers are discovering content. Still, there’s a difference between having access to the data that can help make business decisions, and having too much data.
The Stages Of Data: From Collection To Analysis
No matter what the intended use, every application of data follows a very defined process that can best be imagined as a flow:
Collection. In this first step of the flow, data has to be collected. This collection can be handled programmatically, such as pulling data from an encoder via APIs, or manually, such as an engineer grabbing a log file and dumping it into a tool (such as Datadog).
Post-Processing. Once data has been collected (and delivered), it needs to be processed. In most cases, this involves normalizing variable names and values. In some cases, though, it might involve calculations and establishing relationships. Regardless, this step in the data makes the data useful.
Visualization. After data has been processed, it can be visualized. In most cases, organizations already have tools, like Datadog or Looker, and have spent a lot of time and energy building visualizations against the post-processed data.
Analysis. Finally comes the analysis. Although this may seem like a very manual process (someone has to look at the visualizations and draw some conclusions, perhaps by drilling into the raw data) this is also a perfect application for automation via ML/AI. A well-trained AI could astutely suggest connections and meaning, providing a heavy dose of observability that is entirely automated.
Some of these steps in the data flow require a fair bit of up-front development and on-going maintenance, but when there is a significant volume of data, steps in the flow can get bogged down.
When the Data Flow Is More Firehose Than Steady Stream
So what happens to each of the data flow stages when there’s too much data?
Collection. Although it may not seem like collecting too much data could cause a problem, it can. For example, if there is limited bandwidth available at an end-point, pulling too much data as the same time a viewer is trying to stream a video may cause a switch to a lower bitrate (because of less available bandwidth) which can undermine QoE.
Post-processing. This is perhaps the stage that can see the biggest impact of too much data. When time is of the essence, any delay in being able to use the collected data can impact the business. But the more data to post-process, the longer it takes. Consider a live streaming event: issues must be resolved in real-time to prevent potential refund requests. Yet what if calculating metrics and other KPIs from the raw data takes seconds or even minutes longer because of the volume? That could result in increasing viewer dissatisfaction as problems impacting QoE (everything from video quality to authentication) aren’t addressed quickly enough.
Visualization. The impact to this stage of the data flow is minimal. The only thing may be delays in seeing the data based on the extent of the delay occurring in the post-processing stage.
Analysis. This is perhaps the stage with the biggest impact from data volume. For example, when there is a recognized issue, perhaps through a visualization, operations engineers need to drill into the data to see exactly what’s going on (as the visualization often only represents a calculated value). But imagine trying to sift through millions of lines of data rather than thousands? The impact to the time it takes for analysi8s can be significant. Instead of a few minutes to find and address a problem with viewer quality, it may take 10 minutes or half an hour, especially when there’s coordination needed with a partner or service provider.
The impact of data volume, then, can be significant on a variety of parts of the business. Although it may only result in a few minutes of delay, even that small amount of time can be the difference between attrition and continued subscription. And for ads, the inability to quickly identify ad errors can be a costly, as contractual obligations for impressions can eat into next month’s inventory as failed ads must be redelivered.
Reducing Data Volume Equals Improving Observability
In many ways, it seems counter-intuitive that observability can be improved by reducing the amount of data. But observability is more than just identifying insights, it’s also about the speed to those insights. So if reducing data volume can reveal the same level of insight, observability is improved by increasing the speed to which those insights can be achieved.
Thankfully, there’s an easy way to reduce data volume: data sampling.
In many cases, representative samples of data can be just as meaningful as the entire dataset. If there are 1000 records which indicate an error, do you need all 1000 records to tell you that or would 500 of those records be enough? 250? The reduction in data volume doesn’t undermine observability. In addition to sampling, though, streaming operators can point the raw data to a secondary data lake and a sampled subset to operational tools where post-processing, visualization, and analysis can take place. In this way, immediate issues can be resolved quickly (through a reduction in data volume) but long-term, operations and other business units can access the full data set for analysis that is not time-sensitive. In this way, data anomalies can be identified which may lead to deeper understanding of recurring issues.
Using a Data-as-a-Service Platform to Help You Reduce Data Volume And Make Quick Business Decisions
Most operators are relying on software providers, such as streaming analytics vendors, to help them collect the data from their players and other endpoints. When that collection is done programmatically, rather than through the vendor’s dashboard, there often isn’t a way to sample data nor divert to a secondary location. Thankfully, Datazoom, a Data-as-a-Service Platform, includes features which allow data engineers to optimize data at the time of collection. While this can include activities like transforming variables and enriching with third-party data sets, it also allows the sampling of data and delivery to multiple locations. The result is an optimized data pipeline delivering the data that’s needed to the places it needs to be so that business decisions can happen as quickly as possible.