Thoughts Cory Carpenter Thoughts Cory Carpenter

Standardization Unifies Data

Standardizing player and Content Delivery Network (CDN) data can give video and audio publishers a cohesive, holistic view that streamlines QoE, product, content and advertising decision making by reducing confusion stemming from inconsistencies and eliminating post processing efforts to unify datasets.

All companies need to standardize and normalize data when receiving from multiple sources but building that capability, and providing ongoing maintenance, can be an unnecessary drain on resources.

Standardizing player and Content Delivery Network (CDN) data can give video and audio publishers a cohesive, holistic view that streamlines QoE, product, content and advertising decision making by reducing confusion stemming from inconsistencies and eliminating post processing efforts to unify datasets.

All companies need to standardize and normalize data when receiving from multiple sources but building that capability, and providing ongoing maintenance, can be an unnecessary drain on resources.


Data Standardization Challenges

Data standardization is built into the Datazoom DaaS Platform. Users can easily select which data points to adjust based upon the player or CDN data dictionary.

Designing an Extensible Data Standard

Given how quickly market landscapes, product directions and needs of data constituents can change, designing a future-friendly data standard that captures relevant metrics in a format that is analysis- and budget-friendly can be difficult.

Enforcing Consistency Across Integrations

Given your wide array of CDN and video player integrations, enforcing the standard across disparate contexts requires a meticulous approach and robust mechanisms to guarantee compliance.

Complex Integration Efforts

The level of effort to implement and maintain a data standard data point by data point across your video players and CDNs can be significant. The idiosyncrasies of the different integration contexts are highly prone to inaccuracies and inconsistencies.

How Datazoom Standardizes Data

Standardizing data at the time of collection means no more post-processing. The data comes normalized and ready to use straight from the Datazoom platform.

Automated Data Collection

With pre-built collectors for various player versions and CDNs, Datazoom automatically captures data according to the Datazoom Dictionary, eliminating inaccuracies and inconsistencies that can arise from custom implementations.

Synergistic Player and CDN Standards

Datazoom’s player and CDN standards work hand in hand. By standardizing data collection at both levels, Datazoom bridges the gap between content consumption and content delivery

Enhanced Validation

Datazoom’s platform thoroughly validates that collected data meets expected standards and runs frequent, rigorous quality checks.

Adaptability and Flexibility

Adapt the Datazoom dictionary to other required standards within your ecosystem by using Datazoom’s transformations.

Standardize Example: Bitrate

What's Happening?

Data sources coming from the same endpoint, like a player, may have different names for the same value, such as bitrate. In this case, bitrate could be named Resolution, or video_quality, or bit_rate. Different variable names can make it difficult to relate datasets together when the data is viewed through a visualization tool, requiring manual post-processing. Datazoom’s CDN Data Dictionary and Player Data Dictionaries can provide an automatic standardization of data names and values across datasets speeding up analysis.

Standardizing data at the time of collection means no more post-processing. The data comes normalized and ready to use straight from the Datazoom platform.

Data Nerd? Check Our The Datazoom Data Dictionaries

Datazoom’s Data Dictionary provides a source of truth for streaming-related variables. By standardizing data elements collected through Datazoom data pipes, streaming operators can ensure that similar variables collected from different sources are all named the same for analytical continuity.

Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

What To Do When There Is Too Much Data

This post is the second in a five part series that explores the challenges associated with using streaming video data more effectively. The first part outlined the four primary challenges. This post explores the first of those challenges, the volume of data, and looks at ways that data volume can be better managed so that it doesn’t undermine usability.


If there’s one thing that streaming video doesn’t suffer from, it’s a lack of data. With a tech stack composed of dozens of components, spread across physical hardware, virtual instances, and even third-party service providers, streaming video data is a rushing torrent. And in many ways, that can be a benefit across a number of business units. Operations is an obvious one: with more data, network engineers can better ascertain the root cause of issues which impact viewer QoE. Other uses of data may not be so obvious. Ad teams, for example, can use the data to understand how viewers are engaging with ads (and which ads may have not displayed correctly). Product teams can use the data to get a better picture of how viewers are discovering content. Still, there’s a difference between having access to the data that can help make business decisions, and having too much data.

The Stages Of Data: From Collection To Analysis

No matter what the intended use, every application of data follows a very defined process that can best be imagined as a flow:

  1. Collection. In this first step of the flow, data has to be collected. This collection can be handled programmatically, such as pulling data from an encoder via APIs, or manually, such as an engineer grabbing a log file and dumping it into a tool (such as Datadog).

  2. Post-Processing. Once data has been collected (and delivered), it needs to be processed. In most cases, this involves normalizing variable names and values. In some cases, though, it might involve calculations and establishing relationships. Regardless, this step in the data makes the data useful.

  3. Visualization. After data has been processed, it can be visualized. In most cases, organizations already have tools, like Datadog or Looker, and have spent a lot of time and energy building visualizations against the post-processed data.

  4. Analysis. Finally comes the analysis. Although this may seem like a very manual process (someone has to look at the visualizations and draw some conclusions, perhaps by drilling into the raw data) this is also a perfect application for automation via ML/AI. A well-trained AI could astutely suggest connections and meaning, providing a heavy dose of observability that is entirely automated.

Some of these steps in the data flow require a fair bit of up-front development and on-going maintenance, but when there is a significant volume of data, steps in the flow can get bogged down.

When the Data Flow Is More Firehose Than Steady Stream

So what happens to each of the data flow stages when there’s too much data?

  • Collection. Although it may not seem like collecting too much data could cause a problem, it can. For example, if there is limited bandwidth available at an end-point, pulling too much data as the same time a viewer is trying to stream a video may cause a switch to a lower bitrate (because of less available bandwidth) which can undermine QoE.

  • Post-processing. This is perhaps the stage that can see the biggest impact of too much data. When time is of the essence, any delay in being able to use the collected data can impact the business. But the more data to post-process, the longer it takes. Consider a live streaming event: issues must be resolved in real-time to prevent potential refund requests. Yet what if calculating metrics and other KPIs from the raw data takes seconds or even minutes longer because of the volume? That could result in increasing viewer dissatisfaction as problems impacting QoE (everything from video quality to authentication) aren’t addressed quickly enough.

  • Visualization. The impact to this stage of the data flow is minimal. The only thing may be delays in seeing the data based on the extent of the delay occurring in the post-processing stage.

  • Analysis. This is perhaps the stage with the biggest impact from data volume. For example, when there is a recognized issue, perhaps through a visualization, operations engineers need to drill into the data to see exactly what’s going on (as the visualization often only represents a calculated value). But imagine trying to sift through millions of lines of data rather than thousands? The impact to the time it takes for analysi8s can be significant. Instead of a few minutes to find and address a problem with viewer quality, it may take 10 minutes or half an hour, especially when there’s coordination needed with a partner or service provider.

The impact of data volume, then, can be significant on a variety of parts of the business. Although it may only result in a few minutes of delay, even that small amount of time can be the difference between attrition and continued subscription. And for ads, the inability to quickly identify ad errors can be a costly, as contractual obligations for impressions can eat into next month’s inventory as failed ads must be redelivered.

Reducing Data Volume Equals Improving Observability

In many ways, it seems counter-intuitive that observability can be improved by reducing the amount of data. But observability is more than just identifying insights, it’s also about the speed to those insights. So if reducing data volume can reveal the same level of insight, observability is improved by increasing the speed to which those insights can be achieved.

Thankfully, there’s an easy way to reduce data volume: data sampling.

In many cases, representative samples of data can be just as meaningful as the entire dataset. If there are 1000 records which indicate an error, do you need all 1000 records to tell you that or would 500 of those records be enough? 250? The reduction in data volume doesn’t undermine observability. In addition to sampling, though, streaming operators can point the raw data to a secondary data lake and a sampled subset to operational tools where post-processing, visualization, and analysis can take place. In this way, immediate issues can be resolved quickly (through a reduction in data volume) but long-term, operations and other business units can access the full data set for analysis that is not time-sensitive. In this way, data anomalies can be identified which may lead to deeper understanding of recurring issues.

Using a Data-as-a-Service Platform to Help You Reduce Data Volume And Make Quick Business Decisions

Most operators are relying on software providers, such as streaming analytics vendors, to help them collect the data from their players and other endpoints. When that collection is done programmatically, rather than through the vendor’s dashboard, there often isn’t a way to sample data nor divert to a secondary location. Thankfully, Datazoom, a Data-as-a-Service Platform, includes features which allow data engineers to optimize data at the time of collection. While this can include activities like transforming variables and enriching with third-party data sets, it also allows the sampling of data and delivery to multiple locations. The result is an optimized data pipeline delivering the data that’s needed to the places it needs to be so that business decisions can happen as quickly as possible.

Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

Addressing The Streaming Video Data Challenges

There is no doubt that data is the lifeblood of streaming video. Although having the right content is critical to keeping and attracting subscribers, data provides the insight to what that content library should be. In the streaming video workflow, dozens of components throw off data which provide insight into everything from viewer behavior to revenue opportunities to performance information. Imagine the components of the streaming workflow as islands all connected together by bridges (APIs) suspended in a roaring river of data. And it’s important to remember that the data relevant to streaming video isn’t just from within the workflow. That river is fed by countless tributaries, including such sources as Google Ad Manager and content delivery networks. The fact that this is a river of data, and not just a trickle, emphasizes how challenging it is for streaming operators to make sense of the information within, to take action against what might amount to billions of data points. It’s like trying to identify two fish from the same clutch by just looking at the water. And yet that’s exactly what many streaming operators try to do in real-time: make sense of all the connections within the massive river of data.

The Impact Of Issues With Data

Of course, handling a large volume of data is only one challenge. There are others (as detailed below), but ultimately, any of these challenges result in one thing: slowing down the ability to take action. These challenges represent blockers to using the data in real-time to make critical business decisions. If there’s too much data coming at one time (rather than a sample of the data, for example), it can take too long to process and display in a visualization tool. Even if the visualization tool is connected to unlimited computational resources, it still takes time to process the data. Of course, unlimited resources are often not available so processing mass amounts of data can add significant time. This, and other kinds of delays with handling the data from the streaming workflow, keeps operators from putting that data to use and that undermines the value of the data in the first place. Consider this example: understanding five minutes after an outage where the outage happened doesn’t mitigate customer discontent. But the outage didn’t just suddenly happen. There was probably data which hinted at the impending problem yet it was lost in the river. Only when the outage happened or was noticed (minutes after processing was completed), and the aftermath evident (such as a suddenly spike in customer emails) did it become impossible to miss.

Understanding the Challenges of Streaming Video Data

As was already pointed out, the volume of data is only one of the challenges facing streaming operators with respect to putting data to use. There are several other challenges which can have just as much of a negative impact as having too much data:

  • Delivery time. How fast does the data need to get where it’s going? Many streaming operators employ software in their player to capture information about the viewer experience. But what if that data comes two, three, or even 10 minutes after an issue is detected? Of course, the issue is not in the player. It’s most likely upstream. But having visibility into the viewer experience provides an indicator of other problems in the workflow. So the data needs to be delivered as quickly as it’s needed. The time constraints of individual pieces of data is not a one-size-fits-all approach. Different data needs to be delivered at different speeds.

  • Post-processing. Countless hours are spent processing data once it has been received. That post-processing may be automated, such as through programming attached to a datalake or a visualization dashboard, or it may be manual. However it’s carried out, it takes time. But that post-processing must happen to turn the data into usable information. For example, it doesn’t help the ad team to provide them raw numbers on time spent watching a particular ad. What helps is telling them if a particular ad, across all views, has hit a certain threshold of viewing percentage (which is probably a contractual number). In other words, post-processing makes data usable. But when it takes too much time, the value of the data can diminish.

  • Standardization. Streaming video can be a unique monster when it comes to data sets. Lots of providers are collecting similar (if not identical) data but may represent it differently. When this happens, that data must be sanitized and scrubbed (post-processed) to ensure that it can be compared with similar values from other providers and used as part of larger roll-ups, such as KPIs. Content delivery network logs are a great example of this. Without any standardized approach to variable representation, streaming operators are forced to come up with their own lingua franca which has to be maintained and enforced with new providers.

Yes, data is critical to the success of streaming platforms. But actually using that data in a meaningful way is fraught with challenges: volume, delivery, processing, and standardization. So just as important as identifying and gathering the right sources of data is having a strategy to deal with these challenges. With the right data and the right strategy, streaming operators can ensure that their viewers are always having the best experience because the operator has access to the right amount of data, optimized and transformed and delivered right where, and when, it’s needed.


In the next blog post of this series, we’ll take a look at data volume in more detail and the ways that it might be mitigated. Getting the right data to the right people is critical for streaming platform success. But that’s sometimes more easier said than done.

Read More