Understanding the Datatecture Part 3: Video Infrastructure Deep-Dive
In Part three of this series, we dig into some of the deeper layers of the Streaming Video Datatecture in the Infrastructure category, defining many of the individual sub-categories and explaining their purpose in the broader workflow.
As we covered in the first post of the series, the Datatecture is governed by three main categories: Operations, Infrastructure, and Workflow. Within these categories are also a myriad of other sub-categories, often branching into even more specific groups. This structure isn’t intended as a parent-child hierarchy. Rather, it is just a way of illustrating relationships between specific components and categories of functionality. For example, there are many systems and technologies within analytics that don’t compete against each other because they handle different sets of data from video player metrics to customer behavior.
What is Video Infrastructure?
As was discussed in the initial blog post, Infrastructure refers to systems which house many of the streaming stack technologies. Infrastructure components represent the most foundational aspect of the stack: storage, databases, containers, and even queueing systems. These systems enable many of the other streaming technologies to work at scale.
Containers and Virtualization, Storage and Caching, and Queueing Systems
Within the Infrastructure category, there are three primary sub-categories which were outlined in the first post of this blog series. Let’s dig past those and go deeper into video Infrastructure to understand the individual systems involved in this area of the Datatecture.
Containers and Virtualization
As streaming providers have adopted cloud-based components within their technology stack and have moved from monolithic software architectures to microservices, containers and virtualization have become increasingly important. That’s because hardware-based approaches don’t scale well to global audiences.
To meet the needs of all geographic audiences, such as those with low latency, providers would have to host physical servers around the globe. As those audiences grew, they would need to add more servers to support the demand. It becomes a very expensive proposition.
Virtualization, though, and especially containers, allow operators to deploy new streaming infrastructure into existing cloud providers, enabling operations to grow or shrink programmatically. Containerization is especially exciting as it allows for a simplified – especially when using one of a variety of management tools – to spin up new streaming components that are already pre-configured for production use.
Storage and Caching
Streaming is dependent upon storage. Without somewhere to keep the segmented HTTP video files, there would be no way to provide them to requesting viewers. Of course, sometimes the storage of those segments is transitory, such as in a caching system, and other times is more permanent, such as for an on-demand library.
In addition to physical storage, this category of the datatecture also includes other storage mechanisms such as databases.
Object Storage —This is part of the core infrastructure of a streaming service: a place to house the video segments or transcoded copies. In most cases, this will be a cloud provider which offers a geographically distributed, redundant storage solution and can work in conjunction with CDN caching systems.
Origin Services —This is where the content begins. It represents the storage of the original assets which are then transcoded or packaged into different formats for delivery and storage, downstream. In many cases, this storage isn’t as distributed as object storage which is why it needs to be protected from low-efficiency caches. If there are lots of cache misses and requests need to travel back to the origin, a flood can easily tip these over. Given that, many streaming operators opt for origin services offered by other providers who can protect it against flooding and ensure that the master content is always available to be ingested into the delivery workflow.
Open Caching Management —Open Caching, a development by the Streaming Video Alliance, is an interoperable, API-based caching system that allows streaming operators, network operators, and content rights holders all visibility and control over the caching topology. As a developing set of specifications, Open Caching isn’t something that can be downloaded and installed. It needs to be built and implemented. As such, there are vendors entering the market who can implement and support Open Caching software implementation.
Time-series Databases —There are some aspects of streaming data, such as player analytics, which are time-based. It’s critical to monitor and ultimately troubleshoot player events, understanding at what point the event happened. That way, it can be correlated to other streaming data, such as CDN logs, to provide telemetry on root-cause.
Data Warehouses — Streaming is driven by data. Every component within the workflow, as evidenced by the Datatecture, throws off data. But to provide opportunity for insight, that data needs to be related. For that to happen, it needs to be stored in a single location. Data warehouses, and more recently, Datalakes, provide a single storage location for all data sources enabling streaming operators to see patterns and connections across datasets. By storing the data in a single location, analysis can be significantly sped up as there is no need to query multiple data sources when relating variables.
Queueing Systems
The streaming workflow is built upon servicing requests. Sometimes, those requests may come from viewers. Sometimes, they may come from other systems. For example, consider a user that requests a video that is not in cache. This request is passed up through the workflow to higher caches until it gets to the origin.
But what if the content requested is for a specific device or format that isn’t prepared? That then triggers the workflow to push the origin content through transcoding so it can be returned to the user. But what if there are thousands or millions of such requests? A queueing system, such as a message bus, can help prioritize and organize those requests to ensure that affected systems are receiving them without being overloaded.
Infrastructure All Works Together
These components don’t work in a vacuum. An important distinction to understand is data warehouses are linked to time-series, which are linked to object storage, which is linked to queueing systems. When looking at your own Datatecture, understanding the interplay between systems means you aren’t seeing data in a silo. Data from one component is often used by another component or that data from one technology is affected by data from another. Seeing these relationships will help you get better visibility across the entire workflow.
To learn more, visit and explore the Datatecture site. In the next blog post, we will explore the groups within the Workflow category.