Thoughts Cory Carpenter Thoughts Cory Carpenter

Why real-time data matters

What “real-time” means to Datazoom

The purpose of this article provides a standard for defining “real-time” and the reasons why detailed “real-time” user Quality of Experience data is important for streaming media content providers. Datazoom provides a mechanism to capture detailed QoE data in sub-second real-time. We build our technology to get your team detailed data ‘zooming fast.

For example, some leading analytics claim to provide a “real-time” dashboard view, but your ability to drill down to specifics is limited by their data processing capabilities. Sometimes as much as 30-minutes of delay exists between data collection and your ability to review metric details.

We just lost 50% of our viewers! Why?!”

Startup time just spiked from an average of 2 sec to 12 sec… sustained for the last 2 minutes. All platforms? All apps? Which CDN? Which ISP?”

Authentication latency for MVPD X went up and stayed up. Our users can’t log in!”

Getting detailed, real-time KPI data into the hands of an analyst is important only if it’s actionable. Let’s review some actions a hypothetical Datazoom customer, RadarVid, could take with real-time granular data.

Use case 1: Operations is receiving alerts that indicates an SLA impacting event

Say RadarVid, has an SLA with a CDN vendor which guarantees X percentage of availability. Actionable data, in this case, causes RadarVid to contact the CDN vendor.

RadarVid uses Datazoom collected real-time data fed into their analytics tool(s) to:

  • Determine if any load-balancing/auto/manual failover service has mitigated the impact on users.

  • Inform their customer care team of the problem. Providing customer care with details related to platform/app version/region/ISP/Affiliate etc. is a bonus.

  • Inform their ad sales team of potential revenue impact.

  • Calculate costs against vendor SLA for that calendar month.

Use case 2: Customer care is getting problem reports with their Direct-to-Consumer product.

“I paid to watch this show/event and I can’t. I want a refund!” Hell hath no fury like a disappointed subscription audience.

There can be any number of reasons for this. In this case,  RadarVid’s customer care team would have access to that user’s session data which may include things like available bandwidth, device connection type, application version, ISP and so on. All of this information can help guide the customer care representative on taking a specific course of action. 

RadarVid uses Datazoom collected real-time data fed into their analytics tool to:

  • View relevant data about the problem the subscriber reported

  • Track similar problem reports to develop trends related to app version/platform/region/CDN/ISP

  • Use data to inform a course of action

Use case 3: Ad Operations is alerted about a decrease in fill rates during a predicted high traffic event.

With Datazoom, RadarVid’s ad operations team can see real-time data that includes ad creative metrics. RadarVid uses Datazoom collected real-time data fed into their analytics tool to:

  • Determine if an ad decisioning partner system is suspect

  • Determine if the ad insertion system is suspect.

  • Determine if the ad creative is suspect. Confirm that internal teams have properly configured campaigns.

Use case 4: Synchronous data

Leveraging the real-time capability of Datazoom Data Pipes, RadarVid can ensure that disparate data is time-aligned with client viewing experience data. A time data point captured by Datazoom can allow to specifically display other time indexed data that aligns with the time as represented in a live video stream.

RadarVid uses Datazoom collected real-time data fed into their analytics tool to establish the time of the currently viewed video frame versus the reference and react accordingly.

Use case 5: Add new data points

Using the Datazoom Data Pipes configuration tool, a customer can react to platform-specific changes to the types of data points that become available.

RadarVid uses Datazoom collected real-time data fed into their analytics tool(s) to:

  • Add a newly available platform data point without an often lengthy application update development cycle.

  • Once the data point is added to the configuration, customers can immediately begin to collect data on that metric.


Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

Setting up PostgreSQL on Kubernetes using Stolon

Overview

PostgreSQL is a powerful object-relational database system (ORDBMS). With the ACID-compliant, transactional nature and having earned a reputation for reliability, feature robustness, and performance, many companies use it for managing their master data. For production deployments, it is important for applications to maintain a highly available (HA) environment. PostgreSQL offers HA, but there were challenges to get it set up correctly on Kubernetes environment. Stolon is a cloud-native PostgreSQL manager to maintain HA. It has the ability to run on a Kubernetes environment as well and utilizes the PostgreSQL native cluster mechanism to add more value to the high availability feature.

Why ‘Stolon’ PostgreSQL Cluster??

Implementing a PostgreSQL cluster inside Kubernetes is always a challenge since it cannot be directly integrated with stateful services. The well knows methods for implementing clusters are sorintlab’s Stolon, CrunchyData PostgreSQL cluster, and Zalando‘s Patroni/Spilo PostgreSQL cluster.

As it stands, it is my opinion that Stolon is the best method for implementing a PostgreSQL cluster inside Kubernetes because of:

  • The High Availability of PostgreSQL data

  • Open source data storage services

  • Better customization of PostgreSQL versions based on application requirements

  • Its ability to easily modify service names, database names, and user access privileges

  • Automated failover switching with very minimal delay

  • High resiliency

  • Easy cluster scaling

  • Easy replication

Some information about Postgres Cluster configuration

A PostgreSQL cluster consists of:

  • Stolon Cluster

  • Stolon Sentinel(s)

  • Stolon keepers

  • Stolon Proxies

Stolon Cluster

A highly available PostgreSQL Cluster is implemented with the help of a Stolon cluster in Kubernetes and all the configurations passed through configmap (stolon-cluster-kube-stolon) using stolon cluster. Any update in the Postgres parameter can also be passed as a rolling update through the configmap.

Note: For PostgreSQL cluster setup (including all three of the components mentioned above) wait for this Stolon component to be available.

Stolon Keeper

A PostgreSQL database engine runs as a Stolon keeper service and is implemented as a Statefulset with persistent volume. Each pod in statefulsets is the master and stand-alone of the cluster. Data synchronous between each of the cluster candidates (master & stand-alone) are performed with the help of a separate Postgres user. Every keeper MUST have a different UID which can either be manually provided (–uid option) or one will automatically be generated. Based on this UID, the master election takes place. After the first start, the keeper id (provided or generated) is saved inside the keeper’s data directory.

Stolon Sentinel(s)

A Sentinel discovers and monitors Stolon keepers and calculates the optimal cluster view. The Sentinel uses the UID of master & standalone(s) to monitor and keep a track of a Stolon keeper. Sentinel service is set up as a deployment type.

Stolon Proxies

A Stolon Proxy will enable the PostgreSQL service endpoint with a fixed IP and DNS name for accessing the PostgreSQL service. This proxy will help switch the master connection based on master failover change. The stolon-proxy is a sort of fencer since it’ll close connections to old masters and direct new connections to the current master.

PostgreSQL Users

Stolon requires two kinds of users:

The Superuser

  • manages/queries the keepers’ controlled instances. (AKA Normal connection users)

  • executes (if enabled) pg_rewind based resync

The Replication user

  • manages/queries the keepers’ controlled instances

  • performs replication between postgres instances

Postgres Cluster for HA Production deployments

In order to obtain high availability and resilience, we have customized the default PostgreSQL cluster parameters. What follows is a description of the setup of our environment.

Synchronous Replication

PostgreSQL has “synchronous replication” (SR) as an option for data availability. By default this option is disabled. We have enabled this function so that transactions are committed on one or more replicas before a success message is returned to the database client. This guarantees that if the client saw a commit message that the data is persisted to at least two nodes (master & stand-alone). This option is important for instances when data is so valuable that we’d rather have the database reject writes than risk losing them after the commit on the database cluster master fails.

Fail Interval

This is the interval after the first fail to declare that a master (keeper) as not healthy by the stolon sentinel service. The default value is 20 seconds, but we modified it to 5 seconds for faster recovery.

Streaming replication

Our current setup eliminates the requirement of shared storage for master and standalone since it uses Postgres streaming replication. Using this streaming replication all the standalones are in sync with the master keeper.

Parameter changes

We can alter Postgres parameters utilizing the Stolon cluster features thus eliminating the need for downtime more than our failover switch time. The failover switch mechanism will make sure the change is done by rolling updates.

Max_connections

Max_connections increased from 100 to 10000 connections so that the concurrent process can handle the maximum amount of transactions at a time.

Failover master Switching

Once the existing master is lost, the stand-alone pod will be elected as the new master by Stolon Sentinel and will accordingly maintain the connection with the help of the stolon proxy service. Since the data is synchronized between all pods with the help of streaming synchronization, there will not be any data mismatches. This new master will serve until it experiences an issue. During the master change, the proxy service will also redirect the connection to that new master pod.

The master switch happens within a delay interval of 10 – 12 seconds. Once the master connection is lost, the cluster elects another stand-by as the new master and switches connections to the new one within 12 seconds.

Need More Information?

PostgreSQL: https://www.postgresql.org/

Stolon: https://github.com/sorintlab/stolon


Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

Using Sub-Second Data to Fight Latency in Streaming Video

Closing the performance gap between video and traditional television will require fast data.

Originally posted to LinkedIn | July 13, 2018

When it comes to video streaming, defining and understanding latency has become confusing and controversial. Latency factors into many aspects of video delivery, and yet what it means, how it is measured, and how we can improve it are topics not frequently discussed in the world of OTT and online video. Why? Because it requires vendor cooperation, a new set of data, and a fast infrastructure that has only been implemented by video giants like Netflix, Facebook, Amazon and Google.

What these companies have is a data-driven end-to-end infrastructure that enables them to adapt to issues from CDN origin to device without manual changes. Creating an end-to-end intelligent infrastructure is not going to happen overnight, but it doesn’t mean that there aren’t changes we can all be making today that can be immediately impactful. The future of OTT and online video stacks can look much brighter, simpler, and more manageable if we can leverage data to empower changes within a content distributor’s infrastructure, and within others.

The Importance of Low-Latency Data for Video

Latency is the delay between output and reception of information. For content distributors, two types of latency are critically important. The first is video latency — or the delay between a content transmission’s initiation and completion. It’s an important factor to control, especially for live events, where audiences want an “immediate” experience, as close to physical presence as possible.

Video Latencyis built into linear TV, which has an infrastructure that is less variable prone than the internet. It is typically about 5 seconds according to Brightcove, and exists to give distributors control over what gets broadcast (Janet Jackson at Superbowl XXXVIII comes to mind). Linear providers have the luxury of established channels for distribution which give them the privilege of building in latency. For online video, the internet is too haphazard and unpredictable, and latency in live video (for example) can be anywhere from 30 seconds to minutes behind a linear broadcast.

The second type of latency is the data latency. Data latency, is the delay in the “returning” information describing the streaming session and content playback, like QoS data points. For video distributors to get a hold on the latency with which they are streaming video, they need up-to-the-second data to accurately diagnose problems and improve quality.

Latency for Incident Detection

If the quality of video playbacks is affected, real time data can help identify which service or where along the video supply chain the hiccup is occurring. Maybe its a player side issue which the engineering team needs to address. Perhaps there is an error occurring over the CDN. Having real-time data means being able to predict consequences faster, like users calling in to complain about the service, or preventing a inflammatory tweet by recognizing issues before the world publishes them first. Regardless of the disturbance, only real-time data informs managers about their video delivery stack and where to pinpoint problems.

It’s An Uphill Battle for CDNs Against Distribution Latency

In order for OTT to have the ability to control video latency with the same precision as linear television, and provide the same viewing experience, content distributors need to look deeper into their video delivery pathways and have better visibility within each handoff from origin to end-user.

Today, an overwhelming amount of pressure (and consideration) for the optimization of the entire video delivery pathway on one point: the CDN. This is because the CDN is a directly contracted service by the content distributor. When it comes to end-to-end optimization, it’s generally accepted that the fewer file handoffs or “hops” to get to the end-user, the faster the speed and the higher the quality, as less hops mean that there are fewer chances of congestion and packet loss, which cause delivery quality and image quality issues. This is one of the criteria used to balance video traffic within a CDN’s own network, where they can choose to send data over a Transit provider or directly peer with an ISP to reach the end user. Sometimes one CDN isn’t enough, and in order to reach a worldwide audience, build-in redundancy, or control costs, many content distributors use multiple CDNs and CDN load-balancing platforms (like Cedexis, NicePeopleAtWork’s SmartSwitch and Conviva’s Precision) to help choose the best delivery pathway for each playback.

However, it seems as though the industry is placing all of its focus on addressing just a single step in a truly multi-step process. Anyone who has studied the supply chain knows that improved throughput upstream has no effect on overall efficiency if the throughput downstream cannot be matched, and vice-versa. Therefore, there is a natural “cap” to what we can get by optimizing just a single part of the process. If we really want to optimize the end-to-end delivery chain, we need to build in the consistency, quality and control across the end to end spectrum.

Understanding Transit, and How CDNs Make Transit Contracts

When a video file leaves a CDN, there are one of two options for the next step in the delivery path: either the file is handed off to the end-user’s ISP, or to a Transit provider. CDNs contract with Transit providers as a way of “extending” their infrastructure to connect to their audience’s ISPs. The Transit provider may have a direct relationship with the end-user’s ISP, or sometimes it will also need to hand-off the file to another Transit provider (and another, and another…) until the connection with the ISP is found. To some extent, the SLA of a CDN contract may only be as valuable as the contracts they have downstream.

But not all contracts are created equal. There are different SLAs and guarantees that are made with each type of Transit contract: Contracts with SLA, “Best Effort” and even “No Effort”, which are priced descendingly. A contract with SLA will “reserve” part of the connection throughput, with SLAs for packet loss and latency. A “Best Effort” contract means that there is no SLA, but traffic will take priority over “No Effort” traffic. There are cases where even if a CDN has a direct peering relationship with an ISP, the CDN might still push traffic over a Transit provider if the throughput at the peering point with an ISP goes above and beyond their contract.

So, why should content distributors care about peering agreements and Transit contracts? Because they can be a driving force (or bottleneck) for driving quality viewing experiences.

Optimizing Against Latency in Video Delivery is an End-to-End Effort

Improvements can be made within the content delivery chain through the use of data acting as a real-time performance feedback loop. Content distributors benefit from the visibility provided by QoS Analytics tools like Conviva’s Pulse, NicePeopleAtWork’s YOUBORA and Mux, but we have yet to see this data become actively incorporated by the stakeholders involved in the rest of the video delivery chain. Although partnerships between QoS Analytics platforms and CDNs are not uncommon, the benefit is mostly for the CDNs to know what their customers are seeing and to be able to predict the reports of problems and complaints.

The act of sharing data (which is done today), and using this data are two very different scenarios. In order to use data, data must be provided in both a format and a time-frame that fit with the back-end of the receiving entity, like a CDN. Data that is collected at a different frequency, packaged with other data, or data that is old no longer represents the latest conditions of the delivery network, and therefore we cannot program automated changes against it. There is a big opportunity for CDNs that have software-driven networking maps to have intelligent networking maps if they’re able to use data from the client as a feedback mechanism.

The real question for CDNs is, if they had whatever real-time you wanted coming from the client, would they balance their networks different? From Datazoom’s discussions with the industry, the answer is Yes. Just like trying to choose an flight if you don’t know the total fight duration or the number of layovers, CDNs are blind to key points of information, which makes optimizing paths downstream not just difficult, but impossible.

The Potential Impact of Data Sharing Across the Delivery Chain

But the impact potential for data sharing doesn’t stop with the CDN. Transit providers as well as ISPs can use real-time feedback loops of data to detect and route around issues within their own networks as well. Let’s look at ISPs for an example: When an end-user is streaming content on Netflix and the stream fails, who do they blame? Likely either the content distributor or the ISP. Calls into an ISPs customer service center can create a significant cost center.

If ISPs used a real-time feedback loop of data to understand challenges in their own last mile of delivery they could not only reduce customer service calls but improve overall service reliability. The only problem is that today the content distributor is the only one who collects and has access to (if at all) to this data. With real-time collection and data segmentation (to separate technical data from business data) and routing, content distributors should be motivated to share data with ISPs in order to get better performance for their content, for what is truly a shared end-user customer.

Control the Video Supply Chain with Adaptive Video Logistics

Adaptive Video Logistics is a new class of software invented and pioneered by Datazoom. By returning data ownership to the content distributor and giving customers the flexibility to design data-pipes to carry data to wherever it is needed, whether that be an Analytics tool or a CDN platform. We care about latency too — Unlike other “real-time” tools, Datazoom collects and routes data in under 1 second, guaranteed. At Datazoom, we are dedicated to assisting content distributors overcome the pitfalls associated with slow, siloed data.

Read More