Thoughts Cory Carpenter Thoughts Cory Carpenter

Setting up PostgreSQL on Kubernetes using Stolon

Overview

PostgreSQL is a powerful object-relational database system (ORDBMS). With the ACID-compliant, transactional nature and having earned a reputation for reliability, feature robustness, and performance, many companies use it for managing their master data. For production deployments, it is important for applications to maintain a highly available (HA) environment. PostgreSQL offers HA, but there were challenges to get it set up correctly on Kubernetes environment. Stolon is a cloud-native PostgreSQL manager to maintain HA. It has the ability to run on a Kubernetes environment as well and utilizes the PostgreSQL native cluster mechanism to add more value to the high availability feature.

Why ‘Stolon’ PostgreSQL Cluster??

Implementing a PostgreSQL cluster inside Kubernetes is always a challenge since it cannot be directly integrated with stateful services. The well knows methods for implementing clusters are sorintlab’s Stolon, CrunchyData PostgreSQL cluster, and Zalando‘s Patroni/Spilo PostgreSQL cluster.

As it stands, it is my opinion that Stolon is the best method for implementing a PostgreSQL cluster inside Kubernetes because of:

  • The High Availability of PostgreSQL data

  • Open source data storage services

  • Better customization of PostgreSQL versions based on application requirements

  • Its ability to easily modify service names, database names, and user access privileges

  • Automated failover switching with very minimal delay

  • High resiliency

  • Easy cluster scaling

  • Easy replication

Some information about Postgres Cluster configuration

A PostgreSQL cluster consists of:

  • Stolon Cluster

  • Stolon Sentinel(s)

  • Stolon keepers

  • Stolon Proxies

Stolon Cluster

A highly available PostgreSQL Cluster is implemented with the help of a Stolon cluster in Kubernetes and all the configurations passed through configmap (stolon-cluster-kube-stolon) using stolon cluster. Any update in the Postgres parameter can also be passed as a rolling update through the configmap.

Note: For PostgreSQL cluster setup (including all three of the components mentioned above) wait for this Stolon component to be available.

Stolon Keeper

A PostgreSQL database engine runs as a Stolon keeper service and is implemented as a Statefulset with persistent volume. Each pod in statefulsets is the master and stand-alone of the cluster. Data synchronous between each of the cluster candidates (master & stand-alone) are performed with the help of a separate Postgres user. Every keeper MUST have a different UID which can either be manually provided (–uid option) or one will automatically be generated. Based on this UID, the master election takes place. After the first start, the keeper id (provided or generated) is saved inside the keeper’s data directory.

Stolon Sentinel(s)

A Sentinel discovers and monitors Stolon keepers and calculates the optimal cluster view. The Sentinel uses the UID of master & standalone(s) to monitor and keep a track of a Stolon keeper. Sentinel service is set up as a deployment type.

Stolon Proxies

A Stolon Proxy will enable the PostgreSQL service endpoint with a fixed IP and DNS name for accessing the PostgreSQL service. This proxy will help switch the master connection based on master failover change. The stolon-proxy is a sort of fencer since it’ll close connections to old masters and direct new connections to the current master.

PostgreSQL Users

Stolon requires two kinds of users:

The Superuser

  • manages/queries the keepers’ controlled instances. (AKA Normal connection users)

  • executes (if enabled) pg_rewind based resync

The Replication user

  • manages/queries the keepers’ controlled instances

  • performs replication between postgres instances

Postgres Cluster for HA Production deployments

In order to obtain high availability and resilience, we have customized the default PostgreSQL cluster parameters. What follows is a description of the setup of our environment.

Synchronous Replication

PostgreSQL has “synchronous replication” (SR) as an option for data availability. By default this option is disabled. We have enabled this function so that transactions are committed on one or more replicas before a success message is returned to the database client. This guarantees that if the client saw a commit message that the data is persisted to at least two nodes (master & stand-alone). This option is important for instances when data is so valuable that we’d rather have the database reject writes than risk losing them after the commit on the database cluster master fails.

Fail Interval

This is the interval after the first fail to declare that a master (keeper) as not healthy by the stolon sentinel service. The default value is 20 seconds, but we modified it to 5 seconds for faster recovery.

Streaming replication

Our current setup eliminates the requirement of shared storage for master and standalone since it uses Postgres streaming replication. Using this streaming replication all the standalones are in sync with the master keeper.

Parameter changes

We can alter Postgres parameters utilizing the Stolon cluster features thus eliminating the need for downtime more than our failover switch time. The failover switch mechanism will make sure the change is done by rolling updates.

Max_connections

Max_connections increased from 100 to 10000 connections so that the concurrent process can handle the maximum amount of transactions at a time.

Failover master Switching

Once the existing master is lost, the stand-alone pod will be elected as the new master by Stolon Sentinel and will accordingly maintain the connection with the help of the stolon proxy service. Since the data is synchronized between all pods with the help of streaming synchronization, there will not be any data mismatches. This new master will serve until it experiences an issue. During the master change, the proxy service will also redirect the connection to that new master pod.

The master switch happens within a delay interval of 10 – 12 seconds. Once the master connection is lost, the cluster elects another stand-by as the new master and switches connections to the new one within 12 seconds.

Need More Information?

PostgreSQL: https://www.postgresql.org/

Stolon: https://github.com/sorintlab/stolon


Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

Using Sub-Second Data to Fight Latency in Streaming Video

Closing the performance gap between video and traditional television will require fast data.

Originally posted to LinkedIn | July 13, 2018

When it comes to video streaming, defining and understanding latency has become confusing and controversial. Latency factors into many aspects of video delivery, and yet what it means, how it is measured, and how we can improve it are topics not frequently discussed in the world of OTT and online video. Why? Because it requires vendor cooperation, a new set of data, and a fast infrastructure that has only been implemented by video giants like Netflix, Facebook, Amazon and Google.

What these companies have is a data-driven end-to-end infrastructure that enables them to adapt to issues from CDN origin to device without manual changes. Creating an end-to-end intelligent infrastructure is not going to happen overnight, but it doesn’t mean that there aren’t changes we can all be making today that can be immediately impactful. The future of OTT and online video stacks can look much brighter, simpler, and more manageable if we can leverage data to empower changes within a content distributor’s infrastructure, and within others.

The Importance of Low-Latency Data for Video

Latency is the delay between output and reception of information. For content distributors, two types of latency are critically important. The first is video latency — or the delay between a content transmission’s initiation and completion. It’s an important factor to control, especially for live events, where audiences want an “immediate” experience, as close to physical presence as possible.

Video Latencyis built into linear TV, which has an infrastructure that is less variable prone than the internet. It is typically about 5 seconds according to Brightcove, and exists to give distributors control over what gets broadcast (Janet Jackson at Superbowl XXXVIII comes to mind). Linear providers have the luxury of established channels for distribution which give them the privilege of building in latency. For online video, the internet is too haphazard and unpredictable, and latency in live video (for example) can be anywhere from 30 seconds to minutes behind a linear broadcast.

The second type of latency is the data latency. Data latency, is the delay in the “returning” information describing the streaming session and content playback, like QoS data points. For video distributors to get a hold on the latency with which they are streaming video, they need up-to-the-second data to accurately diagnose problems and improve quality.

Latency for Incident Detection

If the quality of video playbacks is affected, real time data can help identify which service or where along the video supply chain the hiccup is occurring. Maybe its a player side issue which the engineering team needs to address. Perhaps there is an error occurring over the CDN. Having real-time data means being able to predict consequences faster, like users calling in to complain about the service, or preventing a inflammatory tweet by recognizing issues before the world publishes them first. Regardless of the disturbance, only real-time data informs managers about their video delivery stack and where to pinpoint problems.

It’s An Uphill Battle for CDNs Against Distribution Latency

In order for OTT to have the ability to control video latency with the same precision as linear television, and provide the same viewing experience, content distributors need to look deeper into their video delivery pathways and have better visibility within each handoff from origin to end-user.

Today, an overwhelming amount of pressure (and consideration) for the optimization of the entire video delivery pathway on one point: the CDN. This is because the CDN is a directly contracted service by the content distributor. When it comes to end-to-end optimization, it’s generally accepted that the fewer file handoffs or “hops” to get to the end-user, the faster the speed and the higher the quality, as less hops mean that there are fewer chances of congestion and packet loss, which cause delivery quality and image quality issues. This is one of the criteria used to balance video traffic within a CDN’s own network, where they can choose to send data over a Transit provider or directly peer with an ISP to reach the end user. Sometimes one CDN isn’t enough, and in order to reach a worldwide audience, build-in redundancy, or control costs, many content distributors use multiple CDNs and CDN load-balancing platforms (like Cedexis, NicePeopleAtWork’s SmartSwitch and Conviva’s Precision) to help choose the best delivery pathway for each playback.

However, it seems as though the industry is placing all of its focus on addressing just a single step in a truly multi-step process. Anyone who has studied the supply chain knows that improved throughput upstream has no effect on overall efficiency if the throughput downstream cannot be matched, and vice-versa. Therefore, there is a natural “cap” to what we can get by optimizing just a single part of the process. If we really want to optimize the end-to-end delivery chain, we need to build in the consistency, quality and control across the end to end spectrum.

Understanding Transit, and How CDNs Make Transit Contracts

When a video file leaves a CDN, there are one of two options for the next step in the delivery path: either the file is handed off to the end-user’s ISP, or to a Transit provider. CDNs contract with Transit providers as a way of “extending” their infrastructure to connect to their audience’s ISPs. The Transit provider may have a direct relationship with the end-user’s ISP, or sometimes it will also need to hand-off the file to another Transit provider (and another, and another…) until the connection with the ISP is found. To some extent, the SLA of a CDN contract may only be as valuable as the contracts they have downstream.

But not all contracts are created equal. There are different SLAs and guarantees that are made with each type of Transit contract: Contracts with SLA, “Best Effort” and even “No Effort”, which are priced descendingly. A contract with SLA will “reserve” part of the connection throughput, with SLAs for packet loss and latency. A “Best Effort” contract means that there is no SLA, but traffic will take priority over “No Effort” traffic. There are cases where even if a CDN has a direct peering relationship with an ISP, the CDN might still push traffic over a Transit provider if the throughput at the peering point with an ISP goes above and beyond their contract.

So, why should content distributors care about peering agreements and Transit contracts? Because they can be a driving force (or bottleneck) for driving quality viewing experiences.

Optimizing Against Latency in Video Delivery is an End-to-End Effort

Improvements can be made within the content delivery chain through the use of data acting as a real-time performance feedback loop. Content distributors benefit from the visibility provided by QoS Analytics tools like Conviva’s Pulse, NicePeopleAtWork’s YOUBORA and Mux, but we have yet to see this data become actively incorporated by the stakeholders involved in the rest of the video delivery chain. Although partnerships between QoS Analytics platforms and CDNs are not uncommon, the benefit is mostly for the CDNs to know what their customers are seeing and to be able to predict the reports of problems and complaints.

The act of sharing data (which is done today), and using this data are two very different scenarios. In order to use data, data must be provided in both a format and a time-frame that fit with the back-end of the receiving entity, like a CDN. Data that is collected at a different frequency, packaged with other data, or data that is old no longer represents the latest conditions of the delivery network, and therefore we cannot program automated changes against it. There is a big opportunity for CDNs that have software-driven networking maps to have intelligent networking maps if they’re able to use data from the client as a feedback mechanism.

The real question for CDNs is, if they had whatever real-time you wanted coming from the client, would they balance their networks different? From Datazoom’s discussions with the industry, the answer is Yes. Just like trying to choose an flight if you don’t know the total fight duration or the number of layovers, CDNs are blind to key points of information, which makes optimizing paths downstream not just difficult, but impossible.

The Potential Impact of Data Sharing Across the Delivery Chain

But the impact potential for data sharing doesn’t stop with the CDN. Transit providers as well as ISPs can use real-time feedback loops of data to detect and route around issues within their own networks as well. Let’s look at ISPs for an example: When an end-user is streaming content on Netflix and the stream fails, who do they blame? Likely either the content distributor or the ISP. Calls into an ISPs customer service center can create a significant cost center.

If ISPs used a real-time feedback loop of data to understand challenges in their own last mile of delivery they could not only reduce customer service calls but improve overall service reliability. The only problem is that today the content distributor is the only one who collects and has access to (if at all) to this data. With real-time collection and data segmentation (to separate technical data from business data) and routing, content distributors should be motivated to share data with ISPs in order to get better performance for their content, for what is truly a shared end-user customer.

Control the Video Supply Chain with Adaptive Video Logistics

Adaptive Video Logistics is a new class of software invented and pioneered by Datazoom. By returning data ownership to the content distributor and giving customers the flexibility to design data-pipes to carry data to wherever it is needed, whether that be an Analytics tool or a CDN platform. We care about latency too — Unlike other “real-time” tools, Datazoom collects and routes data in under 1 second, guaranteed. At Datazoom, we are dedicated to assisting content distributors overcome the pitfalls associated with slow, siloed data.

Read More
Thoughts Cory Carpenter Thoughts Cory Carpenter

Building A Scalable Strategy for Video Delivery

A Guide for Building a Data Architecture ready for Real-Time Decisioning and Automation

Originally posted to LinkedIn | June 8, 2018

Over the course of the nearly twenty-three years since the dawn of the streaming media space, beginning with that fateful Yankees-Mariners game in 1995, the industry has grown substantially, evolved its strategic approaches and technologies utilized, and then re-evolved accordingly to keep pace with the demands of the next generation. Today, online video commands 20% of total streaming hours, and counting.

What was once a concerted fight for legitimacy, the streaming media industry has transitioned into a quest for efficient and effective business development — with the biggest challenge now shifting from distribution to achieving sustainable margins. Thanks to advances in technology, audiences have benefited from a wider range of content, available to them at any time, in any place, and on any device. Although this has led to a paradigm shift in consumer expectations and the disruption of the linear television market, the business of streaming media has not caught up to its predecessors, with streamers struggling to monetize their content.

The challenge is two-sided: Content distributors need to increase demand for their content and grow earnings coming from online distribution (revenue), while at the same time they need to get more from less, and build greater efficiency and stability into the resources they are working with today (COGS). Despite falling price points for some key systems (like CDN), we have not evolved how we handle the dependencies, fail-overs and trigger points for how critical infrastructure and peripheral systems should work together to maximize potential.

We’ve reached a point where solving the latter challenge actually helps us kill two birds with one stone. If we want to build more profitable streaming businesses, we need to upgrade our core infrastructure technologies, not by replacing any one system but by creating technology glue that holds all of these systems together. Video distributors need a fundamentally different approach for harmonizing interactions between services, providing greater flexibility and ability to self-heal when failures are detected.

What is Adaptive Video Logistics?

Adaptive Video Logistics (AVL), is a new category of software for the streaming media industry, offering better data collection pipelines and integration for content distributors. Datazoom is pioneering this software category, with our patent-pending technology that is uniquely able to pull data from any software delivered environment (i.g. a webpage or video player framework) using “Collectors” — and push high-frequency, sub-second data to various software tools and destination(s) — what we call “Connectors”. We worked with some awesome companies to join our growing ecosystem at Datazoom.

So far we’ve built Collectors for iOS, Android, Anvato, Brightcove , HTML5, JWPlayer, & THEOPlayer. We’ve completed Connectors for Amplitude Analytics, Google Analytics, Heap Analytics, YOUBORA Analytics, Adobe Analytics (Omniture), New Relic Insights, Google BigQuery, Amazon Kinesis Data Streams, Datadog, Keen IO, and Mixpanel. New integrations are released every 1–2 weeks.

From Datazoom, which acts as an integration mission control, you can manage pipelines of data moving between Collectors to Connectors without needing developers to deploy new code. In addition, our platform is redefining real-time when it comes to data: we operate with sub-second latency, enhancing the usability of the metrics output by other platforms. But the true value of Datazoom is found in its strategic utilizations.

Three Pillars of Data for Adaptive Video Logistics

1. Efficiency — Simplified Data Collection

Most content distributors use several analytics tools, each of which requires a separate SDK to capture data. There are several consequences for using multiple SDKs: The added weight caused by each script has led to what is now known in the industry as “player bloat.” Furthermore, data collected for each tool not only ends up in silos but adds weight to the payload. But the real implication is the damage to the overall usefulness of data — any attempt to unify data gathered from disparate sources would have to face the challenges of data inconsistency, variability and duplication.

As a solution, the Datazoom SDK ensures efficient and consistent data collection and creates a single source for any data to be routed out, maximizing data utility.

2. Latency — Sub-Second Latency

Real-time seems to have a variety of definitions these days… in video streaming conditions aren’t changing hour-to-hour or minute-to-minute, but second-to-second. And the stakes are high — 2 seconds is all it takes to lose a viewer. For an industry whose conditions can change so abruptly (with significant potential consequences) why do we permit our data to be minutes or even hours behind?

Datazoom is laying the groundwork (specifically the data pipelines) required to usher in a new era of video operations, and can enable content distributors to adapt the ability to make decisions as changes occur, and leverage the limitless scale of AI and Machine Learning to power better video experiences — Automation. However, the strength of a decision (whether made by man or machine) is only as good as the information it’s given. We must begin to see that the value of our data is indirectly correlated with latency — the greater the latency the smaller the value.

We should begin to shift our focus not to only ensuring we have access to the data we need, but with the least delay as possible so decisions can be made as soon as changes occur. We’ve built Datazoom to capture data in less than a second. In fact, we guarantee it under our SLA. We know that this is what will be required to grow the business of online video, and match the experience of traditional television.

3. Utility — Creating the Data Feedback Loop

With a unified dataset being updated in real-time, video operations departments are in a better position than ever to receive meaningful feedback from distributed video. But the utility of data doesn’t stop there — the effort of video delivery requires alignment with critical, external services along the video delivery path, from Encode to Device. Data collected from the video player can be used to improve the hand-offs between these systems that ultimately impact the user experience and business of video.

Datazoom enables content distributors to provision access and distribute data in a way that create a sub-second latency feedback loop for key providers. Since Datazoom can segment data and parse it into various intervals, we ensure that the data we capture can be fit the unique backends of any system, increasing the total utility of data.

Read More