Eagle Eye Networks

Delivering at Ludicrous-Velocity

March 30, 2017 Eagle Eye Networks

ludicrous-speed-FI

An account of how Continuous Delivery enabled us to quickly diagnose and scale in response to recent issues with our Motion Notification system. While chasing a wild goose on my adventure, I accidentally gathered some Gevent benchmarks worth sharing.

1*M9DdOG4DQZtRXf0YfE8Z6Q - Delivering at Ludicrous-Velocity

“They’ve gone plaid” — Spaceballs 1987

Click edit button to change this text.We ARE a high-velocity application delivery company. That’s only been true recently, and the transformation has been exciting. This post chronicles how we recently approached and resolved an issue with ludicrous speed.

1*PPiAviwngFknNcmCS5njxw - Delivering at Ludicrous-Velocity
I won’t dive into the gritty details of our setup in this article. My intent is to illustrate our tight instrument, analyze, adjust cycles to continually deliver for the customer at high-velocity. In regards to the Spaceballs reference, I think our DevOps director would argue we only approaching ridiculous speed. He’s got big plans.

Alerts and Notifications

Motion alerts and system notifications are critical aspects of any video surveillance system. At Eagle Eye Networks, our customers rely on alerts and notifications to be timely and correct. During an incident, seconds matter.

Eagle Eye Networks customers can tailor motion alerts for certain cameras and regions. Camera motion regions have independent motion settings, alert levels, and recipients. Aside from motion alerts, Eagle Eye Networks also provides system notifications to indicate camera statuses (online/offline). Motion alerts are designed to be delivered immediately, while notifications about a camera state changes have some built in heuristics to prevent spamming users for momentary online/offline status changes, which are typical for networked systems.

System Design

Prior to our recent update, the alert and notification system involved three components — a single alert monitor, a collection of archivers, and a collection of gateways. Archivers continually push device events (ETags) onto the poll stream. Both internal or external services can monitor the poll stream for various device events. The archivers also send out a heartbeat event reporting the current state of a device every 30 seconds. The alert monitor listens on the poll stream for every device for state change and motion ETags. When the alert monitor determines the event is valid, a notification is sent to the gateway service, where, based on the account settings, an email is sent.

Running Late

Recently, we noticed that alerts were delayed. Over a short time, the delays were becoming significant — sometimes hours. We had to react fast.

Lead Suspect: Alert Monitor

We verified the poll stream was delivering ETags on time, our email server was not queueing, and our gateway was responding immediately to the alert monitor. All signs indicated the bottleneck was IN the alert monitor. Was it picking up events on time? How many events it is processing? Is it over worked?

We love Gevent, so naturally, we’d leverage the technology as much as possible for I/O bound tasks. The alert monitor’s main thread spawns a greenlet for each account, which in turn spawns a device state machine (DSM) greenlet for each device. The DSM greenlet attaches to the poll stream for the device and begins listening for ETagsand reacting accordingly. By simple math, we figured we were approaching several hundred thousand greenlets running in the monitoring service. Was this too many greenlets for one process? Did we hit some magic greenlet limit?

Scale It

Without going any further

[being up against time], we thought we might see some improvement by horizontally scaling the alert monitor service. Distribute the work across multiple alert monitors to reduce our greenlets per instance. With a little help from Peter, our DevOps director (by the way, he’s hiring), this was a quick and easy rollout. Like minutes, really. We copied the alert monitor deployment for each of our 9 data centers, configured the alert monitor to only service accounts in that data center, and Bob’s your uncle — we scaled.

** Like good boy scouts, we had coded the alert monitor a few months prior to limit the accounts by data center. Whenever we decided to scale across, we just had to provide an environment variable to control the account filtering EEN_AGENT_DATACENTER=c001 and restart.

We observed our Austin and Japan data centers were still experiencing delays, but they were under an hour. Our other data centers, with less cameras were rock solid, delivering alerts in seconds. We were moving in the right direction.

Austin Data Center Numbers

Still, is there a magic greenlet limit or is something else at play? I was suspicious that we were reaching co-routine saturation. It was time to gather some real statistics given that we now have a few control subjects.

Each device greenlet waits on I/O from the poll stream. When an event arrives, it wakes to do some minimal CPU operations and waits on the stream again. They are very I/O bound. I stated earlier that each device will get a heartbeat ETag every 30 seconds reporting the current camera status. If the status is not changed, no action is taken, but it does require CPU nonetheless, preventing other greenlets from executing. In essence, each ETag read is an interrupt that needs attention.

Instrumenting logs with Scalyr

With Continuous Delivery (CD) Pipelines, we quickly deployed a small change in alert monitor to periodically drop a Scalyr event, recording the interrupts. We rely on Scalyr to aggregate the logs from all of our microservices. Using the Scalyr web interface, we can slice and dice the logs from across our ecosystem to build a cohesive timeline of events. Scalyr has been the biggest productivity boost for me as a developer for mining events and data from our production environment.

At the same time, we made another quick change to the publishing service to calculate the delta between the ETag timestamp and the time the request was handled. This would effectively tell us, from an outside source, how long an event was held in the alert monitor — an alertDelay.

By recording the alertDelay in Scalyr, we could monitor the situation and eventually set up alerts when alertDelay exceeded some threshold. But in the short term, we had a couple of metrics to keep us on course.

** Total time to complete instrumentation: change code, test, code review, QA, deploy was about 1 hour (CD for the win!).

Data analysis

Basing our analysis on the Test Data Center working correctly, we calculated that the system needed to be able to process about 13 Etags per device per minute. In the D001-DC, we observed a maximum of 176,000 interrupts /min. This was significantly below the expected number of interrupts, for the number of devices supported by the D001-DC. Our I/O bound problem became CPU bound. Not good for Gevent.

We could solve this in two ways, further distribute the D001-DC load or reduce the interrupts. Recall, that most of the interrupts were ignored as they were repeating the same camera status, wasting resources. It was clear we needed to modify the archivers to only send Etags when the status actually changed. After we rolled out the change, we watched an immediate 98% improvement unfold in the Scalyr logs.

1*SERSAVL Nr3vwZpbOyjdag - Delivering at Ludicrous-Velocity

Scalyr logs as the archiver update was deployed. That drop is gorgeous!

Instrumenting alert delays with Scalyr

We put some thought into the alertDelay log event so that we could use the Scalyr log parsing feature to pull out the $notifType (Alert, Offline, Online, InternetOffline). Below is the parser:

Click edit button to change this text.Why? Because then, we could filter for events using numerical comparison to find long delays. We could trigger system alerts for the engineers on an event over that number ($alertDelay>300 ).

1*OqPVsPS6R AW0Wr2PQ  Jw - Delivering at Ludicrous-Velocity

Delay between the inception of the event and it being published

And, for the cool factor, graphs are great to show, especially when you see the positive effects of your work as it’s being deployed.

0*X70tI4e i3gWLkd0 - Delivering at Ludicrous-Velocity

Motion Alert — Average delay (in seconds)

0*k KuycGxcf7gE4ME - Delivering at Ludicrous-Velocity

Online and Offline Notification — Average delay (in seconds). NOTE: there is a 300 second and 95 second hold time in Alert Monitor for state events for debouncing flaky network connections.

Greenlet Benchmarks

While we were gathering metrics about the delays and interrupts, I took some time to see if I could figure out where Gevent breaks down. I really wanted to understand if there was a saturation point and I was a concerned about the scheduling. Greenlets “will be executed one by one, in an undefined order” according to the documentationUNDEFINED ORDER?! This speaks to the scheduling of greenlets, and, if it was random, could greenlets starve, were some greenlets favored? I needed to know. Of course, I could just read the code for the scheduling answer but that’s no fun. I like graphs.

1*NQMSm APRgglTynOf3MJDQ - Delivering at Ludicrous-Velocity

Latency Benchmarks

I read somewhere that around 100,000 greenlets, performance start to decline. I was curious in a few metrics regarding latency. When X number of greenlets are ready to execute, what is the latency for the first greenlet to execute AND what is the latency of the middle quartile? My script was simple. Spawn X greenlets to wait on an event from the main greenlet, record the time and set the event to start the gathering. As each greenlet executes, it records the start time in a list.

When the greenlets were all done, I computed the delta for each start time and plotted the results on the box chart below using plotly.

0*QsaEqbg6fB06j25R - Delivering at Ludicrous-Velocity

Box Plot of Latency

Findings There is a noticeable increase in latency at 100,000 greenlets. Also, of interest is how much deviation there is once you hit 100,000. Below is a link to the interactive Plotly data. Here is the plotter function to graph my results.

We are always looking for devOps people to join our staff.

You can also find this article on Medium. 

Tags

Other posts that might interest you

loading

Get Offsite Cloud Backup for Your Existing Video Surveillance Solution

There are many reasons a business might require more control over the storage and retention of it’s video surveillance data. Some industries have tight regulations around video retention, or they…

May 8, 2020 Eagle Eye Networks

Delivering at Ludicrous-Velocity

An account of how Continuous Delivery enabled us to quickly diagnose and scale in response to recent issues with our Motion Notification system. While chasing a wild goose on my…

March 30, 2017 Eagle Eye Networks

What Is CCTV and How Is It Used?

A CCTV, or closed-circuit television, is a system that links a camera to a video monitor allowing users to keep a watchful eye on their business. While CCTV has been…

April 29, 2020 Eagle Eye Networks