Optimizing Performance Analytics using Statistics
GOTO Community Day Chicago 2023

Tuesday Oct 24
3:15 PM –
4:00 PM
TAG (Kinzie 1)

Optimizing Performance Analytics using Statistics

The audience of this talk enjoys optimizing performance through data. In a typical monitoring set-up, a benchmark is defined somewhat arbitrarily: <5% error rate or less than 300ms response time. These benchmarks can be met (or not met) due to a variety of reasons: poor database performance, network latency, poor design under load – the reasons can be seemingly endless. A single boolean value is often not enough to triage effectively and accurately. We need data to inform our decision making process.

But what is the typical process? You receive a page and join the P1 incident and start digging into the logs. But where do you start and how do you know where to start looking? Perhaps you follow the clues and scrape the logs for certain key words, but there’s no smoking gun. There is not much confidence in a root cause until a hot fix is applied and often we cross our fingers that this solves the issue. The RCA is done after the fact after the escalation has died down and engineers can dig into the data streams they’ve collected and piece together what went wrong.

In this talk, I will show you how through the application of basic statistical testing and the nature of the cumulative distribution function one can infer where and when an incident is occurring, or even about to occur. Through basic normalization techniques and by re-imagining pre/post testing, one is able to detect outlier deviations – and even tune the sensitivity of such a model with ease.

Key Takeaways: * defining “performance” is multi-dimensional and rarely captured by arbitrary boolean values * using baseline metrics to capture expected performance, one can understand of the datapoint is of the same, or different, distribution * “performance” cannot be measured on an equal scale, and you will learn of simple ways to normalize data so that it can be easily compared * “performance” is not measured in a vacuum - there are many inputs into a system that provide an output, and I will show you how you can measure that in a multidimensional way. * the world is not “normal” (as in normally distributed), though that is a basic assumption. I will show you how other distributions can also fit into this framework * there are ways to validate the chosen distribution to the sample, such as the Kolmogorov Smirnov “goodness of fit” test.

Overall, I hope attendees come away with a basic understanding of statistics and how statistical techniques can help remove dimension from any measurement such that an aggregate of comparisons can be made to produce something meaningful in performance metrics and analytics. I hope you can come away from this talk with an understanding around how to optimize the many boolean benchmarks you may have in your tech ecosystem into a statistical framework to infer and measure “performance”.