What Are MTTR and MTTD?
Allyson Barr· 5 min read
There are several metrics in use to determine incident management success. Two of them are MTTD and MTTR, which we will be discussing in this piece.
What Is MTTD?
MTTD is an acronym for “mean time to detect,” which refers to the average amount of time that passes between when a failure happens and when the system realizes it. MTTD is a key metric when analyzing how your team is able to relate IT changes to incidents. The faster you detect anomalies, the faster you can solve problems.
To understand why MTTD and MTTR start with the word “mean,” we have to hop into our DeLorean and take a quick trip back in time to middle school math class.
You calculate the mean of a set of numbers by adding them all together and then dividing by how many numbers were in the data set. Just like an average. When it comes to MTTD, you get the mean by adding together all the different times that passed between when a failure happened and when it was detected, then dividing it by the number of instances.
An MTTD data set can be relatively simple. It could look like this:
Monday: Your web server goes down at 2:00 p.m. The system discovers and reports the failure at 2:05 p.m. Discovery time is five minutes.
Wednesday: The same web server goes down at 11:00 a.m., and the failure is discovered at 11:15. Discovery time is 15 minutes.
Thursday: The server fails again at 1:00 p.m., and this is discovered at 1:04. Discovery time? Four minutes.
5 + 15 + 4 = 24, so your MTTD would be that divided by the number of incidents in the data set, or three. That gives you an MTTD of eight minutes.
However, MTTD can also hinge on someone speaking up and others listening. For example, suppose your e-commerce solution is failing due to an error in a database containing customer information. When a customer complains to a rep in your company, you could arguably say the MTTD clock has started. How stakeholders respond will make all the difference in a lower MTTD and, ultimately, customer satisfaction.
What Is MTTR?
MTTR is a slightly more flexible acronym signifying “mean time to recover,” but the last “R” can also stand for "repair," "restore," "resolve," or "remediate." MTTR is the average time that passes between when a failure has been discovered and when it has been fixed.
Depending on the systems in place, MTTR can vary a lot more than MTTD. With automated visibility solutions, MTTD can often be a function of programs that detect faults. MTTR, on the other hand, often involves people and a series of steps needed to fix the issue. So while MTTD may be a measurement of how well an automated alert system performs, MTTR often ends up being a measurement of both your systems and the people you depend on to jump into action after an incident.
Cut Recovery Time With StackState and Topology-Powered Observability
StackState's topology-powered observability solution helps organizations consolidate their monitoring systems data, visualize their entire IT environment in a single topology view, enhance IT observability and identify root cause faster than ever before. With real-time observability, the time it takes to discover a failure is cut down because admins and other stakeholders can see alerts from all segments of the stack and immediately pinpoint the root cause. This results in better MTTD numbers and greater resiliency.
The same is true for StackState’s impact on MTTR. A true recovery can’t happen unless you first identify the root cause. Otherwise, you may be slapping a Band-Aid on a cut that’s deeper than you realize. Because StackState has a complete understanding of the IT environment, its relationships, dependencies and the changes taking place within it, when a problem arises, the system can immediately pinpoint what changed and the associated problems. No more hunting for the cause, no more "blame games" of "guilty until proven innocent" and fewer war rooms. This is supported by our time-traveling topology, which allows you to zoom back in time to the precise moment something changed that perpetuated anomalies or caused incidents.
Webinar: Crush your MTTR with StackState as turbo on top of Splunk
Are you using Splunk and are you interested in how you can use StackState with Splunk to crush your MTTR and MTTD goals? Regardless of whether you have Splunk or not, watch this webinar, Driving Business Performance With Observability in Financial Services. Learn how our customer at NN Bank - one of the larger banks in the Netherlands - now can quickly relate IT incidents to business impact by using StackState's topology-powered observability solution in combination with their Splunk data lake. No matter what industry you are in, there are some best practices on display at NN Bank!
Allyson Barr· 5 min read