What Are MTTR and MTTD?
Allyson Barr· 6 min read
There are several metrics in use to determine incident management success. Two of them are MTTD and MTTR, which we will be discussing in this piece.
What Is MTTD?
MTTD is an acronym for “mean time to detect,” which refers to the average amount of time that passes between when a failure happens and when the system realizes it. MTTD is a key metric when analyzing how your team is able to relate IT changes to incidents. The faster you detect anomalies, the quicker you can solve problems.
To understand why MTTD and MTTR start with the word “mean,” we have to hop in the DeLorean and take a quick trip back in time to middle school math class.
You calculate the mean of a set of numbers by adding them all together and then dividing by how many numbers were in the data set. Just like an average. When it comes to MTTD, you get the mean by adding together all the different times that passed between when a failure happened and when it was detected, then dividing it by the number of instances.
An MTTD data set can be relatively simple. It could look like this:
Monday: Your web server goes down at 2:00 p.m. The system discovers and reports the failure at 2:05 p.m. Discovery time is five minutes.
Wednesday: The same web server goes down at 11:00 a.m., and the failure is discovered at 11:15. Discovery time is 15 minutes.
Thursday: The server fails again at 1:00 p.m., and this is discovered at 1:04. Discovery time? Four minutes.
5 + 15 + 4 = 24, so your MTTD would be that divided by the number of incidents in the data set, or three. That gives you an MTTD of eight minutes.
However, MTTD can also hinge on someone speaking up and others listening. For example, suppose your e-commerce solution is failing due to an error in a database containing customer information. When a customer complains to a rep in your company, you could arguably say the MTTD clock has started. How stakeholders respond will make all the difference.
What Is MTTR?
MTTR is a slightly more flexible acronym signifying “mean time to recover,” but the last “R” can also stand for "repair," "restore," "resolve," or "remediate." MTTR is the average time that passes between when a failure has been discovered and when it has been fixed.
Depending on the systems in place, MTTR can vary a lot more than MTTD. With automated visibility solutions, MTTD can often be a function of programs that detect faults. MTTR, on the other hand, often involves people and a series of steps needed to fix the issue. So while MTTD may be a measurement of how well an automated alert system performs, MTTR often ends up being a measurement of both your systems and the people you depend on to jump into action after an incident.
Cut Recovery Time With StackState
The unified, cross-domain topology capability of StackState's observability solution help organizations consolidate their monitoring systems, visualize topologies, enhance IT observability, and identify root causes faster than ever before. With real-time observability, the time it takes to discover a failure is cut down because admins and other stakeholders can see alerts from all segments of the stack and immediately pinpoint the root cause. This results in better MTTD numbers and greater resiliency.
The same is true for StackState’s impact on MTTR. A true recovery can’t happen unless you first identify the root cause. Otherwise, you may be slapping a Band-Aid on a cut that’s deeper than you realize. Because StackState has a complete understanding of the IT environment, its relationships and dependencies, and the changes taking place within it, when a problem arises, the system can immediately pinpoint what changed and the associated problems. No more hunting for the cause, no more "blame games" of "guilty until proven innocent," and fewer war rooms. This is supported by the time-traveling topology, which allows you to zoom back in time to the precise moment something changed that perpetuated anomalies or caused incidents.
Webinar: crush your MTTR with StackState as Turbo on top of Splunk
Are you using Splunk and are you interested in how you can use StackState with Splunk to crush your MTTR and MTTD? Sign up for this webinar, in which we show how our client NN Bank - one of the larger banks in the Netherlands - now quickly relates IT incidents to business impact by using StackState's AI-powered observabilty tool.
Allyson Barr· 6 min read