The reason that problem and incident investigation is hard is that usually you have to search through multiple tools, correlate data from all those tools, and interpret this data.
At StackState we believe we should not only automate the release cycle of your software but we also should automate everything including the hard parts, like troubleshooting and incident investigation.
There are some companies that try to help during an incident investigation. They do this by correlating incidents reported by different tools. Then see if they contain the same message and verify these are reported in the same time frame. This type of analytics does help during incidents but usually takes quite some time.
As an engineer, I executed lots of those investigations. Most of the time the investigation follows the following pattern:
There is a bottleneck somewhere.
You search for bottlenecks in all dependencies for the hotspot (slow service calls, high disk usage, high CPU usage, increasing memory usage, slow queries, etc.)
When you found a bottleneck, you repeat those steps until you find the Root Cause of the problem.
You fix the Root Cause.
At StackState we automated this process to reduce the time needed to find the root cause of any change or failure. However, it is always better to prevent these issues impacting your end-users and thus your business. That’s why we use anomaly detection combined with graph analytics to become more proactive. This will avoid an issue becoming a root cause of an outage in an early stage, so you don’t have to react in panic, but prevent this outage from happening anyway.
This all will ensure service availability, -stability, and -performance. So happy end-users and a happy business department. But above all: it will save you money!