The Inciting Incident
NN Bank’s customer service net promoter score was a low 6.6. Sander Vijfschagt, the chapter lead of testing and monitoring recognized the link to the bank’s relatively low uptime compared to industry benchmarks. With an uptime of 97.57% and a 4-6 hour mean-time-to-repair impacted services per incident, NN Bank was off its service level objective - negatively impacting the customer experience.
The real problem was the length of time that it took to identify the root cause of outages. NN Bank had more than twenty different IT teams and was using several monitoring solutions - Solarwinds, Prometheus, VMware vRealize Operations, AWS CloudWatch, and Azure. These were all forwarding their data to a central data lake in Splunk, but Splunk didn’t show how all the systems related nor how changes to supporting IT components led to service failures.
How could the teams quickly get systems back up and running when they couldn’t see the root cause of incidents or dependencies? Sander tried manually visualizing the dependencies, but it was an extremely time-consuming process. Splunk’s dashboards were static and needed to be manually updated to get insights from additional data streams, which was inadequate for NN Bank's dynamic, multi-cloud environment. Plus, Splunk didn’t identify dependencies that were related to the underlying infrastructure -only business processes and services. The bank needed a better, more innovative solution.
StackState - The Turbo on Top of Splunk
After chasing a few red herrings, Sander discovered StackState, which had helped other Splunk customers improve their ability to identify root cause and reduce resolution times. With the Splunk StackPack, NN Bank was able to dynamically correlate all performance data captured in the Splunk data lake into a comprehensive topology that clearly visualized the dependencies and relationships between applications, services, infrastructure, and environments. With StackState’s time-traveling topology, NN Bank could easily see how changes to IT environments led to performance degradations and observe exactly where in the topology the root cause of downtime originated. NN Bank could also simply add new sources of data for topological analysis with no manual effort. What’s more, with StackState’s predictive monitoring, they could prevent many issues from ever occurring.
The combination of Splunk’s dashboards and StackState’s topology gave NN Bank everything required to detect performance degradations and understand their root cause before outages occurred.
Business Impact - StackState Makes a Real Splash with Splunk
StackState's open, data-agnostic, unified-observability platform easily integrated with NN Bank’s existing technology. Now all of NN Bank’s 20+ teams have real-time visibility into dependencies across its hybrid IT landscape. With this improved insight and common understanding across tools, the teams can instantaneously prioritize and assign IT issues to the right people, saving valuable troubleshooting time.
With the addition of StackState, not only has NN Bank optimized its investment in Splunk, it can seamlessly incorporate any future technologies. With all its root causes finally revealed, the bank is well on its way to:
achieving high availability of 99.8%
a low MTTR of less than one hour.
a vastly improved customer net promoter score of 11.6.