In this case study, we explain how NN Bank started a positive chain reaction, starting in their troubleshooting process, beginning with root cause analysis and ultimately ending with higher customer satisfaction. Here are some key terms:
DevOps : A software development culture that emphasizes shared goals and collaboration across development and operations teams
Root cause analysis (RCA): The process of figuring out what caused an outage or event
Net promoter score (NPS): A number representing how likely customers are to recommend a company or service to someone else
Mean time to repair (MTTR): The average time it takes to resolve the issues that caused an outage
Challenge: Low uptime and low customer satisfaction
Nationale-Nederlanden Bank (NN Bank) is a bank and insurance provider with over 10,000 employees and 18 million customers. NN Bank’s net promoter score (NPS) was flagging at a low 6.6, and they suspected a correlation between that and another stat: Uptime was only 97.5%.
When it takes several hours—or longer—to identify the root cause of an issue, uptime suffers, dragging down efficiency, customer satisfaction and overall morale along with it.
While it’s possible to manually visualize your network and painstakingly sniff out the root cause, automating the root cause analysis process can save precious hours. Observability in financial services, especially, depends on effective RCA because downtime interrupts your customers' access to the funds they need. Nationale-Nederlanden Bank’s RCA process was inefficient, resulting in excessive downtime and low customer satisfaction.
Sander Vijfschagt, Scrum Master Platform Services at NN Bank, further explained the conundrum they faced was due to a lot of disorganized, difficult-to-interpret data: “A lot of stuff is going wrong and we see a lot of impact happening. We also see what is going wrong sometimes. But what is the cause if something goes wrong? Putting it all together—that's the real puzzle in operations.”
In addition, the lack of insight sometimes created tensions between members of the IT team and upper management. Sander continued, “If any of you ever had to do a root cause analysis, you know it's not so easy. Management often tells you it should be and asks, ‘Why are you taking such a long time?’ But it's really difficult.”
Prefer to listen to Sander's story? Watch the highlights video below or check out the on-demand webinar to hear Sander's full story.
The connection between visibility, problem-solving and root cause analysis
Without real-time visibility into dependencies, NN Bank had to manually analyze the connections between siloed systems involved in an outage. This hindered the RCA process, resulting in unacceptable downtime levels.
NN Bank was using Splunk as a central data lake, and although Splunk was doing a great job capturing and indexing data, its dashboards were static and had to be updated manually. Given NN Bank’s dynamic, hybrid cloud environment and the challenges they faced in managing multiple services, it was clear that the IT teams needed a better way to quickly integrate the massive amounts of data from various systems, correlate it and get a unified view of the overall IT environment.
According to Sander Vijfschagt, NN Bank’s Scrum Master for the Platform Services team, “Our availability over the entire NN Bank was 97.5%. And for a bank, which should be trustworthy, that’s extremely low.” He added that banks strive for 99% or higher availability. NN Bank falling short of this uptime standard could be seen “in the customer complaints and in the user experience.” Sander also recalled that their “mean time to repair was quite high,” taking “four to six hours, on average.”
Combining StackState with Splunk
Splunk had been a very effective data lake for NN Bank, and even though it was managing disparate data streams from SolarWinds, Prometheus, CloudWatch, Azure and other monitoring tools extremely well, it didn’t provide an easy way to visualize the business impacts.
For example, all of the issues would get sent to Splunk. If you knew what to look for, you could parse through the data to identify the root cause. But even if the team in charge of pinpointing the root cause could do so, they were unable to see the business impact. Views into the data each team had were effectively siloed and this prevented comprehensive observability, making RCA challenging and time-consuming.
While Splunk did its job well as a data lake, NN Bank needed more. “That’s when this Gartner quote came along,'' explained Sander. “‘Capturing where events occur and our up and downstream dependencies using graph and bottleneck analysis can provide great insight on where to focus your mediation efforts.’ This, for us, was key."
Gaining unique auto-discovery abilities with observability
StackState automatically discovers when your landscape changes and then updates the topology on an on-going basis.
Sander related that the benefits of this feature were substantive: "Auto-discovery is an absolute lifesaver when you're in a CI/CD (continuous integration / continuous delivery) environment, where at any moment in time, the landscape can change. It can detect the changes, if you hook it up with a server's datastream. And it can show you your full stack, including the network data, which was really, for us, what helped us move forward."
Enabling faster, more accurate root cause analysis
By adding StackState, NN Bank was able to capture where events were occurring and instantly visualize upstream and downstream dependencies using graph and bottleneck analysis. As a result, the bank achieved pinpointed insights - quickly, which helped focus their remediation efforts.
Observability enabled NN Bank's troubleshooting teams to eliminate the time-consuming—and sometimes less-than-productive—discussions around what was happening, where, what caused it and the systems it impacted. Instead, everyone had a single view into where the problem was occurring and how it impacted connected systems. The right people could instantly get involved.
NN Bank adds predictive monitoring
At the highest level of observability maturity lies predictive monitoring, which enables you to see signs of potential issues and address them before they become full-blown incidents. NN Bank was able to use StackState to future-proof their system against a wide range of issues.
With StackState’s AIOps features, the system can learn what causes downtime, look out for those conditions in the future, and empower the IT team to mitigate issues before they grow into real problems.
Results: 99.8% availability, lower MTTR and higher customer satisfaction
The team’s efforts, along with StackState functionality, proved to be a difference-maker in NN Bank’s successful AIOps journey. StackState helped the bank increase uptime and reduce MTTR, resulting in their NPS jumping by 75.7%. Due to the platform’s ability to automate elements of the root cause analysis process, as well as provide predictive monitoring, NN Bank experienced:
IT availability increased from 97.5% to 99.8%
MTTR for P1 incidents went from 4-6 hours to less than one hour
Customer satisfaction (as measured by NPS scores) increased 75%