Having enough time available is a struggle we all experience. Technological innovations enable us to develop and deploy software at lightning speed: Sometimes we can push more to production than our organizations’ IT environments can handle.
At the same time, we want to increase customer satisfaction by reducing downtime. But how are you going to keep customer satisfaction rates high if a large majority of incidents are caused by changes? Changes in your IT environment are crucial if you want to deploy fast and often.
In order to keep their systems healthy and reliable, many organizations have started their observability journey by making use of lots of different APM tools and data lakes. Even though these tools collect useful monitoring and observability data, the amount of data they hold can be overwhelming. Moreover, they often collect different data types - which makes it impossible for these tools to talk to each other and for you to see how data is related.
Needless to say, it can be a challenge to make the most of the data you collect with APM tools and in data lakes. How can you make your data more actionable, so that you can boost productivity and free up time to focus on application development? In this blog post, I’m explaining how topology-powered observability can help.
What Is Topology-Powered Observability?
First things first: let’s unpack what topology-powered observability means. We’ll start with topology. In IT, a topology describes the set of relationships and dependencies between the discrete components in an environment, such as business services, microservices, load balancers and databases.
In today’s modern environments, topologies evolve quickly as new code gets pushed into production continuously and the underlying infrastructure changes rapidly.
In order to successfully manage these dynamic environments - for example, to find root cause quickly when something breaks - you need to have a clear overview of the state of your topology at any point in time. Better said, you need to be able to track changes in topology over time. That way, you can see exactly which services impacted one another when an issue pops up.
This is exactly what StackState’s topology-powered observability does. Want to see how it works? Then watch this video: Save 85% in Time Spent on Root Cause Analysis with Topology-Powered Observability.
Now, let’s dive into the six ways topology-powered observability by StackState gives back precious time to your organization so that you can spend it on the things that matter.
1. Share Tribal Knowledge More Efficiently
In the current era with so many autonomous teams, we sometimes don’t see the big picture. Often, it is unclear how (micro) services and applications are connected to each other and depend on components that are far outside of your own team. Moreover, for new joiners, it is extremely hard to quickly grasp what your dynamic IT environment looks like.
A full stack observability platform like StackState enables you to move away from knowledge that is either captured in the brain of the super hero - someone who has been with the organization for a long time and knows all of the ins and outs of your IT system - or captured in an outdated Visio drawing.
How? Because topology-powered observability provides an accurate view of the state of your IT environment at any time. New team members will ramp up quickly by viewing the environment through a single pane of glass. Now you are no longer dependent on the internal super heroes and you can share knowledge more efficiently.
2. Find Root Cause Faster by Bringing the Right People to the Table
Every environment will face an outage occasionally. Hopefully not too often, but when it does happen you want to act fast, with the right group of people involved to quickly resolve it. Many companies use war rooms or mission control centers to gather people together and start acting upon an outage.
Often, it is difficult to find out who should be involved when there’s an outage - which increases the chances of calling in the wrong people. Needless to say, this is a waste of time: people need to switch context and are unable to focus on their core tasks.
Topology helps you to determine who should be involved. How? First of all, topology shows you how all components are correlated. Then, it adds relevant change and telemetry data on top of that, so that you can see which change(s) likely caused an issue. That way, you can rapidly determine which team(s) and people are impacted and should therefore be invited to the war room. This gives back time to everyone who would otherwise be invited, but would not ultimately be involved in resolution of the outage.
3. Find Root Cause Faster by Triaging an Outage With the Right Context
Solving a problem always starts with understanding context. What components are involved? How do they relate to each other? Which customer-facing applications are impacted? This information is vital to understand before you start solving the problem.
As I mentioned earlier, topology-powered observability correlates observability data - logs, metrics, traces and events - from all the different services and components in your environment. It tells you how everything is related and shows dependencies between services and components.
For example, an AWS security group role change might have affected your application’s ability to make a connection from the Kubernetes environment. In order to solve this problem, you need contextualized data: you need to be able to track all changes across all technologies. With topology-powered observability, you are getting this context automatically, in minutes, rather than hours when an issue pops up. This is a tremendous time saving.
After receiving the context you need to triage the problem, topology-powered observability provides telemetry data on the components that are impacted, allows you to travel back in time, and also automatically directs you to the probable root cause. In IT environments that lack a holistic view, it can take minutes to hours to do these things. StackState can do this in minutes, allowing you to reduce triaging time and focus on solving the problem. This time saver is not only important for your R&D department, but more important for your customer satisfaction and SLOs.
4. Reduce Alert Noise and Focus Only on What Matters
It can be painful to determine which alerts really matter. When the incident management process is triggered and tickets are automatically created in your ITSM system, they need to be right. Moreover, you want to avoid any duplicates or false alarms. Why? Every ticket created will cause more work to understand and process.
Topology-powered observability can reduce the alert noise drastically. How? Because it understands - through its topology layer - how multiple events that are happening in your environment may all stem from the same issue. Before, you might have received 10 tickets because 10 events popped up. Now, you receive one ticket that immediately directs you to the probable root cause of each of these 10 different events. This saves you lots of valuable time that you can repurpose to make your entire system more robust and stable. Bonus: You continue to avoid tons of alerts going forward..
5. Capture Expert Knowledge in Your Observability System
Do you have additional superheroes in your organization that know everything about a particular technology? This often happens when technology becomes older or when new technology is introduced. It can be a risk to only have a few superheroes like these in your organization, because customer satisfaction often depends on the stability and reliability of the old and new technologies.
It can be hard to capture all of the knowledge these experts possess and let other team members benefit from it as well. For example, some experts may know that in your Kubernetes environment, your ReplicaSet should be three to prevent disruption. If you’re new to Kubernetes, you might be unaware of this.
Topology-powered observability helps you by capturing the state of your IT landscape - something that normally only the superheroes can do. Following the previous example, the observability solution will capture that Kubernetes ReplicaSets are currently set to three in your environment. When someone then adds a ReplicaSet that’s lower than three, the observability solution can notify that person. This makes it a lot easier for beginners to start developing Kubernetes applications.
6. Define Rules and Best Practices in Your System
Besides capturing knowledge from a single expert, you might want to set boundaries to the amount of freedom the different teams have. Here’s an example: you want to allow your teams to make use of AWS S3 buckets for storage. However, you want to enforce secure usage of this. How? By checking if the encryption is enabled or determining that root cause access is switched off on your EC2 machine.
A topology-powered observability platform is the perfect instrument for capturing rules and best practices like these. Since every component is checked by StackStates monitors, the teams have the freedom to deploy quickly, while also receiving a warning whenever rules are not applied in the right way.
Rather than explaining these rules to every team or spending many hours checking if teams comply with these rules, you can save yourself a tremendous amount of time by setting up a system like this.
Save Hours by Leveraging Topology-Powered Observability
In this blog, I described how many different teams and roles within your organization will benefit from topology-powered observability. No matter if you’re the expert in this field or just a starter, involved in software development or operations, responsible for a particular application or an entire practice: all will save hours by making use of topology-powered observability.
Curious to learn more? Play in our sandbox or book a personalized StackState observability demo with one of our solution engineers.