The most recent “Annual Cloud Native Computing Foundation (CNCF) Survey " clearly supports the assertion that Kubernetes is growing and “The State of Kubernetes 2022 ” report from Tanzu also confirms adoption. Also significant are Gartner and Forrester, each observing growth and wide adoption of Kubernetes. So in this article, there is no need to predict that Kubernetes adoption will continue to explode!
However, if we unpack this rapid adoption, what else needs to happen to support the widespread uptake of Kubernetes? What do leaders need to consider, in order to enable and foster healthy growth and let the ecosystem of Kubernetes flourish in their organization?
I have five predictions that you need to pay attention to, if Kubernetes is important to your IT organization:
Intensified conversations around creation of separate DevOps and platform teams, since knowledge areas are becoming so broad. Kubernetes adds additional complexity and expands the need for specific skills among teams.
Companies will find better ways to distribute site reliability engineering (SRE) knowledge across teams.
Policy-as-code for Kubernetes will mature and gain traction.
The struggle to quickly troubleshoot Kubernetes will be addressed.
SLIs and SLOs will be adopted by more teams and will drive investment decisions.
Let’s explore each of these predictions in more detail.
1. Companies create separate DevOps and platform teams
At various places on the internet, you’ll see articles and tweets about the fact that DevOps is dead. I entirely disagree: close collaboration between various disciplines remains critical. And the focus on automation and acceleration is vital to survive in this digital era.
However, companies struggle to bring yet another big area of knowledge to their teams. After CI/CD, the shift left of testing, the importance of monitoring, observability and security, teams are now tasked with gaining extensive knowledge of Kubernetes and other cloud platforms. These technologies offer tremendous business and potential financial advantages, but they are complicated environments to learn and maintain.
Companies of all sizes should consider where they want to build a knowledgebase on Kubernetes. Many companies pick a platform team to build and set up this expertise, rather than trying to train every team who touches Kubernetes. A single platform team can support multiple DevOps teams. With this type of segregation, DevOps teams continue to focus on developing and operating the (business) applications while the platform team builds and maintains a robust and reliable underlying platform. DevOps is not dead, but leaders should consider what levels of responsibility and technology support are realistic to push into every team.
2. Teams improve distribution of SRE knowledge
If you are a Kubernetes expert on an SRE team, you might not recognize this challenge. However, many teams do not have the site reliability engineering expertise to optimize Kubernetes usage. Many teams in companies of every size struggle with this issue. But with more and more companies striving to extend knowledge sharing, new models are starting to arise. 2023 will be a year in which many more companies develop best practices around how to spread knowledge internally.
Fundamental needs for a more reliable landscape, better performing applications and a process without too much waste will be the three main reasons to grow a culture of knowledge sharing. Here are several models I’m starting to see:
A central SRE team might be very suitable if you have mature engineering teams that are familiar with Kubernetes and other cloud technologies. If these engineering teams just need occasional guidance and direction or perhaps some support in tool selection, this model might fit your organization.
A coach squad is a perfect model if your teams are new to Kubernetes. A group of experts can go from one team to the other and help to grow the practice implementation as well as share knowledge. Prioritize working with your most critical teams first, help them out for a few weeks or months and then move on to the next team.
Local distribution of the knowledge in every DevOps team is obviously the strongest model. Enabling this approach might require time, but having site reliability knowledge in teams who make use of Kubernetes is the ideal model, since it can continuously spark the improvement circle.
3. Policy-as-code for Kubernetes will mature and gain traction
For several years the focus has been on enabling teams to become more and more autonomous in deploying applications to Kubernetes. Developing pipelines that can push out applications easily are now common practice in many enterprises.
Although autonomy is a great advantage, finding the balance in controlling some areas remains a challenge. With everything that is touched now being defined as-code, an entire world of possibilities is opening up. Policies defined as-code can be easily validated and reviewed by following established engineering practices. Therefore, policy frameworks will rise in importance. Within the CNCF, Open Policy Agent (OPA), is the most common policy framework. Open Policy Agent describes itself as follows:
To continue the adoption of Kubernetes and autonomous teams, practices like this will mature in parallel to allow continuous growth while remaining in – or even gaining – control. Policy-as-code adoption enables you to control how Kubernetes is used by a wide range of teams.
4. Kubernetes troubleshooting challenges will be addressed
Troubleshooting applications running on a Kubernetes cluster at scale can be troublesome. Not just because Kubernetes itself is complex, but also due to the oft-hidden connections between so many moving parts. Having knowledge of all of them remains an issue. Providing troubleshooting solutions that help teams remediate issues effectively will provide a competitive advantage.
To get the full picture of what is happening when an alert occurs, there are four elements that need to be brought together. All four are needed to fully understand what's wrong, remediate it and do a deep analysis on what caused the issue to prevent it in the future:
Events — Various troubleshooting solutions are event driven; they show every change that has happened. Through this data, they can point you to where is issue is and what caused it.
Logs — Many teams use log analytics to spot both warnings and real issues, and then try to determine what went wrong. Logs provide great insights but they can be cumbersome to sift through.
Telemetry data —With more and more detailed metrics being produced and standards like OpenTelemetry growing in adoption, telemetry data is essential for troubleshooting Kubernetes. Detecting issues like performance degradation of services, memory usage problems or disks running out of space is helpful in solving these issues.
Trace data — Gathered through, for example, eBPF, trace data is powerful in helping you gain insights on golden signals like error rate, throughput and traffic.
Solutions that bring these four elements together into a connected and easy-to-access format will start to help you more quickly determine what is wrong as well as how to remediate the issue. Vendors and open source frameworks will drive this trend.
5. SLIs and SLOs will be more widely adopted and will drive investment decisions
For many years, more and more teams have started using Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure and track how they are progressing against their defined targets. And for many years, setting SLIs and SLOs has been an IT-focused exercise that happens without much visibility to the line of business.
The connection between SLIs, SLOs and Service Level Agreements (SLAs) will become more relevant and will be established through the help of tools that, at last, connect business and IT.
More importantly, SLIs and SLOs will not just be a unit of measurement, but will start sparking resource investment conversations. For example, “What services and areas in my Kubernetes environment are lagging behind and not meeting their SLOs?” These are the areas that will require additional investments. “What areas perform really well and require less investment?” Perhaps you can devote fewer resources there, or perhaps those areas receive the same level of investment but can be expected to support more experiments, go faster and try out new things.
In 2023, teams and leaders will become more aware of the data that is hidden in these SLOs and will turn it into valuable insights that drive investment decisions.
Conclusion
Kubernetes and the ecosystem around it are at an interesting moment in their growth towards maturity. With continued, or even accelerated, adoption of Kubernetes, companies and particularly engineering leaders need to start taking a broader look at knowledge within their teams, tools to facilitate growth and different ways of solving problems. If not, Kubernetes adoption might slow down, its results may be less than optimal, engineers might leave or compliance rules might be violated. These challenges would likely impact business goals, compromise stability and negatively affect customers. As you move into 2023, I recommend you join engineering leaders around the world in considering these five best practices to help build a solid Kubernetes foundation.