Query language requirements
First of all, let's get an understanding of the StackState query language’s requirements. StackState automatically maintains a real-time and time-traveling model of a live IT environment by using the 4T data model. This model consists of 4 T-dimensions:
Topology = All IT components (e.g. microservices, applications, hosts, docker containers, etc.) and how they depend on each other, i.e. a dependency graph.
Telemetry = All logs, events (e.g. upgrades, health states, etc.) and metrics (e.g. memory usage, latency, throughput, etc.) that relate to either the components or the relations between them.
Traces = All transactions that flow over the topology to fulfill a certain request.
Time = All of the above needs to be up-to-date in near real-time and accessible at any point in time.
Now imagine a query language that can access any of these types of data and use them to join data from one T-dimension to the other. Pretty neat, right?
Here is an example. Let's say I have an AWS account with a serverless stack that is updated regularly. I want to know which versions of which Lambda functions were the cause of most problems in the last week. I would want to express the following query:
Find all current API Gateways in the current topology.
For each API Gateway, get the count of HTTP 500 responses per 5m from AWS CloudWatch for the last week.
Find all points in time where the count exceeded more than 100 errors per 5 minutes (i.e. spikes).
Get the Lambda functions that were connected to that exact API Gateway with their version number at the time of the spikes.
Below I have expressed this exact query in the STQL:
Topology.query('type = "API Gateway"').components() // 1
.then { it.collect { apiGateway ->
Telemetry.query('AWS CloudWatch', 'name="${apiGateway.name}" AND HTTP_CODE="500"') // 2
.aggregate("event count", "5m")
.start("-1w")
.then { it.filter findAll { it.value > 100 }.collect { it.timestamp } } // 3
.then { it.collect { spike ->
Topology.query('withNeighborsOf(name = "${apiGateway.name}") AND type "Lambda"')
.at(spike) // 4
}}
}}
Some takeaway points:
This query joins topology and telemetry data.
Telemetry is coming directly from AWS CloudWatch.
The lambda functions in the response are the ones that existed at the time that the spikes occurred, not only the current ones (expressed by ".at(spike)")
You may have noticed that the syntax looks like the Groovy programming language; you are right!
Why did we not use SQL?
Knowing our query language requirements, obviously our first thought was to see if we could use SQL - since most developers prefer using SQL. Nearly a year ago now there was a really popular blog post called "I don't want to learn your garbage query language" (https://erikbern.com/2018/08/30/i-dont-want-to-learn-your-garbage-query-language.html). We recognize these complaints and at times I have had the exact same complaints. Nevertheless, we still decided to not go down the SQL route. Our two main reasons being:
SQL is great for relational databases, but not suitable for graph databases. The topology part of StackState is fundamentally a graph. In fact, we have created our own time-traveling graph database to make temporal topology a reality (we can issue topology queries at any point in time up to millisecond accuracy). Therefore, being able to execute graph queries is a hard requirement that we needed to fulfill. Trying to do graph queries with SQL, would be like trying to fit a square peg in a round hole.
N.B. Recently it has been announced that a graph extension to SQL is being worked on, namely: GQL. Having worked with several graph query languages I have to admit that I am a bit skeptical about how much popularity this extension will gain. However, I certainly would not exclude that at some point in time we would switch to GQL. (See https://gql.today/))In addition to the compelling first reason, there was another main reason. Trying to stick to the SQL spec from the beginning onwards would have made our lives considerably more difficult - without getting many short-term benefits from it. One should take the economics of such choices into account; we are still a startup. Perhaps in some cases, there are ways to start with a subset of the SQL language that eventually grows into the full SQL language, however, I can understand why companies do not want to put those constraints on themselves.
Although we did not take the SQL route, we managed to find a very good way without reinventing the wheel. We did this in even such a way that our users may reuse their knowledge of other languages. We are leveraging Groovy , which is a dynamic superset of Java; currently one of the most popular programming languages of all time (at least according to the Tiobe index of July 2019).
Researching other languages
Besides considering using SQL, we looked at several languages that both support graph databases and monitoring products. To name them all, would be too big of a list for this blog, but below I will briefly go over a few languages that we considered- besides SQL.
SPL - Splunk query language.
This language has the benefit of being very easy to get started whilst offering quite a bit of power and flexibility. Each query starts with a search on a search index (think of a table), the results of which can then be piped into any other kind of filter or function (like UNIX pipes).
Here's an example SPL query that searches for all triggered alerts:
index=_audit action=alert_fired ss_app=* | eval ttl=expiration-now() | search ttl>0 |
convert ctime(trigger_time) | table trigger_time ss_name severity | rename trigger_time as
"Alert Time" ss_name as "Alert Name" severity as "Severity"
Considering the Splunk data model, Splunk could have probably also chosen to implement SQL. However, Splunk once started off as a startup too, so my guess is that the economics reason played into it as well. Nevertheless, SPL is a very powerful language and quite popular in the monitoring space. A lot of people make their living by building advanced SPL queries. It is interesting to note that the power of this language comes from two concepts: pipes and sub-queries.
Kusto - Azure data explorer language
Kusto is used to query data from Azure Application Insights and Azure Monitor. Here is a query that gets the number of storms in Florida, during November 2007:
StormEvents |
where StartTime >= datetime(2007-11-01) and StartTime < datetime(2007-12-01) |
where State == "FLORIDA" |
count()
Considering their data model, Microsoft could probably have chosen to implement SQL - perhaps this once again was an economical decision? However, they did not. In the description of the language it even states:
"A Kusto query is a read-only request to process data and return results. The request is stated in plain text, using a data-flow model designed to make the syntax easy to read, author, and automate. The query uses schema entities that are organized in a hierarchy similar to SQL's: databases, tables, and columns."
Nonetheless, this is an interesting language to look at, as it takes some of the concepts of SPL but mixes them with SQL- such as concepts that result in something a bit more readable. Instead of starting with a search you start with a data source, then you progressively filter and transform the data going forward.
Gremlin - Tinkerop Graph query language.
StackState has its own time-traveling graph database under the hood. The query language that we use to query this graph database from our Scala codebase, is in fact Gremlin. This allows our developers, but also our power users (we expose Gremlin), to get any data they want from the database. So we know and love Gremlin intimately. It is, however, a language that is highly specialized for graph querying and has a fairly steep learning curve.
The same concept of a starting point and then progressively filtering and transforming data also exists in Gremlin but is described in so-called steps. Each step describes a path to take in iterating through the graph, but can also simply be a filter or transform. With this simple abstraction, Gremlin makes it possible to describe complex graph algorithms - such as shortest path, centrality calculation, etc.
Here's a Gremlin query that looks for cyclic paths in a graph:
g.V().as('a').repeat(out().simplePath()).times(2). where(out().as('a')).path(). dedup().by(unfold().order().by(id).dedup().fold())
(Source: http://tinkerpop.apache.org/docs/current/recipes/#cycle-detection)
The Builder pattern.
What is interesting to note is that the Gremlin language does not require a parser. In fact, it is actually not a language of its own, but a so-called 'embedded Domain-Specific Language' (eDSL). Gremlin queries can be written in many languages like Scala, Groovy, Java, Python, JavaScript, etc.
Gremlin is fully implemented with the so-called builder pattern - a well-known pattern to object-oriented programmers. The builder pattern is a way of describing a computation, where the chaining method calls on a so-called 'builder object'. It is essentially a way to build an eDSL within any object-oriented programming language.
I would not be surprised if Kusto actually leverages the builder pattern under the hood. Here is the same Kusto example query, but now written in Scala with the builder pattern:
KustoBuilder("StormEvents")
.where("StartTime >= datetime(2007-11-01) and StartTime < datetime(2007-12-01)")
.where("State == 'FLORDIA'")
.count()
Here's the implementation of KustoBuilder in Scala:
case class KustoBuilder(tableName: String, filters: List[String] = List.empty) {
def where(f: String): KustoBuilder = copy(filters = filters :+ f) {
def count(): Int = {
23 // execute query here
}
}
}
You have two types of methods in a builder class. You have the builder methods, the "where" method, that only store the request and return a new builder. And you have the evaluation methods, the "count" method, that uses the stored up requests and evaluates the query to get a certain type of result. Since the result of an evaluation can be the input for a new builder, you can recursively keep going and create a very powerful language with just a few lines of code.
Hybrid scripting and parsing
The 4T data model essentially contains two types of data: temporal graph data (topology and trace) and table-based data (topology metadata and telemetry). Seeing that the builder pattern works very well for both graphs (e.g. Gremlin) and table-based data (e.g. Kusto), gave us a solid base to build on.
Acknowledging that Gremlin is too complex, and languages such as Kusto and SPL follow a similar pattern that is easily imitated by the builder pattern - we chose a hybrid approach that combines both scripting with a simple query language. In the builder example above, you can see how this hybrid approach would work. In the "count" method the "filters" strings that were remembered still would need to be parsed. This is exactly how we execute a query in StackState:
Topology.query('domain = "AWS" AND domain = "Azure"')
.diff(Topology.query('domain = "AWS" AND domain = "Azure"').at('-1d'))
(compare the current topology of AWS and Azure with the same topology yesterday)
Learning curve
With this hybrid approach, we are able to cater to three types of users: beginners, intermediate and advanced users. Beginners and intermediates use a (parsed) query language, whereas advanced users embed queries within a scripting environment using the builder pattern.
Beginners navigate their way through the data with the user interface and a query is built under the hood. Here is a simple topology search that is generated by the GUI for a beginner browsing through a StackState topology:
environment = "Production"
The previous topology search can now be extended by the intermediate user by navigating to the advanced query section:
environment = "Production" AND label in ("user-facing", "client") AND domain != "agent"
Advanced users who want to weave through several types of data can take any query they have built and embed it into the scripting language:
Topology.query('environment = "Production" AND label in ("userfacing", "client") AND domain != "agent"')
.fullComponents()
.metricStreams()
.then { Telemetry.query(it).start('-1h').aggregation('5m', 'max') }
The power of a scripting language
The cool thing about this hybrid embedding strategy is that you can combine the capabilities of the builder classes with the capabilities of the underlying language. For example, if you would want to get all telemetry with values above 100, you could add this line to the above query:
.then { metrics -> metrics.findAll { metric -> metric.value > 0 } }
With the power of Groovy, this can even be shortened to:
.then { it.findAll { it.value > 0 } }
The "then" method is a method of our builders' supply for asynchronous chaining. The "findAll" method, however, is a standard method that the Groovy script language supplies for any collection. It evaluates eagerly, which does make it a bit slower, but nothing stops us from implementing a findAllmethod on the Telemetry query builder such that we can prevent pulling too much data over the network. It would look something like this:
.findAll("value > 10")
Again, the economics work in our favor. We have a computationally expensive method for free, and at the same time, we get to optimize whenever we require to. We can learn our users' needs from what they are doing and evolve the language that way.
All in all, we see a great return on investment by using this strategy. Especially as we already had a scripting language with proper sandboxing, RBAC controls and tooling. Essentially, anybody who understands the 4T data model, a simple conditional filter language, and has a bit of understanding of scripting and/or Java, can build highly complicated queries that span multiple data types and data sources with just a few lines of code!
Tl;dr
The reason why we did not take the SQL route is partly that it does not fit our data model (a graph) and partly because of economics.
Many query language features are also scripting language features. Using the builder pattern one can build a query language as an embedded DSL on top of a scripting language without building a parser. Users can reuse their knowledge of the scripting language at hand and there is no reinventing the wheel.
Combining a simple parser and a scripting language is a powerful and economical way to bootstrap a new query language. This hybrid approach works well for users at different stages of learning.
Using a scripting language gives a lot of power to advanced users. If at any point in time the features of the native scripting language become a performance bottleneck or usability issue, you can either augment the native script language features with builder methods or make your parser more powerful. This way you start with a powerful language and have a straightforward path to incrementally increasing performance and usability.
About StackState
StackState is a full-stack observability platform based on a one-of-a-kind versioned graph database that tracks changes in dependencies, relationships, and configurations over time. It gives you:
clear, end-to-end observability of your complete topology
the probable root cause of an issue - instead of stressful alert storms that just give symptoms
a zero-configuration anomaly detector that identifies abnormal behavior to help you prevent new issues from happening
Curious to see how it can make your job easier? Watch a video or book a live demo here .