For nearly a year now, Kensho has been using GraphQL in a highly unusual way — by compiling GraphQL queries directly into a single, optimized graph database query. By hiding the complexity of the underlying database behind an expressive and easy-to-use GraphQL layer, we have been able to prototype and release new functionality to our clients in record time. Now, we are excited to release our GraphQL compiler as an open-source project !
Despite its name, GraphQL was not meant to be SQL for graph databases. Instead, GraphQL is an API query language developed by Facebook to improve the performance of interactions between a server and a client, such as a web browser or mobile app.
The client specifies what data fields it needs. ** Avoiding a predefined output format means that no unneeded data is sent, reducing the response transfer time to the necessary minimum.
The client can fetch multiple resources in a single query. Avoiding multiple back-to-back server requests means significantly less latency to complete loading data on the client.
Unsurprisingly, GraphQL isn’t a silver bullet — the same problems that GraphQL helps avoid in the client-server setting now arise between the server and the database. Since the GraphQL execution model requires that a “resolver” function must be invoked for each field that is part of the query, a single GraphQL query could easily cause hundreds of database queries . Even if some of these queries can be cached, the network round-trips will add up.
We found we didn’t need separate resolver functions for every field of a given query. Instead, we’d much rather make sure that every GraphQL query executes efficiently in a single database round-trip — and that is precisely why we compile GraphQL queries directly into database queries.
Expressing all database queries in GraphQL has other advantages, too:
Indeed, when we started work on our GraphQL compiler, the OrientDB database we use had not yet added the MATCH graph querying operator we now rely on, and instead only supported the Gremlin graph query language. When MATCH support arrived, its significant performance improvements made it worthwhile to add support for generating MATCH-based queries in our compiler. Rather than needing to rewrite all our database queries in the new query language, the GraphQL abstraction made getting those performance improvements as simple as recompiling our GraphQL .
Every GraphQL query our compiler can compile is fully spec-compliant and valid “regular” GraphQL. ** All existing GraphQL tools just work: syntax highlighting , autocompletion , linting , IDE integration …
We use the Python port of the standard GraphQL library for parsing GraphQL input and validating it against the GraphQL schema:
The GraphQL compiler code then takes the generated AST, and returns a compiled query string together with inferred type metadata. The compiler library aims to be simple and lightweight: it does not run the query against the database and it does not include a server implementation — it simply transforms GraphQL strings into query strings. That way, the compiler library doesn’t care if you use Flask or Django, or if you talk to your database over HTTP, over a binary protocol, or via smoke signals.
Since we use GraphQL as a database query language, we use GraphQL’s built-in extension points to increase its expressive power. We add custom scalar types to represent date and date-time objects, as well as several custom directives that expose powerful database functionality. We’ll explore some of these in the remainder of this post and in future blog posts.
If you are already familiar with regular GraphQL, the two biggest differences you’ll notice when using compiled GraphQL are the lack of per-field “resolver” functions, and the fact that the output from running the compiled queries against the database is in the standard tabular format (rows and columns) rather than the nested format of regular GraphQL.
This section assumes that you are at least vaguely familiar with GraphQL syntax — feel free to quickly skim the official introduction to GraphQL if you aren’t already comfortable with it. In the spirit of Facebook’s Star Wars-based GraphQL examples , we’ll use examples inspired by Game of Thrones — a Game of GraphQL , if you will. And don’t worry — GraphQL does not support spoilers!
Before we are able to query the database with GraphQL, we need to define a schema for GraphQL to use. The GraphQL schema is simply a translation of the database’s own schema into the GraphQL type system and can even be generated automatically by introspecting the database.
In the above schema, CharacterOrHouse
is an interface type, corresponding to a vertex class declared ABSTRACT
in OrientDB terms, with two property fields: a string name
and a list of strings alias
. As a reminder, [String]
in GraphQL notation simply refers to the “list of strings” type.
In contrast, Region
is a concrete (non-abstract) type meant to represent geographical regions: Westeros (the continent), The North (the kingdom) and Winterfell (the castle) are all of type Region
.
All edges in the graph are directed. For example, the edge whose name in the database is Lives_In
starts at a Character
and goes to a Region
. This is reflected in the schema: Character
has a field named out_Lives_In
, of type [Region]
, and Region
has a corresponding field named in_Lives_In
of type [Character]
.
The directedness of edges allows Region
vertices to point to their parent Region
vertex that contains them: Westeros is the parent region of The North, which in turn is the parent region of Winterfell.
Edge fields are also always of list type: ** a single Region
may have multiple characters living in it, and some characters may live in multiple regions.
Let’s start off simple and get the names of all characters in Game of Thrones.
Executing the resulting query string against the database returns a list of dictionary objects , each of which will have the key name
and the name of a character as its value:
Readers already familiar with GraphQL may also object at the fact that our query explicitly marks the name
field for output , via the @output
directive. To show why that is necessary, consider the following GraphQL query, which returns the other names by which Jaime Lannister is known:
In the above query, we want to filter based on the name
field, but we do not want to output it — therefore, we must explicitly specify which fields we want to output, and which unique name to assign to them in the response.
Let’s examine the @filter
directive more closely. It reads from left to right, “the Character’s name equals the parameter named character_name
.” This allows us to use the same query with different data:
The compiler explicitly prohibits passing literal values as filter arguments. This does not limit functionality — users can simply pass the same value to a runtime parameter each time a given query is used — but does keep the compiler’s code simpler. Allowing literal values in GraphQL queries, much like literal values in SQL queries, may also open up a dangerous query injection vector: rather than using runtime parameters, users may be tempted to use simple string interpolation to insert data into their queries. After all, SQL injection as a result of string interpolation is still a common security vulnerability, despite the fact that SQL systems have supported parameterized queries for decades!
We now ask for “the seat of power of each noble house.” This requires that we output the names of noble houses and the regions to which they are connected via a Has_Seat
edge. Aside from the custom @output
directive, the query is just normal GraphQL:
With this query, we are already doing fewer database round-trips than with regular GraphQL. Rather than calling resolver functions that individually query the database for the NobleHouse
and out_Has_Seat
fields, the compiler allows us to fetch all data from the database in a single operation.
As we add more and more fields to the query, regular GraphQL has to do more and more round-trips to the database, while compiled GraphQL always needs only one round-trip. Deeply-nested queries can be especially problematic with regular GraphQL, whereas they are no problem at all for compiled GraphQL: here at Kensho , we regularly execute GraphQL queries with 10 or more levels of nesting!
If we were to examine the (long) list of outputs produced by the above query, we’d notice that some NobleHouse
vertices, like house Cassel, were not returned. Sir Rodrik Cassel is the beloved master-at-arms of Winterfell and the head of this house, and due to his service, his house does not have a castle of their own to make their seat. The above query filters out all NobleHouse vertices that did not have a Has_Seat
edge — the Has_Seat
edge is required.
If we so choose, we can instead make the Has_Seat
edge optional, asking: “For each noble house, return their name and seat of power, if one exists.”
Since the edge is now marked @optional
, the compiler notes in the computed output metadata that the seat_name
column is output from an optional block and therefore may not always exist. Indeed, since House Cassel did not have the Has_Seat
edge, the seat_name
column is not present in its output row.
In addition to the $
syntax discussed above, there is another way to pass arguments to @filter
directives. Consider the following question: “Which regions have a name that contains the name of the house that rules in that region?” This query is tricky because the filter applied to the name of the Region
needs the value of the NobleHouse
connected to that region — the query refers to itself. The @tag
directive and tagged arguments come in handy here:
Here, we use the @tag
directive to tell the compiler to remember the name
property of NobleHouse
, and refer to it in the future as house_name
. Then, in the @filter
directive later on, we specify %house_name
as the argument to the has_substring
operation. The %
prefix of this parameter, unlike the $
prefix we used in a previous example, tells the compiler that the parameter comes from a tagged value in this query rather than being externally provided.
This self-referential querying functionality is not available in regular GraphQL, and is just one example where our holistic approach to querying a database enabled us to expose new functionality. In an upcoming part 2 of this blog post, we’ll demonstrate more such functionality:
The goal of this project was simple: empower anyone at Kensho — engineer, analyst, or designer alike — to write complex graph queries that perform as well as hand-written queries from our most experienced graph database engineers. The project faced many difficult and unique challenges, and even required us to contribute a pull request with an improved, topological scheduler for OrientDB . Even so, the project has been a resounding success!
Overall, writing database queries in GraphQL has reduced time spent debugging correctness and performance problems, and significantly improved our iteration velocity. What used to require the direct attention of one of our handful of graph database engineers has now become easy, even routine, for everyone on our team. We hope that as this open-source project grows, we will be able to extend these benefits to everyone using graph databases.
We’re hiring! Check out our careers page for details.
We also welcome pull requests for all of our open-source projects !
This content was originally published here.