What about Zanzibar?

9-minute read

Zanzibar is Google’s internet-scale permissions service. Google originally described it in a whitepaper in 2019, which can be found here. Currently, no cloud host provides a publicly available implementation of Zanzibar, although there are efforts of various types to provide self-hosted and managed Zanzibar-as-a-service. The best current implementation is from Authzed, which provides both a free, open source, self-hosted server and a SaaS managed service option.

Most implementations in the wild are in-house services for distributed applications. A surprising number of large applications have a Zanzibar team in-house, which is directly responsible for their user experience, infrastructure, and tooling. Very few of them say anything in public about these efforts.

What is a permission service?

Authentication is a different conceptual problem from authorization. Authentication allows a user to prove their identity. Once authenticated, authorization provides a list of application resources and actions on those resources that the user has at that moment in time. Access to resources and actions may be granted or revoked within the operation of the application.

For instance, with a new personal account one may authenticate to GitHub, but afterwards one only has authorization to create or push to repositories in one’s personal account, and the user interface adjusts to only render actions the user can perform. Multitenant collaborative applications like GitHub are built around managed application objects like repositories, commits, and issues, which can be shared and operated on by many users. New GitHub users only get access to managed objects outside their account through administrative actions by a GitHub organization admin through the application.

Usually, developers who are confronted by the requirement to add authorization and permission to their application extend their database schemas to provide it. However, adding permission checks to your internet application database does not scale in a few important ways.

For performance reasons, no customer-facing endpoint handler should do database joins. Joining in the real-time request-response loop should be considered a design bug. If the way the application’s API requires a join in the most common 80 percent of operations, something almost certainly needs to be redesigned.

In addition, transitive permission graphs can be hard to maintain. What if a user is added to a team that has permissions on ten resources, each of which in turn has permissions on ten more resources? Does that operation need to write a hundred records? What if the application gets new requirements and has to change the permissions for every user, such as granting all “author” users permission to edit? Having to materialize every transitive permission change across a user base is a big job with many opportunities for error.

And finally, schema-hosted permission models usually have problems when the concept of a team enters the game. What if a user is added in an admin role to a team that has regional sub-teams, which in turn have functional sub-sub-teams? In order to check that permission, each team must be queried. How many sub-queries can a designer ask their database to do? Keep in mind this check needs to happen on every request to every endpoint.

The reader should be picturing a red-hot cartoon database server and users drumming their fingers while a clock’s hands spin wildly. Further, despite this work the designer is theoretically giving to the database and all of the work they are making for themselves in writing the schema, the specialized API and the admin tools do not get the designer any closer to making their application do anything it has to do in its day job.

A secondary concern: schema-based permission checks make it difficult to figure out why an operation is slow. Is it slow because the application data throughput is slow, or is it spending a lot of time resolving permissions? Databases can be tuned, but the number of permission-bearing structures that have to be checked for a given resource is not visible in a query plan.

The Zanzibar paper (which you should read if this is of interest—it’s only six pages long), describes data structures, cache design, and multi-server cache distribution functionality working in concert to keep permission checks fast.

Data structure

Because the permissions carry no application weight at all, the data structures are simple. Each resource type in the application can have a relation to any other. Users are usually, but not required to be, a resource type.

In the model shown below, teams are group proxies for users. They can contain users and other teams, with team member roles defined so as to be inherited by related resources. In this example, chat rooms can have relations to users or teams (and thereby teams’ related users and teams). Note that there is no cardinality constraint on relations. That is an application-domain concern.

Different sets of users are defined with named roles. Specific permissions are mapped to algebraic combinations of user sets defined by those roles.


definition user {
}
 
definition team {
    relation team: team
    relation owner: user | team#owner
    relation admin: user | team#admin
    relation member: user | team#member
 
    permission lock_team = owner
    permission unlock_team = owner
 
    permission add_team_admin = owner + admin
    permission delete_team_admin = owner + admin
    permission view_team_admins = owner + admin + member
 
    permission add_team_member = owner + admin
    permission delete_team_member = owner + admin
    permission view_team_members = owner + admin + member
 
    permission change_team_name = owner + admin
    permission notify_team = owner + admin
}
 
definition chat_room {
    relation team: team
    relation user: user
 
    relation owner: user | team#owner
    relation admin: user | team#owner | team#admin
    relation member: user | team#member
 
    permission lock_project = owner
    permission unlock_project = owner
 
    permission add_chat_room_admin = owner + admin
    permission delete_chat_room_admin = owner + admin
    permission view_chat_room_admins = owner + admin + member
 
    permission add_chat_room_member = owner + admin
    permission delete_chat_room_member = owner + admin
    permission view_chat_room_members = owner + admin + member
 
    permission add_team = owner + admin
    permission delete_team = owner
}

This example is only one way to model this application. There are many possible implementations depending on the application details. For instance, because this is modeling a chatroom and people are terrible, the ability to ban users might be required.


definition chat_room {
    relation team: team
    relation user: user
 
    relation banned: user      <= add a role
 
    relation owner: user | team#owner
    relation admin: user | team#owner | team#admin
    relation member: user | team#member
 
    ...snip...
 
    permission add_project_admin = owner + admin - banned   <= prevent banned users from doing this action
    permission delete_project_admin = owner + admin - banned
    permission view_project_admins = owner + admin + member - banned
 
    ...etc...
}

Permission model DSL files can be uploaded to the permission server just as schema files can be uploaded to a database. Uploading a new one will replace the existing model, invalidate caches as necessary, and begin distributing the model to peer nodes.

Applications have access to an API that allows developers write code that can check permissions, add and delete tuples, and get materialized resource and permission graphs as necessary.

The internal identifiers for objects should always be primary keys to the application database. Zanzibar stores instance information as assertion tuples of the form:

SUBJECT	is a RELATION	of OBJECT
unique @ app.tld	viewer	chatroom_36

Caching

Optimizing caching strategy for an application domain depends on the graph of objects being governed and the algebraic graph of the permissions over the relations.

Adding or deleting an assertion that is used in a permission term such as “owner,” “admin,” or “banned” in the above example map to cached lists of user ids for fast permission checking.

For multi-region applications, maintaining permission server instance in regions where users are located is the only way to keep permission checks fast. Zanzibar servers maintain a list of cache segments kept fresh. Each server is notified when a cache segment is changed so it can use Most Frequently Used metrics to decide whether to pre-fetch invalidated segments or wait to fetch them on demand.

This job is easier for cloud hosts because they own their own networks and data centers, which they can tune and scale as they please. This scale of effort is not realistic for most application teams, which is why the SaaS model offered by Authzed might make sense for some clients. They have open-sourced their server, but the region-balancing and the hosted and managed service is paid only.The open-source version has the same public API, so an application can bootstrap for free and switch over to the paid tier reasonably painlessly.

Zookies

There is an additional feature in the whitepaper that is designed to keep caches fast. Server API calls return a “Zookie,” short for “Zanzibar cookie.” A Zookie is a hash of a quantized timestamp. The Zookie should be stored so it can be passed back with subsequent permission API calls about the resource. This allows the server to know whether the cache segments that were used to make the prior decision have been invalidated in the interim, thus requiring cache updates before an answer can be given. In many cases, the cache slices can be reused for the permission check at hand even they are invalid for other uses.

Two sources of truth

Using a separate service for permissions and data means that the application state is spread over two services. Both sources must be in sync for applications to work properly.

This introduces complications in two areas. Depending on the location of an application’s users and the database infrastructure, race conditions can be introduced to an application’s operation.

The “new enemy” scenario demonstrates why this is a concern. In this scenario, “user Y” might revoke “user X”’s permission to see a document store, followed by Y adding a new document to the store that X is not supposed to see. It is possible, especially across regions, for the new document notification to arrive before the permission change.

Application endpoint middleware has more to do to prevent this.In a system with permissions built into the database schema, the middleware could just issue a database query and interpret the result to distinguish between a permission error, an application error, and an unexpected state.

With a Zanzibar service, in an endpoint which modifies permissions, such as add_team_member, an application must

• Check permission to use endpoint

• Update permission, returning error on failure

• Update application database, reversing permission update on fail

• Report state breakage if the reversal of the permission update fails

These middleware patterns must be carefully thought through to keep state unified and handle consequences of conditions in either data source. In rare cases, both permissions and resources could be orphaned, with permission pointing to nonexistent resources or vice-versa.

Like what you see?

Connect

Jason Harenski is an architect in the Logic20/20 Architecture practice.

Author

Your additional text goes here

Jason Harenski

View all posts