Our Quest for Faster Boot Times and Offline Support

Slack Engineering

By Jim Whimpey

We recently rolled out a new version of Slack on the desktop, and one of its headlining features is a faster boot time. In this post, we’ll take a look back at our quest to get Slack running quickly, so you can get to work.

The rewrite began as a prototype called “speedy boots” that aimed to–you guessed it–boot Slack as quickly as possible. Using a CDN-cached HTML file, a persisted Redux store, and a Service Worker, we were able to boot a stripped-down version of the client in less than a second (at the time, normal boot times for users with 1–2 workspaces was around 5 seconds). The Service Worker was at the center of this speed increase, and it unlocked the oft-requested ability to run Slack offline as well. This prototype showed us a glimpse of what a radically reimagined desktop client architecture could do. Based on this potential, we set about rebuilding a Slack client with these core expectations of boot performance and offline support baked in. Let’s dive into how it works.

A Service Worker is a powerful proxy for network requests that allows developers to take control of the way the browser responds to individual HTTP requests using a small amount of JavaScript. They come with a rich and flexible cache API designed to use Request objects as keys and Response objects as values. Like Web Workers, they live and run in their own process outside of any individually running window.

Service Workers are a follow-up to the now-deprecated Application Cache, a set of APIs previously used to enable offline-capable websites. AppCache worked by providing a static manifest of files you’d like to cache for offline use… and that was it. It was simple but inflexible and offered no control to developers. The W3C took that feedback to heart when they wrote the Service Worker specification that provides nuanced control of every network interaction your app or website makes.

When we first dove into this technology, Chrome was the only browser with released support, but we knew universal support was on its way. Now support is ubiquitous across all major browsers.

When you first boot the new version of Slack we fetch a full set of assets (HTML, JavaScript, CSS, fonts, and sounds) and place them in the Service Worker’s cache. We also take a copy of your in-memory Redux store and push it to IndexedDB. When you next boot we detect the existence of these caches; if they’re present we’ll use them to boot the app. If you’re online we’ll fetch fresh data post-boot. If not, you’re still left with a usable client.

To distinguish between these two paths we’ve given them names: warm and cold boots. A cold boot is most typically a user’s first ever boot with no cached assets and no persisted data. A warm boot has all we need to boot Slack on a user’s local computer. Note that most binary assets (images, PDFs, videos, etc.) are handled by the browser’s cache (and controlled by normal cache headers). They don’t need explicit handling by the Service Worker to load offline.

The basic cold and warm boot decision tree

There are three events a Service Worker can handle: install, fetch, and activate. We’ll dig into how we handle each but first we’ve got to download and register the Service Worker itself. The lifecycle depends on the way browsers handle updates to the Service Worker file. From MDN’s API docs:

Installation is attempted when the downloaded file is found to be new — either different to an existing Service Worker (byte-wisecompared), or the first Service Worker encountered for this page/site.

Every time we update a relevant JavaScript, CSS, or HTML file it runs through a custom webpack plugin that produces a manifest of those files with unique hashes (here’s a truncated example). This gets embedded into the Service Worker, triggering an update on the next boot even though the implementation itself hasn’t changed.

Install

Whenever the Service Worker is updated, we receive an install event. In response, we loop through the files in the embedded manifest, fetching each and putting them in a shared cache bucket. Files are stored using the new Cache API, another part of the Service Worker spec. It stores Response objects keyed by Request objects: delightfully straight-forward and in perfect harmony with the way Service Worker events receive requests and return responses.

We key our cache buckets by deploy time. The timestamp is embedded in our HTML so it can be passed to every asset request as part of the filename. Caching the assets from each deploy separately is important to prevent mismatches. With this setup we can ensure our initially fetched HTML file will only ever fetch compatible assets whether they be from the cache or network.

Fetch

Once registered, our Service Worker is setup to handle every network request from the same origin. You don’t have a choice about whether or not you want the Service Worker to handle the request but you do have total control over what to do with the request.

First we inspect the request. If it’s in the manifest and present in the cache we return the response we have cached. If not, we return a fetch call for the same request, passing the request through to the network as though the Service Worker was never involved at all. Here’s a simplified version of our fetch handler:

self.addEventListener('fetch', (e) => {
if (assetManifest.includes(e.request.url) {
e.respondWith(
caches
.open(cacheKey)
.then(cache => cache.match(e.request))
.then(response => {
if (response) return response;
return fetch(e.request);
});
);
} else {
e.respondWith(fetch(e.request));
}
});

In the actual implementation there’s much more Slack-specific logic but at its core the fetch handler is that simple.

Responses returned from the Service Worker will be labeled “(ServiceWorker)” in the size column of the network inspector

Activate

The activate event triggers after a new or updated Service Worker has been successfully installed. We use it to look back at our cached assets and invalidate any cache buckets more than 7 days old. This is generally good housekeeping but it also prevents clients booting with assets too far out of date.

You might have noticed that our implementation means anyone booting the Slack client after the very first time will be receiving assets that were fetched last time the Service Worker was registered, rather than the latest deployed assets at the time of boot. Our initial implementation attempted to update the Service Worker after every boot. However, a typical Slack customer may boot just once each morning and could find themselves perpetually a full day’s worth of releases behind (we release new code multiple times a day).

Unlike a typical website that you visit and quickly move on from, Slack remains open for many hours on a person’s computer as they go about their work. This gives our code a long shelf life, and requires some different approaches to keep it up to date.

We still want users on the latest possible version so they receive the most up-to-date bug fixes, performance improvements, and feature roll outs. Soon after we released the new client, we added registration on a jittered interval to bring the gap down. If there’s been a deploy since the last update, we’ll fetch fresh assets ready for the next boot. If not, the registration will do nothing. After making this change, the average age of assets that a client booted with was reduced by half.

New versions are fetched regularly but only the latest is used for booting

Feature flags are conditions in our codebase that let us merge incomplete work before it’s ready for public release. It reduces risk by allowing features to be tested freely alongside the rest of our application, long before they’re finished.

A common workflow at Slack is to release new features alongside corresponding changes in our APIs. Before Service Workers were introduced, we had a guarantee the two would be in sync, but with our one-version-behind cache assets our client was now more likely to be out of sync with the backend. To combat this, we cache not only assets but some API responses too.

The power of Service Workers handling every network request made the solution simple. With each Service Worker update we also make API requests and cache the responses in the same bucket as the assets. This ties our features and experiments to the right assets — potentially out-of-date but sure to be in sync.

This is the tip of the iceberg for what’s possible with Service Workers. A problem that would have been impossible to solve with AppCache or required a complex full stack solution is simple and natural using Service Workers and the Cache API.

The Service Worker enables faster boots by storing Slack’s assets locally, ready to be used by the next boot. Our biggest source of latency and variability, the network, is taken out of the equation. And if you can take the network out of the equation, you’ve got yourself a head start on adding offline support too. Right now, our support for offline functionality is straightforward — you can boot, read messages from conversations you’ve previously read and set your unread marker to be synced once you’re back online — but the stage is set for us to build more advanced offline functionality in the future.

After many months of development, experimentation, and optimization, we’ve learned a lot about how Service Workers operate in practice. And the technology has proven itself at scale: less than a month into our public release, we’re successfully serving tens of millions requests per day through millions of installed Service Workers. This has translated into a ~50% reduction in boot time over the legacy client and a ~25% improvement in start time for warm over cold boots.

p50 legacy vs. warm vs. cold boots in browser (lower is better)

If you like optimizing the performance of a complex and highly-used web app, come work with us! We still have a lot of work left to do.

Huge thanks to Rowan Oulton and Anuj Nair who helped implement our Service Worker and put together this blog post.

Read More

Leave a Comment