A tale of a small JavaScript bug

Sometimes it's assumed that the frontend is a relatively harmless place to be programming. We've been proven wrong once again.

πŸ”₯πŸΆβ˜•οΈπŸ”₯ This is fine

This is fine

For a few months now, we have been seeing relatively high loads on our web servers. The loads varied from around 90% of the total CPU power available, to a bit over 100% in peak hours. We assumed that this was normal and that our product was just being used more (which it was), and that this was a natural burden on the load on our servers. The approach we were going to take was to continue scaling horizontally.

Two weeks ago however, we noticed huge daily load spikes (up to 1000% of our capacity) during peak hours. We got alert SMS'es of Redis that couldn't handle the amount of requests, and saw Kibana logs of very slow user requests to certain routes, effectively rendering our application unusable at times. It was "all hands on deck" immediately, and we started digging. Before we could find out what was happening it stopped again, and the load dropped to what we assumed was "normal load". We found, also using Kibana, that there was a huge amount of calls to our /find endpoint, and so we decided to implement a rate limiter on those, to prevent our app from going down, and buy us some time to look for the actual issue. However, we couldn't see why that endpoint was hit so hard.

Kibana showing a lot of requests for one user in a short time frame

πŸ›πŸ•΅οΈβ€β™‚οΈ Bug Hunting

A few days later, when we were hit with a huge amount of calls again (this time, they were ratelimited and not producing those huge loads on our servers, but we could see them in the logs and in Kibana), we noticed that most of them were coming from only a limited number of customers. We tried to work in our app as they would do, but couldn't reproduce it. After some time however, we noticed in the network tab of the browser we were working in, that the same call to the /find endpoint was done multiple times simultaneously! Bingo!

Or not Bingo? The code seemed to be fine: When we mount our inbox's react component, we use Sonora.on(eventname, callback) to bind a callback to a given websocket eventname (Sonora is an abstraction around socket.io). When we unmount the react component again, we call Sonora.off(eventname, callback) to stop listening for those events. When such an event comes in, we would potentially need more info from the backend and a call to the /find endpoint is issued. It definitely looked like we didn't unregister the callbacks when unmounting that component, given how we saw multiple calls being made simultaneously in the console's network tab, whenever a websocket message came in.

Browser's performance tab showing lots of calls to the find endpoint

While everything was looking okay in the consuming code of the Sonora .on() and .off() methods, we concluded that something must've been wrong in the wrapper around socket.io itself. When looking inside the .on() method, we found out that there was a debugging statement added like this:

const Sonora = {
    // ...

    on: (event, callback) => {
        socket.on(
            event,
            () => {
                console.log('some debugging here');

                callback.apply(this, arguments);
            }
        );
    }

    // ...
}

As you can see, we wrapped the actual callback inside an anonymous function, and passed that on to Socket.io's .on() handler. Now, when calling .off(), we sent along the original callback which didn't match the wrapped one, and nothing was removed. Since it's possible to have multiple callbacks for each incoming event, this resulted in the same callback being added time after time and not being removed. So we had basically made all our clients do loads of unnecessary calls to /find by adding a debugging statement! And believe it or not, this debugging statement was there for a while! (Thanks git blame!)

πŸ”¨πŸ‘©β€πŸ”§ Fixing it

The fix was easy enough: don't add that anonymous function.

The load on our servers dropped immediately when we put that small change into production, and not only during peak hours. It seemed that we had been on a tipping point. A few more users online at any given time, a few more open tabs with our application running in them. The servers constantly running at semi-high loads. And then we went over it. πŸ”₯

Graph displaying the huge drop in load on our servers

Along the way we did some other optimisations:

  • we implemented a rate limiter on the /find endpoint
  • we disallowed to register exactly the same callback twice for the same event using Sonora.on()
  • we fixed a second bug in Sonora.off() where we didn't remove the correct callbacks sometimes

πŸ˜‡πŸ’­ And they lived happily ever after

This was the tale of the small JavaScript bug bringing down the huge web application, a modern David and Goliath if you will. Frontend debugging tricks can take down your application! Every change is important, certainly changes that happen in code that's used very frequently. Some changes that look inconspicuous can over time become real bottlenecks. Tools like Kibana and the browser profiler & network tabs really helped us a great deal finding the issue, so don't forget what you have at your disposal.

We hope you enjoyed reading about our failures! Happy debugging! πŸ‘‹

Categories: JavaScript