Imagebomb: All Engines Flameout
May-2020
You might have noticed us going down over about a week at the end of February. Not really going down, because we managed to keep the service up somewhat, thanks to our OOM-killer configuration and some helpful insight from bocchi and psudo. But let’s rewind.
From one night to the other, our webserver died. Along with the webserver died the most important tool that keeps load from our database servers: our cache server, which is a lone instance of memcached.
First-Aid was administered according to the usual troubleshooting manual:
Which process produces load?
Did we change anything recently?
What are the patterns? Does it happen hourly? Daily? At random!?
Sifting through logfiles, does anything stand out?
In case you’re wondering, no access or error logs could give us a useful answer. Its a hunt for the trees inside a forest, everyone as meaningful as as they are talkative.
One pattern emerged: After a while, the webserver’s memory was eaten up within 2-5 minutes. Trying to adjust the server configuration to distribute memory usage across php-fpm workers was in vain, something else was up and it put us under pressure.
It couldn’t be our scripts? But what if they were? More data was needed and similar to a MySQL slowlog, php-fpm has the ability to be more verbose about why it fails outside of your control. And if you see all your scripts failing at the same function, something is up.
We could pin down what caused the php-fpm workers to die and who was causing it. Revisiting the accesslog brought us confidence: Ladies and gentleman, we got him.
But what actually happened?
Having the IP and the upload time of said user made it easy to get an overview of the damage done. We quickly found out that throughout a week, the same user uploaded up to 20 thousand (!) manga covers that abused the way we handle alternative cover image uploads. Confused how cover uploads are related? Read on.
Because we resize cover uploads to a manageable size, we use php functions that are susceptible to this specific exploit. Similar to a zip bomb (which we also had to deal with before) they masquerade as an image that is small in size, but blows up when put onto a canvas.
How did we fix that?
Aside from banning the malicious actor in question, we did two things:
1) The maximum memory consumption per php-fpm worker is capped.
2) We determine what the final dimensions are on each image we process and cap it to a maximum size (width + height)
In the long run, processes that take a long time (more than one second) will be put into a queue and processed by a background worker that guarantees that the webserver won’t be affected and to reduce the attack surface.
The MangaDex devblog is a place where we share selected stories from our daily adventures. Let us know what you think or where we could shed some light onto! Our twitter is @MangaDex, this post was written by @md_rdn