As promised in December, I’ve figured out why the crash happened in December, why the server didn’t just restart automatically when it went down, and what will be implemented to prevent future incidents of the same sort. I’ll keep the technical details succinct.
The crash was caused by a series of unfortunate events: a memory leak, a password protection on reboot, and the unusual situation that everyone who had the password for rebooting was away on vacation and couldn’t be reached.
There’s an image library at Scubbly that is used to generate thumbnail images of the pictures you upload. That library is known to gobble up RAM and not release it. Over time, the loss of RAM will cause “swapping” to the disk. Swapping is bad. The cure for swapping is a reboot of the web server. It happens rarely, but when it does there’s hardly any service interruption at all. And there’s an automatic process in place with our web host that watches for swapping to trigger a reboot. All that stuff is set up and it’s good.
The failure can be blamed on password protection on the reboot process. Scubbly has authentication measures protecting pretty much everything from being screwed with. So when the web host rebooted the machine that is hosting Scubbly, the system required a password – literally typed in by a human logged in to the machine – before it would restart the web service.
Only one person has that password. That was me, on vacation with my family, at a resort with no wifi where no email could reach me for over a week. As we disembarked in Toronto after a week of sun and mojitos, I switched my phone out of Airplane Mode, that’s when the hundreds of downtime critical alert messages flooded in and I learned that Scubbly.com had been in a coma for over three days. All I had to do is go home and type in my password to restart Apache. It took only a few seconds to fix the problem, but it was waiting for several days because of my own unavailability.
If there had been no password impediment (allowing the automatic reboot process to succeed) or if there was another person who could have intervened, then the interruption of service would have been completely avoided.
So. Here are some steps to prevent similar problems in the future. The first will be done immediately:
1) We will use a different authentication signing method to secure our web service, so it doesn’t require a human to be there when rebooting. This way our automated disaster recovery procedures will actually work without intervention. I’ll have that fixed some time in February.
The next two will happen some day, but not immediately:
2) Add some redundancy to the service by starting up a second server with a load balancer. Then if one machine goes kaput with RAM issues, the service will still be available running on other machines.
3) Some day Scubbly will invest in “managed hosting”. Instead of running on self-serviced machines maintained by me, the site will be managed by a hired team of professional geeks with 24/7 customer assistance and immediate incident management. Hosting like that is expensive, but it is where Scubbly is headed.
The impediment for those last two is budgetary – Scubbly will be able to do them when the service is earning enough to cover the increased costs.
That’s all. Once again, please accept my apologies for letting Scubbly flatline like it did.