During the week we have been working on adding redundancy to the database that stores the contents of posts. To be able to do this in a smooth manner we have been running some upgrades on the main such database. These upgrades to a lot longer than expected (10+ hours) and have had to be retried a couple of times. The latest attempt had been running during the Friday and onto Saturday morning. When this upgrade process was reaching completion the database process hung taking down the entire database.
While the database was unreachable requests to our application servers queued up to such degree that it effectively took the entire system down. Intermittently requests were able to reach our servers, but no posts were loading.
On (local Stockholm/Europe time) we received the initial alerts 02:34. We went to action and tried restarting the database software on 02:45. The process did not respond so we initiated a server reboot on 02:50. 02:59 the server was up and running again.
1. The post contents database being down should not take down the entire site. At most the “Simple” mode version of posts should simply be empty
2. While no data loss has been recorded we need to prioritise ensuring fast backup routines. The last full backup was taken 5 days ago of the post contents database. Had the server not been able to recover, we would have major data loss. This upgrade process actually started because we were working on better and faster backups. Now we have gotten a taste of first hand taste of why this is important.
3. Right now the post contents database is a single point of failure as we only have 1 replica. We need to as fast as possible ensure that we have 2 replicas ready if one fails. This is also something that this upgrade process was aiming to make possible.
4. Our status page was not showing correct status for the service. It said “All systems operational” when clearly this was not the case. We need to investigate and improve this.
We take the uptime of Feeder very seriously. We’re always working to keep services and servers up-to-date and secure. We apologize sincerely for any troubles this downtime caused. If you have any questions please contact our firstname.lastname@example.org and we’ll get back to you as soon as possible.