The system runs within Amazon Web Services (AWS) and uses a highly resilient design where the complete failure of any server is protected by at least one spare. The following shows the architecture, with some components removed for clarity.
There are currently three web servers but in fact no important information is stored on them. These only store the software and the purpose of them is only to respond to users by building the pages you see. If a web server stops working, the others share the load, and they are running at only about 5% capacity.
All important transactional data such as users, bookings, logs, invoices etc. are stored in the school specific databases which are mirrored across two servers as shown above. There are two database servers, again in separate buildings. One acts as the primary and the other as a replica. If either server were to stop working, a new server is started automatically by the system and switch-over time in the case of the primary failing is in the order of a few seconds.
There is additional data such as any images you upload as part of your newsletters or in your restricted content section. This is redundantly stored in a shared network file system also split across different buildings.
Using this architecture, any server can fail, or even a whole building and the system will run on the spares. The database servers are self-healing, so there should be human intervention needed. The web servers are quick and easy to create, simply a case of adding a new one from an image file.
The load balancers trigger a self-test in each web server, and should any test fail it will take itself offline. Tests cover things like reading and writing to the database, file system and checks to make sure regular background processing is running.
A system called Cloud Watch which monitors all this and will raise an alarm if the CPU load increases or if any server goes offline.
Full database backups are created on a schedule every night in the early hours. The old backup is deleted once the new one has completed. This is to cover the case where both or all the database servers fail at exactly the same time. In that case the outage would be in the order of hours because of the time we would need to start new servers from the backup.
The plan is to never need the backup!