As you will be aware, last February 28 internet registered one of those great failures that ended up affecting a lot of the webs, services and applications that we use daily. This was due to a fall in Amazon Web Services (AWS) service S3 (Simple Storage Service), one of the largest on the internet and where companies such as Hootsuite, Twitch, Airbnb, Giphy, Trello, IFTTT and others are hosted.
After the affectation that lasted almost five hours, today 48 hours later Amazon informs us what happened. The company is publishing a report where it explains the origin of the failure, where it is confirmed that it was not an attack as many thought, but a "simple" human error.
Human Errors and the Fragility of the Network
It all started when Tuesday morning some members of the Amazon S3 team were debugging the billing system, which was to shut down some servers. An authorized member of the team executed a command according to what was established in the manual, unfortunately one of the entries of the command was entered erroneously and ended up disabling a larger than expected set of servers.
Among the servers that were left offline were two important subsystems that support S3, one of them is responsible for managing the metadata and location information of all S3 objects in the region. Subsystem that was not operative could not perform basic tasks of data recovery and storage.
When the error was discovered, the next step was to restart the entire system, which took longer than expected. While all this was happening, other AWS web systems stopped working, such as Elastic Compute Cloud (EC2), which is used by companies to expand their cloud storage. And the bad news is that many of AWS's own services are linked to S3 services, like the dashboard, which during the failure showed that all services were working well, when it was clear that it was not.
According to Amazon, the reboot took much longer than expected because many of the servers had never been restarted, and although S3 is designed to work with the loss of some servers, subsystem crashes affected the performance of Important way. The company claims that this error has served to adjust the protocols and make changes, such as periodic debugging on a scheduled basis; Now engineers may not have the ability to disable servers; And the dashboard will be a system independent of S3.
Before all this, it would be interesting to know what the fate of the engineer was, since a failure of this caliber is serious, but in the end did everything that was in the manual. What is important to note, is that in the midst of artificial intelligence, robots and technology, the network remains fragile and more when there are human failures.