DCN911.COM Off-Network Support


Shared server ‘pilotisland’ outage, recovery

Posted in by Agile on the March 27th, 2008

This morning we encountered a combination of issues with our shared server ‘pilotisland’ which resulted in a lengthy (approx. 6 hours) outage. Now that the server is back up, we are able to determine what happened.

A site on the server experienced a large spike in distributed traffic (similar to the “slashdot” effect - but it wasn’t slashdot) which called a dynamic script (PHP/MySQL based) and caused an unsustainable load on the server, and it went down.

We attempted to reboot the server, but kept running into issues both with access (due to network traffic) and also with the hard drives requiring an FSCK before they would come back up. The automatic FSCK would not complete, requiring a manual FSCK, which was very time-intensive due to the size of the drives.

The FSCK has now been completed, the server has been rebooted, traffic for the busy site has been remediated (through both server/script settings and some traffic sharing) and all server operations came back to normal at approximately 1:15 PM Central Time.

We apologize for the frustration that this has caused. We realize that our clients do not want their sites to be down. We don’t want your sites to be down either! This particular server has been up 100% for six months, and showed no signs of trouble prior to this incident. This was an external traffic problem which then became coupled with normal drive maintenance operations for recovery. We are very sorry for the inconvenience and frustration that this has caused, and will continue to do everything in our power to ensure uninterrupted service on this server moving forward. ##

Server svensbluff down, being rebooted; Update: Recovered

Posted in by Agile on the March 26th, 2008

UPDATE, 1:37 PM: ’svensbluff’ is back online. Thank you for your patience! :)

———————————————————————————————-

1:29 PM: Our shared server ’svensbluff’ is currently down, after having emergency maintenance performed. Our on-site technicians are working on the server as we speak and will have it back online as soon as possible.

We apologize for the inconvenience!