Scheduled service outage Friday morning
Thursday, July 7, 2011

Stackable will be performing maintenance in its data center. If you are one of the customers expected to be affected, you have already been notified by email.

We are beginning work at 12:01 AM MDT on Friday, July 8th, 2011, and don’t anticipate it taking longer than 10 minutes. During this time period, we will be shutting down the load balancer which directs traffic for some websites to their container(s). Visitors will not be able to reach these website during this time. We are making every effort to minimize the impact this will have and, in the future, will work to find a way to reduce or eliminate the need for service windows like this. Normally we’d wait until the weekend to perform such service but this situation is somewhat more pressing.

If you have any questions at all, please don’t hesitate to let us know. We’re available during regular business hours (9 AM - 5 PM MST, Monday - Friday) via the Live Chat link on our website http://www.stackable.com or by emailing help@stackable.com.

Update: Beginning at 12:01 AM MST on Friday July 8th 2011, engineers replaced a failing load-balancer with a replacement server. The outage was expected to take only ten minutes to bring down the old server and bring the new one online. A notice was sent to affected customers to inform them of the anticipated brief outage.

The switch to the new server went as anticipated until it was discovered that networking was not working as designed.

Nearly all IP addresses bound to customer sites were not accessible from beyond Stackable’s network. Engineers immediately began to investigate and quickly discovered that the ARP (Address Resolution Protocol) caches were not properly expiring on our upstream Cisco network devices.

To make matters worse, the Cisco routers refused to clear the cache despite repeated attempts using a variety of methods. Ultimately, the only available fix was to manually create and enter ARP entries for several hundred addresses bound to the load-balancer in order to restore upstream connectivity. It took some time for this fix to be written and implemented.

This is obviously a very serious flaw in Cisco IOS and our engineers are working with Cisco to determine the cause and implement a permanent resolution.

Though some sites came online earlier, it took until 2:04 AM before engineers determined that operations were fully restored.

We certainly don’t consider this sort of extended outage acceptable and to our customers we offer a sincere apology. We’re sorry.

This morning, we’re working on a number of options to increase the availability of our front-end load-balancers far beyond what we have in place right now. As we work toward increasing reliability, we’ll continue to keep our customers informed.