Hardware Node Outage
Tuesday, January 17, 2012

For as of yet unknown reasons, at approximately 3:02 AM MST one of the hardware nodes servicing our cluster started behaving erratically and negativity impacting the performance of customer containers on that hardware node. Systems administrators were immediately notified and the decision to reboot the affected hardware node was made at 4:43 AM. The hardware node full completed a reboot at 5:13 AM MST, at which point customer containers recovered. In total, 44 customers were affected.

Of course, we are very concerned about this unexpected outage. We plan to begin working with customers with containers on the affected hardware node immediately to begin to move them to other nodes. We are deeply sorry for this outage.

Hardware Node Outage
Tuesday, January 10, 2012

At approximately 2:46 PM MST one of the hardware nodes servicing our cluster unexpectedly experienced a kernel panic following a configuration change. Systems administrators on site were immediately notified and the node was brought back online at 2:59 PM MST. In total, 27 customers were affected.

Customer containers on that hardware node proceeded to recover over the course of the next few minutes. Some customer’s MySQL containers experienced some minor issues in restarting their MySQL daemons which required manual intervention; however, our experienced team was quickly able to resolve those remaining issues.

We certainly strive to avoid issues like this from happening, but are lucky to have experienced talent and emergency response on hand to restore service as quickly as possible.

Upstream core router outage
Wednesday, November 16, 2011

One of our upstream provider’s core routers is currently experiencing significant routing issues, preventing us from contacting our DNS servers and degrading access to, from, and within our cluster. They are working on the problem now and we will update this status as more information becomes available. Thank you for your patience.

Update 1:55 PM MST: Network service has been restored. We deeply apologize for this outage and are working with our upstream network provider to find solutions to prevent this occurring in the future.

Update 4:03 PM MST: One of our upstream provider’s redundant core routers had serious routing problems from approximately 1:37 to 1:53 PM MST on Wednesday, November 16th.

While they are still investigating the cause, BGP routing sessions on one of their core Cisco routers (a Cisco 6509-E) started to flap badly at approximately 1:37 PM. As a result, CPU utilization spiked, and the router became unresponsive to management commands and began dropping and re-establishing additional BGP sessions. The flapping of these sessions going up and down and the resulting heavy load of processing routing information meant the router could not recover enough to fully establish all the sessions and keep them up. 

To help reduce the load caused by this cycle, a network engineer decided to disconnect an Internet transit connection that went directly to this router. This kept the router from constantly processing and dropping the full feed of Internet routes from this provider and the router was able to stabilize very quickly afterward.

Although our upstream provider’s network is fully redundant, both at the border and core of their network, a partial outage like this is possible when a router is still partially operable because full failover to the other router might not get triggered. Among other services affected was one of their redundant recursive name servers, causing some DNS queries to time out.

Again, we are very sorry for any problems this may have caused and we are working to find solutions that will mitigate any future outages.

DNS resolution failures
Friday, November 11, 2011

During the early-morning hours of November 11th, 2011, DNS resolution for the stackablehost.com domain ceased to work correctly.

The most frequent issues where arose as a result is that customers depending on resolution of the stackablehost.com domain for inter-container requests such as MySQL or Postgres saw failures. 

Due to a series of mistakes, the domain was not renewed by Stackable as it normally would be and briefly lapsed. 

The issue was corrected at 7:40 AM MST.

Unquestionably, we’re deeply concerned about an outage like this which could have been prevented. We’ve renewed all domains for an additional ten years and we’re working on putting additional monitoring in place to notify us of potential failures of this type in advance. We deeply regret this outage and we’re very sorry.

Exim regression causing mail delivery to fail on some PHP containers
Wednesday, October 26, 2011

A normal security update to the Exim mail delivery package on PHP containers has been rolled out, however it has been discovered that in some cases mail delivery will fail when mail is sent via a PHP script which utilizes the system binaries to send mail.

A temporary fix for this problem is being rolled out but may not be available to all customer containers immediately. If you continue experience problems with sending email from your container, please contact help@stackable.com for further assistance. We expect that this problem will be fully resolved by tomorrow, Thursday October 26th 2011. We apologize for the inconvenience. 

DNS server degradation
Monday, October 24, 2011

At approximately 12:15 PM MST on Sunday, October 23, 2011 the recursive name servers at our upstream provider experienced degraded performance as of a result of a Denial of Service attack lasting until about 6:33 PM MST. Systems engineers at Stackable were notified quickly after the problem began and, in concert with our upstream provider, were able to identify and resolve the issue.

Our provider has now taken proactive steps to assure that these kinds of outages do not occur in the future by implementing systems to prevent a future event of this nature.

We are, of course, continuing to investigate additional steps we can take to avoid any potential service interruption. To assure that this does not happen again, and as an added precaution, we will be setting up our own internal recursive name servers independent of those currently used by our upstream provider to provide increased reliability and redundancy. It has been, and remains to be, our top priority to provide you with the best service imaginable and we deeply apologize for any inconvenience.

Scheduled service outage Wednesday morning
Tuesday, October 18, 2011

Stackable will be performing maintenance in its data center. If you are one of the customers expected to be affected, you have already been notified by email.

We plan to begin this work at Wednesday, Oct 19, 7:00 AM MDT and expect that all work will be concluded by 7:30 AM (30 minutes). We expect that these container(s) will be down for only a few minutes during this time frame.

Stackable does offer high-availability options to ensure that even scheduled downtimes like these won’t affect your site. Contact us by emailing sales@stackable.com or calling (877) 977-2253 to learn more.

If you have any questions at all, please don’t hesitate to let us know. We’re available during regular business hours (9 AM - 5 PM MST, Monday - Friday) via the Live Chat link on our website http://www.stackable.com or by emailing help@stackable.com.

Control panel issues
Tuesday, September 6, 2011

The Stackable Control panel is currently inaccessible. Engineers are working to correct the problem.

If you have urgent requests, please use our Live Chat feature to connect with a technical support agent who can assist you or email help@stackable.com.

All containers and sites are operating normally. 

Update: As of 12:45 AM MST, the Stackable Control Panel is now functioning normally. We are sorry for any inconvenience this may have caused. Thank you for your patience.

Managed switch replacement
Thursday, August 25, 2011

On Friday, August 26, at 5:00 PM MST, we will be replacing a faulty managed switch with a hot spare we have available for such situations. The faulty switch has not resulted in any problems or outages to date, but we have decided to replace it as a precaution. There will be no downtime as a result of this replacement. As always, our goal is to provide our customers with high-availability, reliable service.

If you have any questions or concerns, please don’t hesitate to let us know. We’re available during regular business hours (9 AM - 5 PM MST, Monday - Friday) via the Live Chat link on our website http://www.stackable.com, by emailing help@stackable.com, or by calling (801) 983-7394.

Scheduled service outage Friday morning
Thursday, July 7, 2011

Stackable will be performing maintenance in its data center. If you are one of the customers expected to be affected, you have already been notified by email.

We are beginning work at 12:01 AM MDT on Friday, July 8th, 2011, and don’t anticipate it taking longer than 10 minutes. During this time period, we will be shutting down the load balancer which directs traffic for some websites to their container(s). Visitors will not be able to reach these website during this time. We are making every effort to minimize the impact this will have and, in the future, will work to find a way to reduce or eliminate the need for service windows like this. Normally we’d wait until the weekend to perform such service but this situation is somewhat more pressing.

If you have any questions at all, please don’t hesitate to let us know. We’re available during regular business hours (9 AM - 5 PM MST, Monday - Friday) via the Live Chat link on our website http://www.stackable.com or by emailing help@stackable.com.

Update: Beginning at 12:01 AM MST on Friday July 8th 2011, engineers replaced a failing load-balancer with a replacement server. The outage was expected to take only ten minutes to bring down the old server and bring the new one online. A notice was sent to affected customers to inform them of the anticipated brief outage.

The switch to the new server went as anticipated until it was discovered that networking was not working as designed.

Nearly all IP addresses bound to customer sites were not accessible from beyond Stackable’s network. Engineers immediately began to investigate and quickly discovered that the ARP (Address Resolution Protocol) caches were not properly expiring on our upstream Cisco network devices.

To make matters worse, the Cisco routers refused to clear the cache despite repeated attempts using a variety of methods. Ultimately, the only available fix was to manually create and enter ARP entries for several hundred addresses bound to the load-balancer in order to restore upstream connectivity. It took some time for this fix to be written and implemented.

This is obviously a very serious flaw in Cisco IOS and our engineers are working with Cisco to determine the cause and implement a permanent resolution.

Though some sites came online earlier, it took until 2:04 AM before engineers determined that operations were fully restored.

We certainly don’t consider this sort of extended outage acceptable and to our customers we offer a sincere apology. We’re sorry.

This morning, we’re working on a number of options to increase the availability of our front-end load-balancers far beyond what we have in place right now. As we work toward increasing reliability, we’ll continue to keep our customers informed.