Last Friday morning at 10:32 a.m. ET, our primary service site experienced a major power disruption that caused our website and product to go down for the balance of the business day and into the evening.
This was truly the perfect storm of events — and never should have happened given the level of investment we make in our hosting setup and systems. Transparency is a key tenet of how we work, and in that spirit, I’d like to explain what happened and how we responded.
A little background
Before jumping into what happened, I’ll share a little about our hosting setup. As I’ve mentioned, we invest heavily in this area, both in our systems and with our data center space provider. Our primary site is hosted in a Tier 4 wholesale facility. The company who manages our space does so across 20 million square feet on behalf of their customers. As a Tier 4 facility, we have all the things that you would expect:
- Street and generator power
- Automatic transfer of power between feeds
- A and B feeds into our private space which are fed via Uninterruptible Power Supply (UPS)
- Dual-corded equipment on our critical gear
All of the core equipment required for our service is configured in a High Availability setup – which means at least two of everything, or 200% of capacity provisioned and online.
We also have a backup warm site where we constantly replicate all customer data (which is also within a Tier 4 facility).
Root cause of our service interruption
In short, the Constant Contact site lost all power on Friday at our primary site. In working with our data center space provider, we have learned that there was a catastrophic failure with power on the site, which caused both street and generator power to be locked out. The complete power outage caused our systems, as well as the systems of other companies hosted at the site, to shut down. This is not something that should happen in a Tier 4 facility.
Recovery steps
As a Severity 0 event, we immediately started executing our disaster response procedures. This included starting the recovery of some services in our alternate site right away and assessing the situation in our primary site to determine which recovery path would give us the shortest recovery time (RTO) with the most current data set (RPO).
Based on having 90 minutes of unstable power and the abruptness of the way our systems shut down, we had to completely restart all systems to assess their state. In doing so, we determined that while we had many significant hardware failures, all customer data was intact in our primary site. This led us to the decision to recover in our primary site (driven by RPO and RTO), and we started the systematic process of replacing hardware and restarting our software.
We went through this process very carefully to ensure the integrity of our customers’ data, and because methodically restarting all applications was the best way to make sure we got everything running in a safe and stable way. We were able to restore our website and tracking services first. The additional work of shutting down all other applications, restarting them, and verifying their status took us late into the night on Friday.
This all took longer than we would have liked. That said, it was a conscious decision to take a cautious approach, metaphorically “measuring twice and cutting once” to protect our customers’ data and get everything back online in a stable way. After everything was back up and running, we saw a few residual login issues, but those were simply a matter of customers’ needing to refresh their web browsers. At all times, customer account information and data was secure with full data integrity.
The final chapter
Successfully recovering with all customer data intact is only the start, as there is a lot to learn from this event. We are working closely with our data center provider on a post-mortem review. We know that our customers depend on Constant Contact, and the entire company is focused on living up to the trust that is placed in us.
Leave a Comment