How do you prepare for the unexpected? When disaster strikes, having a recovery plan in place can make all the difference. In the healthcare industry, when we lose access to patient records and the systems that drive operations, patient care itself is negatively impacted. So even though outages can’t be predicted, it’s important to have a working disaster recovery plan in place before you need it.
A Technical Waterfall
The sun was shining. The crows were cawing. It was a typical summer day. My coworkers and I were just starting the great lunch debate as we entered the middle of the workday at Salinas Valley Memorial Healthcare System (Salinas, CA). Then it happened. We were notified one of our systems was down. Less than a minute later, MEDITECH was down, then PACS was down, one after the next. The downward cycle continued as several more systems went down one after the other. We contacted MEDITECH and immediately began to prep our recovery processes, but we didn’t know what was causing the issues.
Our server room is in a separate building from the hospital, so a couple system administrators on our team rushed over to see what was the problem. Upon entering the room, nothing was noticeably wrong, but the server lights in one rack had mostly gone amber. It wasn’t until they pulled out one of the server blades from the rack, did we find out what the problem was – it was full of water.
One of our chillers in the server room that day was undergoing maintenance work. With the sidewall barriers down during repairs, some pressurized water sprayed the server rack across the aisle. Since the server blades are small and stacked densely on each other, the water hit and then cascaded down, shutting off systems one by one as water drained down from one to the next. This “waterfall” explains why we saw a literal cascading system failure, and not everything at the same time.
Once we identified the impacted servers, we were able to start recovering MEDITECH and other systems onto unaffected hardware. Because of good teamwork among staff trained to handle disaster recovery, and MEDITECH downtimes, things moved relatively quickly. Within the hour, we had the systems restored, MEDITECH validation completed, and the downtime lifted. The only requirement left was to decide on lunch.
Want to see MEDITECH's technology in action? Attend one of our upcoming webinars, and see how we can transform your organization.
All in all, we had 11 servers fail that day, plus another five that were damaged from water corrosion and needed replacement. Amazingly, our downtime on that fateful day was only 59 minutes. Although we lost all these servers, thankfully, the water spray did not hit our Storage Area Network (SAN), which is only inches away from the servers. Had that SAN gone down, the recovery time alone would have had some systems down for weeks.
We ended up shutting off systems that were in test or still under production, and anything that was running as a redundancy. When all was said and done, we had just enough remaining server capacity to run our production systems. By design, we expected the Virtual Machines (VMs) to automatically recover from a server hardware failure. However, the VMs eventually became orphaned on the damaged hardware due to the cascading water that fried the servers faster than failover could happen. The recovery effort was essentially finding the VMs on disk and restoring them to working hardware.
We also had quite a number of systems in addition to MEDITECH go down that day, but when we brought MEDITECH and the other systems back up, thankfully, we had no data corruption. After a hard shutdown like we had, that could have easily happened. Working with our MEDITECH Technical Account Manager, Jay Williams, we were able to get our systems validated by MEDITECH support and back in production very quickly on our remaining hardware. As water damage tends to void warranties, it took eight weeks running without any further redundancy while waiting for new hardware, and a quarter million dollars in replacement costs to get us fully back up and running.
While a plan on paper is important, only so many potential scenarios can be written out. It is quite likely that the disaster recovery plan is going to sit in the binder during an actual disaster since the combination of factors in a particular event won’t match the paper plan. Yet the work that goes into the production of the plan is crucial in order to identify gaps and problems in the data center design. Designing the plan is also helpful in building familiarity with the tools available, so that IT staff can quickly evaluate and respond to the distinctions and variations that are unique to each disaster.
Our end goal is to have a strong design that can defend against any number of scenarios. A combination of planning, redundancy, backups, off-site replication, budgeting, training, as well as exercises all contribute to a highly available system design (or at the least, easily recoverable). At the same time, it minimizes the performance impact to production.
Because we lost servers rather than data, most of our recovery tools were not needed for this particular event. We were confident though that the backups, offsite tapes, data redundancy, and replication with CloudWave were all available if the situation had required it.
While it’s impossible to account for every imaginable data emergency, proper disaster recovery planning can save time and money when it is needed most. It certainly did for us.
Learn more about MEDITECH’s EHR interoperability offerings and how they align with industry movement, both present and future.