When calculating total cost of ownership (TCO), it is important to look beyond the straightforward maintenance costs. For example, when a primary storage system fails, the time it takes to rebuild the data can be quantified in lost dollars. As ZDNet writer Steven Vaughan-Nichols recently noted, redundant array of independent discs (RAID) is a popular choice for adding infrastructure redundancy. However, RAID technology comes with several caveats.
As cloud storage companies build in better uptime guarantees, even a few hours of downtime translates to lost money—whether it takes the form of reimbursing customers for broken promises or losing customers due to cancellations. Vaughan-Nichols identified two core problems with existing technology:
- Reductions in system performance during recovery
- The time it takes from the start of the process to the end
Particularly as data volumes increase, improving these metrics will be crucial to maintaining low TCO. It may not be straightforward to plan for the cost of a disaster, but data centre operators can minimise risk by improving recovery metrics as much as possible.
RAID operational problems
Traditional RAID recovery operates by using the data on active drives to recover the failed drive. This process is not only slow but may damage data integrity—something that businesses cannot sacrifice. The traditional recovery process also has a negative impact on overall system performance because it forces many drives into a long read/write cycle.
Ignore the Impending RAID Catastrophe at Your Own Risk by Dragon Slayer Consulting via Amplidata provides an in-depth look at the problems associated with traditional RAID technology in their white paper titled “Ignore the Impending RAID Catastrophe at Your Own Risk”. When capacities were in gigabytes instead of terabytes, rebuilds took a few minutes and didn't cripple performance. According to Dragon Slayer, rebuilding a 2TB hard drive in a RAID-5 group could take between 50 and 60 hours—even when given a high priority.
“But high prioritisation reduces storage system performance by as much as 50% or more. Few IT organisations can tolerate that much storage system performance reduction for two and a half days,” as Marc Staimer of Dragon Slayer suggests. “Therefore, rebuild prioritisation is often set as a background task reducing that performance penalty to more tolerable levels. While this eliminates most of the negative performance impact on the storage system, it also lengthens that rebuild time by as much as 7×, or 700%, to more than two weeks.”
Lengthy downtime presents challenges on its own, but there is another issue associated with RAID technology: it presents a significant threat to data integrity. The traditional recovery process puts extra strain on the other drives in the same system. Dragon Slayer observes, too, that this makes it more likely that another piece of hardware will fail while data is being recovered.
Seagate RAID Rebuild technology
To combat the common pain points associated with the recovery process, Seagate RAID Rebuild technology takes a different approach. Rather than first attempting to rebuild data from existing drives, Seagate RAID Rebuild technology extracts as much data as possible from the failed drive and then initiates RAID recovery. This process is not only significantly faster, but puts less pressure on the rest of the system. The net result is that both problems ZDNet touched on are reduced, along with a third core metric: reduced risk of secondary failure.
Although Seagate RAID Rebuild technology cannot extract data from a drive with no functional heads, a successful partial copy lowers recovery time significantly while also reducing the burden on the rest of the system. Because data is better protected and downtime is lessened, the technology reduces overall TCO.