A client contacted us saying that his Storage Spaces pool had failed and looking for a way to recover data from the pool. The pool was created using a thin provisioning feature with the size of 50 TB! But the failed pool was showing only 18 TB of data used.
The failure occurred when the client tried to add two more drives to a pool already consisting of 16 (!) drives. Storage Spaces, for some strange reason, refused to add the disks. But to make matters worse, it disconnected two disks from the pool and stopped access to the data.
What a failed storage space looks like
We started working the case and were quickly able to determine that Storage Spaces considered the two missing disks as not relevant to the current pool. As a result, the areas containing the respective disk identifiers had been completely overwritten.
Generally, there is an unwritten rule in data recovery—never repair anything "in place". The reason for this is that there is too great a risk of losing data during recovery. Regardless of whether recovery is attempted by the data owner via software recovery tools or by a professional lab, recovery is always "read only". Any modifications are made using copies of user data, either disk image files or drive clones.
However, in this case, scanning 20 TB of data and then asking the client to provide another 20 TB to copy the recovered data seemed, to put it mildly, unrealistic. So we made an exception to the above rules and, with the permission of the client, took the risk. We attempted to fix Storage Spaces metadata "in place", hoping that we could get Storage Spaces to detect the missing disks. As a precaution, we backed up the original Storage Spaces configuration. So in case of failure, it would be possible to restore the original, albeit, non-working state.
We adjusted Storage Spaces metadata on one of the missing disks. But, alas, Storage Spaces still refused to recognize the "fixed" disk, complaining about inconsistent metadata. Even though we managed to reanimate the "missing" disks to the test pool in our lab, the fix failed to work on the client site.
Failing to fix Storage Spaces "in place", we convinced the client to face the fact that he needed to provide 20 TB of storage to copy data extracted from the recovered Storage Spaces configuration. At the moment, copying is well underway, with the speed of recovery mainly limited by the ability of the client to provide the disks.
An interesting note is that there was a single parity virtual disk in the pool, which by definition can operate with one disk missing. Accordingly, the prototype of our Storage Spaces recovery software can handle a Storage Spaces parity volume with one disk missing as well. This allowed to pull the biggest disk (3 TB) out of the pool and use it to store the recovered data from the same pool.
Lessons Learned
The key take-aways from this experience are:
- Our knowledge of Storage Spaces gained from reverse engineering is still incomplete.
- It takes a lot of time and tedious work to recover data from an 18 TB pool! We have already spent ten days and data is still being copied.
- You should resist the temptation to create one huge Storage Spaces pool. It is better to create several small pools. By doing this, you will save time and money should data recovery ever become necessary. If we had three 6 TB pools to recover vs. one 18 TB pool, it would be much easier to maneuver with free space and the recovery process would take less time.
- Back up data, even if you work with a parity volume. Storage Spaces, like RAID, is not a substitute for back up! If you need to purchase a lot of disks to recover data (as in our case), then use the disks for backup after recovery is complete.
Elena Pakhomova does both marketing and development for data recovery software company ReclaiMe.com.