Have you ever experienced a major un-oh moment? You know, when something goes completely awry and your mind starts to race, palms start to sweat and the classic Homer Simpson “d’oh” echoes in your head. Well, that’s what I imagine recently happened to one of the system administrators at GitLab, a web service for hosting and syncing source code. News broke on Tuesday that GitLab had suffered a major backup restoration failure following an incident of accidental data deletion. According to the tech news site The Register, the incident was caused by an admin who was working late and might have accidentally deleted the wrong folder in a planned maintenance operation.
Whoops! It’s a textbook case of human error that can happen to any business – no matter how technically advanced or experienced. Nevertheless, this unfortunate incident highlights the incredible need to provide true backup and disaster recovery (BDR) to your clients so they don’t become the next headline.
What Went Wrong?
Initially, the GitLab service seemed to have been going through load time and stability problems. Alas, the issue quickly escalated into emergency database maintenance after data was accidentally deleted, per a series of tweets from the @GitLabStatus account.
we are experiencing issues with our production database and are working to recover— GitLab.com Status (@gitlabstatus) February 1, 2017
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8— GitLab.com Status (@gitlabstatus) February 1, 2017
As noted above, the company confirmed the emergency issue and admitted that someone deleted something they shouldn't have. That someone was, supposedly, a tired GitLab admin who accidentally wiped a folder containing 300GB of live production data that was due to be replicated. Although the procedure calls for snapshots to be taken every 24 hours, the data loss occurred six hours after the last one was taken. As a result, six hours of data had been lost, perhaps permanently.
On top of all of this, the company then experienced a major backup restoration failure. In their efforts to restore the deleted data, it was noticed that the replication procedure was very fragile and prone to error. This then brought about the realization that out of the five backup techniques deployed, none had either been working reliably or set up correctly in the first place, as explained in this Google Doc.
The Main Takeaways
The Dangers of Human Error
As an MSP, what should you take away from this? The first thing that comes to mind is that human error can be a risk to any organization. No matter the technology or procedures you have in place – one minor slip-up by an employee can lead to a major data disaster. It happened to GitLab, and it can happen to your clients just as easily. Having a reliable BDR solution gives MSPs like you the opportunity to strengthen your clients’ data security. After all, it’s your responsibility to ensure that they’re protected against any potential issues or threats.
Not All BDR Technologies are Created Equal
Each time a story such as this one breaks, it makes it that much more difficult for SMBs to blindly trust that their backup systems are working properly. We all understand that data loss and breaches are not uncommon today, but most of the time it’s how the situation is handled that matters most. Although GitLab handled this incident with honesty and transparency, their fatal mistake was skimping out on backup verification! When attempting to recover their data, GitLab encountered a handful of problems, including not being able to figure out where regular backups were stored, having a flawed replication procedure and relying on single snapshots as verification. It’s especially interesting that they had five different backup techniques in place, yet not even one worked properly when they needed it most. In this case, quality should have outweighed the quantity, which goes to show that having the right BDR platform can make all the difference.
Backup verification has become such an important and useful resource for MSPs. There are various ways that BDR solutions try to verify the validity of a backup, but none are as robust as Continuum BDR’s Tru-Verify™ feature. Tru-Verify allows for actual video verification of backups, which provides an even greater level of detail and allows for more thorough troubleshooting if things need to be looked at. The result is that MSPs using Continuum BDR can have a lot more confidence in their backup verifications—and that confidence is a huge value add when it comes to conversations with clients. Additionally, the Continuum Network Operations Center (NOC) proactively addresses any Tru-Verify failures, so MSPs don’t have to devote a technician to constantly babysit the product or take time each week addressing issues that pop up. It’s what makes Continuum BDR a comprehensive and highly reliable BDR platform.
With stories like GitLab’s breaking more and more often, it’s now essential that you’re offering your clients the most efficient BDR solution on the market. Without it, their business could be spotlighted in the next major data loss headline.