What is OVH VPS

The night my virtual server burned down

On the morning of March 10th, I found reports from my monitoring system from the night in my e-mail inbox that one of my virtual private servers was no longer accessible. It took me a moment to realize what had happened. A lesson in the importance of a fireproof backup strategy.

Precisely 2:30 a.m. in the early morning hours of March 10th, the monitoring of my cloud provider OVH informed me that the Virtual Private Server (VPS) system was no longer responding. When I noticed the message early in the morning, I realized that I was actually no longer able to reach the system. Rather unusual, the system usually runs without any problems. In addition, I could not reach a mailbox in the mail service at OVH, "Connection refused".

So get started in the web interface and the VPS. Strange effects occurred when registering in the web GUI at OVH, the registration hung - all rather unusual. With the suspicion of a major system failure, I went looking for it. OVH has different status pages in different languages: in French, English, sometimes also in German.

Source: Twitter

 

All of this wasn't really effective until I stumbled across the OVH company's founder, Octave Klaba, on Twitter. There was something to be read by one at 3:42 a.m. Major incident and fire in the SBG2 data center. There in Strasbourg, I knew my virtual server was running.

I didn't know exactly which data center was affected. But with the impressive pictures of the extent of the fire and the reports afterwards, that didn't really matter. SBG 2 completely destroyed, SBG 1 destroyed to a third, SBG 3 and SBG 4 shut down.

 

Source: Twitter

Backup

I hadn't booked a backup for my VPS, but in the conventional way with my own proven backup script backed up to another server.

First check in your own inventory: full backup from March 1st, incremental backups at the same level from March 5th to March 9th, on March 10th he didn't manage any more, it ran after 3 or 4 in the morning. If the system does not come back, then there will be almost 24h data loss, which is not critical on my game system.

Data replication

But why shouldn't the system come back? A virtual server is initially independent of the sheet metal on which it runs. You can start it on any basic system. The data disk of the system is important. The VPS at OVH were advertised with CEPH storage and triple replication when I booked it back then.

Triple replication sounds great. The question is: where are the nodes? I had no further information on this. I was able to cope with the failure of my game and craft server for a few days without any problems, but when should I start the recovery? Actually, I didn't feel like doing a restore from the backup. So I decided to wait for a final clarification of what happened to my data.

After OVH had its own infrastructure and web GUI up and running again, I was able to determine that my system should run in cluster 001 os-sbg1-002, with the localization SBG1. Well, based on my current information, I went to one of them 2/3 probability that my system survived. Assuming that CEPH nodes were not in the same room, they were even a little higher.

Everything burned

On March 12th, however, the sobering news came that os-sbg1 was actually in SBG2 and was completely burned. Now that the total loss was certain and OVH offered a few months of credit for new orders, I restored the system on a comparable VPS in the data center in Gravelines.

As every time, there is a certain thrill in actually reloading a full backup. First find the keys for encryption, decrypt the backup. Start the new VServer in the rescue system, prepare the disk and restore it via the network. IPv4 was assigned dynamically, IPv6 was stored statically, so pull it straight out quickly. Install and activate the bootloader and then start the system. Voilà! The system is available again. Correct the IP addresses in the DNS, the CNAMES then automatically follow suit and the system can be reached again for all services.

The next morning, make sure that the backup has run again and then you can let the system run again.

Disaster recovery

What insights can be drawn from this little lesson?

For many years I have been quoting the wisdom “Nobody wants backup, everyone wants restore” in the backup discussions.

Disaster recovery or disaster recovery are things that you always have to keep in mind for the local infrastructure. Fire and (extinguishing) water are not good friends for servers. In the worst-case scenario, total losses must always be expected. At a certain altitude, the CPU and the memory are completely irrelevant, they can be retrieved promptly. OVH was able to provide 10,000 new servers in Gravelines in just a few days.

The data is more important. It doesn't help here if all redundant storage nodes are destroyed together. If so, these systems should already be in different fire compartments. And backups should also be stored outside in completely different environments. Corresponding Securely stored, and ideally still encrypted on the media. But the decryption of the backups must still work if the entire security unit has become unusable, i.e. the key must also be secured outside.

The price of lost data

As a rule, lost data cannot be replaced with money today. The volume of data processed today is simply so large that we no longer talk about some data typists re-typing the paper order forms of the day. It is about extensive data stocks, which in most cases cannot simply be reconstructed again. “Please re-execute all orders from March 10th” is a rather hopeless request to ... yes, to whom?

Special appliances

Unlike the "normal" computers, it is with Special appliances such as firewalls or proxy systems. As a rule, the procurement of the replacement devices itself is still a possible problem, in addition to the complete backup of the configuration and here again keys for the encryption. That is why the cluster nodes or redundant systems should be installed in different fire compartments. If one data center burns out, the other can at least take over in the event of a disaster.

That remains to be the subject of the Internet connection. Often there is only one handover point, so that the primary introduction also exists in a data center. As a rule, however, the replacement line can be placed in the second data center if the site is allowed to be entered. Alternatively, LTE replacement lines can also be negotiated with the carrier. But that can be done much faster than purchasing all appliances from scratch.

What does the cloud offer?

While the local conditions can be surveyed and understood relatively well, and thus disaster scenarios can be described very easily, it is much more difficult in the cloud. As a rule, one can assume that the cloud provider will largely exclude itself from all liability through its terms and conditions and TOS. This lies in the common cloud environments for Infrastructure as a Service (IaaS), i.e. the rental of virtual machines and storage for free use Backup is usually the responsibility of the customer. Sometimes snapshots are offered, but they usually do not replace a backup.

In the case of SaaS services, there is sometimes a rudimentary backup of the provider, or, in the meantime, interfaces to which the common backup tools can dock in order to carry out a data backup on their own responsibility are much more common. And as a rule, the backup is not the challenge: granular restoration is often the real challenge for which special tools are required.

Availability modeling in the cloud

There are other challenges for cloud data centers. At AWS sometimes you have to restart virtual machines in order to move them to another hardware, because Amazon wants to carry out maintenance on the original hardware.

At Azure it is, for example, the case that the redundancy is described in availability zones and availability sets. Each Availability Zone consists of at least one data center, whose power supply, cooling and network operations function independently. The availability sets, on the other hand, are only independent servers, storage and network components in a common data center.

Azure in Germany does not yet offer any availability zones. If you need this, you have to go to Azure Europe West, i.e. Amsterdam. This shows that the change to the cloud is more complicated in detail than it seems at first glance.

Security costs

And here it is important to consider where you come from: Compared to the server cabinet in the broom closet, the cloud is probably better secured. But if you come from a certain available modeled setup with separate server rooms on your own company premises, you have to make sure that the availability is not worse than before. And cloud providers usually pay for this improved availability. Just like any other topic. Anyone who has ever tried to estimate their backup costs at Microsoft in the calculator at Azure can probably understand that.

Contingency planning

It is actually important for all those responsible to find out about the Risks and the effects of failure and destruction to become clear. This can be done in a structured way in a tool for emergency planning, it can also happen with the hand on the arm in a wiki. It is only important to consider whether this documentation can still be accessed after a disaster. Not to be overlooked are regular system recovery tests. Nothing is more fatal than a backup that cannot be restored.