Why didn't anyone bother Microsoft Office

Microsoft cloud failures

Marketing is sure not to list the problems of cloud providers. After all, this could prevent customers from using Office 365 or other cloud services. But anyone who is serious must of course also shed light on the topic and customers should understand that nobody can pay 100% for it. Microsoft documents at least SLAs, which many IT departments find it very difficult to do, even when asked. Most of the time you also have a somewhat glorified view of past failures and in the end everything wasn't that bad.

Where is the problem?

With cloud services, deployment will be even trickier now, at least three parties are involved in the picture:

  • Local
    The cloud service can never be used without its local function. Your PC has to be running and the application installed. a connection to the Internet with a functioning name resolution and, if necessary. The proxy / firewall must transmit the packets. But that's not all. The users have to authenticate themselves to the cloud and if you use pass-through authentication (PTA) or ADFS authentication, then the cloud uses your local server. Appropriate monitoring is definitely advisable here.
  • Transfer networks
    Microsoft is doing everything to ensure that the fastest, shortest and "managed" path possible is switched between their access point to the Internet and the service in the cloud. However, these routes via the provider can also be disrupted. That can be the classic excavator. However, a DoS attack on your Internet access or provider or similar can also disrupt the path. You can and should also monitor these sections accordingly, e.g. with .End2End-HTTP, End2End Office 365, End2End-UDP3478 and others.
  • Microsoft cloud
    You can be very sure that Microsoft is actively and permanently monitoring its environment and therefore recognizes and fixes problems and errors very quickly. Errors can still occur. Then you should at least see through surveillance that the problem is probably not theirs. The tricky thing here is that there are a lot of servers and transitions at Microsoft and that a fault is never actually a binary situation.

If you think you have detected a failure, then at least you have already determined what a failure is. This is not that easy, because a disorder can only affect a subset. I have already seen cases in which new users could no longer work but existing users were still active. This can be due, for example, to the fact that new users could simply no longer log in due to a local malfunction of the ADFS server while the tickets of the existing users were still valid. But there have also been cases where a server in the cloud had problems. Since Microsoft distributes the users' mailboxes over a large number of servers, only very few or even only a single user notice this. However, you cannot reliably monitor such a thing. Even if you were to open every mailbox again and again via EWS and impersonation, there would then only be an EWS test but not ActiveSync or Mapi / HTTP. If you expand this fuzziness with the locality of the user, i.e. a user in a branch office uses a different Internet access and thus Azure entrance, an almost infinite number of constellations can be determined, which you certainly cannot check all.

An interruption of accessibility without data loss is still to be assessed differently than a disruption with data loss. With Exchange Online, Microsoft has created a constellation with the DAG since Exchange 2010, in which mails should actually no longer be lost. The transport delivers the mail to the next station and only deletes it when the next station has confirmed that it has saved the information on another system, i.e. the next transport service or a replica of the mailbox. So two servers in a row would have to fail with a very short distance. The probability is very small, but not impossible.

With other services, e.g. SQ databases, data can also be transferred to another server via log shipping or snapshots and thus secured against losses more quickly than a classic "one-time backup at night" ever could. However, there is also a time lag here. With AzureDBs, Microsoft supposedly takes a snapshot every 5 minutes. If a database is deleted here, as happened at the beginning of 2019, then up to 5 minutes can no longer be restored. But here, too, you should ask yourself exactly which SLA you as the IT department actually guarantee your company when you talk about local servers.

Anyone who, like Microsoft, operates millions of servers, hard drives and network connections will always have to reckon with failures. Failures are even more of the order of the day. Large hosters speak of a 1-2% failure rate for hard drives / year. Repairs, RAID rebuilds or reseseding of data in clusters and DAGs are therefore the norm in the cloud. OnPremises administrators are much more nervous about such actions and if replacement hardware has to be delivered first, then the tension lasts longer

Office 365 status

Microsoft publishes its own values ​​at https://docs.microsoft.com/de-de/office365/servicedescriptions/office-365-platform-service-description/service-health-and-continuity. In mid-Feb 2020 it all looked pretty good.

However, 99.99% also means up to 52:36 minutes / year. Such summaries are also very general, because a service consists of many services, then a part is probably only partially included. Every failure is felt to be too long

Major Office 365 and Azure disruptions

But also Office 365 and Azure are marked by disruptions that are more or less well known. I built a sensor on PRTG with Office 365 for a long time, with which I call up the status panel of my tenant and visualize it in PRTG. You can see there that there is almost never a point in time when all the displays are "green". There is always something somewhere. It's just nice that there are mostly small things, e.g. somewhat limited performance, delays in provisioning or that a partial function (e.g. free / busy queries in Outlook) may only affect a few users.

But there are one or the other major disturbance that then also makes its way into the media.

more links