Why is serveral instances down simultaneously? Is it just me?

OhNoMyInstanceIsDown@lemm.ee · edit-2 1 year ago

Why is serveral instances down simultaneously? Is it just me?

zalack@kbin.social · edit-2 1 year ago

It’s not that strange. A timeout occurs on several servers overnight, and maybe a bunch of Lemmy instances are all run in the same timezone, so all their admins wake up around the same time and fix it.

Well it’s a timeout, so by fixing it at the same time the admins have “synchronized” when timeouts across their servers are likely to occur again since it’s tangentially related to time. They’re likely to all fail again around the same moment.

It’s kind of similar to the thundering herd where a bunch of things getting errors will synchronize their retries in a giant herd and strain the server. It’s why good clients will add exponential backoff AND jitter (a little bit of randomness to when the retry is done, not just every x^2 seconds). That way if you have a million clients, it’s less likely that all 1,000,000 of them will attempt a retry at the extract same time, because they all got an error from your server at the same time when it failed.

Edit: looked at the ticket and it’s not exactly the kind of timeout I was thinking of.

This timeout might be caused by something that’s loosely a function of time or resources usage. If it’s resource usage, because the servers are federated, those spikes might happen across servers as everything is pushing events to subscribers. So, failure gets synchronized.

Or it could just be a coincidence. We as humans like to look for patterns in random events.

hitagi@ani.social · 1 year ago

Interesting. Never thought of it that way.