[Maintenance] Feb 7 - Mastodon Data Migration

Crashdoom@pawb.social · edit-2 9 months ago

[Maintenance] Feb 7 - Mastodon Data Migration

huxley@pawb.social · 9 months ago

Looks like furry.engineer is down?

Stefen Auris@pawb.social · 9 months ago

I’m seeing the same here, something about an Argo tunnel error. @[email protected]

Crashdoom@pawb.social · 9 months ago

Aware and investigating!

Stefen Auris@pawb.social · 9 months ago

and that’s why you’re the best <3

liquidparasyte@pawb.social · edit-2 9 months ago

pawb.fun as well. Something got fucky wucky during the migration, it seems.

natebluehooves@pawb.social · 9 months ago

Correct! to give a bit of background while I wait for backups…

last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).

so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.

I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.

Vincent Hayes@pawb.social · 9 months ago

SysAdmin lesson learned, always make the backups :3

natebluehooves@pawb.social · 9 months ago

Lessons do stick around when you have to learn the hard way!

Exec@pawb.social · 9 months ago

node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash

Oof, that’s pretty much a cascading failure

natebluehooves@pawb.social · 9 months ago

Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!