tl;dr summary furry.engineer and pawb.fun will be down for several hours this evening (5 PM Mountain Time onward) as we migrate data from the cloud to local storage. We’ll post updates via our announcements channel at https://t.me/pawbsocial.
In order to reduce costs and expand our storage pool, we’ll be migrating data from our existing Cloudflare R2 buckets to local replicated network storage, and from Proxmox-based LXC containers to Kubernetes pods.
Currently, according to Mastodon, we’re using about 1 TB of media storage, but according to Cloudflare, we’re using near 6 TB. This appears to be due to Cloudflare R2’s implementation of the underlying S3 protocol that Mastodon uses for cloud-based media storage, which is preventing Mastodon from properly cleaning up no longer used files.
As part of the move, we’ll be creating / using new Docker-based images for Glitch-SOC (the fork of Mastodon we use) and hooking that up to a dedicated set of database nodes and replicated storage through Longhorn. This should allow us to seamlessly move the instances from one Kubernetes node to another for performing routine hardware and system maintenance without taking the instances offline.
We’re planning to roll out the changes in several stages:
-
Taking furry.engineer and pawb.fun down for maintenance to prevent additional media being created.
-
Initiating a transfer from R2 to the new local replicated network storage for locally generated user content first, then remote media. (This will happen in parallel to the other stages, so some media may be unavailable until the transfer fully completes).
-
Exporting and re-importing the databases from their LXC containers to the new dedicated database servers.
-
Creating and deploying the new Kubernetes pods, and bringing one of the two instances back online, pointing at the new database and storage.
-
Monitoring for any media-related issues, and bringing the second instance back online.
We’ll be beginning the maintenance window at 5 PM Mountain Time (4 PM Pacific Time) and have no ETA at this time. We’ll provide updates through our existing Telegram announcements channel at https://t.me/pawbsocial.
During this maintenance window, furry.engineer and pawb.fun will be unavailable until the maintenance concluded. Our Lemmy instance at pawb.social will remain online, though you may experience longer than normal load times due to high network traffic.
Finally and most importantly, I want to thank those who have been donating through our Ko-Fi page as this has allowed us to build up a small war chest to make this transfer possible through both new hardware and the inevitable data export fees we’ll face bringing content down from Cloudflare R2.
Going forward, we’re looking into providing additional fediverse services (such as Pixelfed) and extending our data retention length to allow us to maintain more content for longer, but none of this would be possible if it weren’t for your generous donations.
Looks like furry.engineer is down?
I’m seeing the same here, something about an Argo tunnel error. @[email protected]
Aware and investigating!
and that’s why you’re the best <3
pawb.fun as well. Something got fucky wucky during the migration, it seems.
Correct! to give a bit of background while I wait for backups…
last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).
so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.
I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.
SysAdmin lesson learned, always make the backups :3
Lessons do stick around when you have to learn the hard way!
Oof, that’s pretty much a cascading failure
Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!