Camp Outage [Aug 17] - sh.itjust.works

We had an outage going from 5:35PM to 4:29AM. (times in pacific)

It calmed down by itself but service was quite slow. A proper fix was done at 8AM. At 9am a configuration update was pushed to prevent the scaling issue.

Timeline:

5:35pm: outage starts, service was completely unavailable
4:29am: traffic went down enough to allow some automatic recovery.
7:50am: Start investigation
8:04am: Just did a quick look and one of the nodes in the cluster was in a bad state. Orchestrator didn’t catch it and kept it in rotation. Just took it down and got it back in shape. So hopefully y’all see an improvement now.
8:52am: Just found a big issue with docker setup for /kbin. The guide for it is not really tested, so things aren’t fully cover. But I found another guide for bare-metal (when running on a server) that has an important step for letting /kbin handle a lot more processes. Implementing this right now. So that should make a huge difference.
9:21am: Alrigthy, pushed some config changes that should improve Artemis.camp performance. Will monitor latency over the next day to measure improvements.
11:04am: Things seem to be stabler. Tho, will look at the end of the day 🙂