We had an outage going from 5:35PM to 4:29AM. (times in pacific)

It calmed down by itself but service was quite slow. A proper fix was done at 8AM. At 9am a configuration update was pushed to prevent the scaling issue.

Timeline:

  • 5:35pm: outage starts, service was completely unavailable
  • 4:29am: traffic went down enough to allow some automatic recovery.
  • 7:50am: Start investigation
  • 8:04am: Just did a quick look and one of the nodes in the cluster was in a bad state. Orchestrator didn’t catch it and kept it in rotation. Just took it down and got it back in shape. So hopefully y’all see an improvement now.
  • 8:52am: Just found a big issue with docker setup for /kbin. The guide for it is not really tested, so things aren’t fully cover. But I found another guide for bare-metal (when running on a server) that has an important step for letting /kbin handle a lot more processes. Implementing this right now. So that should make a huge difference.
  • 9:21am: Alrigthy, pushed some config changes that should improve Artemis.camp performance. Will monitor latency over the next day to measure improvements.
  • 11:04am: Things seem to be stabler. Tho, will look at the end of the day 🙂