We had an outage going from 5:35PM to 4:29AM. (times in pacific)
It calmed down by itself but service was quite slow. A proper fix was done at 8AM. At 9am a configuration update was pushed to prevent the scaling issue.
Timeline:
- 5:35pm: outage starts, service was completely unavailable
- 4:29am: traffic went down enough to allow some automatic recovery.
- 7:50am: Start investigation
- 8:04am: Just did a quick look and one of the nodes in the cluster was in a bad state. Orchestrator didn’t catch it and kept it in rotation. Just took it down and got it back in shape. So hopefully y’all see an improvement now.
- 8:52am: Just found a big issue with docker setup for /kbin. The guide for it is not really tested, so things aren’t fully cover. But I found another guide for bare-metal (when running on a server) that has an important step for letting /kbin handle a lot more processes. Implementing this right now. So that should make a huge difference.
- 9:21am: Alrigthy, pushed some config changes that should improve Artemis.camp performance. Will monitor latency over the next day to measure improvements.
- 11:04am: Things seem to be stabler. Tho, will look at the end of the day 🙂
Thank you for this post, and your transparency. Keep up the great work!