Serverless Functions Post-Mortem

sizeoftheuniverse@programming.dev · 1 year ago

Serverless Functions Post-Mortem

snowe@programming.dev · 1 year ago

continued

Functions often needed to be rewritten as you went, moving everything you could to the initialization phase and keeping all the connection logic out of the handler code. The initial momentem of serverless was crashing into the rewrites as teams learned painful lesson after painful lesson.

we have never encountered this. this is probably exacerbated by the fact that the author thinks having 100+ lambdas for a medium sized app is normal. you focus even more on the startup time, rather than solving business problems.

Price. Instead of being fire and forget, serverless functions proved to be very expensive at scale. Developers don’t think of routes of an API in terms of how many seconds they need to run and how much memory they use. It was a change in thinking and certainly compared to a flat per-month EC2 pricing, the spikes in traffic and usage was an unpleasant surprise for a lot of teams.

Combined with the cost of RDS and API Gateway and you are looking at a lot of cash going out every month.

lambdas are saving us tens of thousands of dollars a month because we don’t need to worry about massive monoliths and the required ec2 autoscaling instances needed, nor the insane costs of RDS.

The other cost was the requirement that you have a full suite of cloud services identical to production for testing. How do you test your application end to end with serverless functions? You need to stand up the exact same thing as production.

why do you need this? That’s not how most testing works. you mock what you need. Unless you’re using a monolith then this applies to any architecture.

Traditional applications you could test on your laptop and run tests against it in the CI/CD pipeline before deployment.

if by traditional you mean monoliths. Any sort of microservices, or even a slightly macroservices architecture.

Serverless stacks you need to rely a lot more on Blue/Green deployments and monitoring failure rates.

but why? this isn’t explained. We haven’t seen this. Maybe it’s how we use lambdas, but we use versioned lambdas and we deploy and immediately forget about it. There’s nothing to maintain about old versions, rollbacks are automatic.

Slow deployments. Pushing out a ton of new Lambdas is a time-consuming process. I’ve waited 30+ minutes for a medium-sized application. God knows how long people running massive stacks were waiting.

why are you ‘pushing out a ton of new lambdas’? The whole point is for things to be self contained. If you are needing to touch multiple things often then your lambdas should be a single thing, not multiple. This comes back to the ‘100+’ lambdas thing. That’s just bad design. Don’t blame lambdas for this.

We are able to build GraalVM Kotlin lambdas in less time than that, along with the deploy. The slowest part is literally the CDK synthesis. If we were using CF yaml then it would be half the time.

Security. Not running the server is great, but you still need to run all the dependencies. It’s possible for teams to spawn tons of functions with different versions of the same dependencies, or even choosing to use different libraries. This makes auditing your dependency security very hard, even with automation checking your repos. It is more difficult to guarantee that every compromised version of X dependency is removed from production than it would be for a smaller number of traditional servers.

This is going to completely depend on your team, your languages, and the frameworks you’re using. For us, it’s dead simple to keep up to date. Snyk helps us, it’s one click deploy for each lambda, and we can send to prod immediately due to having a very mature CI/CD pipeline. We are getting even better at this as we’ll be switching to gradle’s version catalogs which means that all of the applications can use the exact same version catalog and it will then require a single change whenever we need to update stuff, instead of hundreds.

The complexity of running a server in a modern cloud platform was massively overstated. Especially with containers, running a linux box of some variety and pushing containers to it isn’t that hard. All the cloud platform offer load balancers, letting you offload SSL termination, so really any Linux box with Podman or Docker can run listening on that port until the box has some sort of error.

so now you have to maintain your linux security, your autoscaling on linux, your deployment pipelines for linux, your nginx configs for linux, or if you’re using k8s you have to learn two stacks, k8s and AWS. If you’re adding loadbalancers then you should be deploying those with CDK anyway so now you’re both using cdk and k8s along with maintaining security on your ec2 instances.

Setting up Jenkins to be able to monitor Docker Hub for an image change and trigger a deployment is not that hard. If the servers are just doing that, setting up a new box doesn’t require the deep infrastructure skills that serverless function advocates were talking about. The “skill gap” just didn’t exist in the way that people were talking about.

O___o we literally have an entire infra team that is unable to manage Jenkins to the level that devs need due to how difficult it is to maintain a jenkins build pipeline. Not only that, but now you’re dependent on maintaining security for jenkins which is, and has always been, a nightmare. Jenkins pipelines aren’t testable locally (github actions you can use Act along with something like mock-github and act.js to even test your pipelines as part of ci/cd!). We’re currently switching the entire org to github actions due to how terrible jenkins is. And then you’re writing more pipelines to do monitoring! The author claims that serverless has more monitoring, but then goes on to say that you can ‘simply set up’ all this other stuff which is wayyyy harder to maintain in the long run.

People didn’t think critically about price. Serverless functions look cheap, but we never think about how many seconds or minute a server is busy. That isn’t how we’ve been conditioned to think about applications and it showed. Often the first bill was a shocker, meaning the savings from maintenance had to be massive and they just weren’t.

100+ lambdas once again.

Really hard to debug problems. Relying on logs and X-Ray to figure out what went wrong is just much harder than pulling the entire stack down to your laptop and triggering the same requests.

I don’t know what the author is doing, but it’s so dead simple to run lambdas locally and test locally that I really really don’t understand this. There’s only a single entrypoint. You know what the request was going in and out. It takes me way less time to debug something in a lambda than it ever did in a monolith (I’ve worked in a lot of monoliths and we still maintain a monolith on my team). If you have 100+ lambdas then maybe you should start blaming your architecture, rather than lambdas. It would be the exact same if you had 100+ microservices…a nightmare.

I am very sorry, but I honestly read through the whole article and agreed with a lot of it, and then when I went to write this up just got angrier and angrier because it’s very very clear that the author has a terrible architecture and is blaming it on lambdas. Lambdas don’t work for everything. In general don’t use them for web servers! But for a great solution to small self contained applications, or an architecture that might need one side to scale differently than others, or for step functions where you’re a state-flow diagram, the list goes on and on… then it’s a fantastic solution.

nibblebit@programming.dev · 1 year ago

Man, I have to agree. Your write up reflect my experience with Azure Functions in a mid-large sized application way more than the post. Fantastic