Achieving 1 million rps in Go
Following my previous article on data pipelines, I’ve written “Plutos” which acts as a `bare metal` template for event streaming. In this article we’ll see how can we improve a relatively naive code and get better throughput.
TL;DR: You can follow the debug process and see all final changes here, the main bottlenecks were: UUID pkg, JSON marshaling, blocking on mutex for too long, gzip compression, and no buffers before compression.
Solution were: Using a non crypto secure random uint32, writing a custom serializer, using bytebufferpool, moving to LZ4 compression, putting a bufio buffer of 4MB before the compressor
The goal: sending a 100B over a GET request, storing it in S3, while keeping costs as low as possible.
We’ll be running two c5n.9xlarge for this test, one for server one for wrk.
Before we begin, lets recheck our assumption that an “out of the box tool” doesn’t offer better performance, for this we’ll use NGINX.
Installing NGINX
worker_processes auto, disable gzip, access & error logs:
NGINX is a bit slow, we’ll compare with fasthttp — simple health endpoint to match the previous test:
Much better, this gives us some wiggle room to add request parsing, maybe even JSON formatting to the output and adding basic enrichment like a nanosecond timestamp and request id.
Taking a pprof snapshot to make debugging later easier.
Adding Plutos
no compression, discard output, is resulting in a significant decrease in throughput. (full example)
Outputting a CPU profile:
This show us a slightly different picture than the “health” endpoint, we are doing another syscall to unix.GetRandom
The solutions, instead of a UUID pkg we’ll use a concat of
(MD5 of the payload + nanosecond timestamp + fastrand uint32)
This in combination together with the UUID of the filename to which we are writing will give us a pretty good unique id of an event.
OK! now we are talking :) we are at 66% of the health endpoint throughput, which is much better, still some room for improvement.
Next up is the JSON marshaling, right now our code looks like this
Inside the writer there is a sync.Mutex, so while a go routine is marshaling the struct it keeps other go routines from writing to it, effectively making the process single threaded.
A possible solution can be marshaling the response into a bufferpool, this will remove the contention and avoiding allocating more memory at the same time.
It’s better, but jsoniter is still taking about 2% CPU time, so lets implement a custom serializer
A ~5% uplift in throughput. We are now 87% closer to the health endpoint :)
For the purposes of this article we’ll keep the change, but for production use a custom serializer is much harder to maintain and I personally would pay 5% more on servers to keep the code’s readability.
It looks as though the only think remains in both CPU & Memory profile are the fasthttp overhead, MD5 and timestamp write.
Removing the timestamp and MD5 doesn’t effect the throughput enough (1%–2%) to optimize it further. Memory profiling shows that the only thing left is the fasthttp overhead and specifically bufio:
This behavior didn’t present in out health endpoint, but it stands to reason because we didn’t send a payload to it :) sending the same payload just to “/health” produces a throughput of 1.24M req/sec.
This brings us to within a margin of error between `/health` and `/e` endpoints with slight benefit to health when running the tests for longer
Lets send to S3!!!
:\ Well that took a dive, Most likely culprit is the bandwidth limitations of EC2 and S3. if we want to go forward we’ll have to compress the output, we’ll use gzip best compression.
Better :) but still no good, changing the compression lvl between 0,1,6,9 yields no improvements, lets change the compression to something faster.
Running lz4 to s3 driver yields:
And if we optimize this even further with a bufio buffer on the lz4 writer to match the LZ4 4MB block size we finally get:
Cost
The test was performed on c5n.9xlarge machine with no Load Balancer, because we implemented grace full shutdown we can save around 60% and use spot instances putting Totaling~567$/mo (or $1419.12/mo on-demand)
In conclusion
The steps we took to being out our performance from 400K req/sec to 1M req/sec:
- Using pprof we found that using UUID is a single threaded operation that costs a lot in performance (much more than JSON serialization or even compression). When using third party packages take note on their performance and implementations.
- bytesBufferPool and sync.Pool in general are your friends, use them :) in this case we saw an uplift of 15% in performance.
- JSON Serialization does have an impact, but a lot less than I though. If you pay $100K/mo for this service 5% is just $5K/mo, removing it will save that money… but at the cost of making the code more complex.
- Don’t write small payloads to io.Writer that works best with a bulk, try using bufio.WriterSize to push the performance a bit further almost for free — Free RAM is Wasted RAM
- Choose you compression algorithm wisely, while LZ4 is faster it’s significantly less effective, which is OK for RAW logs that are going to be aggregated anyway. For other usecases, like long term storage using GZIP/Snappy will save a lot of money down the line. with data pipeline the cost of the collection is minor in relation of storing that data for a long time. That said you can always re compress.
PR with full code can be found here