The road to the data pipeline, part 1
TL;DR: A ~200K req/sec events collector can be done in under 150 lines of code, a full project template with multiple drivers: here
Data, everybody are talking about it, everybody want to do it but the price of entrance is high and price of ownership is even higher. In this three part story I would like to explore the various possibilities to implement a Working, Cost efficient and Reliable data pipeline starting from the collection layer down right to analysis and transforming raw data into actionable intelligence.
Lets say you are Software Engineer & your amazing product is getting traction, and a need arises to be more “Data Driven (TM)”.
The chosen solution is sending GET/POST calls from Frontend on every user activity event, storing it and displaying it in your client facing back office dashboards.
To put it visually
General best practices that will guide us on our journey:
- Separate each layer of the design to enable “hot swapping” parts of the system as scale dictates.
- Always leave redundancies to avoid data loss.
- Future proofing must be though of, but never implemented.
Ingest:
First thing first our ingest layer must be able to write to raw storage regardless of what the raw storage is, must be durable, efficient and most importantly as stupid as possible, why? more logic => more bugs => data loss/corruption.
Option 1: nginx, Relatively fast but limited functionality
Off the shelf solution ≠ No work
An example implementation with nginx (replace default /etc/nginx/sites-available/default):
On an Intel Core i7 6700K @ 4.0Ghz 4 physical cores 8 logical cores, 32GB, 250GB Samsung EVO NVMe
wrk and nginx were allocated each with 4 logical cores
CPU utilization was ~100% and memory utilization was ~68MB
Option 2: Custom solution — Slower, and requires knowledge of a top performer language, but testable and offers best flexibility.
What happens when we want all of these nice features:
- Direct streaming to storage or batching
- On the fly compression
- Custom parses / Validators / Serializers (Protobuf?!)
- Enrichment (IP2Country, User agent parsing, Sessionazation?)
- Custom CORS/Auth/Security
- Multiple ingress options (HTTP/gRPC/UDP/QUIC)
- Custom metrics & monitoring
- Grace full Shutdowns
- ………….
In that case, you better buckle up for a nice ride :) but it might be a short one… the basic template for a logger that receives an HTTP/TCP/UDP connection, parses the request to a JSON/other format, compresses and streams the data to S3/SQS/Kafka in predefined buffers , autoscales and supports grace full shutdowns is not as hard as it first seems, but talk is easy, lets build it.
First we’ll break it down to features, lets start from the end, grace full shutdowns, or more accurately when the application receives a OS signal (interrupt or kill or other…) we’ll turn the service to an unhealthy state and wait for the health checker (ALB or Kubernetes or Route53) to drain all requests to us and then flush all the data.
Now we need to support receiving GET and POST data and print it to stdOut
Instead of printing to only stdOut lets stream it to s3 with GZIP & close every X seconds
So we now have a functioning data collector, so lets do a quick benchmark:
hmmm…. ~4x times faster than nginx implementation, in less than 150 lines of code, ugly code, but 150 lines of it non the less… cool.
So lets do some cleanup, move the implementation to use fasthttp for faster performance and lower memory usage and maybe add some request parsing to minimize work a bit down the line
wrapping it all up we get “Plutos” — minimal working template for high performance data collection
Try it yourself:
docker run --netowrk=host \
-e PORT=:8080 \
-e DRIVER=s3 \
-e S3_REGION=<REGION> \
-e S3_BUCKET=<BUCKET> \
-e S3_PREFIX=data \
-e MAX_BUFFER_TIME_SECONDS=60 \
-e GZIP_LVL=9 \
-e ENABLE_GZIP=true \
-e AWS_ACCESS_KEY_ID=<AWS_KEY> \
-e AWS_SECRET_ACCESS_KEY=<AWS SECRET>
maxkuder/plutos:0.1.2curl http://localhost:8080/e?test=me