Observability strategies to not overload engineering teams.
- Nicolas takashi
- Observability , Infrastructure
- September 28, 2022
No doubt, implementing a certain level of Observability at your company without requiring engineering effort, is a dream for everyone on that journey. Today I’m gonna share with you strategies, to help you implement Observability without adding cognitive load on your engineering teams.
Why is instrumentation a challenge?
Well, most of the teams probably already experienced something similar to the image below.
Unfortunately, instrument code is not a priority in many cases, especially when this type of work usually does not bring instant value to the product.
But, when the business needs a low turnaround time, to resolve a problem and ensure that the SLA doesn’t get hurt, and customer satisfaction doesn’t decrease, the entire company starts to miss more instrumentation.
Talk is cheap, now show me the value.
Using the essentials observability signals is the best way to prove the importance of custom instrumentation, it will help you solve issues and improve the product, and then present to your product manager, how the observability signals help the team improve the product, or ensure that the SLA was not breached.
There are a couple of default Observability signals that can be collected using auto instrumentation strategies, that will give you enough information to achieve what I’ve described above.
Auto Instrumentation Strategies
There are mainly three strategies that make it possible to expose telemetry data without requiring any changes to the services code.
1 — Proxies
Probably you already have a proxy in front of your service, doing a lot of stuff, such as:
- TLS Termination
- Circuit Breaker
- Rate Limiting
This is the perfect place to collect telemetry data about your HTTP or GRPC Services, because all the traffic will pass through this proxy, and you have out-of-the-box Metrics, Logs, and Traces.
Most of the mainstream solutions such as Nginx, HAProxy, and Envoy already expose those telemetry data and also provides integration with Observability platforms.
2 — OpenTelemetry
OpenTelemetry aims to be a de facto standard for collecting and shipping telemetry data to other Observability back-ends such as Prometheus, Jaeger, and ElasticSearch.
OpenTelemetry has a feature called Automatic Instrumentation , this feature injects instrumentation code into your service code to collect the telemetry data, each language has its implementation and you can check all the details on the official documentation .
That strategy is not so available for all languages yet. Still, I do believe that it’s important to have this in mind since it provides auto instrumentation for all kinds of services and not only HTTP or GRPC-based applications.
3 — eBPF
eBPF works by listening to the Linux Kernel syscalls and enabling you to execute actions when those syscalls were executed, and it’s being used by companies and open-source projects to automatically collect telemetry data without changing the application code.
Nowadays, there are a few tools available that already provide Observability out-of-the-box using eBPF such as Pixie , and if you want, you can also build your eBPF programs using Go, Rust, or Python.
eBPF is a potent technology to help you collect telemetry data without changing your application code, but it also requires a high level of knowledge about Linux, if you’re not familiar with this yet, I do recommend you read this blog post that will give you a very nice overview and also a resource to continue your study.
Conclusion
In my view, the strategy that is faster to be achieved is the proxy strategy, because the majority of distributed systems make use of proxies in their layers, making it easier to be implemented, and possible to collect telemetry data from all services uniformly without requesting code changes to the engineering teams.
It’s also important to highlight that the proxy strategy only works well for services based on HTTP or GRPC.
In the next blog post, I’m going to show you a simple implementation of each strategy, starting with the fastest one which is the proxy strategy.
I hope you’ve enjoyed that blog post, and if you have any doubts or suggestions feel free to comment or reach out on Twitter 😃.