Observability agents can consume significant resources while running. To prevent unnecessary additional costs caused by the agent occupying too many resources, you can monitor the resource usage of the agent to ensure that its resource consumption is within a reasonable range.
Translated from gauge your observability agent overhead to s**e costs, by bCameron Gain is the founder and principal analyst at Revecom Media. His obsession with computers began in the early '80s, when he hacked the Space Invaders console and played for just 25 cents a day at his local gaming hall. And then. We recently looked at how OpenTelemetry Collector can be used as a filter to monitor telemetry. It applies when it comes to multiple applications or microservices, especially for security reasons. As such, OpenTelemetry Collector falls into the category of Observability Agents. Observability agents, such as OpenTelemetry Collector, include fluent bits, vectors, etc. The Observability Agent plays a key role in the working mechanism of observability. They handle data transmission to ensure that telemetry data is transmitted accurately. Agents typically provide data collection, data processing, and data transfer, and play a key role in monitoring system performance. They help identify unknown issues so that performance issues can be troublephased and mitigated, before they occur. This is the standard for observability capabilities. In this way, when used for data collection, the Observability Agent collects data sent to it from one or more sources. In addition to receiving data, it also sends data to endpoints, such as visualizations for grafana panels. With it, you can configure the collection of certain types of logs, traces, and metrics for observability. Initially, if you're already deploying instrumented applications that send telemetry directly to the observability platform, you can choose not to use the observability agent. Collectors are useful when monitoring applications that cannot be detected. Since this is also a very common use case, collectors are also useful when monitoring undetectable applications, Google software developer Braydon Kains told The New Stack.
Google Cloud's @ragecage64 on Observability Days @kubecon @cloudnativefdn @nybooks the Observability Agent Pictwitter.com/wgacudsxxfWithout observability collector capabilities, you need to configure each backend or user monitoring separately for those features, which can be cumbersome. Instead, the observability collector acts as a single endpoint for all microservices, simplifying access to applications and microservices through a unified point facilitated by the collector. With the Observability Agent as a collector, you can centrally view and manage microservices, providing a unified view on platforms such as Grafana. While Grafana offers certain alternatives that don't use the OpenTelemetry collector, the collector greatly simplifies this process. However, observability agents can consume significant resources. To solve this problem, they themselves can also be or are being monitored to ensure that they do not over-consume resources and thus avoid unnecessary costs. In other words, OpenTelemetry Collector, Fluent Bit, Vector, etc., all demonstrate great robustness and the ability to perform a variety of tasks while achieving their superior results, but their relative performance may differ.bc gain (@bcamerongain) november 6, 2023
Most of the most popular agents have Kubernetes filters and processors that fetch metadata from the Kubernetes API to enrich logs and data. As Google software developer Braydon Kains said in his own KubeCon + CloudNativeCon talk "How Much Overhead: How to Evaluate Observability Agent Performance", in addition to OpenTelemetry, Fluent Bits and Vectors are also growing in popularity. "Each agent also has a way to build custom processing if the available defaults don't meet your needs," Kains told The New Stack after the meeting. "The biggest challenge with this is that doing anything on a pipeline that processes terabytes of data per second has a multiplicative effect on your overhead. Especially for regex logs or JSON log parsing, the impact can grow rapidly," says Kains. "If you can't send data fast enough, I highly recommend increasing the number of workers or leveraging the agent's threading implementation if possible. According to Kains, export is the only step in the pipeline that can be easily parallelized. Most backends can handle slightly out-of-order timestamps, and one feature that Fluent Bit provides, for example, is to set up 8 workers to create a thread pool of 8 workers that are sending data at the same time. This can dramatically improve the efficiency of the pipeline by dispatching data to the thread pool and letting one of the workers handle the slower parts in case the default process runs low, Kains says. Organizations often need to independently determine which agent is best for them and the expected overhead, Kains says. "The only way to do that is to try to run it. If you can replicate the production environment, install the agent, configure it and monitor the metrics," Kains said. "It's the best way to get answers. ”
If replicating a production environment is challenging, Kains recommends considering using a log generator or scraping a test workload such as Prometheus. AWS's Logbench is a good log generator for testing log pipelines. For prometheus scraping, set up a mock server with a copy of the text scraping. "If you're expecting a high cardinality scenario, especially for database metrics, force a high cardinality scenario to stress test the agent's performance. If you're not satisfied with the results of your assessment, consider reducing work or offloading work to reduce resource usage. Aggregating nodes and back-end processing can also help manage resource usage," says Kains. "If you encounter unacceptable performance or find regressions, open an issue for maintenance staff with details of the replication issue and associated performance data such as charts, CSVs, Linux Perf reports, or pprof configuration files. "Kains' team at Google uses Google Cloud Operations, which merges two agents, using Fluent Bit for log collection and OpenTelemetry for metrics and tracing. Behind the scenes, the team maintains a configuration layer that generates configurations for OpenTelemetry and Fluent Bit. These configurations are optimized primarily for users on virtual machines, such as normal virtual machines, with OpenTelemetry to ensure efficient metrics collection.
Some time ago, Kains said, we were interested in seeing if OpenTelemetry logs could be used as Ops Agents instead of Fluent Bits. "This will allow us to be completely unified on the OpenTelemetry Collector," Kains said. "At the time, OpenTelemetry logs weren't mature enough to handle the throughput and memory usage of Fluent Bits, so we chose not to move forward," says Kains. "We haven't updated those benchmarks yet, so it's hard to say what the situation will be today. ”
However, for most regular users, relying on Google infrastructure to benchmark the agent would otherwise be very expensive and overly complex for the end user. "The benchmark community I run can't replicate," Kains said. "This is something I intend to commit to in the new year, to modify our benchmarking and performance evaluation strategies and technologies so that they are open source and don't rely on any Google know-how or infrastructure. However, using AWS Log Bench or even scripts created by the Kains team, it is possible to manually generate log loads for the agent and observe and compare metrics directly through tools on the VM such as HTOP, and collect metrics using scripts that can gather information from PROC or something similar, Kains says. "I want to create guides or tools that can be open source to make this kind of benchmark more accessible to less technical users," Kains said. "I don't have exact plans yet, but I hope to have more to say in the coming months. ”