Balancing Act: Cutting Kubernetes Event Storage Costs Without Losing Visibility with OpenTelemetry

Balancing Act: Cutting Kubernetes Event Storage Costs Without Losing Visibility with OpenTelemetry

A few weeks ago I wrote a blog post about efficient Kubernetes event tracking . In that blog post I explained how you can collect Kubernetes events using OpenTelemetry, and I also explained why I think OpenTelemetry is a great solution for Kubernetes event collection.

After I shared that blog post, I received a lot of feedback from the community, and one of the most common feedback was about the storage costs, mainly because the Kubernetes events are stored as logs, and logs are generally expensive to store.

So I decided to write this blog post to share how you can reduce the storage costs of your Kubernetes events without losing visibility, using an initial idea from my previous blog post for event filtering.

Initial OpenTelemetry Configuration

As a starting point, let’s use the same OpenTelemetry configuration from my previous blog post:

receivers:
  k8sobjects:
    objects:
      - name: event
        mode: watch
        group: events.k8s.io

processors:
  batch:
  filter:
    logs:
      log_record:
        - 'IsMatch(body["reason"], "(Pulling|Pulled)") == true'

exporters:
  otlp:
    endpoint: otelcol:4317

service:
  pipelines:
    logs:
      receivers: [k8sobjects]
      processors: [filter,batch]
      exporters: [otlp]

This configuration works pretty well, and it’s already filtering high-volume events, such as Pulling and Pulled events, but we’re losing visibility of these events, you want to drop the whole event, but you want to keep the visibility about how many events of the discarded type happened.

Enriching Kubernetes Events

First, let’s add a new processor to the OpenTelemetry configuration, this processor is responsible for enriching the Kubernetes Events resource with the event reason, so we can use it later to filter and count the events.

receivers:
  k8sobjects:
    objects:
      - name: event
        mode: watch
        group: events.k8s.io

processors:
  batch:
  transform:
    error_mode: ignore
    log_statements:
    - context: log
      statements:
        - merge_maps(cache, body["object"], "upsert")
        - set(attributes["event.reason"], ConvertCase(cache["reason"], "lower"))

exporters:
  otlp:
    endpoint: otelcol:4317

service:
  pipelines:
    logs:
      receivers: [k8sobjects]
      processors: [transform, batch]
      exporters: [otlp]

Info

I have taken the step of temporarily removing the filter processor from the pipeline.

Counting Kubernetes Events

Before we can filter the events, we need to count them, so we can keep the visibility about the amount of events per reason the Kubernetes Cluster is producing, even if we don’t export all these events to the backend.

Luckily, OpenTelemetry has a processor that can help with that, the count processor, this processor can be used to count the number of events per reason and export them as metrics, after adding this processor to the pipeline, the configuration looks like this:

Note

If you’re not familiar to the OpenTelemetry connectors, I recommend you to read the OpenTelemetry Connectors documentation

receivers:
  k8sobjects:
    objects:
      - name: event
        mode: watch
        group: events.k8s.io

connectors:
  count:
    logs:
      k8s.events.count:
        description: "Count the number of events"
        conditions:
          - 'attributes["k8s.resource.name"] == "events"'
        attributes:
          - key: event.reason
            default_value: unknown
processors:
  batch:
  transform:
    error_mode: ignore
    log_statements:
    - context: log
      statements:
        - merge_maps(cache, body["object"], "upsert")
        - set(attributes["event.reason"], ConvertCase(cache["reason"], "lower"))

exporters:
  otlp:
    endpoint: otelcol:4317
  prometheus:
    endpoint: localhost:9090

service:
  pipelines:
    logs:
      receivers: [k8sobjects]
      processors: [transform, batch]
      exporters: [count]
    metrics:
      receivers: [count]
      processors: [batch]
      exporters: [prometheus]

Let’s break down this configuration:

  • I added a new connectors section to the configuration, this section is responsible for creating a new metric called k8s.events.count, and it will count the number of events per reason, but only for the entries that have the k8s.resource.name attribute set to events.

  • Instead of exporting the events to the otlp exporter, I’m exporting the events to the count connector.

  • Added a new pipeline called metrics, this pipeline is responsible for receiving the metrics from the count connector, and then it will apply the batch processor, and finally it will export the metrics to the prometheus exporter.

Adding filter processor back to the pipeline

Now it’s time to add the filter processor back to the pipeline, but this time we’re going to leverage the event.reason attribute to filter the events, so we can drop the events that we don’t want to export to the backend, the configuration looks like this:

receivers:
  k8sobjects:
    objects:
      - name: event
        mode: watch
        group: events.k8s.io

connectors:
  count:
    logs:
      k8s.events.count:
        description: "Count the number of events"
        conditions:
          - 'attributes["k8s.resource.name"] == "events"'
        attributes:
          - key: event.reason
            default_value: unknown
processors:
  batch:
  transform:
    error_mode: ignore
    log_statements:
    - context: log
      statements:
        - merge_maps(cache, body["object"], "upsert")
        - set(attributes["event.reason"], ConvertCase(cache["reason"], "lower"))
  filter:
    logs:
      log_record:
        - 'IsMatch(attributes["event.reason"], "(pulling|pulled)") == true'

exporters:
  otlp:
    endpoint: otelcol:4317
  prometheus:
    endpoint: localhost:9090

service:
  pipelines:
    logs:
      receivers: [k8sobjects]
      processors: [transform, batch]
      exporters: [count]
    metrics:
      receivers: [count]
      processors: [batch]
      exporters: [prometheus]

You probably noticed that I didn’t add the filter processor to the logs pipeline, this is because I will lose the visibility of the events, and I don’t want that. What I want is to filter the events after they have been counted, so I can drop the events that I don’t want to export to the backend, but still keep the visibility of the events.

Unfortunately, this is not possible with the current OpenTelemetry configuration, because the filter processor is applied before the count connector, so we need to find a way to work around this.

Splitting the OpenTelemetry Pipeline

To achieve what I described in the previous section, we need to split the OpenTelemetry logs pipeline in two pipelines, one pipeline will be responsible for counting the events, and the other pipeline will be responsible for exporting the events to the backend.

This is super simples to do, because the OpenTelemetry collector provides another connector called forward, and we can foward the events from one pipeline to another pipeline, so the configuration looks like this:

receivers:
  k8sobjects:
    objects:
      - name: event
        mode: watch
        group: events.k8s.io

connectors:
  forward:
  count:
    logs:
      k8s.events.count:
        description: "Count the number of events"
        conditions:
          - 'attributes["k8s.resource.name"] == "events"'
        attributes:
          - key: event.reason
            default_value: unknown
processors:
  batch:
  transform:
    error_mode: ignore
    log_statements:
    - context: log
      statements:
        - merge_maps(cache, body["object"], "upsert")
        - set(attributes["event.reason"], ConvertCase(cache["reason"], "lower"))
  filter:
    logs:
      log_record:
        - 'IsMatch(attributes["event.reason"], "(pulling|pulled)") == true'

exporters:
  otlp:
    endpoint: otelcol:4317
  prometheus:
    endpoint: localhost:9090

service:
  pipelines:
    logs:
      receivers: [k8sobjects]
      processors: [transform, batch]
      exporters: [count, forward]
    logs/filtered:
      receivers: [forward]
      processors: [filter, batch]
      exporters: [otlp]
    metrics:
      receivers: [count]
      processors: [batch]
      exporters: [prometheus]

Ok, let’s break down this configuration:

  • The first pipeline named logs, is responsible for receive the events from the Kubernetes API, and then it will apply the transform and batch processors, and finally it will export the events to the count connector, and to the forward connector.

  • The second pipeline named logs/filtered, is responsible for receive the events from the forward connector, and then it will apply the filter and batch processors, and finally it will export the events to the otlp exporter.

  • The third pipeline named metrics, is responsible for receive the metrics from the count connector, and then it will apply the batch processor, and finally it will export the metrics to the prometheus exporter.

Conclusion

This is what I most like about the OpenTelemetry collector, you can design a observability pipeline that fits your needs, do all the processing that you need, and then export the data to the backend, avoiding unnecessary costs on the backend.

The strategy that I described in this blog post is a great way to reduce costs on the backend, and can be applied to other use cases, not only for Kubernetes events, but also for other high-volume data sources, such as logs, metrics, and traces.

I hope you enjoyed this blog post, and if you have any questions, please feel free to reach out to me on Twitter . Don’t forget to share this post with your friends and colleagues. πŸš€

Related Posts

Think Like a Detective: Using 5w2h to Solve Production Mysteries.

Think Like a Detective: Using 5w2h to Solve Production Mysteries.

I love the idea behind your build and run it, it’s a great way to ensure the team is accountable for the product they are creating.

Read More
Observability strategies to not overload engineering teams β€” Proxy Strategy.

Observability strategies to not overload engineering teams β€” Proxy Strategy.

A web proxy is a perfect place to start collecting telemetry data without required engineering efforts.

Read More
Observability strategies to not overload engineering teams.

Observability strategies to not overload engineering teams.

No doubt, implementing a certain level of Observability at your company without requiring engineering effort, is a dream for everyone on that journey.

Read More