Bit by Bit

Bit by Bit

Share this post

Bit by Bit
Bit by Bit
Lossless Log Aggregation
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Bit by Bit
All things observability
Already have an account? Sign in

Lossless Log Aggregation

Reduce Log Volume by 99% Without Dropping Data

Kevin Lin's avatar
Kevin Lin
Jan 29, 2024
9

Share this post

Bit by Bit
Bit by Bit
Lossless Log Aggregation
Copy link
Facebook
Email
Notes
More
Share
Thomas Howard

On a rainy March day in 1538, Thomas Howard, the Duke of Norfolk, found himself confined within the cold stone walls of his grand estate. His fingers reluctantly penned a letter to sell his cherished lands to settle longstanding debts.

"A man can not have his cake and eat his cake", he wrote, the words etched with the pain of having to confront a reality that he did not want to accept.

This was the earliest written record of the proverb "you can't have your cake and eat it too". It's used in situations when we are forced to make tradeoffs that we'd rather not make.

When it comes to logging, that tradeoff is one of volume vs cost. You want the volume necessary to observe all your systems and you want to not spend most of your infra budget on logs. But you often can't have both.

Not Eating Cake

The common choice companies make when logging costs are high is to log less. It lowers the cost and also reduces the volume.

This could be deciding that you only need ERROR logs in production. Or deciding that certain services can do without logs at all. Or only keeping logs on the host and hoping it doesn't crash.

While this can reduce costs, it's often a temporary bandage, one with very real downsides.

Not Having Cake

The reason companies log in the first place is because logs are insurance. In the happy case, no one thinks about it (besides the CFO griping about the bill). In the other case, the one where things are on fire, logs are essential for engineers to properly diagnose and recover from an incident.

Incidents result in downtime and downtime impacts the business's bottom line. When Amazon went down for 40min in August 2013, a Forbes article estimated that Amazon lost $66,240 per minute during the outage

Dropping logs does decrease your costs. But you need to balance these savings with additional costs incurred by every extra second of downtime.

Have it Both Ways

What if you didn't need to choose? What if you could store all your logs while lowering your costs? Enter Lossless Log Aggregation (LLA).

LLA is the process of aggregating similar logs into a larger aggregate log. Common metadata and values are deduplicated and merged during the aggregation. When done effectively, this can result in a 100X reduction in volume and a 40% reduction in size. Without dropping data.

LLA

Illustration of LLA

Types of Log Groups

Three types of log groups are good candidates for LLA:

  1. Logs with common message patterns

  2. Logs with common identifiers

  3. Multi-line Logs

To see before and after examples of each type, see the examples section of the Nimbus Documentation.

Logs with common message patterns

These are high-volume log events that repeat most of their content. For most applications most of the time, this will be the primary driver of log volume. Examples include health checks and heartbeat notifications.

Logs with common identifiers

These are logs that describe a sequence of related events. These sequences usually have some sort of common identifier like a transactionId or a jobId. Examples include a background job and business-specific user flows.

Multi-line Logs

These are logs where the message body can be spread across multiple new lines. Unless you add special logic on the agent side, the default behavior is to emit each newline delimited message as a separate log event.

A Motivating Example

Below are logs from a load balancer performing health checks on a suite of targets.

{"timestamp":"Tue Jan 10 09:15:16 2023","host":"lb1.nimbus.dev:5678","target":"web1","path":"/health","latency":"23ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:15:18 2023","host":"lb2.nimbus.dev:5678","target":"web2","path":"/health","latency":"57ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:15:25 2023","host":"lb1.nimbus.dev:5678","target":"web4","path":"/health","latency":"14ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:15:28 2023","host":"lb2.nimbus.dev:5678","target":"web5","path":"/health","latency":"38ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:16:01 2023","host":"lb3.nimbus.dev:5678","target":"web3","path":"/health","latency":"16ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:17:16 2023","host":"lb1.nimbus.dev:5678","target":"web1","path":"/health","latency":"19ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:17:18 2023","host":"lb2.nimbus.dev:5678","target":"web2","path":"/health","latency":"41ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:17:22 2023","host":"lb3.nimbus.dev:5678","target":"web3","path":"/health","latency":"32ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:17:25 2023","host":"lb1.nimbus.dev:5678","target":"web4","path":"/health","latency":"27ms","status":"passed"}
{"timestamp":"Tue Jan 10 09:17:28 2023","host":"lb2.nimbus.dev:5678","target":"web5","path":"/health","latency":"62ms","status":"passed"}
// ... more logs

These are the same logs after applying LLA based on host and status:

{"host":"lb1.nimbus.dev:5678","status":"passed","path":"/health","data":[{"target":"web1","time":"23ms"},{"target":"web4","time":"14ms"},{"target":"web1","time":"19ms"},{"target":"web4","time":"27ms"}, ...], "size":700, "timestamp":"Tue Jan 10 09:15:16 2023","timestamp_end":"Tue Jan 10 09:17:28 2023"}
{"host":"lb1.nimbus.dev:5678","status":"passed","path":"/health","data":[{"target":"web1","time":"23ms"},{"target":"web4","time":"14ms"},{"target":"web1","time":"19ms"},{"target":"web4","time":"27ms"},...], "size": 700, "timestamp":"Tue Jan 10 09:15:16 2023","timestamp_end":"Tue Jan 10 09:17:28 2023"}
{"host":"lb2.nimbus.dev:5678","status":"passed","path":"/health","data":[{"target":"web2","time":"57ms"},{"target":"web5","time":"38ms"},{"target":"web2","time":"41ms"},{"target":"web5","time":"62ms"},...], "size": 700, "timestamp":"Tue Jan 10 09:15:18 2023","timestamp_end":"Tue Jan 10 09:17:28 2023"}
{"host":"lb3.nimbus.dev:5678","status":"passed","path":"/health","data":[{"target":"web3","time":"101ms"},{"target":"web3","time":"16ms"},{"target":"web3","time":"32ms"},...],"size": 700,"timestamp":"Tue Jan 10 09:15:22 2023","timestamp_end":"Tue Jan 10 09:17:22 2023"}
// ... more logs

Some things to note:

  • a new attribute, size, has been added to indicate the number of logs aggregated

  • a new field, timestamp_end, has been introduced to mark the time of the last log in the aggregation

  • a new field, data, holds an array of the unique info of each log

For the above example, we've managed a 77% reduction in log size and a 99% reduction in log volume.

Log Results

How to do this at Home

LLA can be implemented in an observability pipeline - these are data pipelines that are optimized for receiving, transforming, and sending observability data. Popular open-source pipeline solutions include the OTEL Collector and Vector.

To configure a pipeline for LLA, you'll need to identify log groups, create forwarding rules, normalize log data before and after aggregation, as well as perform the aggregation itself over the correct fields.

You also need to pay attention to vendor-specific details - for example, Datadog has limits for maximum content size, array size, and the size of a single log. This means you need to be mindful of not exceeding any of these quotas when aggregating datadog logs or risk losing data.

Finally, as the pipeline itself is now part of your observability stack, you will need to operationalize it and make sure it can scale to handle traffic from all your services.

If you like the idea of LLA but not necessarily the toil, consider trying Nimbus. Nimbus is the first observability pipeline that automatically analyzes log traffic and can identify high-volume log groups as well as come up with LLAs to reduce their volume. On average, organizations save 60% off their logging costs within the first month of use.

Conclusion

It's 2024, and many companies find themselves at odds with their observability vendor - they want to ship all the logs necessary to provide a reliable service but cannot justify the expense to do so.

Like Duke Thomas, these organizations find themselves having to make the painful tradeoff of needing to part with something they need versus paying for something they cannot in good conscience afford.

But unlike in the Duke's time, there's a third choice - lossless log aggregations. Sometimes you don't have to choose, sometimes you can have your cake and eat it too!

Thanks for reading Bit by Bit! Subscribe for free to receive new posts and support my work.

Vasu's avatar
Alok Nandan's avatar
Robert Wilde's avatar
Kevin Lin's avatar
Leandro G. Almeida's avatar
9 Likes
9

Share this post

Bit by Bit
Bit by Bit
Lossless Log Aggregation
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
OpenTelemetry in 2023
4 Years In, OpenTelemetry Delivers on its Promise for Open Observability
Aug 28, 2023 â€¢ 
Kevin Lin
14

Share this post

Bit by Bit
Bit by Bit
OpenTelemetry in 2023
Copy link
Facebook
Email
Notes
More
1
Lessons from Creating a VSCode Extension with GPT-4
Lately, I've been playing around with LLMs to write code.
May 25, 2023 â€¢ 
Kevin Lin
9

Share this post

Bit by Bit
Bit by Bit
Lessons from Creating a VSCode Extension with GPT-4
Copy link
Facebook
Email
Notes
More
6
Logging with OpenTelemetry and Loki
A step-by-step guide to using OpenTelemetry's Logging SDK with Grafana Loki and NodeJS
Sep 11, 2023 â€¢ 
Kevin Lin

Share this post

Bit by Bit
Bit by Bit
Logging with OpenTelemetry and Loki
Copy link
Facebook
Email
Notes
More
1

Ready for more?

© 2025 Kevin Lin
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.