Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 1 : Instrumentation & Collection

(This post is part 1/2 in a series. For part 2, see: Part 2 : Visualizing & Alerting)

At Twilio, we use the open source log-forwarder Fluentd to forward billions of log messages per day from thousands of instances in Amazon AWS into Google BigQuery. Since the reliability of Fluentd is crucial to our operations, we have extensive monitoring and alerting around the Fluentd process running on each host. This blog post is a write-up of how we monitor Fluentd at scale, using both Datadog and Rollbar.

Datadog’s Builtin Integrations

Datadog provides several built-in, easy to setup integrations that we can use for monitoring Fluentd. Here are the two primary integrations that we use:

Datadog-Fluentd Integration

The Datadog-Fluentd integration provides several metrics from Fluentd:

  • fluentd.retry_count (gauge)
  • fluentd.buffer_total_queued_size (gauge)
  • fluentd.buffer_queue_length (gauge)

To setup this integration:

# /etc/fluent/config.d/src/source-monitor-agent.conf
---
<source>
  type monitor_agent
  bind 127.0.0.1
  port 19837
</source>


# /etc/dd-agent/conf.d/fluentd.yaml
---
instances:
  - monitor_agent_url: http://127.0.0.1:19837/api/plugins.json

init_config:

Datadog Process Check

The Datadog Process Check captures metrics from specific running processes on a system, such as CPU %, memory usage, and disk I/O. We can use this to monitor the CPU & memory overhead of the Fluentd process itself.

This integration is also very easy to setup. Just configure the Datadog agent (dd-agent) on your host(s) to collect stats on your Fluentd process:

# /etc/dd-agent/conf.d/process.yaml
---
instances:
- name: fluentd
  search_string:
    - /usr/local/bin/fluentd
  exact_match: 'False'

init_config:

If this integration isn’t working for you, you likely have the wrong process name listed for your Fluentd process. You can determine which process name the dd-agent would match against by running the following snippet:

#!/opt/datadog-agent/embedded/bin/python

import psutil

for proc in psutil.process_iter():
    print proc.cmdline()

Metrics on Messages/Sec & Bytes/Sec

While the above metrics for CPU, memory, buffer size, and retries are helpful, none of them allow us to answer one of the most basic questions one would hope to answer: How many messages/sec or bytes/sec of log messages is a given host (or cluster) forwarding?

In order to answer that operational question, we had to build a creative solution:

1. For all messages that are forwarded to our typical output destination (ex. AWS Kinesis, Google BigQuery, etc), use Fluentd’s built-in copy plugin to also send those messages to fluent-plugin-flowcounter. Fluent-plugin-flowcounter will then emit new messages with metrics for bytes/sec and messages/sec as key/value pairs under a new tag, ex: “flowcount”.

<match app.* system.* service.*>
  @type copy

  <store>
    @type flowcounter
    tag flowcount
    aggregate tag
    output_style tagged
    count_keys *
    unit second
  </store>

  <store>
    @type kinesis_streams
    ...
  </store>
</match>

2. Create a new match rule against those “flowcount” messages. This rule will duplicate each of those messages with new tags, “myproject.fluentd.messages.bytes” and “myproject.fluentd.messages.count”. We do this because metrics for bytes and message count must be submitted to Datadog as separate metrics.

#
# Match messages that came from the "flowcounter" plugin and duplicate them.
# The resulting messages should have different tags though, so that we can
# process them separately.
#
<match flowcount>
  @type copy

  #
  # Need to "deep copy", otherwise the <store>s below will share & modify
  # the same record. This is bad because we need to modify the new
  # "bytes"/"count" records separately.
  #
  deep_copy true

  <store>
    type record_modifier
    tag quantico.fluentd.messages.bytes
  </store>
  <store>
    type record_modifier
    tag quantico.fluentd.messages.count
  </store>
</match>

3. Create a filter rule to match “myproject.fluentd.messages.*” (both variants that we produced in step #2). Use the record_transformer plugin to transform the “fluentd.messages.count” messages into messages that simply contain the message count metric, and “fluentd.messages.bytes” messages into messages that only contain the bytes metric.

#
# Messages coming into here will have the form:
#
#     quantico.fluentd.messages.count: {
#         "count": 0,
#         "bytes": 0,
#         "count_rate": 0.0,
#         "bytes_rate": 0.0,
#         "tag": "syslog.messages"
#     }
#
# ... where metrics for both "bytes" and "count" are packed into the same
# message.
#
# Since the Datadog statsd plugin can only handle 1 metric/message, we must
# duplicate the flowcount message and tag the resulting messages for either
# "count" or "bytes", respectively.
#
# Those resulting duplicated events then come here for additional
# transformations:
#
#     * Add a new JSON key named "value", whose value is the respective
#       count/bytes value, taken from the key "count" or "bytes".
#     * Rename the key named "tag" into "fluentd-tag". This will result in a
#       Datadog metric-tag named "fluentd-tag" which we can then pivot on.
#       Ex: {"fluentd-tag": "system.messages"}
#     * Drop the original keys: "bytes,bytes_rate,count,count_rate". If not
#       removed, they would show up as distinct metrics tags in Datadog and
#       given that they would have values of varying sizes, this would cause
#       our number of tracked metrics (and associated $$ costs) in Datadog to
#       grow unbounded.
#
# The resulting messages will have the following format, which we can then
# <match> into the "dogstatsd" output plugin, for shipping off to the local
# "statsd":
#
#     quantico.fluentd.messages.count: {
#         "fluentd-tag": "system.messages",
#         "value": "20"
#     }
#
#
<filter quantico.fluentd.messages.*>
  @type record_transformer
  enable_ruby true
  remove_keys bytes,bytes_rate,count,count_rate,tag

  <record>
    value ${record[tag_parts[3]]}
  </record>
  <record>
    fluentd-tag ${record['tag']}
  </record>
</filter>

4. A final match rule to accept those “myproject.fluentd.message.*” messages, and emit them to the local dogstatsd process as a counter metric (XXX), which will then automatically forward the metrics to Datadog.

#
# Send messages like "quantico.fluentd.messages.count" to the local dogstatsd
# daemon.
#
<match quantico.fluentd.**>
  @type dogstatsd
  port 8125
  host localhost
  metric_type count
  flat_tags true
  use_tag_as_key true
  value_key value

  #
  # The default flush interval of all fluentd output plugins is 60 seconds
  # which is far too infrequent for dogstatsd, which flushes to the Datadog
  # service every 10 seconds. Without a shorter "flush_interval" here,
  # metrics in Datadog will only appear every 60 seconds, with 0's padded
  # in between each data point. Yuk.
  #
  flush_interval 5s
</match>

One thing to note about the message that is pushed to dogstatsd plugin, and our use of “flat_tags true”. For the sample message below, the dogstatsd plugin will convert “fluentd-tag” into a tag on the Datadog metric. This means that we will be able to aggregate our Datadog dashboards on Fluentd tags, and enable us to get metrics on specific log files. Ex: Count/bytes for “fluentd-tag” of “nginx.access”, “haproxy”, “system.messages”, or “*”!

quantico.fluentd.messages.count: {
    "fluentd-tag": "system.messages",
    "value": "20"
}

Monitoring Fluentd’s Log File Itself with Rollbar

A final issue that we thought of: The Fluentd process itself has a log file. What if Fluentd emits errors or exceptions on any of our thousands of AWS instances? How would we ever find out? We certainly couldn’t trust Fluentd to forward it own logs in these situations, so therefore we needed an external and independent destination for Fluentd’s logs.

To solve this problem, I wrote a small GoLang app called “go-tail-rollbar-forwarder” which does exactly what its name implies: It uses the hpcloud/tail library to efficiently tail files from the file system, and if it finds ERROR or CRIT messages, it forwards those messages to Rollbar, a software error-tracking SaaS (Sentry could also be used).

This being the first GoLang app I’d written of any significance, I also setup Datadog process checks to measure CPU & RSS memory metrics of this process. I’m delighted to say that after 8 months of running on thousands of instances, we’ve never had any problems with resource consumption of “go-tail-rollbar-forwarder”. According to our heat maps in Datadog, this process typically uses just ~12MB of RSS memory, and <1% of a single CPU.

The Next Steps

This guide covered how to instrument and collect metrics and error messages from Fluentd. In Part 2 of this guide, we will cover best practices around visualizing and alerting on those metrics and messages. On to Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 2 : Visualizing & Alerting.

redbug

2 thoughts on “Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 1 : Instrumentation & Collection

  1. Pingback: Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 2 : Visualizing & Alerting | Dan Tehranian's Blog

  2. Pingback: Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 1 : Instrumentation & Collection

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s