Advanced Monitoring of Fluentd with Datadog and Rollbar : Part 2 : Visualizing & Alerting

(This post is part 2/2 in a series. For part 1, see: Part 1 : Instrumentation & Collection)

In part 1 of this series we went over how to instrument Fluentd so that it sends its errors and metrics to the monitoring services Datadog and Rollbar. In this post, we’ll cover how to effectively visualize and alert using those metrics.

Visualizing & Alerting with Datadog

Dashboard Template Variables

template-variables

Before jumping into the visualization themselves, it would be helpful to see how we have templatized the variables in our dashboards, so that one can quickly change the scope of all visualizations in that dashboard. For a complete guide on this powerful Datadog feature, see their Guide to Dashboard Templating.

Template variables we use:

  • $account-name: Account or environment name. Ex: “dev”, “stage”, “prod”, etc.
  • $realm: Twilio is a multi-region SaaS provider. Realm is analogous to the geographic region of a host. Ex: “usa”, “brazil”, “japan”.
  • $role: Functional type of the host. Ex: “mysql-master”, “mysql-slave”, “kafka”, “sms-sender”, etc.
  • $host: Host ID. We use an internal instance ID at Twilio, but AWS instance ID would also work.
  • $fluentd-tag: Thanks to the custom fluentd/dogstats instrumentation we built in part 1 of this post, we have metrics on bytes/sec and messages/sec tagged with their Fluentd tag. Ex: “nginx_access”, “nginx_error”, or “haproxy”.
  • $active: Load balancer state. Ex: “in-load-balancer” or “out-of-load-balancer”.

These template variables allow an operator to quickly modify all charts on a dashboard to visualize answers to questions like:

  • How much memory is being used by Fluentd in “prod” on all “mysql” hosts?
  • How much CPU is being used by Fluentd in “prod” by any 1 host?
  • How many “nginx_access” messages/sec are being sent in “prod” in Japan by all “kafka” hosts that are in-load-balancer?

Fluentd Bytes/Sec & Messages/Sec

02-bytessec02-messagessec

We visualize the sum of Fluentd bytes/sec and messages/sec across dimensions using the default, line graph visualization type, as rates. This simple, aggregated metric gives us a simple visualization of log message throughput. The dashed line in purple provides a week-over-week comparison, for context.

Fluentd Buffer Size

00-buffer-size-gb-by-role00-buffer-size-mb-by-host

We visualize Fluentd’s on-host disk output buffer via the above two charts. The first chart uses the stacked bar visualization type, aggregated on “role”. The absolute height of each bar tells us the sum of all Fluentd disk buffers across an account. The individual components of each bar (the different shades of blue) each represent an individual role. From this we can quickly tell how much any individual role is contributing to the sum of all buffers.

The second chart uses the heat map visualization aggregated by “host”. Because Twilio operates thousands of instances in Amazon AWS, a line graph containing a separate series for each host would be far too noisy to derive any meaning from. This heat map visualization automatically scales from hundreds of hosts to thousands of hosts, while still maintaining the ability for an operator to derive meaning.

With regards to alerting, we have Datadog monitors setup to page us if the on-disk buffer of a given host reaches a critical threshold, or if the sum of all disk buffers across an account reaches a certain threshold. In our case, these alerts would be indicative of issues in forwarding log messages to AWS Kinesis.

Fluentd Retries

00-fluentd-retries-by-host-2

We specifically do not use a line graph for visualizing retries, as retries are relatively infrequent events and the line graph visualization would try to connect/interpolate points with lines.

Instead, we visualize Fluentd retries via a stacked bar graph. The absolute height of the graph shows us the number of retries, and each colored segment of the stack represents a distinct host.

A red color palette for the bars was chosen instead of the default blue color palette to indicate impending doom. 🙂

Fluentd CPU & Memory Consumption

01-fluentd-cpu01-fluentd-rss-memory

For Fluentd %CPU & memory consumption, we again turn the the heat map visualization type. Because each of these heat maps is aggregated by host, they can represent thousands of hosts in a simple chart. Ex: From the bottom chart we see that on the vast majority of hosts, Fluentd is consuming 70-90MB of RSS memory.

We have Datadog monitors setup to alert us via Slack if Fluentd is consuming >80% of a single CPU on any host, or page us via PagerDuty if Fluentd’s memory consumption becomes unreasonable.

01-fluentd-cpu-distribution

We can also use the distribution visualization type to view absolute numbers for the most recent sampling period. A distribution chart is basically a cross-section of the heat map chart rotated sideways, showing only a single period. This differs from the heat map in that a heat map shows evolution of the metric over time. Is this distribution, we can see that 2,000+ hosts have a Fluentd CPU utilization of <=5%, for the latest period.

Rollbar-Forwarder CPU & Memory Consumption

01-go-tail-rollbar-cpu01-go-tail-rollbar-memory

We similarly use heat maps to measure the %CPU and memory consumption of our file-tailing, rollbar-forwarder which is installed on each host.

Ingress vs Egress

04-fluentd_output_vs_bigquery_output_redacted

Since we use AWS Kinesis as a intermediate transport, it can be helpful to visualize data in-flow vs out-flow in order to determine where slow-downs in the data pipeline are happening.

The blue line in the chart above represents data flowing into Kinesis from Fluentd (ingress). The purple line at the bottom represents data read out of Kinesis by our Kinesis consumer (egress). The yellow line in the middle is a simple subtraction of the two series (B – A). If all is well and egress ~= ingress, then the yellow line should be right around 0.

The metrics for AWS Kinesis come from Datadog’s AWS integration.

Top Graphs & Change Graphs

04-most-active-logs-by-bytes

04-top10-roles-by-count

04-role-change-by-count

Thanks to the tags and dimensions we have applied to our metrics, we were able to create the top charts and change graphs to answer questions like:

  • What log file is the top source of log messages?
  • Which roles are the top producers of logs?
  • Which roles are contributing most to a sudden spike in log messages?

Alerting on Fluentd Issues with Rollbar

fluentd-rollbar

Using the go-tail-rollbar-forwarder app that I mentioned in part 1, we are able to get WARN/ERROR/CRIT log messages from Fluentd’s own log file into a SaaS service, so that we can be automatically notified about Fluentd issues. Other benefits of using Rollbar are that Rollbar will automatically aggregate and group similar messages from different hosts (see message about “1,000 occurrences…” above), and save us from having to log into machines to view Fluentd’s application log.

Since Rollbar was designed to be used by a logging library (Ex: log4j) and not as log aggregation service, we had to configure custom grouping rules to get Rollbar to group similar messages correctly. This was relatively simple to do using regex’s, but I mention it here for the sake of completeness.

Conclusion

I hope you found this guide on how to monitor Fluentd with Datadog and Rollbar to be helpful. If you’ve made it this far in the post, I’ll reward you by leaving you with some wisdom on monitoring:

Start with questions, don’t start with metrics.

If you start with metrics, the temptation will be to make a dashboard with a bunch of line graphs and call it a day. If you start with questions, however, you’ll be forced to work backwards and figure out if you even have the metrics, instrumentation, tags/dimensions, etc to be able to answer that question.

Ex: Using the metrics that we discussed in this post one could answer the question, “How many hosts are running Fluentd in my infrastructure?”. I guarantee you, however, that if you started with the raw metrics, you would never end up with a Datadog counter that showed you an absolute count of how many hosts are running Fluentd (Hint: Use “count(not_null(fluentd.messages.count{$host}))”). That is the value of starting with questions instead of metrics.

What are good ways to come up with insightful questions? One way that we came up with the questions that drove the creation of the charts in this post was through Chaos Game Day exercises. By practicing incident response in a controlled environment (ex: stage), we were able to learn what types of instrumentation and visualizations we needed in order to answer questions like, “Is the problem getting better or worse?” or “If the problem is getting better, what is the ETA until full recovery?”. When we did our first Game Day exercises we were not able to answer these questions, but in future simulations and real incidents, we were prepared!

If you don’t do Chaos Game Day exercises, ultimately you will have to learn in production when the real-deal incidents happen, and during your blameless post-mortems. The choice in how you’d like to prepare for these situations is yours.

redbug

Advertisements

DockerCon 2015 Wrap-Up

I attended DockerCon 2015 in San Francisco from June 22-23. The official wrap-ups for Day 1 and Day 2 are available from Docker, Inc. Keynote videos are posted here. Slides from every presentation are available here.

Here are my personal notes and take-aways from the conference:

The Good

  • Attendance was much larger than I expected, reportedly at 2,000 attendees. It reminded me a lot of VMworld back in 2007. Lots of buzz.
  • There were many interesting announcements in the keynotes:
    • Diogo Mónica unveiled and demoed Notary, a tool for publishing and verifying the authenticity of content. (video)
    • Solomon Hykes announced that service discovery is being added into the Docker stack. Currently one needs to use an external tools like registrator, Consul, and etcd for this.
    • Solomon announced that multi-host networking is coming.
    • Solomon announced that Docker is splitting out its internal plumbing from the Docker daemon. First up is splitting out the container runtime plumbing into a new project called RunC. The net effect is that this creates a reusable component that other software can use for running containers. This will also make it easy to run containers without the full Docker daemon as well.
    • Solomon announced the Open Container Project & Open Container Format – Basically Docker, Inc. and CoreOS have buried the hatchet and are working with the Linux Foundation and over a dozen other companies to create open standards around containers. Libcontainer and RunC are bring donated to this project by Docker, while CoreOS is contributing the folks who were working on AppC. More info on the announcement here.
    • Docker revealed how they will start to monetize their success. They announced an on-prem Docker registry with a support plan starting at $150/month for 10 hosts.
  • Diptanu Choudhury unveiled Netflix’s Titan system in Reliably Shipping Containers in a Resource Rich World using Titan. Titan is a combination of Docker and Apache Mesos, providing a highly resilient and dynamic PaaS that is native to public clouds and runs across multiple geographies.
  • VMware announced he availability of AppCatalyst, a free, CLI-only version of VMware Fusion. That software, combined with the Vagrant plugin for AppCatalyst that Fabio Rapposelli released, means that developers no-longer need to pay for VMware Fusion in order to have a more stable and performant alternative to Oracle’s VirtualBox for use with Vagrant. William Lam has written a great Getting Started Guide for AppCatalyst.
  • The prize for most entertaining presentation goes to Bryan Cantrill for Running Aground: Debugging Docker in Production. Praise for his talk & funny excerpts from that talk were all over Twitter:

The Bad

I was pretty disappointed with most of the content of the presentations on the “Advanced” track. There were a lot of fluffy talks about micro-services, service discovery, and auto-scaling groups. Besides not getting into great technical detail, I was frustrated by these talks because there was essentially no net-new content for anyone who frequents meetups in the Bay Area, follows Hacker News, or follows a few key accounts on Twitter.

Speaking to other attendees, I found that I was not the only one who felt that these talks were very high-level and repetitive. Bryan Cantrill even eluded to this in his own talk when he mentioned “micro-services” for the first time, adding, “Don’t worry, this won’t be one of those talks.”

Closing Thoughts

I had a great time at DockerCon 2015. The announcements and presentations around security and network were particularly interesting to me because there were new things being announced in those areas. I could have done w/o all of the fluffy talks about micro-services and auto-scaling.

It was also great to meet new people and catch up with former colleagues. I got to hear a lot of interesting ways developers are using Docker in their development and production environments and can’t wait to implement some of the things I learned at my current employer.

Testing Ansible Roles with Test Kitchen

Recently while attending DevOps Days Austin 2015, I participated in a breakout session focused on how to test code for configuration management tools like Puppet, Chef, and Ansible. Having started to use Ansible to manage our infrastructure at Delphix I was searching for a way to automate the testing of our configuration management code across a variety of platforms, including Ubuntu, CentOS, RHEL, and Delphix’s custom Illumos-based OS, DelphixOS. Dealing with testing across all of those platforms is a seemingly daunting task to say the least!

Intro to Test Kitchen

The conversation in that breakout session introduced me to Test Kitchen (GitHub), a tool that I’ve been very impressed by and have had quite a bit of fun writing tests for. Test Kitchen is a tool for automated testing of configuration management code written for tools like Ansible. It automates the process of spinning up test VMs, running your configuration management tool against those VMs, executing verification tests against those VMs, and then tearing down the test VMs.

What’s makes Test Kitchen so powerful and useful is its modular design:

Using Test Kitchen

After learning about Test Kitchen at the DevOps Days conference, I did some more research and stumbled across the following presentation which was instrumental in getting started with Test Kitchen and Ansible: Testing Ansible Roles with Test Kitchen, Serverspec and RSpec (SlideShare).

In summary one needs to add three files to their Ansible role to begin using Test Kitchen:

  • A “.kitchen.yml” file at the top-level. This file describes:
    • The driver to use for VM provisioning. Ex: Vagrant, AWS, Docker, etc.
    • The provisioner to use. Ex: Puppet, Chef, Ansible.
    • A list of 1 or more operating to test against. Ex: Ubuntu 12.04, Ubuntu 14.04, CentOS 6.5, or even a custom VM image specified by URL.
    • A list of test suites to run.
  • A “test/integration/test-suite-name/test-suite-name.yml” file which contains the Ansible playbook to be applied.
  • One or more test files in “test/integration/test-suite-name/test-driver-name/”. For example, when using the BATS test-runner to run a test suite named “default”: “test/integration/default/bats/my-test.bats”.

Example Code

A full example of Test Kitchen w/Ansible is available via the delphix.package-caching-proxy Ansible role in Delphix’s GitHub repo. Here are direct links to the aforementioned files/directories:682240

Running Test Kitchen

Using Test Kitchen couldn’t be easier. From the directory that contains your “.kitchen.yml” file, just run “kitchen test” to automatically create your VMs, configure them, and run tests against them:

$ kitchen test
-----> Starting Kitchen (v1.4.1)
-----> Cleaning up any prior instances of 
-----> Destroying ...
 Finished destroying  (0m0.00s).
-----> Testing 
-----> Creating ...
 Bringing machine 'default' up with 'virtualbox' provider...
 ==> default: Importing base box 'opscode-ubuntu-14.04'...
==> default: Matching MAC address for NAT networking...
 ==> default: Setting the name of the VM: kitchen-ansible-package-caching-proxy-default-ubuntu-1404_default_1435180384440_80322
 ==> default: Clearing any previously set network interfaces...
 ==> default: Preparing network interfaces based on configuration...
 default: Adapter 1: nat
 ==> default: Forwarding ports...
 default: 22 => 2222 (adapter 1)
 ==> default: Booting VM...
 ==> default: Waiting for machine to boot. This may take a few minutes...

..  ...

-----> Running bats test suite
 ✓ Accessing the apt-cacher-ng vhost should load the configuration page for Apt-Cacher-NG
 ✓ Hitting the apt-cacher proxy on the proxy port should succeed
 ✓ The previous command that hit ftp.debian.org should have placed some files in the cache
 ✓ Accessing the devpi server on port 3141 should return a valid JSON response
 ✓ Accessing the devpi server via the nginx vhost should return a valid JSON response
 ✓ Downloading a Python package via our PyPI proxy should succeed
 ✓ We should still be able to install Python packages when the devpi contianer's backend is broken
 ✓ The vhost for the docker registry should be available
 ✓ The docker registry's /_ping url should return valid JSON
 ✓ The docker registry's /v1/_ping url should return valid JSON
 ✓ The front-end serer's root url should return http 204
 ✓ The front-end server's /_status location should return statistics from our web server
 ✓ Accessing http://www.google.com through our proxy should always return a cache miss
 ✓ Downloading a file that is not in the cache should result in a cache miss
 ✓ Downloading a file that is in the cache should result in a cache hit
 ✓ Setting the header 'X-Refresh: true' should result in a bypass of the cache
 ✓ Trying to purge when it's not in the cache should return 404
 ✓ Downloading the file again after purging from the cache should yield a cache miss
 ✓ The yum repo's vhost should return HTTP 200

 19 tests, 0 failures
 Finished verifying  (1m52.26s).
-----> Kitchen is finished. (1m52.49s)

And there you have it, one command to automate your entire VM testing workflow!

Next Steps

Giving individual developers on our team the ability to quickly run a suite of automated tests is a big win, but that’s only the first step. The workflow we’re planning is to have Jenkins also run these automated Ansible tests every time someone pushes to our git repo. If those tests succeed we can automatically trigger a run of Ansible against our production inventory. If, on the other hand, the Jenkins job which runs the tests is failing (red), we can use that to prevent Ansible from running against our production inventory. This would be a big win for validating infrastructure changes before pushing them to production.

ansible_logo_black_square

Why Adam Smith Would Be Impressed by AWS

In “The Moment I Bought into AWS” I mentioned how the father of modern economics, Adam Smith, would have been impressed by AWS.  Why is that so?

This is best illustrated by one of my most favorite EconTalk episodes ever, “Roberts on Smith, Ricardo, and Trade“.  I enjoyed that episode so much that I’ve probably listened to it 3 or 4 times.  The point I’m referring to gets made about 20 mins in, although you have to listen through the first 20 mins for the proper buildup.

Smith’s point is that if the extent of the market is large enough, only then does it make sense to invest in capital (technology) that will make it cheaper to produce output.  If the extent of the market is not large enough, investing in that capital will actually make the output more expensive.

Or in a more concrete example: If you wanted to make a sandwiches for yourself at home every day for lunch, it wouldn’t make sense to buy a deli meat slicer (the technology) for just yourself. The slicer is expensive and turning it on and cleaning it would be too much hassle for a single sandwich.  However if you were making 500 sandwiches per day as a sub shop owner, then having the slicer would allow you to produce sandwiches more quickly and more efficiently (less labor needed per sandwich).

This relates back to the the Amazon video in the examples of innovative datacenter optimizations they’ve created for themselves.  Since Amazon is operating at such a massive scale, they’re able to invest resources into optimizing power supplies, BIOSes, UPS firmwares, and custom storage systems in ways that a smaller player would be foolish to spend time investing in improving for their own benefit which is of a much smaller scale.

The Moment I Bought into AWS

I bought into AWS after watching this video that Alan G had sent me.  It’s an amazing example of how due to economies of scale, Amazon is able to invest in efficiencies in storage, servers, UPS’s, etc that it would never make sense for a smaller organization to invest in.  Then, because of these efficiencies that they develop, Amazon is able to pass better products and cost savings onto their customers.

Adam Smith would be proud.