Succeeding through Laziness and Open Source

Back in mid-2014 I was in the midst of Docker-izing the build process at Virtual Instruments. As part of that work I’d open sourced one component of that system, the Docker-in-Docker Jenkins build slave which I’d created.

docker-meme

While claiming that I was driven by altruistic motivations when posting this code to GitHub (GH) would make for a great ex-post narrative, I have to admit that the real reasons for making the code publicly available were much more practical:

  • At the time the Docker image repositories on the Docker Hub Registry had to be tied to a GitHub repo (They’ve added Bitbucket support since then).
  • I was too cheap to pay for a private GitHub repo.

… And thus the code for the Docker-in-Docker Jenkins slave became open source! 😀

Unfortunately, making this image publicly available presented some challenges soon thereafter: Folks started linking their blog posts to it, people I’d never met emailed me asking for help in getting set up w/this system, others started filing issues against me on either GH or the Docker Hub Registry, and I started receiving pull-requests (PRs) to my GH repo.

Having switched employers just a few months after posting the code to GH, dealing with the issues and PRs was a bit of a challenge: My new employer didn’t have a Dockerized build system (yet), and short of setting up my own personal Jenkins server and Dockerized build slaves, there was no way for me to verify issues/fixes/PRs for this side-project. And so “tehranian/dind-jenkins-slave” stagnated on GH with relatively little participation from me.

Having largely forgotten about this project, I was quite surprised a few weeks ago when perusing the GH repo for Disqus. I accidentally discovered that the engineering team at Disqus had forked my repo and had been actively committing changes to their fork!

Their changes had:

  • Optimized the container’s layers to make it smaller in size,
  • Updated the image to work with new versions of Docker,
  • And also modified some environment variable names to avoid collisions with names that popular frameworks would use.

Prompted by this, I went back to my own GH repo, looked at the graph of all other forks, and saw that several others had forked my GH repo as well.

One such fork had updated my image to work with Docker Swarm and also to be able to easily use SSH keys for authenticating with the build slave instead of using password-based auth.

“How cool!”, I thought. I’d put an idea into the public domain a year ago, others had found it, and improved it in ways that I couldn’t have imagined. Further, their improvements were now available for myself and others to use!

My Delphix colleague Michael Coyle summed this all up very nicely, saying “As a software developer I can only realistically work for one organization at a time. Open source allows developers from different organizations to collaborate with each other without boundaries. In that way one actually can contribute to more than one organization at once.”

In hindsight I’m absolutely delighted that my unwillingness to purchase a private GitHub repo led to me contributing the Docker-in-Docker Jenkins slave to the public domain. There was nothing proprietary that Virtual Instruments could have used in its product, and by making it available other organizations like Disqus, CloudBees have been able to benefit, along with software developers on the other side of the planet. How exciting!

Advertisements

Building Vagrant Boxes with Nested VMs using Packer

In “Improving Developer Productivity with Vagrant” I discussed the productivity benefits gained from using Vagrant in our software development tool chain. Here are some more details about the mechanics of how we created those Vagrant boxes as part of every build of our product.

Using Packer to Build VMware-Compatible Vagrant Boxes

Packer is a tool for creating machine images which was also written by Hashicorp, the authors of Vagrant. It can build machine images for almost any type of environment, including Amazon AWSDocker, Google Compute Engine, KVM, Vagrant, VMwareXen, and more.

We used Packer’s built-in VMware builder and Vagrant post-processor to create the Vagrant boxes for users to run on their local desktops/laptops via VMware Fusion or Workstation.

Note: This required each user to install Vagrant’s for-purchase VMware plugin. In our usage of running Vagrant boxes locally we noted that the VMware virtualization providers delivered far better IO performance and stability than the free Oracle VirtualBox provider. In short, the for-purchase Vagrant-VMware plugin was worth every penny!

Running VMware Workstation VMs Nested in ESXi

One of the hurdles I came across in integrating the building of the Vagrant boxes into our existing build system is that Packer’s VMware builder needs to spin up a VM using Workstation or Fusion in order to perform configuration of the Vagrant box. Given that our builds were already running in static VMs, this meant that we needed to be able to run Workstation VMs nested within an ESXi VM with a Linux guest OS!

This sort of VM-nesting was somewhat complicated to setup in the days of vSphere 5.0, but in vSphere 5.1+ this has become a lot simpler. With vSphere 5.1+ one just needs to make sure that their ESXi VMs are running with “Virtual Hardware Version 9” or newer, and one must enable “Hardware assisted virtualization” for the VM within the vSphere web client.

Here’s what the correct configuration for supporting nested VMs looks like:

2014-09-28 02.05.03 pm

Packer’s Built-in Remote vSphere Hypervisor Builder

One question that an informed user of Packer may correctly ask is: “Why not use Packer’s built-in Remote vSphere Hypervisor Builder and create the VM directly on ESXi? Wouldn’t this remove the need for running nested VMs?”

I agree that this would be a better solution in theory. There are several reasons why I chose to go with nested VMs instead:

  1. The “Remote vSphere Hypervisor Builder” requires manually running an “esxcli” command on your ESXi boxes to enable some sort of “GuestIP hack”. Doing this type of configuration on our production ESXi cluster seemed sketchy to me.
  2. The “Remote vSphere Hypervisor Builder” doesn’t work through vSphere, but instead directly ssh’es into your ESXi boxes as a privileged user in order to create the VM. The login credentials for that privileged ESXi/ssh user must be kept in the Packer build script or some other area of our build system. Again, this seems less than ideal to me.
  3. As far as I can tell from the docs, the “Remote vSphere Hypervisor Builder” only works with the “vmware-iso” builder and not the “vmware-vmx” builder. This would’ve painted us into a corner as we had plans to switch from the “vmware-iso” builder to the “vmware-vmx” builder once it had become available.
  4. The “Remote vSphere Hypervisor Builder” was not available when I implemented our nested VM solution because we were early adopters of Packer. It was easier to stick with a working solution that we already had 😛

Automating the Install of VMware Workstation via Puppet

One other mechanical piece I’ll share is how we automated the installation of VMware Workstation 10.0 into our static build VMs. Since all of the build VM configuration is done via Puppet, we could automate the installation of Workstation 10 with the following bit of Puppet code:

# Install VMware Workstation 10
  $vmware_installer = '/mnt/devops/software/vmware/VMware-Workstation-Full-10.0.0-1295980.x86_64.bundle'
  $vmware_installer_options = '--eulas-agreed --required'
  exec {'Install VMware Workstation 10':
    command => "${vmware_installer} ${vmware_installer_options}",
    creates => '/usr/lib/vmware/config',
    user    => 'root',
    require => [Mount['/mnt/devops'], Package['kernel-default-devel']],
  }

Building Docker Images within Docker Containers via Jenkins

If you’re like me and you’ve Dockerized your build process by running your Jenkins builds from within dynamically provisioned Docker containers, where do you turn next? You may want the creation of any Docker images themselves to also happen within Docker containers. In order words, running Docker nested within Docker (DinD).

docker-meme

I’ve recently published a Docker image to facilitate building other Docker images from within Jenkins/Docker slave containers. Details at:

Why would one want to build Docker images nested within Docker containers?

  1. For consistency. If you’re building your JARs, RPMs, etc, from within Docker containers, it makes sense to use the same high-level process for building other artifacts such as Docker images.
  2. For Docker version freedom. As I mentioned in a previous post, the Jenkins/Docker plugin can be finicky with regards to compatibility with the version of Docker that you are running on your base OS. In order words, Jenkins/Docker plugin 0.7 will not work with Docker 1.2+, so if you really need a feature from a newer version of Docker when building your images you either have to wait for a fix from the Jenkins plugin author, or you can run Docker-nested-in-Docker with the Jenkins plugin-compatible Docker 1.1.x on the host and a newer version of Docker nested within the container. Yes, this actually works!
  3. This:

Docker + Jenkins: Dynamically Provisioning SLES 11 Build Containers

TL; DR

Using JenkinsDocker Plugin, we can dynamically spin-up SLES 11 build slaves on-demand to run our builds. One of the hurdles to getting there was to create a SLES 11 Docker base-image, since there are no SLES 11 container images available at the Docker Hub Registry. We used SUSE’s Kiwi imaging tool to create a base SLES 11 Docker image for ourselves, and then layered our build environment and Jenkins build slave support on top of it. After configuring Jenkins’ Docker plugin to use our home-grown SLES image, we were off and running with our containerized SLES builds!

Jenkins/Docker Plugin

The path to Docker-izing our build slaves started with stumbling across this Docker Plugin for Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin. This plugin allows one to use Docker to dynamically provision a build slave, run a single build, and then tear-down that slave, optionally saving it. This is very similar in workflow to the build VM provisioning system that I created while working in VMware’s Release Engineering team, but much lighter weight. Compared to VMs, Docker containers can be spun up in milliseconds instead of a in few minutes and Docker containers are much lighter on hardware resources.

The above link to the Jenkins wiki provides details about how to configure your environment as well as how to configure your container images. Some high-level notes:

  • Your base OS needs to have Docker listening on a TCP port. By default, Docker only listens on a Unix socket.
  • The container needs run “sshd” for Jenkins to connect to it. I suspect that once the container is provisioned, Jenkins just treats it as a plain-old SSH slave.
  • In my testing, the Docker/Jenkins plugin was not able to connect via SSH to the containers it provisioned when using Docker 1.2.0. After trial and error, I found that the current version of the Jenkins plugin (0.6) works well with Docker 1.0-1.1.2, but Docker 1.2.0+ did not work with this Jenkins Plugin. I used Puppet to make sure that our Ubuntu build server base VMs only had Docker 1.1.2 installed. Ex:
    • # VW-10576: install docker on the ubuntu master/slaves
      # * Have Docker listen on a TCP port per instructions at:
      # https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin
      # * Use Docker 1.1.2 and not anything newer. At the time of writing this
      # comment, Docker 1.2.0+ does not work with the Jenkins/Docker
      # plugin (the port for sshd fails to map to an external port).
      class { 'docker':
        tcp_bind => 'tcp://0.0.0.0:4243',
        version  => '1.1.2',
      }
  • There is a sample Docker/Jenkins slave based on “ubuntu:latest” available at: https://registry.hub.docker.com/u/evarga/jenkins-slave/. I would recommend getting that working as a proof-of-concept before venturing into building your own custom build slave containers. It’s helpful to be familiar with the “Dockerfile” for that image as well: https://registry.hub.docker.com/u/evarga/jenkins-slave/dockerfile/

Once you have the Docker Plugin installed, you need to go to your Jenkins “System Configuration” page and add your Docker host as a new cloud provider. In my proof-of-concept case, this is an Ubuntu 12.04 VM running Docker 1.1.2, listening on port 4243, configured to use the “evarga/jenkins-slave” image, providing the “docker-slave” label which I can then configure my Jenkins build job to be restricted to. The Jenkins configuration looks like this:

Jenkins' "System Configuration" for a Docker host

Jenkins’ “System Configuration” for a Docker host

I then configured a job named “docker-test” to use that “docker-slave” label and run a shell script with basic commands like “ps -eafwww”, “cat /etc/issue”, and “java -version”. Running that job, I see that it successfully spins up a container of “evarga/jenkins-slave” and runs my little script. Note the hostname at the top of the log, and output of “ps” in the screenshot below:

A proof-of-concept of spinning up a Docker container on demand

A proof-of-concept of spinning up a Docker container on demand

 

Creating Our SLES 11 Base Image

Having built up the confidence that we can spin up other people’s containers on-demand, we now turned to creating our SLES 11 Docker build image. For reasons that I can only assume are licensing issues, SLES 11 does not have a base image up on the Docker Hub Registry in the same vein as the images that Ubuntu, Fedora, CentOS, and others have available.

Luckily I stumbled upon the following blog post: http://flavio.castelli.name/2014/05/06/building-docker-containers-with-kiwi/

At Virtual Instruments we were already using Kiwi to build the OVAs of our build VMs, so we were already familiar with using Kiwi. Since we’d already been using Kiwi to create the OVA of our build environment it wasn’t much more work to follow that blog post and get Kiwi to generate a tarball that could be consumed by “docker import”. This worked well for the next proof-of-concept phase, but ultimately we decided to go down another path.

Rather than have Kiwi generate fully configured build images for us, we decided it’d be best to follow the conventions of the “Docker Way” and have Kiwi generate a SLES 11 base image which we could then use with a “FROM” statement in a “Dockerfile” and install the build environment via the Dockerfile. One of the advantages of this is that we only have to use Kiwi to generate the base image the first time. After there we can stay in Docker-land to build the subsequent images. Additionally, having a shared base image among all of our build image tags should allow for space savings as Docker optimizes the layering of filesystems over a common base image.

Configuring the Image for Use with Jenkins

Taking a SLES 11 image with our build environment installed and getting it to work with the Jenkins Docker plugin took a little bit of work, mainly spent trying to configure “sshd” correctly. Below is the “Dockerfile” that builds upon a SLES image with our build environment installed and prepares it for use with Jenkins:

# This Dockerfile is used to build an image containing basic
# configuration to be used as a Jenkins slave build node.

FROM vi-docker.lab.vi.local/pa-dev-env-master
MAINTAINER Dan Tehranian <REDACTED@virtualinstruments.com>


# Add user & group "jenkins" to the image and set its password
RUN groupadd jenkins
RUN useradd -m -g jenkins -s /bin/bash jenkins
RUN echo "jenkins:jenkins" | chpasswd


# Having "sshd" running in the container is a requirement of the Jenkins/Docker
# plugin. See: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin

# Create the ssh host keys needed for sshd
RUN ssh-keygen -A

# Fix sshd's configuration for use within the container. See VW-10576 for details.
RUN sed -i -e 's/^UsePAM .*/UsePAM no/' /etc/ssh/sshd_config
RUN sed -i -e 's/^PasswordAuthentication .*/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Expose the standard SSH port
EXPOSE 22

# Start the ssh daemon
CMD ["/usr/sbin/sshd -D"]

Running a Maven Build Inside of a SLES 11 Docker Container

Having created this new image and pushed it to our internal docker repo, we can now go back to Jenkins’ “System Configuration” page and add a new image to our Docker cloud provider. Creating a new Jenkins “Maven Job” which utilizes this new SLES 11 image and running a build, we can see our SLES 11 container getting spun up, code getting checked out from our internal git repo, and Maven being invoked:

Hooray! A Successful Maven Build

Hooray! A successful Maven build inside of a Docker container!

Output From the Maven Build. LGTM!

Output from the Maven Build that was run in the container. LGTM!

 

Wins

There are a whole slew of benefits to a system like this:

  • We don’t have to run & support SLES 11 VMs in our infrastructure alongside the easier-to-manage Ubuntu VMs. We can just run Ubuntu 12.04 VMs as the base OS and spin up SLES slaves as needed. This makes testing of our Puppet repository a lot easier as this gives us a homogeneous OS environment!
  • We can have portable and separate build environment images for each of our branches. Ex: legacy product branches can continue to have old versions of the JDK and third party libraries that are updated only when needed, but our mainline development can have a build image with tools that are updated independently.
    • This is significantly better than the “toolchain repository” solution that we had at VMware, where several 100s of GBs of binaries were checked into a monolithic Perforce repo.
  • Thanks to Docker image tags, we can tag the build image at each GA release and keep that build environment saved. This makes reproducing builds significantly easier!
  • Having a Docker image of the build environment allows our developers to do local builds via their IDEs, if they so choose. Using Vagrant’s Docker provider, developers can spin up a Docker container of the build environment for their respective branch on their local machines, regardless of their host OS – Windows, Mac, or Linux. This allows developers to build RPMs with the same libraries and tools that the build system would!

A Local Caching Proxy for “pypi.python.org” via Docker

TL; DR

If your infrastructure automation installs packages from PyPI (the Python Package Index) via “pip” or similar tools, you can save yourself from annoying “pypi.python.org timeout” errors by running a local caching proxy of the PyPI service. After trying several of these services, I found “devpi” to be the most resilient. It’s available as both a Python package or as a Docker container that you can run in your data center.

Problem

If you have infrastructure automation that tries to install packages from PyPI then you’ve undoubtedly encountered encounded availability issues with the PyPI web service, hosted at “pypi.python.org”. Example email alerts that we see from our Puppet infrastructure look like:

Tue Sep 02 23:47:23 -0700 2014 /Stage[main]/Jenkins::Dev_jenkins_slave/Package[jenkins-check-for-success]
(err): Could not evaluate: Could not get latest version: 
Timeout while contacting pypi.python.org: execution expired

One could choose to ignore these sorts of connectivity issues since they are transient, but there’s quite a few negative consequences to that:

  • If you’re spinning up new machines on-demand and they require a Python package as part of their configuration, then your ability to consistently spin up these machines successfully has become compromised by a dependency which is completely out of your own control.
  • If your infrastructure automation is configured to send email alerts on these types of errors, you’ll be getting un-actionable emails that add to the noise of email alerts that you get from your infrastructure. This effectivly makes your alerting system less valuable as your team will be trained to ignore their email alerts.
  • As you scale your infrastructure to hundreds or thousands of nodes, you’ll be receiving a lot of alerts about connectivity issues with “pypi.python.org” throughout the day and night time hours. When “pypi.python.org” goes down hard for a prolonged period of time, you’ll end up with all of the nodes in your infrastructure simultaneously bombarding you with alerts about not being able to contact “pypi.python.org”.

Solution

The solution for this problem is to run a local caching proxy for “pypi.python.org” within your data center. The Python community has developed proxy packages like pypiserver, chishop, devpi, and others specifically for this use case. After extensive research and trying several of them out, I’ve found devpi to be the most resilient as well as the most actively developed as of this writing.

One can either install devpi as a Python package (see instructions on their website) or via a Docker container. Since our infrastructure has been making the move to “All Docker Everything” I’ll write up the steps I took to setup the Docker container running devpi and how I configured our clients to use it.

Devpi Server Installation

Here’s some sample Puppet code for how to download & run the “scrapinghub/devpi” container with an nginx proxy in front of it. (I discussed why having an nginx proxy in front is advantageous in Private Docker Registry w/Nginx Proxy for Stats Collection)

You’ll want to change “DEVPI_PASSWORD” and the hostname for the Nginx vhost below.

# devpi server & nginx configuration
docker::image { 'scrapinghub/devpi': }
docker::run { 'devpi':
    image => 'scrapinghub/devpi',
    ports => ['3141:3141',],
    use_name => true,
    env => ['DEVPI_PASSWORD=1234',],
}

nginx::resource::upstream { 'pypi_app':
    members => ['localhost:3141',],
}
nginx::resource::vhost { 'vi-pypi.lab.vi.local':
    proxy => 'http://pypi_app',
}

Once your container is running you can run “docker logs” to see what it is up to. You can see the “devpi” proxy saving your bacon when “pypi.python.org” occasionally becomes unavailable via log statements like this:

172.17.42.1 - - [22/Jul/2014 22:12:09] "GET /root/public/+simple/argparse/ HTTP/1.0" 200 4316
2014-07-22 22:12:09,251 [INFO ] requests.packages.urllib3.connectionpool: Resetting dropped connection: pypi.python.org
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: cache-streaming: https://pypi.python.org/packages/source/a/argparse/argparse-1.2.
1.tar.gz, target root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: starting file iteration: root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz (size 69297)

Python Client Configuration

On the client side we need to configure both “pip” and “easy_install” to use the devpi container we just instantiated. This requires creating a special configuration file for each of those Python package managers. The configuration file tells those package managers to use your devpi proxy server for their package index URL.

You’ll want to change the URL to point to the hostname you use within your own infrastructure.

# ~/.pip/pip.conf

[global]
index-url = http://vi-pypi.lab.vi.local/root/public/
# ~/.pydistutils.cfg

[easy_install]
index_url = http://vi-pypi.lab.vi.local/root/public/

But Wait There’s More – Uploading Your Own Python Packages

One of the additional benefits to running a local PyPI proxy is that it becomes a distribution point for your private Python packages. Instead of clumsily checking out SCM repos full of your own custom Python scripts to each machine in your infrastructure, you can install your Python scripts as first-order Python packages, the same way you would install packages from PyPI. This lets you properly version your packages and define dependency requirements between your packages.

Creating a “setup.py” file for each of your Python projects is outside the scope of this post, but details can be found online. Once your Python project has its “setup.py” file, uploading your versioned package to your local devpi instance requires just a few simple commands. From our Jenkins job which publishes a new version of a Python package upon a git push:

# from the cwd of "setup.py"
devpi use http://vi-pypi.lab.vi.local/root/public/
devpi login root --password 1234 
devpi upload

More details at: http://doc.devpi.net/latest/quickstart-releaseprocess.html

Conclusion

By running a local caching proxy of “pypi.python.org” we’re able to improve the reliability of our infrastructure because we are no longer beholden to the availability of an external dependency. We also get the added benefit of having a proper Python package distribution point, which allows us to have better development & deployment practices. Finally, this local caching proxy provides better performance for installing packages, as local network copies are significantly faster than downloading from an external website.

Building a Better Dashboard for Virtual Infrastructure

TL; DR

We built a pretty sweet dashboard for our R&D infrastructure using Graphite, Grafana, collectd, and a home-made VMware VIM API data collector script.

Screenshot of Our Dashboard

Background – Tracking Build Times

The second major project I worked on after joining Virtual Instruments resulted in improving the product build time performance 15x, from 5 hours down to 20 minutes. Having spent a lot of time and effort to accomplish that goal, I wanted to setup tracking of build time metrics going forward by having the build time of each successful build recorded into a system where it could be visualized. Setting up the collection and visualization of this data seemed like a great intern project and to that end for the Summer of 2014 we brought back our awesome intern from the previous Summer, Ryan Henrick.

Within a few weeks Ryan was able to setup a Graphite instance and add a “post-build action” for each of our Jenkins jobs (via our DSL’d job definitions in SCM) in order to have Jenkins push this build time data to the Graphite server. From Graphite’s web UI we could then visualize this build time data for each of the components of our product.

Here’s an example of our build time data visualized in Graphite. As you can see it is functionally correct, but it’s difficult to get excited about using something like this:

Graphite Visualization of Build Times

Setting up the Graphite server with Puppet looks something like this:

  class { 'graphite':
    gr_max_creates_per_minute    => 'inf',
    gr_max_updates_per_second    => 2000,

    # This configuration reads from top to bottom, returning first match
    # of a regex pattern. "default" catches anything that did not match.
    gr_storage_schemas        => [
      {
        name                => 'build',
        pattern             => '^build.*',
        retentions          => '1h:60d,12h:5y'
      },
      {
        name                => 'carbon',
        pattern             => '^carbon\.',
        retentions          => '1m:1d,10m:20d'
      },
      {
        name                => 'collectd',
        pattern             => '^collectd.*',
        retentions          => '10s:20m,30s:1h,3m:6h,12m:1d,1h:7d,4h:60d,1d:2y'
      },
      {
        name                => 'vCenter',
        pattern             => '^vCenter.*',
        retentions          => '1m:12h,5m:1d,1h:7d,4h:2y'
      },
      {
        name                => 'default',
        pattern             => '.*',
        retentions          => '10s:20m,30s:1h'
      }
    ],
    gr_timezone                => 'America/Los_Angeles',
    gr_web_cors_allow_from_all => true,
  }

There are some noteworthy settings there:

  • gr_storage_schemas – This is how you define the stat roll-ups, ie. “10 second data for the first 2 hours, 1 minute data for the next 24 hours, 10 minute data for the next 7 days”, etc. See Graphite docs for details.
  • gr_max_creates_per_minute – This limits how many new “buckets” Graphite will create per minute. If you leave it at the default of “50” and then unleash all kinds of new data/metrics onto Graphite, it may take some time for the database files for each of the new metrics to be created. See Graphite docs for details.
  • gr_web_cors_allow_from_all – This is needed for the Grafana UI to talk to Graphite (see later in this post)

Collecting More Data with “collectd”

After completing the effort to log all Jenkins build time data to Graphite, our attention turned to what other sorts of things we could measure and visualize. We did a 1-day R&D Hackathon project around collecting metrics from the various production web apps that we run within the R&D org, ex: Jira, Confluence, Jenkins, Nexus, Docker Registry, etc.

Over the course of doing our Hackathon project we realized that there was a plethora of tools available to integrate with Graphite. Most interesting to us was “collectd” with its extensive set of plugins, and Grafana, a very rich Node.js UI app for rendering Graphite data in much more impressive ways than the static “Build Time” chart that I included above.

We quickly got to work and leveraged Puppet’s collectd module to collect metrics about CPU, RAM, network, disk IO, disk space, and swap activity from the guest OS’s of all of our production VMs that run Puppet. We were also able to quickly implement collectd’s postgresql and nginx plugins to measure stats about the aforementioned web apps.

Using Puppet, we can deploy collectd onto all of our production VMs and configure it to send its data to our Graphite server with something like:

# Install collectd for SLES & Ubuntu hosts; Not on OpenSuSE (yet)
  case $::operatingsystem {
    'SLES', 'Ubuntu': {
      class { '::collectd':
        purge        => true,
        recurse      => true,
        purge_config => true,
        version      => installed,
      }
      class { 'collectd::plugin::df': }
      class { 'collectd::plugin::disk':
        disks          => ['/^dm/'],
        ignoreselected => true,
      }
      class { 'collectd::plugin::swap':
        reportbydevice => false,
        reportbytes    => true,
      }
      class { 'collectd::plugin::write_graphite':
        protocol     => 'tcp',
        graphitehost => 'vi-devops-graphite1.lab.vi.local',
      }
      collectd::plugin { 'cpu': }
      collectd::plugin { 'interface': }
      collectd::plugin { 'load': }
      collectd::plugin { 'memory': }
      collectd::plugin { 'nfs': }
    }
    default: {}
  }

For another example w/details on monitoring an HTTP server, see my previous post: Private Docker Registry w/Nginx Proxy for Stats Collection

Collecting vCenter Data with pyVmomi

After the completion of the Hackathon, I pointed the intern to the pyvmomi Python module and the sample scripts available at http://vmware.github.io/pyvmomi-community-samples/, and told him to get crackin’ on making a Python data collector that would connect to our vCenter instance, grab metrics about ESXi cluster CPU & memory usage, VMFS datastore usage, along with per-VM CPU and memory usage, and ship that data off to our Graphite server.

Building a Beautiful Dashboard with Grafana

With all of this collectd and vCenter data now being collected in Graphite, the only thing left was to create a beautiful dashboard for visualizing what’s going on in our R&D infrastructure. We leveraged Puppet’s grafana module to setup a Grafana instance.

Grafana is a Node.js app that talks directly to Graphite’s WWW API (bypassing Graphite’s very rudimentary static image rendering layer) and pulls in the raw series data that you’ve stored in Graphite. Grafana allows you to build dashboards suitable for viewing in your web browser or for tossing up a TV display. It has all kinds of sexy UI features like drag-to-zoom, mouse-over-for-info, and annotation of events.

There’s a great live demo of Grafana available at: http://play.grafana.org

Setting up the Grafana server with Puppet is pretty simple. The one trick is that in order to save your Dashboards you need to setup Elasticsearch:

  class { 'elasticsearch':
    package_url => 'https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.2.deb',
    java_install => true,
  }

  elasticsearch::instance { 'grafana': }

  class { 'nginx':
      confd_purge => true,
      vhost_purge => true,
  }

  nginx::resource::vhost { 'localhost':
    www_root       => '/opt/grafana',
    listen_options => 'default_server',
  }

  class { 'grafana':
    install_dir         => '/opt',
    elasticsearch_host  => $::fqdn,
  }

Closing Thoughts

One of the main benefits of this solution is that we can have our data from disparate data sources in one place, and plot these series against one another. Unlike vCenter’s “Performace” graphs which are slow to setup and regularly time-out before rendering, days or weeks of rolled up data will load very quickly in Graphite/Grafana.

Screenshot of Our Dashboard

Acknowledgements

  • Thanks to our intern Ryan Henrick for his work this past Summer in implementing these data collectors and creating the visualizations.
  • Thanks to Alan Grosskurth from Package Lab for pointing me to Grafana when I told him what we were up to with our Hackathon project.

RELENG 2014 Wrap-Up

I was a speaker at the 2nd International Workshop on Release Engineering @ Google HQ last week.  I enjoyed meeting up with other release engineers and sharing ideas with them.  Here are some talks I enjoyed:

Finally, my own presentation went quite well.  There were a lot of people in the audience that came up to chat with me afterwards and it seemed that the message really resonated with the audience (confirmation bias, much? : ).  From the sounds of it, I may have an opportunity to give an extended-version of this talk at Google and VMware in the near future which is great because there were several ROI examples and software industry anecdotes around quality and time-to-market that I had to cut out to meet the time requirements.

Here are the slides I used: