Testing Ansible Roles with Test Kitchen

Recently while attending DevOps Days Austin 2015, I participated in a breakout session focused on how to test code for configuration management tools like Puppet, Chef, and Ansible. Having started to use Ansible to manage our infrastructure at Delphix I was searching for a way to automate the testing of our configuration management code across a variety of platforms, including Ubuntu, CentOS, RHEL, and Delphix’s custom Illumos-based OS, DelphixOS. Dealing with testing across all of those platforms is a seemingly daunting task to say the least!

Intro to Test Kitchen

The conversation in that breakout session introduced me to Test Kitchen (GitHub), a tool that I’ve been very impressed by and have had quite a bit of fun writing tests for. Test Kitchen is a tool for automated testing of configuration management code written for tools like Ansible. It automates the process of spinning up test VMs, running your configuration management tool against those VMs, executing verification tests against those VMs, and then tearing down the test VMs.

What’s makes Test Kitchen so powerful and useful is its modular design:

Using Test Kitchen

After learning about Test Kitchen at the DevOps Days conference, I did some more research and stumbled across the following presentation which was instrumental in getting started with Test Kitchen and Ansible: Testing Ansible Roles with Test Kitchen, Serverspec and RSpec (SlideShare).

In summary one needs to add three files to their Ansible role to begin using Test Kitchen:

  • A “.kitchen.yml” file at the top-level. This file describes:
    • The driver to use for VM provisioning. Ex: Vagrant, AWS, Docker, etc.
    • The provisioner to use. Ex: Puppet, Chef, Ansible.
    • A list of 1 or more operating to test against. Ex: Ubuntu 12.04, Ubuntu 14.04, CentOS 6.5, or even a custom VM image specified by URL.
    • A list of test suites to run.
  • A “test/integration/test-suite-name/test-suite-name.yml” file which contains the Ansible playbook to be applied.
  • One or more test files in “test/integration/test-suite-name/test-driver-name/”. For example, when using the BATS test-runner to run a test suite named “default”: “test/integration/default/bats/my-test.bats”.

Example Code

A full example of Test Kitchen w/Ansible is available via the delphix.package-caching-proxy Ansible role in Delphix’s GitHub repo. Here are direct links to the aforementioned files/directories:682240

Running Test Kitchen

Using Test Kitchen couldn’t be easier. From the directory that contains your “.kitchen.yml” file, just run “kitchen test” to automatically create your VMs, configure them, and run tests against them:

$ kitchen test
-----> Starting Kitchen (v1.4.1)
-----> Cleaning up any prior instances of 
-----> Destroying ...
 Finished destroying  (0m0.00s).
-----> Testing 
-----> Creating ...
 Bringing machine 'default' up with 'virtualbox' provider...
 ==> default: Importing base box 'opscode-ubuntu-14.04'...
==> default: Matching MAC address for NAT networking...
 ==> default: Setting the name of the VM: kitchen-ansible-package-caching-proxy-default-ubuntu-1404_default_1435180384440_80322
 ==> default: Clearing any previously set network interfaces...
 ==> default: Preparing network interfaces based on configuration...
 default: Adapter 1: nat
 ==> default: Forwarding ports...
 default: 22 => 2222 (adapter 1)
 ==> default: Booting VM...
 ==> default: Waiting for machine to boot. This may take a few minutes...

..  ...

-----> Running bats test suite
 ✓ Accessing the apt-cacher-ng vhost should load the configuration page for Apt-Cacher-NG
 ✓ Hitting the apt-cacher proxy on the proxy port should succeed
 ✓ The previous command that hit ftp.debian.org should have placed some files in the cache
 ✓ Accessing the devpi server on port 3141 should return a valid JSON response
 ✓ Accessing the devpi server via the nginx vhost should return a valid JSON response
 ✓ Downloading a Python package via our PyPI proxy should succeed
 ✓ We should still be able to install Python packages when the devpi contianer's backend is broken
 ✓ The vhost for the docker registry should be available
 ✓ The docker registry's /_ping url should return valid JSON
 ✓ The docker registry's /v1/_ping url should return valid JSON
 ✓ The front-end serer's root url should return http 204
 ✓ The front-end server's /_status location should return statistics from our web server
 ✓ Accessing http://www.google.com through our proxy should always return a cache miss
 ✓ Downloading a file that is not in the cache should result in a cache miss
 ✓ Downloading a file that is in the cache should result in a cache hit
 ✓ Setting the header 'X-Refresh: true' should result in a bypass of the cache
 ✓ Trying to purge when it's not in the cache should return 404
 ✓ Downloading the file again after purging from the cache should yield a cache miss
 ✓ The yum repo's vhost should return HTTP 200

 19 tests, 0 failures
 Finished verifying  (1m52.26s).
-----> Kitchen is finished. (1m52.49s)

And there you have it, one command to automate your entire VM testing workflow!

Next Steps

Giving individual developers on our team the ability to quickly run a suite of automated tests is a big win, but that’s only the first step. The workflow we’re planning is to have Jenkins also run these automated Ansible tests every time someone pushes to our git repo. If those tests succeed we can automatically trigger a run of Ansible against our production inventory. If, on the other hand, the Jenkins job which runs the tests is failing (red), we can use that to prevent Ansible from running against our production inventory. This would be a big win for validating infrastructure changes before pushing them to production.

ansible_logo_black_square

Best VMworld 2014 Sessions

The videos from the VMworld 2014 Sessions have slowly been making their way online over the past few months. Some videos are freely available on the VMworld 2014 YouTube Playlist. Other videos are still only available to attendees of the conference who have credentials to access to the VMworld 2014 video site. Hopefully all of the videos will be made available to the public on YouTube in the near future.

I’ve gone through most of the videos and found the following videos to be especially helpful in our team’s day-to-day activities:

Building Vagrant Boxes with Nested VMs using Packer

In “Improving Developer Productivity with Vagrant” I discussed the productivity benefits gained from using Vagrant in our software development tool chain. Here are some more details about the mechanics of how we created those Vagrant boxes as part of every build of our product.

Using Packer to Build VMware-Compatible Vagrant Boxes

Packer is a tool for creating machine images which was also written by Hashicorp, the authors of Vagrant. It can build machine images for almost any type of environment, including Amazon AWSDocker, Google Compute Engine, KVM, Vagrant, VMwareXen, and more.

We used Packer’s built-in VMware builder and Vagrant post-processor to create the Vagrant boxes for users to run on their local desktops/laptops via VMware Fusion or Workstation.

Note: This required each user to install Vagrant’s for-purchase VMware plugin. In our usage of running Vagrant boxes locally we noted that the VMware virtualization providers delivered far better IO performance and stability than the free Oracle VirtualBox provider. In short, the for-purchase Vagrant-VMware plugin was worth every penny!

Running VMware Workstation VMs Nested in ESXi

One of the hurdles I came across in integrating the building of the Vagrant boxes into our existing build system is that Packer’s VMware builder needs to spin up a VM using Workstation or Fusion in order to perform configuration of the Vagrant box. Given that our builds were already running in static VMs, this meant that we needed to be able to run Workstation VMs nested within an ESXi VM with a Linux guest OS!

This sort of VM-nesting was somewhat complicated to setup in the days of vSphere 5.0, but in vSphere 5.1+ this has become a lot simpler. With vSphere 5.1+ one just needs to make sure that their ESXi VMs are running with “Virtual Hardware Version 9” or newer, and one must enable “Hardware assisted virtualization” for the VM within the vSphere web client.

Here’s what the correct configuration for supporting nested VMs looks like:

2014-09-28 02.05.03 pm

Packer’s Built-in Remote vSphere Hypervisor Builder

One question that an informed user of Packer may correctly ask is: “Why not use Packer’s built-in Remote vSphere Hypervisor Builder and create the VM directly on ESXi? Wouldn’t this remove the need for running nested VMs?”

I agree that this would be a better solution in theory. There are several reasons why I chose to go with nested VMs instead:

  1. The “Remote vSphere Hypervisor Builder” requires manually running an “esxcli” command on your ESXi boxes to enable some sort of “GuestIP hack”. Doing this type of configuration on our production ESXi cluster seemed sketchy to me.
  2. The “Remote vSphere Hypervisor Builder” doesn’t work through vSphere, but instead directly ssh’es into your ESXi boxes as a privileged user in order to create the VM. The login credentials for that privileged ESXi/ssh user must be kept in the Packer build script or some other area of our build system. Again, this seems less than ideal to me.
  3. As far as I can tell from the docs, the “Remote vSphere Hypervisor Builder” only works with the “vmware-iso” builder and not the “vmware-vmx” builder. This would’ve painted us into a corner as we had plans to switch from the “vmware-iso” builder to the “vmware-vmx” builder once it had become available.
  4. The “Remote vSphere Hypervisor Builder” was not available when I implemented our nested VM solution because we were early adopters of Packer. It was easier to stick with a working solution that we already had 😛

Automating the Install of VMware Workstation via Puppet

One other mechanical piece I’ll share is how we automated the installation of VMware Workstation 10.0 into our static build VMs. Since all of the build VM configuration is done via Puppet, we could automate the installation of Workstation 10 with the following bit of Puppet code:

# Install VMware Workstation 10
  $vmware_installer = '/mnt/devops/software/vmware/VMware-Workstation-Full-10.0.0-1295980.x86_64.bundle'
  $vmware_installer_options = '--eulas-agreed --required'
  exec {'Install VMware Workstation 10':
    command => "${vmware_installer} ${vmware_installer_options}",
    creates => '/usr/lib/vmware/config',
    user    => 'root',
    require => [Mount['/mnt/devops'], Package['kernel-default-devel']],
  }

Building a Better Dashboard for Virtual Infrastructure

TL; DR

We built a pretty sweet dashboard for our R&D infrastructure using Graphite, Grafana, collectd, and a home-made VMware VIM API data collector script.

Screenshot of Our Dashboard

Background – Tracking Build Times

The second major project I worked on after joining Virtual Instruments resulted in improving the product build time performance 15x, from 5 hours down to 20 minutes. Having spent a lot of time and effort to accomplish that goal, I wanted to setup tracking of build time metrics going forward by having the build time of each successful build recorded into a system where it could be visualized. Setting up the collection and visualization of this data seemed like a great intern project and to that end for the Summer of 2014 we brought back our awesome intern from the previous Summer, Ryan Henrick.

Within a few weeks Ryan was able to setup a Graphite instance and add a “post-build action” for each of our Jenkins jobs (via our DSL’d job definitions in SCM) in order to have Jenkins push this build time data to the Graphite server. From Graphite’s web UI we could then visualize this build time data for each of the components of our product.

Here’s an example of our build time data visualized in Graphite. As you can see it is functionally correct, but it’s difficult to get excited about using something like this:

Graphite Visualization of Build Times

Setting up the Graphite server with Puppet looks something like this:

  class { 'graphite':
    gr_max_creates_per_minute    => 'inf',
    gr_max_updates_per_second    => 2000,

    # This configuration reads from top to bottom, returning first match
    # of a regex pattern. "default" catches anything that did not match.
    gr_storage_schemas        => [
      {
        name                => 'build',
        pattern             => '^build.*',
        retentions          => '1h:60d,12h:5y'
      },
      {
        name                => 'carbon',
        pattern             => '^carbon\.',
        retentions          => '1m:1d,10m:20d'
      },
      {
        name                => 'collectd',
        pattern             => '^collectd.*',
        retentions          => '10s:20m,30s:1h,3m:6h,12m:1d,1h:7d,4h:60d,1d:2y'
      },
      {
        name                => 'vCenter',
        pattern             => '^vCenter.*',
        retentions          => '1m:12h,5m:1d,1h:7d,4h:2y'
      },
      {
        name                => 'default',
        pattern             => '.*',
        retentions          => '10s:20m,30s:1h'
      }
    ],
    gr_timezone                => 'America/Los_Angeles',
    gr_web_cors_allow_from_all => true,
  }

There are some noteworthy settings there:

  • gr_storage_schemas – This is how you define the stat roll-ups, ie. “10 second data for the first 2 hours, 1 minute data for the next 24 hours, 10 minute data for the next 7 days”, etc. See Graphite docs for details.
  • gr_max_creates_per_minute – This limits how many new “buckets” Graphite will create per minute. If you leave it at the default of “50” and then unleash all kinds of new data/metrics onto Graphite, it may take some time for the database files for each of the new metrics to be created. See Graphite docs for details.
  • gr_web_cors_allow_from_all – This is needed for the Grafana UI to talk to Graphite (see later in this post)

Collecting More Data with “collectd”

After completing the effort to log all Jenkins build time data to Graphite, our attention turned to what other sorts of things we could measure and visualize. We did a 1-day R&D Hackathon project around collecting metrics from the various production web apps that we run within the R&D org, ex: Jira, Confluence, Jenkins, Nexus, Docker Registry, etc.

Over the course of doing our Hackathon project we realized that there was a plethora of tools available to integrate with Graphite. Most interesting to us was “collectd” with its extensive set of plugins, and Grafana, a very rich Node.js UI app for rendering Graphite data in much more impressive ways than the static “Build Time” chart that I included above.

We quickly got to work and leveraged Puppet’s collectd module to collect metrics about CPU, RAM, network, disk IO, disk space, and swap activity from the guest OS’s of all of our production VMs that run Puppet. We were also able to quickly implement collectd’s postgresql and nginx plugins to measure stats about the aforementioned web apps.

Using Puppet, we can deploy collectd onto all of our production VMs and configure it to send its data to our Graphite server with something like:

# Install collectd for SLES & Ubuntu hosts; Not on OpenSuSE (yet)
  case $::operatingsystem {
    'SLES', 'Ubuntu': {
      class { '::collectd':
        purge        => true,
        recurse      => true,
        purge_config => true,
        version      => installed,
      }
      class { 'collectd::plugin::df': }
      class { 'collectd::plugin::disk':
        disks          => ['/^dm/'],
        ignoreselected => true,
      }
      class { 'collectd::plugin::swap':
        reportbydevice => false,
        reportbytes    => true,
      }
      class { 'collectd::plugin::write_graphite':
        protocol     => 'tcp',
        graphitehost => 'vi-devops-graphite1.lab.vi.local',
      }
      collectd::plugin { 'cpu': }
      collectd::plugin { 'interface': }
      collectd::plugin { 'load': }
      collectd::plugin { 'memory': }
      collectd::plugin { 'nfs': }
    }
    default: {}
  }

For another example w/details on monitoring an HTTP server, see my previous post: Private Docker Registry w/Nginx Proxy for Stats Collection

Collecting vCenter Data with pyVmomi

After the completion of the Hackathon, I pointed the intern to the pyvmomi Python module and the sample scripts available at http://vmware.github.io/pyvmomi-community-samples/, and told him to get crackin’ on making a Python data collector that would connect to our vCenter instance, grab metrics about ESXi cluster CPU & memory usage, VMFS datastore usage, along with per-VM CPU and memory usage, and ship that data off to our Graphite server.

Building a Beautiful Dashboard with Grafana

With all of this collectd and vCenter data now being collected in Graphite, the only thing left was to create a beautiful dashboard for visualizing what’s going on in our R&D infrastructure. We leveraged Puppet’s grafana module to setup a Grafana instance.

Grafana is a Node.js app that talks directly to Graphite’s WWW API (bypassing Graphite’s very rudimentary static image rendering layer) and pulls in the raw series data that you’ve stored in Graphite. Grafana allows you to build dashboards suitable for viewing in your web browser or for tossing up a TV display. It has all kinds of sexy UI features like drag-to-zoom, mouse-over-for-info, and annotation of events.

There’s a great live demo of Grafana available at: http://play.grafana.org

Setting up the Grafana server with Puppet is pretty simple. The one trick is that in order to save your Dashboards you need to setup Elasticsearch:

  class { 'elasticsearch':
    package_url => 'https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.2.deb',
    java_install => true,
  }

  elasticsearch::instance { 'grafana': }

  class { 'nginx':
      confd_purge => true,
      vhost_purge => true,
  }

  nginx::resource::vhost { 'localhost':
    www_root       => '/opt/grafana',
    listen_options => 'default_server',
  }

  class { 'grafana':
    install_dir         => '/opt',
    elasticsearch_host  => $::fqdn,
  }

Closing Thoughts

One of the main benefits of this solution is that we can have our data from disparate data sources in one place, and plot these series against one another. Unlike vCenter’s “Performace” graphs which are slow to setup and regularly time-out before rendering, days or weeks of rolled up data will load very quickly in Graphite/Grafana.

Screenshot of Our Dashboard

Acknowledgements

  • Thanks to our intern Ryan Henrick for his work this past Summer in implementing these data collectors and creating the visualizations.
  • Thanks to Alan Grosskurth from Package Lab for pointing me to Grafana when I told him what we were up to with our Hackathon project.

Vagrant VMware vCenter Simulator

When developing and testing a product that integrates with VMware vSphere, it can often be useful to have a vCenter to test against. Unfortunately for developers and testers, the process of setting up and managing their own vCenter instance along with the necessary entities for it to manage can be extremely time consuming. One alternative is to have a shared instance used by many people, but can cause issues of engineers stomping on each other’s feet.

At Virtual Instruments we needed a way to test against VMware’s API for for metrics collection. Leveraging the vCenter Server Appliance OVA and numerous posts on vCenter configuration automation from virtuallyGhetto, I came up with a builder to create a Vagrant box out of vCenter that runs in simulator mode. The vCenter simulator mode is built in to vCenter and is an unsupported way to simulate an inventory of hosts, clusters, VMs, data stores, etc, along with performance metrics for those entities that can then be queried via VMware’s API.

Note that the resulting Vagrant box can also be used with the “vagrant-vcenter” plugin to deploy these vCenter Simulators into a vSphere environment so that people don’t have to run the vCenter Simulator locally.

Code is available at: https://github.com/tehranian/vagrant-vcenter-simulator

Screen Shots

vCenter Dashboard w/Simulated Inventory

vCenter Dashboard w/Simulated Inventory

Simulated Performance Metrics in vCenter

Simulated Performance Metrics in vCenter