Testing Ansible Roles with Test Kitchen

Recently while attending DevOps Days Austin 2015, I participated in a breakout session focused on how to test code for configuration management tools like Puppet, Chef, and Ansible. Having started to use Ansible to manage our infrastructure at Delphix I was searching for a way to automate the testing of our configuration management code across a variety of platforms, including Ubuntu, CentOS, RHEL, and Delphix’s custom Illumos-based OS, DelphixOS. Dealing with testing across all of those platforms is a seemingly daunting task to say the least!

Intro to Test Kitchen

The conversation in that breakout session introduced me to Test Kitchen (GitHub), a tool that I’ve been very impressed by and have had quite a bit of fun writing tests for. Test Kitchen is a tool for automated testing of configuration management code written for tools like Ansible. It automates the process of spinning up test VMs, running your configuration management tool against those VMs, executing verification tests against those VMs, and then tearing down the test VMs.

What’s makes Test Kitchen so powerful and useful is its modular design:

Using Test Kitchen

After learning about Test Kitchen at the DevOps Days conference, I did some more research and stumbled across the following presentation which was instrumental in getting started with Test Kitchen and Ansible: Testing Ansible Roles with Test Kitchen, Serverspec and RSpec (SlideShare).

In summary one needs to add three files to their Ansible role to begin using Test Kitchen:

  • A “.kitchen.yml” file at the top-level. This file describes:
    • The driver to use for VM provisioning. Ex: Vagrant, AWS, Docker, etc.
    • The provisioner to use. Ex: Puppet, Chef, Ansible.
    • A list of 1 or more operating to test against. Ex: Ubuntu 12.04, Ubuntu 14.04, CentOS 6.5, or even a custom VM image specified by URL.
    • A list of test suites to run.
  • A “test/integration/test-suite-name/test-suite-name.yml” file which contains the Ansible playbook to be applied.
  • One or more test files in “test/integration/test-suite-name/test-driver-name/”. For example, when using the BATS test-runner to run a test suite named “default”: “test/integration/default/bats/my-test.bats”.

Example Code

A full example of Test Kitchen w/Ansible is available via the delphix.package-caching-proxy Ansible role in Delphix’s GitHub repo. Here are direct links to the aforementioned files/directories:682240

Running Test Kitchen

Using Test Kitchen couldn’t be easier. From the directory that contains your “.kitchen.yml” file, just run “kitchen test” to automatically create your VMs, configure them, and run tests against them:

$ kitchen test
-----> Starting Kitchen (v1.4.1)
-----> Cleaning up any prior instances of 
-----> Destroying ...
 Finished destroying  (0m0.00s).
-----> Testing 
-----> Creating ...
 Bringing machine 'default' up with 'virtualbox' provider...
 ==> default: Importing base box 'opscode-ubuntu-14.04'...
==> default: Matching MAC address for NAT networking...
 ==> default: Setting the name of the VM: kitchen-ansible-package-caching-proxy-default-ubuntu-1404_default_1435180384440_80322
 ==> default: Clearing any previously set network interfaces...
 ==> default: Preparing network interfaces based on configuration...
 default: Adapter 1: nat
 ==> default: Forwarding ports...
 default: 22 => 2222 (adapter 1)
 ==> default: Booting VM...
 ==> default: Waiting for machine to boot. This may take a few minutes...

..  ...

-----> Running bats test suite
 ✓ Accessing the apt-cacher-ng vhost should load the configuration page for Apt-Cacher-NG
 ✓ Hitting the apt-cacher proxy on the proxy port should succeed
 ✓ The previous command that hit ftp.debian.org should have placed some files in the cache
 ✓ Accessing the devpi server on port 3141 should return a valid JSON response
 ✓ Accessing the devpi server via the nginx vhost should return a valid JSON response
 ✓ Downloading a Python package via our PyPI proxy should succeed
 ✓ We should still be able to install Python packages when the devpi contianer's backend is broken
 ✓ The vhost for the docker registry should be available
 ✓ The docker registry's /_ping url should return valid JSON
 ✓ The docker registry's /v1/_ping url should return valid JSON
 ✓ The front-end serer's root url should return http 204
 ✓ The front-end server's /_status location should return statistics from our web server
 ✓ Accessing http://www.google.com through our proxy should always return a cache miss
 ✓ Downloading a file that is not in the cache should result in a cache miss
 ✓ Downloading a file that is in the cache should result in a cache hit
 ✓ Setting the header 'X-Refresh: true' should result in a bypass of the cache
 ✓ Trying to purge when it's not in the cache should return 404
 ✓ Downloading the file again after purging from the cache should yield a cache miss
 ✓ The yum repo's vhost should return HTTP 200

 19 tests, 0 failures
 Finished verifying  (1m52.26s).
-----> Kitchen is finished. (1m52.49s)

And there you have it, one command to automate your entire VM testing workflow!

Next Steps

Giving individual developers on our team the ability to quickly run a suite of automated tests is a big win, but that’s only the first step. The workflow we’re planning is to have Jenkins also run these automated Ansible tests every time someone pushes to our git repo. If those tests succeed we can automatically trigger a run of Ansible against our production inventory. If, on the other hand, the Jenkins job which runs the tests is failing (red), we can use that to prevent Ansible from running against our production inventory. This would be a big win for validating infrastructure changes before pushing them to production.

ansible_logo_black_square

Advertisements

Ansible vs Puppet – Hands-On with Ansible

This is part 2/2 in a series. For part #1 see: Ansible vs Puppet – An Overview of the Solutions.

Notes & Findings From Going Hands-On with Ansible

After playing with Ansible for a week to Ansible-ize Graphite/Grafana (via Docker) and Jenkins (via an Ansible Galaxy role), here are my notes about Ansible:

  • “Batteries Included” and OSS Module Quality
    • While Ansible does include more modules out of the box, the “batteries included” claim is misleading. IMO an Ansible shop will have to rely heavily upon Ansible Galaxy to find community-created modules (Ex: for installing Jenkins, dockerd, or ntp), just as a Puppet shop would have to rely upon PuppetForge.
    • The quality and quantity of the modules on Ansible Galaxy is about on par with what is available at PuppetForge. Just as with PuppetForge, there are multiple implementations for any given module (ex: nginx, ntp, jenkins), each with their own quirks, strengths, and deficiencies.
    • Perhaps this is a deficiency of all of the configuration management systems. Ultimately a shop’s familiarity with Python or Ruby may add some preference here.
  • Package Installations
    • Coming from Puppet-land this seemed worthy of pointing out: Ansible does not abstract an OS’s package manager the same way that Puppet does with the “package” resource. Users explicitly call out the package manager to be used. Ex: the “apt” module or “yum” module. One can see that Ansible provides a tad bit less abstraction. FWIW a package installed via “pip” or “gem” in Puppet still requires explicit naming of the package provider. Not saying that either is better or worse here. Just a noticable difference to an Ansible newbie.
  • Programming Language Constructs
  • Noop Mode
  • Agent-less
    • Ansible’s agent-less, SSH-based push workflow actually was notably easier to deal with than a Puppetmaster, slave agents, SSL certs, etc.
  • Learning Curve
    • If I use my imagination and pretend that I was starting to use a configuration management tool for the first time, I perceive that I’d have an easier time picking up Ansible. Even though I’m not a fan of YAML by any stretch of the imagination, Ansible playbooks are a bit easier to write & understand than Puppet manifests.

Conclusions

After three years of using Puppet at VMware and Virtual Instruments, the thought of not continuing to use the market leader in configuration management tools seemed like a radical idea when it was first suggested to me. After spending several weeks researching Ansible and using it hands-on, I came to the conclusion that Ansible is a perfectly viable alternative to Puppet. I tend to agree with Lyft’s conclusion that if you have a centralized Ops team in change of deployments then they can own a Puppet codebase. On the other hand if you want more wide-spread ownership of your configuration management scripts, a tool with a shallower learning curve like Ansible is a better choice.

Ansible vs Puppet – An Overview of the Solutions

This is part 1/2 in a series. For part #2 see: Ansible vs Puppet – Hands-On with Ansible.

Having recently joined Delphix in a DevOps role, I was tasked with increasing the use of a configuration management tool within the R&D infrastructure. Having used Puppet for the past 3 years at VMware and Virtual Instruments I was inclined to deploy another Puppet-based infrastructure at Delphix, but before doing so I thought I’d take the time to research the new competitors in this landscape.

Opinions From Colleagues

I started by reaching out to three former colleagues and asking for their experiences with alternatives to Puppet, keeping in mind that two of the three are seasoned Chef experts. Rather than giving me a blanket recommendation, they responded with questions of their own:

  • How does the application stack look? Is is a typical web server, app, and db stack or more complex?
  • Are you going to be running the nodes in-house, cloud, or hybrid? What type of hypervisors are involved?
  • What is the future growth? (ie. number of nodes/environments)
  • How much in-depth will the eng team be involved in writing recipes, manifests, or playbooks?

After describing Delphix’s infrastructure and use-cases (in-house VMware; not a SaaS company/product but there are some services living in AWS; less than 100 nodes; I expect the Engineering team to be involved in writing manifests/cookbooks), I received the following recommendations:

Colleague #1: “Each CM tool will have its learning curve but I believe the decision will come down to what will the engineers adapt to the easiest. I would tend to lean towards Ansible for the mere fact thats it’s the easiest to implement. There are no agents to be installed and works well with vSphere/Packer. You will most likely have the engineers better off on it better since they know Python.”

Colleague #2: “I would say that if you don’t have a Puppet champion (someone that knows it well and is pushing to have it running) it is probably not worth for this. Chef would be a good solution but it has a serious learning curve and from your details it is not worth either. I would go with something simple (at least in this phase); I think Ansible would be a good fit for this stage. Maybe even with some Docker here or there ;). There will be no state or complicated cluster configs, but you seem to not need that. You don’t have to install any agent and just need SSH.“

The third colleague was also in agreement that I should take a good look at Ansible before starting with deployment of Puppet.

Researching Ansible on My Own

Having gotten similar recommendations from 3/3 of my former colleagues, I spent some time looking into Ansible. Here are some major differences between Ansible & Puppet that I gathered from my research:

  • Server Nodes
    • Puppet infrastructure generally contains 1 (or more) “puppetmaster” servers, along with a special agent package installed on each client node. (Surprisingly, the Puppetization of the puppetmaster server itself is an issue that does not have a well-defined solution)
    • Ansible has neither a special master server, nor special agent executables to install. The executor can be any machine with a list (inventory) of the nodes to contact, the Ansible playbooks, and the proper SSH keys/credentials in order to connect to the nodes.
  • Push vs Pull
    • Puppet nodes have special client software and periodically check into a puppet master server to “pull” resource definitions.
    • Ansible follows a “push” workflow. The machine where Ansible runs from SSH’s into the client machines and uses SSH to copy files, remotely install packages, etc. The client machine VM requires no special setup outside of a working installation of Python 2.5+.
  • Resources & Ordering
    • Puppet: Resources defined in a Puppet manifest are not applied in order of their appearance (ex: top->bottom) which is confusing for people who come from conventional programming languages (C, Java, etc). Instead resources are applied randomly, unless explicit resource ordering is used. Ex: “before”, ”require”, or chaining arrows.
    • Ansible: The playbooks are applied top-to-bottom, as they appear in the file. This is more intuitive for developers coming from other languages. An example playbook that can be read top-down: https://gist.github.com/phips/aa1b6df697b8124f1338
  • Resource Dependency Graphs
    • Internally, Puppet internally creates a directed graph of all of the resources to be defined in a system along with the order they should be applied in. This is a robust way of representing the resources to be applied and Puppet can even generate a graph file so that one can visualize everything that Puppet manages. On the down side building this graph is susceptible to “duplicate resource definition” errors (ex: multiple definitions of a given package, user, file, etc). Also, conflicting rules from a large collection of 3rdparty modules can lead to circular dependencies.
    • Since Ansible is basically a thin-wrapper for executing commands over SSH, there is no resource dependency graph built internally. One could view this as a weakness as compared with Puppet’s design but it also means that these “duplicate resource” errors are completely avoided. The simpler design lends itself to new users understanding the flow of the playbook more easily.
  • Batteries Included vs DIY
  • Language Extensibility
  • Syntax
  • Template Language
    • Puppet templates are based upon Ruby’s ERB.
    • Ansible templates are based upon Jinja2, which is a superset of Django’s templating language. Most R&D orgs will have more experience with Django.
  • DevOps Tool Support

Complexity & Learning Curve

I wanted to call this out even though it’s already been mentioned several times above. Throughout my conversations with my colleagues and in my research, it seemed that Ansible was a winner in terms of ease of adoption. In terms of client/server setup, resource ordering, “batteries included”, Python vs Ruby, and Jinja vs ERB, I got the impression that I’d have an easier time getting teammates in my R&D org to get a working understanding of Ansible.

Supporting this sentiment was a post I found about the Taxi startup Lyft abandoning Puppet because the dev team had difficulty learning Puppet. From “Moving Away from Puppet”:

“[The Puppet code base] was large, unwieldy and complex, especially for our core application. Our DevOps team was getting accustomed to the Puppet infrastructure; however, Lyft is strongly rooted in the concept of ‘If you build it you run it’. The DevOps team felt that the Puppet infrastructure was too difficult to pick up quickly and would be impossible to introduce to our developers as the tool they’d use to manage their own services.”

Getting Hands-On

Having done my research on Puppet vs Ansible, I was now ready to dive in and implement some sample projects. How did my experience with Ansible turn out? Read on in Part #2 of this series: Ansible vs Puppet – Hands-On with Ansible.

Testing Puppet Code with Vagrant

At Virtual Instruments we use Vagrant boxes to locally test our Puppet changes before pushing those changes into production. Here are some details about how we do this.

Puppet Support in Vagrant

Vagrant has built-in support for using Puppet as a machine provisioner, either by contacting a Puppet master to receive modules and manifests or by running “puppet apply” with a local set of modules and manifests (aka. masterless Puppet). We chose to use masterless Puppet with Vagrant in our test environment due to its simplicity of setup.

Starting with a Box for the Base OS

Before we can use Puppet to provision our machine, we need to have a base OS available with Puppet installed. At Virtual Instruments our R&D infrastructure is standardized on Ubuntu 12.04 which means that we want our Vagrant base box to be a otherwise minimal installation of Ubuntu 12.04 with Puppet also installed. Luckily this is a very common configuration and there are pre-made Vagrant boxes available for download at VagrantCloud.com. We’re going to use the box named “puppetlabs/ubuntu-12.04-64-puppet“.

If you are using a different OS you can search the Vagrant Cloud site for a Vagrant base box that matches the OS of your choice. See: https://vagrantcloud.com/discover/featured

If you can find a base box for your OS but not a base box for that OS which has Puppet pre-installed, you can use one of @mitchellh‘s nifty Puppet-bootstrap scripts with a Vagrant Shell Provisioner to get Puppet installed into your base box. See the README included in that repo for details: https://github.com/hashicorp/puppet-bootstrap/blob/master/README.md

The Vagrantfile

Having found a suitable base box, one can use the following “Vagrantfile” to start that box and invoke Puppet to provision the machine.

VAGRANTFILE_API_VERSION = "2"

# set the following hostname to a name that Puppet will match against. ex:
# "vi-cron9.lab.vi.local"
MY_HOSTNAME = "vi-nginx-proxy9.lab.vi.local"


Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  # from: https://vagrantcloud.com/search?utf8=✓&sort=&provider=&q=puppetlabs+12.04
  config.vm.box = "puppetlabs/ubuntu-12.04-64-puppet"
  config.vm.hostname = MY_HOSTNAME

  # needed to load hiera data for puppet
  config.vm.synced_folder "hieradata", "/data/puppet-production/hieradata"

  # Vagrant/Puppet docs:
  #   http://docs.vagrantup.com/v2/provisioning/puppet_apply.html
  config.vm.provision :puppet do |puppet|
    puppet.facter = {
      "is_vagrant_vm" => "true"
    }
    puppet.hiera_config_path = "hiera.yaml"
    puppet.manifest_file  = "site.pp"
    puppet.manifests_path = "manifests"
    puppet.module_path = "modules"
    # puppet.options = "--verbose --debug"
  end

end

Breaking Down the Vagrantfile

Setting Our Hostname

# Set the following hostname to a name that Puppet will match against. ex:
# "vi-cron9.lab.vi.local"
MY_HOSTNAME = "vi-nginx-proxy9.lab.vi.local"

Puppet determines which resources to apply based on the hostname of our VM. For ease of use, our “Vagrantfile” has a variable called “MY_HOSTNAME” defined at the top of the file which allows users to easily define which node they want to provision locally.

Defining Which Box to Use

# From: https://vagrantcloud.com/search?utf8=✓&sort=&provider=&q=puppetlabs+12.04
config.vm.box = "puppetlabs/ubuntu-12.04-64-puppet"

The value for “config.vm.box” is the name of the box we found on vagrantcloud.com. This allows Vagrant to automatically download the base VM image from the Vagrant Cloud service.

Puppet-Specific Configurations

  # Needed to load Hiera data for Puppet
  config.vm.synced_folder "hieradata", "/data/puppet-production/hieradata"

  # Vagrant/Puppet docs:
  #   http://docs.vagrantup.com/v2/provisioning/puppet_apply.html
  config.vm.provision :puppet do |puppet|
    puppet.facter = {
      "is_vagrant_vm" => "true"
    }
    puppet.hiera_config_path = "hiera.yaml"
    puppet.manifest_file  = "site.pp"
    puppet.manifests_path = "manifests"
    puppet.module_path = "modules"
    # puppet.options = "--verbose --debug"
  end

Here we are setting up the configuration of the Puppet provisioner. See the full documentation for Vagrant’s masterless Puppet provisioner at: https://docs.vagrantup.com/v2/provisioning/puppet_apply.html

Basically this code:

  • Sets up a shared folder to make our Hiera data available to the guest OS
  • Set a custom Facter fact called “is_vagrant_vm” to “true“. This fact can then be used by our manifests for edge-cases around running VMs locally (like routing collectd/SAR data to a non-production Graphite server to avoid pollution of the production Graphite server)
  • Tells the Puppet provisioner where the root Puppet manifest file is and where necessary Puppet modules can be found.

Conclusion

Vagrant is a powerful tool for testing Puppet code changes locally. With a simple “vagrant up” one can fully provision a VM from scratch. One can also use the “vagrant provision” command to locally test incremental updates to Puppet code as it is iteratively being developed, or to test changes to long-running mutable VMs.

Building Vagrant Boxes with Nested VMs using Packer

In “Improving Developer Productivity with Vagrant” I discussed the productivity benefits gained from using Vagrant in our software development tool chain. Here are some more details about the mechanics of how we created those Vagrant boxes as part of every build of our product.

Using Packer to Build VMware-Compatible Vagrant Boxes

Packer is a tool for creating machine images which was also written by Hashicorp, the authors of Vagrant. It can build machine images for almost any type of environment, including Amazon AWSDocker, Google Compute Engine, KVM, Vagrant, VMwareXen, and more.

We used Packer’s built-in VMware builder and Vagrant post-processor to create the Vagrant boxes for users to run on their local desktops/laptops via VMware Fusion or Workstation.

Note: This required each user to install Vagrant’s for-purchase VMware plugin. In our usage of running Vagrant boxes locally we noted that the VMware virtualization providers delivered far better IO performance and stability than the free Oracle VirtualBox provider. In short, the for-purchase Vagrant-VMware plugin was worth every penny!

Running VMware Workstation VMs Nested in ESXi

One of the hurdles I came across in integrating the building of the Vagrant boxes into our existing build system is that Packer’s VMware builder needs to spin up a VM using Workstation or Fusion in order to perform configuration of the Vagrant box. Given that our builds were already running in static VMs, this meant that we needed to be able to run Workstation VMs nested within an ESXi VM with a Linux guest OS!

This sort of VM-nesting was somewhat complicated to setup in the days of vSphere 5.0, but in vSphere 5.1+ this has become a lot simpler. With vSphere 5.1+ one just needs to make sure that their ESXi VMs are running with “Virtual Hardware Version 9” or newer, and one must enable “Hardware assisted virtualization” for the VM within the vSphere web client.

Here’s what the correct configuration for supporting nested VMs looks like:

2014-09-28 02.05.03 pm

Packer’s Built-in Remote vSphere Hypervisor Builder

One question that an informed user of Packer may correctly ask is: “Why not use Packer’s built-in Remote vSphere Hypervisor Builder and create the VM directly on ESXi? Wouldn’t this remove the need for running nested VMs?”

I agree that this would be a better solution in theory. There are several reasons why I chose to go with nested VMs instead:

  1. The “Remote vSphere Hypervisor Builder” requires manually running an “esxcli” command on your ESXi boxes to enable some sort of “GuestIP hack”. Doing this type of configuration on our production ESXi cluster seemed sketchy to me.
  2. The “Remote vSphere Hypervisor Builder” doesn’t work through vSphere, but instead directly ssh’es into your ESXi boxes as a privileged user in order to create the VM. The login credentials for that privileged ESXi/ssh user must be kept in the Packer build script or some other area of our build system. Again, this seems less than ideal to me.
  3. As far as I can tell from the docs, the “Remote vSphere Hypervisor Builder” only works with the “vmware-iso” builder and not the “vmware-vmx” builder. This would’ve painted us into a corner as we had plans to switch from the “vmware-iso” builder to the “vmware-vmx” builder once it had become available.
  4. The “Remote vSphere Hypervisor Builder” was not available when I implemented our nested VM solution because we were early adopters of Packer. It was easier to stick with a working solution that we already had 😛

Automating the Install of VMware Workstation via Puppet

One other mechanical piece I’ll share is how we automated the installation of VMware Workstation 10.0 into our static build VMs. Since all of the build VM configuration is done via Puppet, we could automate the installation of Workstation 10 with the following bit of Puppet code:

# Install VMware Workstation 10
  $vmware_installer = '/mnt/devops/software/vmware/VMware-Workstation-Full-10.0.0-1295980.x86_64.bundle'
  $vmware_installer_options = '--eulas-agreed --required'
  exec {'Install VMware Workstation 10':
    command => "${vmware_installer} ${vmware_installer_options}",
    creates => '/usr/lib/vmware/config',
    user    => 'root',
    require => [Mount['/mnt/devops'], Package['kernel-default-devel']],
  }

Docker + Jenkins: Dynamically Provisioning SLES 11 Build Containers

TL; DR

Using JenkinsDocker Plugin, we can dynamically spin-up SLES 11 build slaves on-demand to run our builds. One of the hurdles to getting there was to create a SLES 11 Docker base-image, since there are no SLES 11 container images available at the Docker Hub Registry. We used SUSE’s Kiwi imaging tool to create a base SLES 11 Docker image for ourselves, and then layered our build environment and Jenkins build slave support on top of it. After configuring Jenkins’ Docker plugin to use our home-grown SLES image, we were off and running with our containerized SLES builds!

Jenkins/Docker Plugin

The path to Docker-izing our build slaves started with stumbling across this Docker Plugin for Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin. This plugin allows one to use Docker to dynamically provision a build slave, run a single build, and then tear-down that slave, optionally saving it. This is very similar in workflow to the build VM provisioning system that I created while working in VMware’s Release Engineering team, but much lighter weight. Compared to VMs, Docker containers can be spun up in milliseconds instead of a in few minutes and Docker containers are much lighter on hardware resources.

The above link to the Jenkins wiki provides details about how to configure your environment as well as how to configure your container images. Some high-level notes:

  • Your base OS needs to have Docker listening on a TCP port. By default, Docker only listens on a Unix socket.
  • The container needs run “sshd” for Jenkins to connect to it. I suspect that once the container is provisioned, Jenkins just treats it as a plain-old SSH slave.
  • In my testing, the Docker/Jenkins plugin was not able to connect via SSH to the containers it provisioned when using Docker 1.2.0. After trial and error, I found that the current version of the Jenkins plugin (0.6) works well with Docker 1.0-1.1.2, but Docker 1.2.0+ did not work with this Jenkins Plugin. I used Puppet to make sure that our Ubuntu build server base VMs only had Docker 1.1.2 installed. Ex:
    • # VW-10576: install docker on the ubuntu master/slaves
      # * Have Docker listen on a TCP port per instructions at:
      # https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin
      # * Use Docker 1.1.2 and not anything newer. At the time of writing this
      # comment, Docker 1.2.0+ does not work with the Jenkins/Docker
      # plugin (the port for sshd fails to map to an external port).
      class { 'docker':
        tcp_bind => 'tcp://0.0.0.0:4243',
        version  => '1.1.2',
      }
  • There is a sample Docker/Jenkins slave based on “ubuntu:latest” available at: https://registry.hub.docker.com/u/evarga/jenkins-slave/. I would recommend getting that working as a proof-of-concept before venturing into building your own custom build slave containers. It’s helpful to be familiar with the “Dockerfile” for that image as well: https://registry.hub.docker.com/u/evarga/jenkins-slave/dockerfile/

Once you have the Docker Plugin installed, you need to go to your Jenkins “System Configuration” page and add your Docker host as a new cloud provider. In my proof-of-concept case, this is an Ubuntu 12.04 VM running Docker 1.1.2, listening on port 4243, configured to use the “evarga/jenkins-slave” image, providing the “docker-slave” label which I can then configure my Jenkins build job to be restricted to. The Jenkins configuration looks like this:

Jenkins' "System Configuration" for a Docker host

Jenkins’ “System Configuration” for a Docker host

I then configured a job named “docker-test” to use that “docker-slave” label and run a shell script with basic commands like “ps -eafwww”, “cat /etc/issue”, and “java -version”. Running that job, I see that it successfully spins up a container of “evarga/jenkins-slave” and runs my little script. Note the hostname at the top of the log, and output of “ps” in the screenshot below:

A proof-of-concept of spinning up a Docker container on demand

A proof-of-concept of spinning up a Docker container on demand

 

Creating Our SLES 11 Base Image

Having built up the confidence that we can spin up other people’s containers on-demand, we now turned to creating our SLES 11 Docker build image. For reasons that I can only assume are licensing issues, SLES 11 does not have a base image up on the Docker Hub Registry in the same vein as the images that Ubuntu, Fedora, CentOS, and others have available.

Luckily I stumbled upon the following blog post: http://flavio.castelli.name/2014/05/06/building-docker-containers-with-kiwi/

At Virtual Instruments we were already using Kiwi to build the OVAs of our build VMs, so we were already familiar with using Kiwi. Since we’d already been using Kiwi to create the OVA of our build environment it wasn’t much more work to follow that blog post and get Kiwi to generate a tarball that could be consumed by “docker import”. This worked well for the next proof-of-concept phase, but ultimately we decided to go down another path.

Rather than have Kiwi generate fully configured build images for us, we decided it’d be best to follow the conventions of the “Docker Way” and have Kiwi generate a SLES 11 base image which we could then use with a “FROM” statement in a “Dockerfile” and install the build environment via the Dockerfile. One of the advantages of this is that we only have to use Kiwi to generate the base image the first time. After there we can stay in Docker-land to build the subsequent images. Additionally, having a shared base image among all of our build image tags should allow for space savings as Docker optimizes the layering of filesystems over a common base image.

Configuring the Image for Use with Jenkins

Taking a SLES 11 image with our build environment installed and getting it to work with the Jenkins Docker plugin took a little bit of work, mainly spent trying to configure “sshd” correctly. Below is the “Dockerfile” that builds upon a SLES image with our build environment installed and prepares it for use with Jenkins:

# This Dockerfile is used to build an image containing basic
# configuration to be used as a Jenkins slave build node.

FROM vi-docker.lab.vi.local/pa-dev-env-master
MAINTAINER Dan Tehranian <REDACTED@virtualinstruments.com>


# Add user & group "jenkins" to the image and set its password
RUN groupadd jenkins
RUN useradd -m -g jenkins -s /bin/bash jenkins
RUN echo "jenkins:jenkins" | chpasswd


# Having "sshd" running in the container is a requirement of the Jenkins/Docker
# plugin. See: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin

# Create the ssh host keys needed for sshd
RUN ssh-keygen -A

# Fix sshd's configuration for use within the container. See VW-10576 for details.
RUN sed -i -e 's/^UsePAM .*/UsePAM no/' /etc/ssh/sshd_config
RUN sed -i -e 's/^PasswordAuthentication .*/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Expose the standard SSH port
EXPOSE 22

# Start the ssh daemon
CMD ["/usr/sbin/sshd -D"]

Running a Maven Build Inside of a SLES 11 Docker Container

Having created this new image and pushed it to our internal docker repo, we can now go back to Jenkins’ “System Configuration” page and add a new image to our Docker cloud provider. Creating a new Jenkins “Maven Job” which utilizes this new SLES 11 image and running a build, we can see our SLES 11 container getting spun up, code getting checked out from our internal git repo, and Maven being invoked:

Hooray! A Successful Maven Build

Hooray! A successful Maven build inside of a Docker container!

Output From the Maven Build. LGTM!

Output from the Maven Build that was run in the container. LGTM!

 

Wins

There are a whole slew of benefits to a system like this:

  • We don’t have to run & support SLES 11 VMs in our infrastructure alongside the easier-to-manage Ubuntu VMs. We can just run Ubuntu 12.04 VMs as the base OS and spin up SLES slaves as needed. This makes testing of our Puppet repository a lot easier as this gives us a homogeneous OS environment!
  • We can have portable and separate build environment images for each of our branches. Ex: legacy product branches can continue to have old versions of the JDK and third party libraries that are updated only when needed, but our mainline development can have a build image with tools that are updated independently.
    • This is significantly better than the “toolchain repository” solution that we had at VMware, where several 100s of GBs of binaries were checked into a monolithic Perforce repo.
  • Thanks to Docker image tags, we can tag the build image at each GA release and keep that build environment saved. This makes reproducing builds significantly easier!
  • Having a Docker image of the build environment allows our developers to do local builds via their IDEs, if they so choose. Using Vagrant’s Docker provider, developers can spin up a Docker container of the build environment for their respective branch on their local machines, regardless of their host OS – Windows, Mac, or Linux. This allows developers to build RPMs with the same libraries and tools that the build system would!

A Local Caching Proxy for “pypi.python.org” via Docker

TL; DR

If your infrastructure automation installs packages from PyPI (the Python Package Index) via “pip” or similar tools, you can save yourself from annoying “pypi.python.org timeout” errors by running a local caching proxy of the PyPI service. After trying several of these services, I found “devpi” to be the most resilient. It’s available as both a Python package or as a Docker container that you can run in your data center.

Problem

If you have infrastructure automation that tries to install packages from PyPI then you’ve undoubtedly encountered encounded availability issues with the PyPI web service, hosted at “pypi.python.org”. Example email alerts that we see from our Puppet infrastructure look like:

Tue Sep 02 23:47:23 -0700 2014 /Stage[main]/Jenkins::Dev_jenkins_slave/Package[jenkins-check-for-success]
(err): Could not evaluate: Could not get latest version: 
Timeout while contacting pypi.python.org: execution expired

One could choose to ignore these sorts of connectivity issues since they are transient, but there’s quite a few negative consequences to that:

  • If you’re spinning up new machines on-demand and they require a Python package as part of their configuration, then your ability to consistently spin up these machines successfully has become compromised by a dependency which is completely out of your own control.
  • If your infrastructure automation is configured to send email alerts on these types of errors, you’ll be getting un-actionable emails that add to the noise of email alerts that you get from your infrastructure. This effectivly makes your alerting system less valuable as your team will be trained to ignore their email alerts.
  • As you scale your infrastructure to hundreds or thousands of nodes, you’ll be receiving a lot of alerts about connectivity issues with “pypi.python.org” throughout the day and night time hours. When “pypi.python.org” goes down hard for a prolonged period of time, you’ll end up with all of the nodes in your infrastructure simultaneously bombarding you with alerts about not being able to contact “pypi.python.org”.

Solution

The solution for this problem is to run a local caching proxy for “pypi.python.org” within your data center. The Python community has developed proxy packages like pypiserver, chishop, devpi, and others specifically for this use case. After extensive research and trying several of them out, I’ve found devpi to be the most resilient as well as the most actively developed as of this writing.

One can either install devpi as a Python package (see instructions on their website) or via a Docker container. Since our infrastructure has been making the move to “All Docker Everything” I’ll write up the steps I took to setup the Docker container running devpi and how I configured our clients to use it.

Devpi Server Installation

Here’s some sample Puppet code for how to download & run the “scrapinghub/devpi” container with an nginx proxy in front of it. (I discussed why having an nginx proxy in front is advantageous in Private Docker Registry w/Nginx Proxy for Stats Collection)

You’ll want to change “DEVPI_PASSWORD” and the hostname for the Nginx vhost below.

# devpi server & nginx configuration
docker::image { 'scrapinghub/devpi': }
docker::run { 'devpi':
    image => 'scrapinghub/devpi',
    ports => ['3141:3141',],
    use_name => true,
    env => ['DEVPI_PASSWORD=1234',],
}

nginx::resource::upstream { 'pypi_app':
    members => ['localhost:3141',],
}
nginx::resource::vhost { 'vi-pypi.lab.vi.local':
    proxy => 'http://pypi_app',
}

Once your container is running you can run “docker logs” to see what it is up to. You can see the “devpi” proxy saving your bacon when “pypi.python.org” occasionally becomes unavailable via log statements like this:

172.17.42.1 - - [22/Jul/2014 22:12:09] "GET /root/public/+simple/argparse/ HTTP/1.0" 200 4316
2014-07-22 22:12:09,251 [INFO ] requests.packages.urllib3.connectionpool: Resetting dropped connection: pypi.python.org
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: cache-streaming: https://pypi.python.org/packages/source/a/argparse/argparse-1.2.
1.tar.gz, target root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: starting file iteration: root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz (size 69297)

Python Client Configuration

On the client side we need to configure both “pip” and “easy_install” to use the devpi container we just instantiated. This requires creating a special configuration file for each of those Python package managers. The configuration file tells those package managers to use your devpi proxy server for their package index URL.

You’ll want to change the URL to point to the hostname you use within your own infrastructure.

# ~/.pip/pip.conf

[global]
index-url = http://vi-pypi.lab.vi.local/root/public/
# ~/.pydistutils.cfg

[easy_install]
index_url = http://vi-pypi.lab.vi.local/root/public/

But Wait There’s More – Uploading Your Own Python Packages

One of the additional benefits to running a local PyPI proxy is that it becomes a distribution point for your private Python packages. Instead of clumsily checking out SCM repos full of your own custom Python scripts to each machine in your infrastructure, you can install your Python scripts as first-order Python packages, the same way you would install packages from PyPI. This lets you properly version your packages and define dependency requirements between your packages.

Creating a “setup.py” file for each of your Python projects is outside the scope of this post, but details can be found online. Once your Python project has its “setup.py” file, uploading your versioned package to your local devpi instance requires just a few simple commands. From our Jenkins job which publishes a new version of a Python package upon a git push:

# from the cwd of "setup.py"
devpi use http://vi-pypi.lab.vi.local/root/public/
devpi login root --password 1234 
devpi upload

More details at: http://doc.devpi.net/latest/quickstart-releaseprocess.html

Conclusion

By running a local caching proxy of “pypi.python.org” we’re able to improve the reliability of our infrastructure because we are no longer beholden to the availability of an external dependency. We also get the added benefit of having a proper Python package distribution point, which allows us to have better development & deployment practices. Finally, this local caching proxy provides better performance for installing packages, as local network copies are significantly faster than downloading from an external website.