Improving Developer Productivity with Vagrant

As part of improving developer productivity at Virtual Instruments during development of  Virtual Widsom 4.0, I introduced Vagrant to the development team. At the time, the product was being re-architected from a monolithic Java app into a service oriented architecture (SOA). Without Vagrant, the challenge for a given Java developer working on any one of the Java services was that there was no integration environment available for that developer to test the respective service that they were working on. In other words, a developer could run their respective Java service locally, but without the other co-requisite services and databases they couldn’t do anything useful with it.

How Not To Solve

We could have documented a long set of instructions in a wiki, detailing how to setup and run each one of the Java services locally, along with instructions on how to setup and run each of the databases manually, but there would be several problems with this approach:

  1. Following such instructions would be a very manual, time-consuming, and mistake-prone process. The total time on such efforts would be multiplied by the size of the R&D team as each developer would have to duplicate this effort on their own.
  2. Such instructions would be a “living document“, continually changing over time. This means that if Jack followed the instructions on Day X, the instructions that Jane followed on Day X+Y could be potentially different and lead to two very different integration environments.
  3. All of our developers were running Mac OS or Windows laptops, but the product environment was SuSE Linux Enterprise Server 11 (SLES 11). Regardless of how complete our instructions on how to setup the environment could be, there would still be the issue of consistency of environment. If developers were to test their Java services in hand-crafted environments that were not identical to the actual environment that QA tested in or that the customer ran the product in, then we would be sure to hit issues where functionality would work in one developer’s environment, but not in QA or in the customer’s environment! (i.e., “It worked on my box!”)

A Better Approach

logo-952f2ab5

Turning our integration environment into a portable Vagrant box (a virtual machine) solved all of these issues. The Vagrant box was an easily distributable artifact generated by our build process that contained fully configured instances of all of the Java services and databases that comprised our product. Developers could download the Vagrant box and get it running in minutes. The process for running the Vagrant box was so simple that even managers and directors could download a “Vagrantfile” and “vagrant up” to get a recent build running locally on their laptops. Finally, the Vagrant box generated by our build process utilized the identical SLES 11 environment that QA and customers would be running with, so developers would not be running into issues related to differences in environment. I will write a follow-up post about how we use Packer in our build process to create the Vagrant box, but for now I’ll provide some details about our Vagrant box workflow.

The “Vagrantfile”

Here’s a partial sample of our “Vagrantfile” where I’d like to call a few things out:

VAGRANTFILE_API_VERSION = "2"  # Do not modify

VM_NUM_CPUS = "4"
VM_RAM_MB = "4096"
VM_SHOW_CONSOLE = false


Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # Box name & URL to download from
  config.vm.box = "vw-21609-431"
  config.vm.box_url = "http://devnull.vi.local/builds/aruba-images-master/aruba-images-21609/vagrant/vmware/portal_appliance.vmware.21609-431.box"

...

  # Create a private network interface, /dev/eth1. This allows host-only access
  # to the machine using a specific IP. The host OS is available to the guest
  # at the "192.168.33.1" IP address.
  config.vm.network :private_network, ip: "192.168.33.10"

  config.vm.provider "vmware_fusion" do |v|
    v.gui = VM_SHOW_CONSOLE
    v.vmx["memsize"]  = VM_RAM_MB
    v.vmx["numvcpus"] = VM_NUM_CPUS
  end

...

end

Keep in mind that the “Vagrantfile” is executable Ruby code, so there are virtually limitless possibilities for what one can accomplish depending on your needs and desired workflow.

Private Networking and Our “services.conf”

The workflow used by the developers of our Java services is to run the service that they modifying via the IDE in their host OS (ex: Eclipse or IntelliJ), and to have all other services and databases running within the Vagrant box (the guest OS). In order to facilitate the communication between the host OS and guest OS, we direct the “Vagrantfile” to create a private network with static IP addresses for the host and guest. Here our host OS will have the IP “192.168.33.1” while the guest will be available at “192.168.33.10”:

  # Create a private network interface, /dev/eth1. This allows host-only access
  # to the machine using a specific IP. The host OS is available to the guest
  # at the "192.168.33.1" IP address.
  config.vm.network :private_network, ip: "192.168.33.10"

With private networking connectivity in place, we modified our Java services to read the configuration of where to find their peer-services into a hierarchy of configuration files. Ex: When a Java service initializes, it reads the following hierarchy of configuration files to determine how to connect to the other services:

  • /etc/vi/services.conf (the default settings)
  • /vagrant/services.conf
  • ~/services.conf (highest precedence)

Sample contents for these “services.conf” files:

# /vagrant/services.conf

com.vi.ServiceA=192.168.33.1
com.vi.ServiceB=localhost
com.vi.ServiceC=192.168.33.10

The “services.conf” hierarchy allows a developer to direct the service running in their IDE/host OS to connect to the Java services running within the Vagrant box/guest OS (via “~/services.conf”), as needed. It also allows the developer to configure the services within the Vagrant box/guest OS to connect to the Java services running on the host OS via the “/vagrant/services.conf” file. One clarification – The “/vagrant/services.conf” file actually lives on the host OS in the working directory of the “Vagrantfile” that the developer downloads. The file appears as “/vagrant/services.conf” via the default shared folder provided by Vagrant. Having the “/vagrant/services.conf” live on the host OS is especially convenient as it allows for easy editing, and more importantly it provides persistence of the developer’s configuration when tearing down and re-initializing newer versions of our Vagrant box.

Easy Downloading with “box_url”

As part of our workflow I found it to be easiest to have users not download the Vagrant .box file directly, but instead to download the small (~3KB) “Vagrantfile” which in turn contains the URL for the .box file. When the user runs “vagrant up” from the cwd of this “Vagrantfile”, Vagrant will automatically detect that the Vagrant box of the respective name is not in the local library and start to download the Vagrant box from the URL listed in the “Vagrantfile”.

  # Box name & URL to download from
  config.vm.box = "vw-21609-431"
  config.vm.box_url = "http://devnull.vi.local/builds/aruba-images-master/aruba-images-21609/vagrant/vmware/portal_appliance.vmware.21609-431.box"

More details available in the Vagrant docs: http://docs.vagrantup.com/v2/vagrantfile/machine_settings.html Note: Earlier this year the authors of Vagrant released a SaaS service for box distribution called Vagrant Cloud. You may want to look into using this, along with the newer functionality of Vagrant box versioning. We are not using the Vagrant Cloud SaaS service yet as our solution pre-dates the availability of this service and there hasn’t been sufficient motivation to change our workflow.

VM Hardware Customization

In our “Vagrantfile” I wanted to make it dead-simple for people to be able to modify the hardware resources. At VI some developers had very new laptops with lots of RAM while others had older laptops. Putting the following Ruby variables at the top of the “Vagrantfile” made it easy for someone that knows absolutely nothing about Ruby to edit the hardware configuration of their Vagrant box:

VM_NUM_CPUS = "4"
VM_RAM_MB = "4096"
VM_SHOW_CONSOLE = false

...

  config.vm.provider "vmware_fusion" do |v|
    v.gui = VM_SHOW_CONSOLE
    v.vmx["memsize"]  = VM_RAM_MB
    v.vmx["numvcpus"] = VM_NUM_CPUS
  end

Conclusion

In developing an SOA application, having a Vagrant box for developers to integrate their services that are under development has been a enormous boon for developer productivity. Downloading and running a Vagrant box is orders of magnitude faster than configuring and starting services by hand. The Vagrant box also solves the problem of “consistency of environment”, allowing developers to run their code in an environment that closely matches the QA/customer environment. In the post-mortem analysis of our Virtual Wisdom 4.0 release, having Vagrant boxes for developer integration of our Java services was identified as one of the big “wins” of the release. As the Director of Engineering said, “Without the developer productivity gains from Vagrant, we would not have been able to ship VirtualWisdom 4.0 when we did.”

Docker + Jenkins: Dynamically Provisioning SLES 11 Build Containers

TL; DR

Using JenkinsDocker Plugin, we can dynamically spin-up SLES 11 build slaves on-demand to run our builds. One of the hurdles to getting there was to create a SLES 11 Docker base-image, since there are no SLES 11 container images available at the Docker Hub Registry. We used SUSE’s Kiwi imaging tool to create a base SLES 11 Docker image for ourselves, and then layered our build environment and Jenkins build slave support on top of it. After configuring Jenkins’ Docker plugin to use our home-grown SLES image, we were off and running with our containerized SLES builds!

Jenkins/Docker Plugin

The path to Docker-izing our build slaves started with stumbling across this Docker Plugin for Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin. This plugin allows one to use Docker to dynamically provision a build slave, run a single build, and then tear-down that slave, optionally saving it. This is very similar in workflow to the build VM provisioning system that I created while working in VMware’s Release Engineering team, but much lighter weight. Compared to VMs, Docker containers can be spun up in milliseconds instead of a in few minutes and Docker containers are much lighter on hardware resources.

The above link to the Jenkins wiki provides details about how to configure your environment as well as how to configure your container images. Some high-level notes:

  • Your base OS needs to have Docker listening on a TCP port. By default, Docker only listens on a Unix socket.
  • The container needs run “sshd” for Jenkins to connect to it. I suspect that once the container is provisioned, Jenkins just treats it as a plain-old SSH slave.
  • In my testing, the Docker/Jenkins plugin was not able to connect via SSH to the containers it provisioned when using Docker 1.2.0. After trial and error, I found that the current version of the Jenkins plugin (0.6) works well with Docker 1.0-1.1.2, but Docker 1.2.0+ did not work with this Jenkins Plugin. I used Puppet to make sure that our Ubuntu build server base VMs only had Docker 1.1.2 installed. Ex:
    • # VW-10576: install docker on the ubuntu master/slaves
      # * Have Docker listen on a TCP port per instructions at:
      # https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin
      # * Use Docker 1.1.2 and not anything newer. At the time of writing this
      # comment, Docker 1.2.0+ does not work with the Jenkins/Docker
      # plugin (the port for sshd fails to map to an external port).
      class { 'docker':
        tcp_bind => 'tcp://0.0.0.0:4243',
        version  => '1.1.2',
      }
  • There is a sample Docker/Jenkins slave based on “ubuntu:latest” available at: https://registry.hub.docker.com/u/evarga/jenkins-slave/. I would recommend getting that working as a proof-of-concept before venturing into building your own custom build slave containers. It’s helpful to be familiar with the “Dockerfile” for that image as well: https://registry.hub.docker.com/u/evarga/jenkins-slave/dockerfile/

Once you have the Docker Plugin installed, you need to go to your Jenkins “System Configuration” page and add your Docker host as a new cloud provider. In my proof-of-concept case, this is an Ubuntu 12.04 VM running Docker 1.1.2, listening on port 4243, configured to use the “evarga/jenkins-slave” image, providing the “docker-slave” label which I can then configure my Jenkins build job to be restricted to. The Jenkins configuration looks like this:

Jenkins' "System Configuration" for a Docker host

Jenkins’ “System Configuration” for a Docker host

I then configured a job named “docker-test” to use that “docker-slave” label and run a shell script with basic commands like “ps -eafwww”, “cat /etc/issue”, and “java -version”. Running that job, I see that it successfully spins up a container of “evarga/jenkins-slave” and runs my little script. Note the hostname at the top of the log, and output of “ps” in the screenshot below:

A proof-of-concept of spinning up a Docker container on demand

A proof-of-concept of spinning up a Docker container on demand

 

Creating Our SLES 11 Base Image

Having built up the confidence that we can spin up other people’s containers on-demand, we now turned to creating our SLES 11 Docker build image. For reasons that I can only assume are licensing issues, SLES 11 does not have a base image up on the Docker Hub Registry in the same vein as the images that Ubuntu, Fedora, CentOS, and others have available.

Luckily I stumbled upon the following blog post: http://flavio.castelli.name/2014/05/06/building-docker-containers-with-kiwi/

At Virtual Instruments we were already using Kiwi to build the OVAs of our build VMs, so we were already familiar with using Kiwi. Since we’d already been using Kiwi to create the OVA of our build environment it wasn’t much more work to follow that blog post and get Kiwi to generate a tarball that could be consumed by “docker import”. This worked well for the next proof-of-concept phase, but ultimately we decided to go down another path.

Rather than have Kiwi generate fully configured build images for us, we decided it’d be best to follow the conventions of the “Docker Way” and have Kiwi generate a SLES 11 base image which we could then use with a “FROM” statement in a “Dockerfile” and install the build environment via the Dockerfile. One of the advantages of this is that we only have to use Kiwi to generate the base image the first time. After there we can stay in Docker-land to build the subsequent images. Additionally, having a shared base image among all of our build image tags should allow for space savings as Docker optimizes the layering of filesystems over a common base image.

Configuring the Image for Use with Jenkins

Taking a SLES 11 image with our build environment installed and getting it to work with the Jenkins Docker plugin took a little bit of work, mainly spent trying to configure “sshd” correctly. Below is the “Dockerfile” that builds upon a SLES image with our build environment installed and prepares it for use with Jenkins:

# This Dockerfile is used to build an image containing basic
# configuration to be used as a Jenkins slave build node.

FROM vi-docker.lab.vi.local/pa-dev-env-master
MAINTAINER Dan Tehranian <REDACTED@virtualinstruments.com>


# Add user & group "jenkins" to the image and set its password
RUN groupadd jenkins
RUN useradd -m -g jenkins -s /bin/bash jenkins
RUN echo "jenkins:jenkins" | chpasswd


# Having "sshd" running in the container is a requirement of the Jenkins/Docker
# plugin. See: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin

# Create the ssh host keys needed for sshd
RUN ssh-keygen -A

# Fix sshd's configuration for use within the container. See VW-10576 for details.
RUN sed -i -e 's/^UsePAM .*/UsePAM no/' /etc/ssh/sshd_config
RUN sed -i -e 's/^PasswordAuthentication .*/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Expose the standard SSH port
EXPOSE 22

# Start the ssh daemon
CMD ["/usr/sbin/sshd -D"]

Running a Maven Build Inside of a SLES 11 Docker Container

Having created this new image and pushed it to our internal docker repo, we can now go back to Jenkins’ “System Configuration” page and add a new image to our Docker cloud provider. Creating a new Jenkins “Maven Job” which utilizes this new SLES 11 image and running a build, we can see our SLES 11 container getting spun up, code getting checked out from our internal git repo, and Maven being invoked:

Hooray! A Successful Maven Build

Hooray! A successful Maven build inside of a Docker container!

Output From the Maven Build. LGTM!

Output from the Maven Build that was run in the container. LGTM!

 

Wins

There are a whole slew of benefits to a system like this:

  • We don’t have to run & support SLES 11 VMs in our infrastructure alongside the easier-to-manage Ubuntu VMs. We can just run Ubuntu 12.04 VMs as the base OS and spin up SLES slaves as needed. This makes testing of our Puppet repository a lot easier as this gives us a homogeneous OS environment!
  • We can have portable and separate build environment images for each of our branches. Ex: legacy product branches can continue to have old versions of the JDK and third party libraries that are updated only when needed, but our mainline development can have a build image with tools that are updated independently.
    • This is significantly better than the “toolchain repository” solution that we had at VMware, where several 100s of GBs of binaries were checked into a monolithic Perforce repo.
  • Thanks to Docker image tags, we can tag the build image at each GA release and keep that build environment saved. This makes reproducing builds significantly easier!
  • Having a Docker image of the build environment allows our developers to do local builds via their IDEs, if they so choose. Using Vagrant’s Docker provider, developers can spin up a Docker container of the build environment for their respective branch on their local machines, regardless of their host OS – Windows, Mac, or Linux. This allows developers to build RPMs with the same libraries and tools that the build system would!

A Local Caching Proxy for “pypi.python.org” via Docker

TL; DR

If your infrastructure automation installs packages from PyPI (the Python Package Index) via “pip” or similar tools, you can save yourself from annoying “pypi.python.org timeout” errors by running a local caching proxy of the PyPI service. After trying several of these services, I found “devpi” to be the most resilient. It’s available as both a Python package or as a Docker container that you can run in your data center.

Problem

If you have infrastructure automation that tries to install packages from PyPI then you’ve undoubtedly encountered encounded availability issues with the PyPI web service, hosted at “pypi.python.org”. Example email alerts that we see from our Puppet infrastructure look like:

Tue Sep 02 23:47:23 -0700 2014 /Stage[main]/Jenkins::Dev_jenkins_slave/Package[jenkins-check-for-success]
(err): Could not evaluate: Could not get latest version: 
Timeout while contacting pypi.python.org: execution expired

One could choose to ignore these sorts of connectivity issues since they are transient, but there’s quite a few negative consequences to that:

  • If you’re spinning up new machines on-demand and they require a Python package as part of their configuration, then your ability to consistently spin up these machines successfully has become compromised by a dependency which is completely out of your own control.
  • If your infrastructure automation is configured to send email alerts on these types of errors, you’ll be getting un-actionable emails that add to the noise of email alerts that you get from your infrastructure. This effectivly makes your alerting system less valuable as your team will be trained to ignore their email alerts.
  • As you scale your infrastructure to hundreds or thousands of nodes, you’ll be receiving a lot of alerts about connectivity issues with “pypi.python.org” throughout the day and night time hours. When “pypi.python.org” goes down hard for a prolonged period of time, you’ll end up with all of the nodes in your infrastructure simultaneously bombarding you with alerts about not being able to contact “pypi.python.org”.

Solution

The solution for this problem is to run a local caching proxy for “pypi.python.org” within your data center. The Python community has developed proxy packages like pypiserver, chishop, devpi, and others specifically for this use case. After extensive research and trying several of them out, I’ve found devpi to be the most resilient as well as the most actively developed as of this writing.

One can either install devpi as a Python package (see instructions on their website) or via a Docker container. Since our infrastructure has been making the move to “All Docker Everything” I’ll write up the steps I took to setup the Docker container running devpi and how I configured our clients to use it.

Devpi Server Installation

Here’s some sample Puppet code for how to download & run the “scrapinghub/devpi” container with an nginx proxy in front of it. (I discussed why having an nginx proxy in front is advantageous in Private Docker Registry w/Nginx Proxy for Stats Collection)

You’ll want to change “DEVPI_PASSWORD” and the hostname for the Nginx vhost below.

# devpi server & nginx configuration
docker::image { 'scrapinghub/devpi': }
docker::run { 'devpi':
    image => 'scrapinghub/devpi',
    ports => ['3141:3141',],
    use_name => true,
    env => ['DEVPI_PASSWORD=1234',],
}

nginx::resource::upstream { 'pypi_app':
    members => ['localhost:3141',],
}
nginx::resource::vhost { 'vi-pypi.lab.vi.local':
    proxy => 'http://pypi_app',
}

Once your container is running you can run “docker logs” to see what it is up to. You can see the “devpi” proxy saving your bacon when “pypi.python.org” occasionally becomes unavailable via log statements like this:

172.17.42.1 - - [22/Jul/2014 22:12:09] "GET /root/public/+simple/argparse/ HTTP/1.0" 200 4316
2014-07-22 22:12:09,251 [INFO ] requests.packages.urllib3.connectionpool: Resetting dropped connection: pypi.python.org
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: cache-streaming: https://pypi.python.org/packages/source/a/argparse/argparse-1.2.
1.tar.gz, target root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz
2014-07-22 22:12:09,301 [INFO ] devpi_server.filestore: starting file iteration: root/pypi/+f/2fb/ef8cb61e506c706957ab6e135840c/argparse-1.2.1.tar.gz (size 69297)

Python Client Configuration

On the client side we need to configure both “pip” and “easy_install” to use the devpi container we just instantiated. This requires creating a special configuration file for each of those Python package managers. The configuration file tells those package managers to use your devpi proxy server for their package index URL.

You’ll want to change the URL to point to the hostname you use within your own infrastructure.

# ~/.pip/pip.conf

[global]
index-url = http://vi-pypi.lab.vi.local/root/public/
# ~/.pydistutils.cfg

[easy_install]
index_url = http://vi-pypi.lab.vi.local/root/public/

But Wait There’s More – Uploading Your Own Python Packages

One of the additional benefits to running a local PyPI proxy is that it becomes a distribution point for your private Python packages. Instead of clumsily checking out SCM repos full of your own custom Python scripts to each machine in your infrastructure, you can install your Python scripts as first-order Python packages, the same way you would install packages from PyPI. This lets you properly version your packages and define dependency requirements between your packages.

Creating a “setup.py” file for each of your Python projects is outside the scope of this post, but details can be found online. Once your Python project has its “setup.py” file, uploading your versioned package to your local devpi instance requires just a few simple commands. From our Jenkins job which publishes a new version of a Python package upon a git push:

# from the cwd of "setup.py"
devpi use http://vi-pypi.lab.vi.local/root/public/
devpi login root --password 1234 
devpi upload

More details at: http://doc.devpi.net/latest/quickstart-releaseprocess.html

Conclusion

By running a local caching proxy of “pypi.python.org” we’re able to improve the reliability of our infrastructure because we are no longer beholden to the availability of an external dependency. We also get the added benefit of having a proper Python package distribution point, which allows us to have better development & deployment practices. Finally, this local caching proxy provides better performance for installing packages, as local network copies are significantly faster than downloading from an external website.