Automating Linux Security Best Practices with Ansible

Suppose you’re setting up a new Linux server in Amazon AWS, Digital Ocean, or even within your private network. What are some of the low-hanging fruits one can knock-out to improve the security of a Linux system? One of my favorite posts on this topic is “My First 5 Minutes On A Server; Or, Essential Security for Linux Servers“.

In that post Bryan Kennedy provides several tips, including:

  • Installing fail2ban – a daemon that monitors login attempts to a server and blocks suspicious activity as it occurs.
  • Modifying the configuration of “sshd” to permit neither “root” logins, nor password-based logins (ie. allow authentication with key-pairs only).
  • Configuring a software-based firewall with “iptables”.
  • Enabling automatic software security updates through the OS’s package manager.

These all seem like sane things to do on any new server, but how can one automate these steps for all of the machines in their infrastructure for speed, review-ability, and consistency? How can one do this in a way that allows them to easily modify the above rules for the different types of machines in their environment (ex. web servers should have port 80 open, but DB servers should have port 5432 open)? What’re some gotchas to be aware of when doing this? Here are my notes on how to automatically bring these security best practices to the Linux machines in your infrastructure, using Ansible.

Community-Contributed Roles which Implement the Best Practices

The first rule of automating anything is: Don’t write a single line of code. Instead, see if someone else has already done the work for you. It turns out that Ansible Galaxy serves us well here.

Ansible Galaxy is full of community-contributed roles, and as it relates this this post there are existing roles which implement each of the best practices that were named above:

Combining these well-maintained, open source roles, we can automate the implementation of the best practices from Bryan Kennedy’s blog post.

Making Security Best-Practices your Default

The pattern that I would use to automatically incorporate these third-party roles into your infrastructure would be to make these roles dependancies of a “common” role that gets applied to all machines in your inventory. A “common” role is a good place to put all sorts of default-y configuration, like setting up your servers to use NTP, install VMware Tools, etc.

Assuming that your common role is called “my-company.common” making these third-party roles dependancies is as simple as the following:

# requirements.yml
---
- src: franklinkim.ufw
- src: jnv.unattended-upgrades
- src: tersmitten.fail2ban
- src: willshersystems.sshd

# roles/my-company.common/meta/main.yml
---
dependencies:
  - { role: franklinkim.ufw }
  - { role: jnv.unattended-upgrades }
  - { role: tersmitten.fail2ban }
  - { role: willshersystems.sshd }

Wow, that was easy! All of Bryan Kennedy’s best practices can be implemented with just two files: One for downloading the required roles from Ansible Galaxy, and the other for including those roles for execution.

Creating Your Own Default Configuration for Security-Related Roles

Each of those roles comes with their own respective default configuration. That configuration can be seen by looking through the “defaults/main.yml” directory in each role’s respective GitHub repo.

Suppose you wanted to provide your own configuration for these roles which overrides the author’s defaults. How would you do that? A good place to put your own defaults for the configuration of your infrastructure is in the “group_vars/all” file in your Ansible repo. The “group_vars/all” file defines variables which will take precedence and override the  variables from the roles themselves.

The variable names and structure can be obtained by reading the docs for each respective role. Here’s an example of what your custom default security configuration might look like:

# group_vars/all
---
# Configure "franklinkim.ufw"
ufw_default_input_policy: DROP
ufw_default_output_policy: ACCEPT
ufw_default_forward_policy: DROP
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }

# Configure "jnv.unattended-upgrades"
unattended_automatic_reboot: false
unattended_package_blacklist: [jenkins, nginx]
unattended_mail: 'it-alerts@your-company.com'
unattended_mail_only_on_error: true

# Configure "tersmitten.fail2ban"
fail2ban_bantime: 600
fail2ban_maxretry: 3
fail2ban_services:
  - name: ssh
    enabled: true
    port: ssh
    filter: sshd
    logpath: /var/log/auth.log
    maxretry: 6
    findtime: 600

# Configure "willshersystems.sshd"
sshd_defaults:
  Port: 22
  Protocol: 2
  HostKey:
    - /etc/ssh/ssh_host_rsa_key
    - /etc/ssh/ssh_host_dsa_key
    - /etc/ssh/ssh_host_ecdsa_key
    - /etc/ssh/ssh_host_ed25519_key
  UsePrivilegeSeparation: yes
  KeyRegenerationInterval: 3600
  ServerKeyBits: 1024
  SyslogFacility: AUTH
  LogLevel: INFO
  LoginGraceTime: 120
  PermitRootLogin: no
  PasswordAuthentication: no
  StrictModes: yes
  RSAAuthentication: yes
  PubkeyAuthentication: yes
  IgnoreRhosts: yes
  RhostsRSAAuthentication: no
  HostbasedAuthentication: no
  PermitEmptyPasswords: no
  ChallengeResponseAuthentication: no
  X11Forwarding: no
  PrintMotd: yes
  PrintLastLog: yes
  TCPKeepAlive: yes
  AcceptEnv: LANG LC_*
  Subsystem: "sftp {{ sshd_sftp_server }}"
  UsePAM: yes

The above should be tailored to your needs, but hopefully you get the idea. By placing the above content in your “group_vars/all” file, it will provide the default security configuration for every machine in your infrastructure (ie. every member of the group “all”).

Overriding Your Own Defaults for Sub-Groups

The above configuration provides a good baseline of defaults for any machine in your infrastructure, but what about machines that need modifications or exceptions to these rules? Ex: Your HTTP servers need to have firewall rules that allow for traffic on ports 80 & 443, PostgreSQL servers need firewall rules for port 5432, or maybe there’s some random machine that needs X11 forwarding over SSH turned on. How can we override our own defaults?

We can continue to use the power of “group_vars” and “host_vars” to model the configuration for subsets of machines. If, for example, we wanted our web servers to have ports 80 &443 open, but PostgreSQL servers have port 5432 open, we could override the variables from “group_vars/all” with respective variables in “group_vars/webservers” and “group_vars/postgresql-servers”. Ex:

# "group_vars/webservers"
---
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }
  - { port: 80, rule: allow }
  - { port: 443, rule: allow }

===
# "group_vars/postgresql-servers"
---
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }
  - { port: 5432, rule: allow }

Rules for individual hosts can be set using “host_vars/<hostname>”. Ex: “host_vars/nat-proxy1.your-company.com”

# host_vars/nat-proxy1.your-company.com
---
# Allow IP traffic forwarding for our NAT proxy
ufw_default_forward_policy: ACCEPT

Ansible’s “group_vars”, “host_vars”, and variable precedence are a pretty powerful features. For additional readings, see:

Gotchas

There are a couple of “gotchas” to be aware of with the aforementioned scheme:

  1. Variable replacing vs. merging – When Ansible is applying the precedence of variables from different contexts, it will use variable “replacement” as the default behavior. This makes sense in the context of scalar variables like strings and integers, but may surprise you in the context of hashes where you might have expected the sub-keys to be automatically merged. If you want your hashes to be merged then you need to set “hash_behaviour=merge” in your “ansible.cfg” file. Be warned that this is a global setting though.
  2. Don’t ever forget to leave port 22 open for SSH – Building upon the first gotcha: Knowing that variables are replaced and not merged, you must to remember to always include an “allow” rule for port 22 when overriding the “ufw_rules” variable. Ex:
    ufw_rules:
      - { ip: '127.0.0.1/8', rule: allow }
      - { port: 22, rule: allow }

    Omitting the rule for port 22 may leave your machine unreachable, except via console.

  3. Unattended upgrades for daemons – Unattended security upgrades are a great way to keep you protected from the next #HeartBleed or #ShellShock security vulnerability. Unfortunately unattended upgrades can themselves cause outages.
    Imagine, for example, that you have unattended upgrades enabled on a machine with Jenkins and/or nginx installed: When the OS’s package manager goes to check for potential upgrades and finds that a new version of Jenkins/nginx is available, it will merrily go ahead and upgrade your package, which in-turn will likely cause the package’s install/upgrade script to restart your service, thus causing a temporary and unexpected downtime. Oops!
    You can prevent specific packages from bring automatically upgraded by listing them in the variable, “unattended_package_blacklist”. This is especially useful for daemons. Ex:

    unattended_package_blacklist: [jenkins, nginx]
    

Conclusion

Ansible and the community-contributed code at Ansible Galaxy make it quick and easy to implement the best-practices described at “My First 5 Minutes On A Server; Or, Essential Security for Linux Servers“. Ansible’s notion of “group_vars”, “host_vars”, and variable precedence provides a powerful way for one to configure these security best-practices for groups of machines in their infrastructure. Using all of these together one can automate & codify the security configuration of their machines for speed of deployment, review-ability, and consistency.

What other sorts of best-practices would be good to implement across your infrastructure? Feel free to leave comments below. Thanks!

ansible_logo_black_square

Advertisements

Testing Ansible Roles with Test Kitchen

Recently while attending DevOps Days Austin 2015, I participated in a breakout session focused on how to test code for configuration management tools like Puppet, Chef, and Ansible. Having started to use Ansible to manage our infrastructure at Delphix I was searching for a way to automate the testing of our configuration management code across a variety of platforms, including Ubuntu, CentOS, RHEL, and Delphix’s custom Illumos-based OS, DelphixOS. Dealing with testing across all of those platforms is a seemingly daunting task to say the least!

Intro to Test Kitchen

The conversation in that breakout session introduced me to Test Kitchen (GitHub), a tool that I’ve been very impressed by and have had quite a bit of fun writing tests for. Test Kitchen is a tool for automated testing of configuration management code written for tools like Ansible. It automates the process of spinning up test VMs, running your configuration management tool against those VMs, executing verification tests against those VMs, and then tearing down the test VMs.

What’s makes Test Kitchen so powerful and useful is its modular design:

Using Test Kitchen

After learning about Test Kitchen at the DevOps Days conference, I did some more research and stumbled across the following presentation which was instrumental in getting started with Test Kitchen and Ansible: Testing Ansible Roles with Test Kitchen, Serverspec and RSpec (SlideShare).

In summary one needs to add three files to their Ansible role to begin using Test Kitchen:

  • A “.kitchen.yml” file at the top-level. This file describes:
    • The driver to use for VM provisioning. Ex: Vagrant, AWS, Docker, etc.
    • The provisioner to use. Ex: Puppet, Chef, Ansible.
    • A list of 1 or more operating to test against. Ex: Ubuntu 12.04, Ubuntu 14.04, CentOS 6.5, or even a custom VM image specified by URL.
    • A list of test suites to run.
  • A “test/integration/test-suite-name/test-suite-name.yml” file which contains the Ansible playbook to be applied.
  • One or more test files in “test/integration/test-suite-name/test-driver-name/”. For example, when using the BATS test-runner to run a test suite named “default”: “test/integration/default/bats/my-test.bats”.

Example Code

A full example of Test Kitchen w/Ansible is available via the delphix.package-caching-proxy Ansible role in Delphix’s GitHub repo. Here are direct links to the aforementioned files/directories:682240

Running Test Kitchen

Using Test Kitchen couldn’t be easier. From the directory that contains your “.kitchen.yml” file, just run “kitchen test” to automatically create your VMs, configure them, and run tests against them:

$ kitchen test
-----> Starting Kitchen (v1.4.1)
-----> Cleaning up any prior instances of 
-----> Destroying ...
 Finished destroying  (0m0.00s).
-----> Testing 
-----> Creating ...
 Bringing machine 'default' up with 'virtualbox' provider...
 ==> default: Importing base box 'opscode-ubuntu-14.04'...
==> default: Matching MAC address for NAT networking...
 ==> default: Setting the name of the VM: kitchen-ansible-package-caching-proxy-default-ubuntu-1404_default_1435180384440_80322
 ==> default: Clearing any previously set network interfaces...
 ==> default: Preparing network interfaces based on configuration...
 default: Adapter 1: nat
 ==> default: Forwarding ports...
 default: 22 => 2222 (adapter 1)
 ==> default: Booting VM...
 ==> default: Waiting for machine to boot. This may take a few minutes...

..  ...

-----> Running bats test suite
 ✓ Accessing the apt-cacher-ng vhost should load the configuration page for Apt-Cacher-NG
 ✓ Hitting the apt-cacher proxy on the proxy port should succeed
 ✓ The previous command that hit ftp.debian.org should have placed some files in the cache
 ✓ Accessing the devpi server on port 3141 should return a valid JSON response
 ✓ Accessing the devpi server via the nginx vhost should return a valid JSON response
 ✓ Downloading a Python package via our PyPI proxy should succeed
 ✓ We should still be able to install Python packages when the devpi contianer's backend is broken
 ✓ The vhost for the docker registry should be available
 ✓ The docker registry's /_ping url should return valid JSON
 ✓ The docker registry's /v1/_ping url should return valid JSON
 ✓ The front-end serer's root url should return http 204
 ✓ The front-end server's /_status location should return statistics from our web server
 ✓ Accessing http://www.google.com through our proxy should always return a cache miss
 ✓ Downloading a file that is not in the cache should result in a cache miss
 ✓ Downloading a file that is in the cache should result in a cache hit
 ✓ Setting the header 'X-Refresh: true' should result in a bypass of the cache
 ✓ Trying to purge when it's not in the cache should return 404
 ✓ Downloading the file again after purging from the cache should yield a cache miss
 ✓ The yum repo's vhost should return HTTP 200

 19 tests, 0 failures
 Finished verifying  (1m52.26s).
-----> Kitchen is finished. (1m52.49s)

And there you have it, one command to automate your entire VM testing workflow!

Next Steps

Giving individual developers on our team the ability to quickly run a suite of automated tests is a big win, but that’s only the first step. The workflow we’re planning is to have Jenkins also run these automated Ansible tests every time someone pushes to our git repo. If those tests succeed we can automatically trigger a run of Ansible against our production inventory. If, on the other hand, the Jenkins job which runs the tests is failing (red), we can use that to prevent Ansible from running against our production inventory. This would be a big win for validating infrastructure changes before pushing them to production.

ansible_logo_black_square

Running Docker Containers on Windows & Mac via Vagrant

If you ever wanted to run a Docker container on a non-Linux platform (ex: Windows or Mac), here’s a “Vagrantfile” which will allow you to do that quickly and easily with Vagrant.

The Vagrantfile

For the purposes of this post, suppose that we want to run the “tutum/wordpress” Docker container on our Mac. That WordPress container comes with everything needed for a fully-functioning WordPress CMS installation, including MySQL and WordPress’s other dependencies.

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  config.vm.box = "puphpet/ubuntu1404-x64"

  config.vm.provision "docker" do |d|
    d.run "tutum/wordpress", args: "-p '80:80'"
  end

  config.vm.network "forwarded_port", guest: 80, host: 8080

end

Explanation

  • This “Vagrantfile” will download the “puphpet/ubuntu1404-x64” Vagrant box which is a widely-used VirtualBox/VMware image of Ubuntu 14.04.
  • Once that Vagrant box is downloaded and the VM has booted, Vagrant will run the “docker” provisioner. The “docker” provisioner will the download and run the “tutum/wordpress” Docker container, with the “docker run” argument to expose port 80 of the container to port 80 of the Ubuntu 14.04 OS.
  • The final line of our “Vagrantfile” tells Vagrant to expose port 80 of the Ubuntu guest OS to port 8080 of our host OS (i.e., Windows or Mac OS). When we access http://localhost:8080 from our host OS, that TCP traffic will be transparently forwarded to port 80 of the guest OS which will then transparently forward the traffic to port 80 of the container. Neat!

Results

After running “vagrant up” the necessary Vagrant box and Docker container downloads will start automatically:

==> default: Waiting for machine to boot. This may take a few minutes
...
==> default: Machine booted and ready!
==> default: Forwarding ports...
 default: -- 80 => 8080
 default: -- 22 => 2222
==> default: Configuring network adapters within the VM...
==> default: Waiting for HGFS kernel module to load...
==> default: Enabling and configuring shared folders...
 default: -- /Users/tehranian/Downloads/boxes/docker-wordpress: /vagrant
==> default: Running provisioner: docker...
 default: Installing Docker (latest) onto machine...
 default: Configuring Docker to autostart containers...
==> default: Starting Docker containers...
==> default: -- Container: tutum/wordpress

Once our “vagrant up” has completed, we can access the WordPress app that is running within the Docker container by pointing our web browser to http://localhost:8080. This takes us to the WordPress setup wizard where we can finish the installation and configuration of WordPress:

Wordpress Setup Wizard

 

Voila! A Docker container running quickly and easily on your Mac or Windows PC!

Building Docker Images within Docker Containers via Jenkins

If you’re like me and you’ve Dockerized your build process by running your Jenkins builds from within dynamically provisioned Docker containers, where do you turn next? You may want the creation of any Docker images themselves to also happen within Docker containers. In order words, running Docker nested within Docker (DinD).

docker-meme

I’ve recently published a Docker image to facilitate building other Docker images from within Jenkins/Docker slave containers. Details at:

Why would one want to build Docker images nested within Docker containers?

  1. For consistency. If you’re building your JARs, RPMs, etc, from within Docker containers, it makes sense to use the same high-level process for building other artifacts such as Docker images.
  2. For Docker version freedom. As I mentioned in a previous post, the Jenkins/Docker plugin can be finicky with regards to compatibility with the version of Docker that you are running on your base OS. In order words, Jenkins/Docker plugin 0.7 will not work with Docker 1.2+, so if you really need a feature from a newer version of Docker when building your images you either have to wait for a fix from the Jenkins plugin author, or you can run Docker-nested-in-Docker with the Jenkins plugin-compatible Docker 1.1.x on the host and a newer version of Docker nested within the container. Yes, this actually works!
  3. This:

Docker + Jenkins: Dynamically Provisioning SLES 11 Build Containers

TL; DR

Using JenkinsDocker Plugin, we can dynamically spin-up SLES 11 build slaves on-demand to run our builds. One of the hurdles to getting there was to create a SLES 11 Docker base-image, since there are no SLES 11 container images available at the Docker Hub Registry. We used SUSE’s Kiwi imaging tool to create a base SLES 11 Docker image for ourselves, and then layered our build environment and Jenkins build slave support on top of it. After configuring Jenkins’ Docker plugin to use our home-grown SLES image, we were off and running with our containerized SLES builds!

Jenkins/Docker Plugin

The path to Docker-izing our build slaves started with stumbling across this Docker Plugin for Jenkins: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin. This plugin allows one to use Docker to dynamically provision a build slave, run a single build, and then tear-down that slave, optionally saving it. This is very similar in workflow to the build VM provisioning system that I created while working in VMware’s Release Engineering team, but much lighter weight. Compared to VMs, Docker containers can be spun up in milliseconds instead of a in few minutes and Docker containers are much lighter on hardware resources.

The above link to the Jenkins wiki provides details about how to configure your environment as well as how to configure your container images. Some high-level notes:

  • Your base OS needs to have Docker listening on a TCP port. By default, Docker only listens on a Unix socket.
  • The container needs run “sshd” for Jenkins to connect to it. I suspect that once the container is provisioned, Jenkins just treats it as a plain-old SSH slave.
  • In my testing, the Docker/Jenkins plugin was not able to connect via SSH to the containers it provisioned when using Docker 1.2.0. After trial and error, I found that the current version of the Jenkins plugin (0.6) works well with Docker 1.0-1.1.2, but Docker 1.2.0+ did not work with this Jenkins Plugin. I used Puppet to make sure that our Ubuntu build server base VMs only had Docker 1.1.2 installed. Ex:
    • # VW-10576: install docker on the ubuntu master/slaves
      # * Have Docker listen on a TCP port per instructions at:
      # https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin
      # * Use Docker 1.1.2 and not anything newer. At the time of writing this
      # comment, Docker 1.2.0+ does not work with the Jenkins/Docker
      # plugin (the port for sshd fails to map to an external port).
      class { 'docker':
        tcp_bind => 'tcp://0.0.0.0:4243',
        version  => '1.1.2',
      }
  • There is a sample Docker/Jenkins slave based on “ubuntu:latest” available at: https://registry.hub.docker.com/u/evarga/jenkins-slave/. I would recommend getting that working as a proof-of-concept before venturing into building your own custom build slave containers. It’s helpful to be familiar with the “Dockerfile” for that image as well: https://registry.hub.docker.com/u/evarga/jenkins-slave/dockerfile/

Once you have the Docker Plugin installed, you need to go to your Jenkins “System Configuration” page and add your Docker host as a new cloud provider. In my proof-of-concept case, this is an Ubuntu 12.04 VM running Docker 1.1.2, listening on port 4243, configured to use the “evarga/jenkins-slave” image, providing the “docker-slave” label which I can then configure my Jenkins build job to be restricted to. The Jenkins configuration looks like this:

Jenkins' "System Configuration" for a Docker host

Jenkins’ “System Configuration” for a Docker host

I then configured a job named “docker-test” to use that “docker-slave” label and run a shell script with basic commands like “ps -eafwww”, “cat /etc/issue”, and “java -version”. Running that job, I see that it successfully spins up a container of “evarga/jenkins-slave” and runs my little script. Note the hostname at the top of the log, and output of “ps” in the screenshot below:

A proof-of-concept of spinning up a Docker container on demand

A proof-of-concept of spinning up a Docker container on demand

 

Creating Our SLES 11 Base Image

Having built up the confidence that we can spin up other people’s containers on-demand, we now turned to creating our SLES 11 Docker build image. For reasons that I can only assume are licensing issues, SLES 11 does not have a base image up on the Docker Hub Registry in the same vein as the images that Ubuntu, Fedora, CentOS, and others have available.

Luckily I stumbled upon the following blog post: http://flavio.castelli.name/2014/05/06/building-docker-containers-with-kiwi/

At Virtual Instruments we were already using Kiwi to build the OVAs of our build VMs, so we were already familiar with using Kiwi. Since we’d already been using Kiwi to create the OVA of our build environment it wasn’t much more work to follow that blog post and get Kiwi to generate a tarball that could be consumed by “docker import”. This worked well for the next proof-of-concept phase, but ultimately we decided to go down another path.

Rather than have Kiwi generate fully configured build images for us, we decided it’d be best to follow the conventions of the “Docker Way” and have Kiwi generate a SLES 11 base image which we could then use with a “FROM” statement in a “Dockerfile” and install the build environment via the Dockerfile. One of the advantages of this is that we only have to use Kiwi to generate the base image the first time. After there we can stay in Docker-land to build the subsequent images. Additionally, having a shared base image among all of our build image tags should allow for space savings as Docker optimizes the layering of filesystems over a common base image.

Configuring the Image for Use with Jenkins

Taking a SLES 11 image with our build environment installed and getting it to work with the Jenkins Docker plugin took a little bit of work, mainly spent trying to configure “sshd” correctly. Below is the “Dockerfile” that builds upon a SLES image with our build environment installed and prepares it for use with Jenkins:

# This Dockerfile is used to build an image containing basic
# configuration to be used as a Jenkins slave build node.

FROM vi-docker.lab.vi.local/pa-dev-env-master
MAINTAINER Dan Tehranian <REDACTED@virtualinstruments.com>


# Add user & group "jenkins" to the image and set its password
RUN groupadd jenkins
RUN useradd -m -g jenkins -s /bin/bash jenkins
RUN echo "jenkins:jenkins" | chpasswd


# Having "sshd" running in the container is a requirement of the Jenkins/Docker
# plugin. See: https://wiki.jenkins-ci.org/display/JENKINS/Docker+Plugin

# Create the ssh host keys needed for sshd
RUN ssh-keygen -A

# Fix sshd's configuration for use within the container. See VW-10576 for details.
RUN sed -i -e 's/^UsePAM .*/UsePAM no/' /etc/ssh/sshd_config
RUN sed -i -e 's/^PasswordAuthentication .*/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Expose the standard SSH port
EXPOSE 22

# Start the ssh daemon
CMD ["/usr/sbin/sshd -D"]

Running a Maven Build Inside of a SLES 11 Docker Container

Having created this new image and pushed it to our internal docker repo, we can now go back to Jenkins’ “System Configuration” page and add a new image to our Docker cloud provider. Creating a new Jenkins “Maven Job” which utilizes this new SLES 11 image and running a build, we can see our SLES 11 container getting spun up, code getting checked out from our internal git repo, and Maven being invoked:

Hooray! A Successful Maven Build

Hooray! A successful Maven build inside of a Docker container!

Output From the Maven Build. LGTM!

Output from the Maven Build that was run in the container. LGTM!

 

Wins

There are a whole slew of benefits to a system like this:

  • We don’t have to run & support SLES 11 VMs in our infrastructure alongside the easier-to-manage Ubuntu VMs. We can just run Ubuntu 12.04 VMs as the base OS and spin up SLES slaves as needed. This makes testing of our Puppet repository a lot easier as this gives us a homogeneous OS environment!
  • We can have portable and separate build environment images for each of our branches. Ex: legacy product branches can continue to have old versions of the JDK and third party libraries that are updated only when needed, but our mainline development can have a build image with tools that are updated independently.
    • This is significantly better than the “toolchain repository” solution that we had at VMware, where several 100s of GBs of binaries were checked into a monolithic Perforce repo.
  • Thanks to Docker image tags, we can tag the build image at each GA release and keep that build environment saved. This makes reproducing builds significantly easier!
  • Having a Docker image of the build environment allows our developers to do local builds via their IDEs, if they so choose. Using Vagrant’s Docker provider, developers can spin up a Docker container of the build environment for their respective branch on their local machines, regardless of their host OS – Windows, Mac, or Linux. This allows developers to build RPMs with the same libraries and tools that the build system would!

Vagrant/VMware: Resolving “Waiting for HGFS kernel module” Timeouts

TL; DR

After you upgrade your VMware-based Vagrant box’s kernel, you’ll experience a timeout error on reboot during “Waiting for HGFS kernel module to load…”. You can fix this by enabling VMware Tools’s built-in “automatic kernel modules” feature via:

echo "answer AUTO_KMODS_ENABLED yes" | sudo tee -a /etc/vmware-tools/locations

Problem Description

The HGFS (or Host-Guest File System) driver is a VMware extension that provides the shared folder support for your VMware VM. Vagrant in-turn uses this feature to provide its default shared folder implementation when using the VMware provider.

When you upgrade your Linux OS’s kernel and reboot, the new kernel does not have this HGFS driver available and Vagrant will timeout waiting for this driver to load when trying to setup the shared folders. The error looks like this:

$ time vagrant reload
...
==> default: Machine booted and ready!
...
==> default: Waiting for HGFS kernel module to load...
The HGFS kernel module was not found on the running virtual machine.
This must be installed for shared folders to work properly. Please
install the VMware tools within the guest and try again. Note that
the VMware tools installation will succeed even if HGFS fails
to properly install. Carefully read the output of the VMware tools
installation to verify the HGFS kernel modules were installed properly.

real    4m43.252s
user    0m8.948s
sys     0m1.400s

There are several potential solutions here:

  1. Never upgrade your Linux kernel. Heh.
  2. Disable Vagrant’s default shared folder via your “Vagrantfile”:
    config.vm.synced_folder ".", "/vagrant", disabled: true
  3. Use one of the alternative shared folder implementations like NFS or rsync. These in-turn have their own draw-backs.
  4. Enable VMware Tools’s “automatic kernel modules” feature to have missing kernel modules automatically built upon boot. Details below.

Solution – Enabling VMware Tools’s Automatic Kernel Modules Feature

You can either run “sudo vmware-config-tools.pl” and answer “yes” to the following question:

VMware automatic kernel modules enables automatic building and installation of
VMware kernel modules at boot that are not already present. This feature can be
enabled/disabled by re-running vmware-config-tools.pl.

Would you like to enable VMware automatic kernel modules?
[no]

… or you can cause the same side-effect that answering “yes” to that question will cause, by appending a line into a system file:

echo "answer AUTO_KMODS_ENABLED yes" | sudo tee -a /etc/vmware-tools/locations

More details are available in: https://github.com/mitchellh/vagrant/issues/4362

Private Docker Registry w/Nginx Proxy for Stats Collection

In using Docker at Virtual Instruments, we had to setup an internal Docker Registry in order to share our custom Docker images. Setting up a basic internal Docker Registry is trivially easy, but there was one issue we hit in production-izing this app.

In VI’s R&D infrastructure we run all of our web apps through proxies like Nginx for the following benefits:

  • We can run the app itself via an unprivileged user on an unprivileged port (>1024) and proxy HTTP traffic from port 80 on that host to the unprivileged app’s port. This means that our app does not need to run with root permissions in order to receive requests from port 80.
  • What follows from the above is the convenience that humans don’t have to remember the magical TCP port number for each of the numerous internal web apps that we have (ex, 5000 for Docker, 8080 for Jenkins, etc). A user can just put the DNS name of the server in their browser and go.
  • We can configure the “HttpStubStatus” module into our nginx proxy in order to provide a URL that displays realtime stats about our web server’s traffic. Ex: Number of HTTP Connections or HTTP Requests/second. We can then use collectd’s nginx plugin to collect that data and send it to our monitoring systems for analysis & visualization. The end result is something like this:

Graphite Visualization of collectd/nginx Data

The snag that we hit is that with our tried-and-true basic nginx proxy configuration, “docker push” commands from our clients were failing without any useful error messages. Googling around, I figured out that when serving a Docker Registry via nginx there is special nginx configuration required to enable chunked encoding of messages.  Sample nginx config files are at:

After manually making the above configuration changes to our nginx proxy, our test “docker push”es were finally completing successfully. The last step was to put all of this configuration into Puppet, our configuration management system.

In case anyone would benefit from this, I have provided the meat of the Puppet configuration for a private Docker Registry w/nginx proxy serving web server stats to collectd here:

  # Create an internal docker registry w/an nginx proxy in front of it
  # directing requests from port 80. Also add a "/nginx_status" URL so that
  # we can monitor web server metrics via collectd.

  include docker

  $docker_dir = '/srv/docker'  # where to store images

  file { $docker_dir:
    ensure => directory,
    owner  => 'root',
    group  => 'root',
  }

  # see docs on docker's registry app at:
  #   https://github.com/dotcloud/docker-registry/blob/master/README.md

  docker::image { 'registry': }

  docker::run { 'registry':
    image           => 'registry',
    ports           => ['5000:5000',],
    volumes         => ["${docker_dir}:/tmp/registry"],
    use_name        => true,
    env             => ['SETTINGS_FLAVOR=local',],
    restart_service => true,
    privileged      => false,
    require         => File[$docker_dir],
  }

  # the following nginx vhost proxy and location are configured per:
  #   https://github.com/docker/docker-registry/blob/0.7.3/contrib/nginx.conf
  #
  # "nginx-extras" is the Ubuntu 12.04 nginx package with extra modules like
  # "chunkin" compiled in
  class { 'nginx':
    confd_purge   => true,
    vhost_purge   => true,
    manage_repo   => false,
    package_name  => 'nginx-extras',
  }
  nginx::resource::upstream { 'docker_registry_app':
    members => ['localhost:5000',],
  }
  nginx::resource::vhost { 'docker_registry_app':
    server_name          => ['vi-docker.lab.vi.local'],
    proxy                => 'http://docker_registry_app',
    listen_options       => 'default_server',
    client_max_body_size => 0,  # allow for unlimited image upload size
    proxy_set_header     => [
      'Host $http_host',
      'X-Real-IP $remote_addr',
    ],
    vhost_cfg_append     => {
      chunkin    => 'on',
      error_page => '411 = @my_411_error',
    },
  }
  nginx::resource::location { 'my_411_error':
    vhost               => 'docker_registry_app',
    location            => '@my_411_error',
    location_custom_cfg => {
      # HACK: the value for chunkin_resume key needs to be non-empty string
      'chunkin_resume' => ' ',
    },
  }

  # nginx_status URL for collectd to pull stats from. see:
  #   https://collectd.org/wiki/index.php/Plugin:nginx
  nginx::resource::location { 'nginx_status':
    ensure              => present,
    location            => '/nginx_status',
    stub_status         => true,
    vhost               => 'docker_registry_app',
    location_cfg_append => {
      'access_log' => 'off',
    },
  }

  # collectd configuration for nginx
  class { '::collectd':
    purge        => true,
    recurse      => true,
    purge_config => true,
  }
  class { 'collectd::plugin::nginx':
    url   => 'http://vi-docker.lab.vi.local/nginx_status',
  }