Vagrant/VMware: Resolving “Waiting for HGFS kernel module” Timeouts

TL; DR

After you upgrade your VMware-based Vagrant box’s kernel, you’ll experience a timeout error on reboot during “Waiting for HGFS kernel module to load…”. You can fix this by enabling VMware Tools’s built-in “automatic kernel modules” feature via:

echo "answer AUTO_KMODS_ENABLED yes" | sudo tee -a /etc/vmware-tools/locations

Problem Description

The HGFS (or Host-Guest File System) driver is a VMware extension that provides the shared folder support for your VMware VM. Vagrant in-turn uses this feature to provide its default shared folder implementation when using the VMware provider.

When you upgrade your Linux OS’s kernel and reboot, the new kernel does not have this HGFS driver available and Vagrant will timeout waiting for this driver to load when trying to setup the shared folders. The error looks like this:

$ time vagrant reload
...
==> default: Machine booted and ready!
...
==> default: Waiting for HGFS kernel module to load...
The HGFS kernel module was not found on the running virtual machine.
This must be installed for shared folders to work properly. Please
install the VMware tools within the guest and try again. Note that
the VMware tools installation will succeed even if HGFS fails
to properly install. Carefully read the output of the VMware tools
installation to verify the HGFS kernel modules were installed properly.

real    4m43.252s
user    0m8.948s
sys     0m1.400s

There are several potential solutions here:

  1. Never upgrade your Linux kernel. Heh.
  2. Disable Vagrant’s default shared folder via your “Vagrantfile”:
    config.vm.synced_folder ".", "/vagrant", disabled: true
  3. Use one of the alternative shared folder implementations like NFS or rsync. These in-turn have their own draw-backs.
  4. Enable VMware Tools’s “automatic kernel modules” feature to have missing kernel modules automatically built upon boot. Details below.

Solution – Enabling VMware Tools’s Automatic Kernel Modules Feature

You can either run “sudo vmware-config-tools.pl” and answer “yes” to the following question:

VMware automatic kernel modules enables automatic building and installation of
VMware kernel modules at boot that are not already present. This feature can be
enabled/disabled by re-running vmware-config-tools.pl.

Would you like to enable VMware automatic kernel modules?
[no]

… or you can cause the same side-effect that answering “yes” to that question will cause, by appending a line into a system file:

echo "answer AUTO_KMODS_ENABLED yes" | sudo tee -a /etc/vmware-tools/locations

More details are available in: https://github.com/mitchellh/vagrant/issues/4362

Advertisements

Building a Better Dashboard for Virtual Infrastructure

TL; DR

We built a pretty sweet dashboard for our R&D infrastructure using Graphite, Grafana, collectd, and a home-made VMware VIM API data collector script.

Screenshot of Our Dashboard

Background – Tracking Build Times

The second major project I worked on after joining Virtual Instruments resulted in improving the product build time performance 15x, from 5 hours down to 20 minutes. Having spent a lot of time and effort to accomplish that goal, I wanted to setup tracking of build time metrics going forward by having the build time of each successful build recorded into a system where it could be visualized. Setting up the collection and visualization of this data seemed like a great intern project and to that end for the Summer of 2014 we brought back our awesome intern from the previous Summer, Ryan Henrick.

Within a few weeks Ryan was able to setup a Graphite instance and add a “post-build action” for each of our Jenkins jobs (via our DSL’d job definitions in SCM) in order to have Jenkins push this build time data to the Graphite server. From Graphite’s web UI we could then visualize this build time data for each of the components of our product.

Here’s an example of our build time data visualized in Graphite. As you can see it is functionally correct, but it’s difficult to get excited about using something like this:

Graphite Visualization of Build Times

Setting up the Graphite server with Puppet looks something like this:

  class { 'graphite':
    gr_max_creates_per_minute    => 'inf',
    gr_max_updates_per_second    => 2000,

    # This configuration reads from top to bottom, returning first match
    # of a regex pattern. "default" catches anything that did not match.
    gr_storage_schemas        => [
      {
        name                => 'build',
        pattern             => '^build.*',
        retentions          => '1h:60d,12h:5y'
      },
      {
        name                => 'carbon',
        pattern             => '^carbon\.',
        retentions          => '1m:1d,10m:20d'
      },
      {
        name                => 'collectd',
        pattern             => '^collectd.*',
        retentions          => '10s:20m,30s:1h,3m:6h,12m:1d,1h:7d,4h:60d,1d:2y'
      },
      {
        name                => 'vCenter',
        pattern             => '^vCenter.*',
        retentions          => '1m:12h,5m:1d,1h:7d,4h:2y'
      },
      {
        name                => 'default',
        pattern             => '.*',
        retentions          => '10s:20m,30s:1h'
      }
    ],
    gr_timezone                => 'America/Los_Angeles',
    gr_web_cors_allow_from_all => true,
  }

There are some noteworthy settings there:

  • gr_storage_schemas – This is how you define the stat roll-ups, ie. “10 second data for the first 2 hours, 1 minute data for the next 24 hours, 10 minute data for the next 7 days”, etc. See Graphite docs for details.
  • gr_max_creates_per_minute – This limits how many new “buckets” Graphite will create per minute. If you leave it at the default of “50” and then unleash all kinds of new data/metrics onto Graphite, it may take some time for the database files for each of the new metrics to be created. See Graphite docs for details.
  • gr_web_cors_allow_from_all – This is needed for the Grafana UI to talk to Graphite (see later in this post)

Collecting More Data with “collectd”

After completing the effort to log all Jenkins build time data to Graphite, our attention turned to what other sorts of things we could measure and visualize. We did a 1-day R&D Hackathon project around collecting metrics from the various production web apps that we run within the R&D org, ex: Jira, Confluence, Jenkins, Nexus, Docker Registry, etc.

Over the course of doing our Hackathon project we realized that there was a plethora of tools available to integrate with Graphite. Most interesting to us was “collectd” with its extensive set of plugins, and Grafana, a very rich Node.js UI app for rendering Graphite data in much more impressive ways than the static “Build Time” chart that I included above.

We quickly got to work and leveraged Puppet’s collectd module to collect metrics about CPU, RAM, network, disk IO, disk space, and swap activity from the guest OS’s of all of our production VMs that run Puppet. We were also able to quickly implement collectd’s postgresql and nginx plugins to measure stats about the aforementioned web apps.

Using Puppet, we can deploy collectd onto all of our production VMs and configure it to send its data to our Graphite server with something like:

# Install collectd for SLES & Ubuntu hosts; Not on OpenSuSE (yet)
  case $::operatingsystem {
    'SLES', 'Ubuntu': {
      class { '::collectd':
        purge        => true,
        recurse      => true,
        purge_config => true,
        version      => installed,
      }
      class { 'collectd::plugin::df': }
      class { 'collectd::plugin::disk':
        disks          => ['/^dm/'],
        ignoreselected => true,
      }
      class { 'collectd::plugin::swap':
        reportbydevice => false,
        reportbytes    => true,
      }
      class { 'collectd::plugin::write_graphite':
        protocol     => 'tcp',
        graphitehost => 'vi-devops-graphite1.lab.vi.local',
      }
      collectd::plugin { 'cpu': }
      collectd::plugin { 'interface': }
      collectd::plugin { 'load': }
      collectd::plugin { 'memory': }
      collectd::plugin { 'nfs': }
    }
    default: {}
  }

For another example w/details on monitoring an HTTP server, see my previous post: Private Docker Registry w/Nginx Proxy for Stats Collection

Collecting vCenter Data with pyVmomi

After the completion of the Hackathon, I pointed the intern to the pyvmomi Python module and the sample scripts available at http://vmware.github.io/pyvmomi-community-samples/, and told him to get crackin’ on making a Python data collector that would connect to our vCenter instance, grab metrics about ESXi cluster CPU & memory usage, VMFS datastore usage, along with per-VM CPU and memory usage, and ship that data off to our Graphite server.

Building a Beautiful Dashboard with Grafana

With all of this collectd and vCenter data now being collected in Graphite, the only thing left was to create a beautiful dashboard for visualizing what’s going on in our R&D infrastructure. We leveraged Puppet’s grafana module to setup a Grafana instance.

Grafana is a Node.js app that talks directly to Graphite’s WWW API (bypassing Graphite’s very rudimentary static image rendering layer) and pulls in the raw series data that you’ve stored in Graphite. Grafana allows you to build dashboards suitable for viewing in your web browser or for tossing up a TV display. It has all kinds of sexy UI features like drag-to-zoom, mouse-over-for-info, and annotation of events.

There’s a great live demo of Grafana available at: http://play.grafana.org

Setting up the Grafana server with Puppet is pretty simple. The one trick is that in order to save your Dashboards you need to setup Elasticsearch:

  class { 'elasticsearch':
    package_url => 'https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.2.deb',
    java_install => true,
  }

  elasticsearch::instance { 'grafana': }

  class { 'nginx':
      confd_purge => true,
      vhost_purge => true,
  }

  nginx::resource::vhost { 'localhost':
    www_root       => '/opt/grafana',
    listen_options => 'default_server',
  }

  class { 'grafana':
    install_dir         => '/opt',
    elasticsearch_host  => $::fqdn,
  }

Closing Thoughts

One of the main benefits of this solution is that we can have our data from disparate data sources in one place, and plot these series against one another. Unlike vCenter’s “Performace” graphs which are slow to setup and regularly time-out before rendering, days or weeks of rolled up data will load very quickly in Graphite/Grafana.

Screenshot of Our Dashboard

Acknowledgements

  • Thanks to our intern Ryan Henrick for his work this past Summer in implementing these data collectors and creating the visualizations.
  • Thanks to Alan Grosskurth from Package Lab for pointing me to Grafana when I told him what we were up to with our Hackathon project.

Private Docker Registry w/Nginx Proxy for Stats Collection

In using Docker at Virtual Instruments, we had to setup an internal Docker Registry in order to share our custom Docker images. Setting up a basic internal Docker Registry is trivially easy, but there was one issue we hit in production-izing this app.

In VI’s R&D infrastructure we run all of our web apps through proxies like Nginx for the following benefits:

  • We can run the app itself via an unprivileged user on an unprivileged port (>1024) and proxy HTTP traffic from port 80 on that host to the unprivileged app’s port. This means that our app does not need to run with root permissions in order to receive requests from port 80.
  • What follows from the above is the convenience that humans don’t have to remember the magical TCP port number for each of the numerous internal web apps that we have (ex, 5000 for Docker, 8080 for Jenkins, etc). A user can just put the DNS name of the server in their browser and go.
  • We can configure the “HttpStubStatus” module into our nginx proxy in order to provide a URL that displays realtime stats about our web server’s traffic. Ex: Number of HTTP Connections or HTTP Requests/second. We can then use collectd’s nginx plugin to collect that data and send it to our monitoring systems for analysis & visualization. The end result is something like this:

Graphite Visualization of collectd/nginx Data

The snag that we hit is that with our tried-and-true basic nginx proxy configuration, “docker push” commands from our clients were failing without any useful error messages. Googling around, I figured out that when serving a Docker Registry via nginx there is special nginx configuration required to enable chunked encoding of messages.  Sample nginx config files are at:

After manually making the above configuration changes to our nginx proxy, our test “docker push”es were finally completing successfully. The last step was to put all of this configuration into Puppet, our configuration management system.

In case anyone would benefit from this, I have provided the meat of the Puppet configuration for a private Docker Registry w/nginx proxy serving web server stats to collectd here:

  # Create an internal docker registry w/an nginx proxy in front of it
  # directing requests from port 80. Also add a "/nginx_status" URL so that
  # we can monitor web server metrics via collectd.

  include docker

  $docker_dir = '/srv/docker'  # where to store images

  file { $docker_dir:
    ensure => directory,
    owner  => 'root',
    group  => 'root',
  }

  # see docs on docker's registry app at:
  #   https://github.com/dotcloud/docker-registry/blob/master/README.md

  docker::image { 'registry': }

  docker::run { 'registry':
    image           => 'registry',
    ports           => ['5000:5000',],
    volumes         => ["${docker_dir}:/tmp/registry"],
    use_name        => true,
    env             => ['SETTINGS_FLAVOR=local',],
    restart_service => true,
    privileged      => false,
    require         => File[$docker_dir],
  }

  # the following nginx vhost proxy and location are configured per:
  #   https://github.com/docker/docker-registry/blob/0.7.3/contrib/nginx.conf
  #
  # "nginx-extras" is the Ubuntu 12.04 nginx package with extra modules like
  # "chunkin" compiled in
  class { 'nginx':
    confd_purge   => true,
    vhost_purge   => true,
    manage_repo   => false,
    package_name  => 'nginx-extras',
  }
  nginx::resource::upstream { 'docker_registry_app':
    members => ['localhost:5000',],
  }
  nginx::resource::vhost { 'docker_registry_app':
    server_name          => ['vi-docker.lab.vi.local'],
    proxy                => 'http://docker_registry_app',
    listen_options       => 'default_server',
    client_max_body_size => 0,  # allow for unlimited image upload size
    proxy_set_header     => [
      'Host $http_host',
      'X-Real-IP $remote_addr',
    ],
    vhost_cfg_append     => {
      chunkin    => 'on',
      error_page => '411 = @my_411_error',
    },
  }
  nginx::resource::location { 'my_411_error':
    vhost               => 'docker_registry_app',
    location            => '@my_411_error',
    location_custom_cfg => {
      # HACK: the value for chunkin_resume key needs to be non-empty string
      'chunkin_resume' => ' ',
    },
  }

  # nginx_status URL for collectd to pull stats from. see:
  #   https://collectd.org/wiki/index.php/Plugin:nginx
  nginx::resource::location { 'nginx_status':
    ensure              => present,
    location            => '/nginx_status',
    stub_status         => true,
    vhost               => 'docker_registry_app',
    location_cfg_append => {
      'access_log' => 'off',
    },
  }

  # collectd configuration for nginx
  class { '::collectd':
    purge        => true,
    recurse      => true,
    purge_config => true,
  }
  class { 'collectd::plugin::nginx':
    url   => 'http://vi-docker.lab.vi.local/nginx_status',
  }