Encrypting Login Credentials in Ansible Vault

One of Ansible‘s benefits over Puppet and Chef is its server-less/agent-less architecture. Rather than having agents on client machines continuously checking in to a server for configuration changes (pull model), Ansible operates via a “push model” over SSH. One of the complications of the push model over SSH, however, is that the “pusher” needs to have login credentials to all of the target machines. Given that the set of these credentials are the collective keys to your kingdom, how can one store these credentials securely?

Ideally one would use SSH public/private key pairs for authentication. This assumes some of prerequisites though: 1) That the machine(s) you’ll be pushing from with the private key(s) are sufficiently locked-down such that the private key(s) will never be accessible to normal users; and 2) That each of your target hosts is configured with the corresponding public key as an authorized login key.

Suppose that you’re not yet able to fully implement public/private key pairs for authentication across your entire infrastructure. How should you proceed? Ansible’s docs on Inventory management indicate that one can put the login credentials in their inventory file. Ex:

[targets]
other1.example.com    ansible_connection=ssh    ansible_ssh_user=mpdehaan   ansible_ssh_pass=foobar
other2.example.com    ansible_connection=ssh    ansible_ssh_user=mdehaan    ansible_ssh_pass=foobar123

Unfortunately, here’s a dirty little secret about Ansible & Ansible Vault: The inventory file itself is not encrypt-able with Ansible Vault. If you proceed down the path of putting your login credentials into the plain-text Ansible inventory file in source control, anyone with access to your source control repo will have the login creds for all of the machines in your inventory. Ouch!

Luckily there is a solution for this: Buried deep within a page in Ansible’s documentation that describes Ansible’s support for Microsoft Windows, there is an example which provides a path forward for securely storing login credentials for machines managed by Ansible. The solution mentioned on that page is to use Ansible’s concept of “group_vars” and “host_vars” to store variables like “ansible_ssh_user” and “ansible_ssh_pass” on a per group/host basis, and then encrypt those variables with Ansible Vault.

Some examples:

# group_vars/webservers:
---
ansible_ssh_user: deployment
ansible_ssh_pass: my deployment password


# host_vars/db1.mycompany.com
---
ansible_ssh_user: ansible
ansible_ssh_pass: my other deployment password

From there one can encrypt those group/host variable files with “ansible-vault encrypt …”, and deploy with “ansible-playbook –ask-vault-pass …”.

For more information on Ansible Vault see their official docs, or my two-part blog post on using Ansible Vault:

ansible_logo_black_square

Automating Linux Security Best Practices with Ansible

Suppose you’re setting up a new Linux server in Amazon AWS, Digital Ocean, or even within your private network. What are some of the low-hanging fruits one can knock-out to improve the security of a Linux system? One of my favorite posts on this topic is “My First 5 Minutes On A Server; Or, Essential Security for Linux Servers“.

In that post Bryan Kennedy provides several tips, including:

  • Installing fail2ban – a daemon that monitors login attempts to a server and blocks suspicious activity as it occurs.
  • Modifying the configuration of “sshd” to permit neither “root” logins, nor password-based logins (ie. allow authentication with key-pairs only).
  • Configuring a software-based firewall with “iptables”.
  • Enabling automatic software security updates through the OS’s package manager.

These all seem like sane things to do on any new server, but how can one automate these steps for all of the machines in their infrastructure for speed, review-ability, and consistency? How can one do this in a way that allows them to easily modify the above rules for the different types of machines in their environment (ex. web servers should have port 80 open, but DB servers should have port 5432 open)? What’re some gotchas to be aware of when doing this? Here are my notes on how to automatically bring these security best practices to the Linux machines in your infrastructure, using Ansible.

Community-Contributed Roles which Implement the Best Practices

The first rule of automating anything is: Don’t write a single line of code. Instead, see if someone else has already done the work for you. It turns out that Ansible Galaxy serves us well here.

Ansible Galaxy is full of community-contributed roles, and as it relates this this post there are existing roles which implement each of the best practices that were named above:

Combining these well-maintained, open source roles, we can automate the implementation of the best practices from Bryan Kennedy’s blog post.

Making Security Best-Practices your Default

The pattern that I would use to automatically incorporate these third-party roles into your infrastructure would be to make these roles dependancies of a “common” role that gets applied to all machines in your inventory. A “common” role is a good place to put all sorts of default-y configuration, like setting up your servers to use NTP, install VMware Tools, etc.

Assuming that your common role is called “my-company.common” making these third-party roles dependancies is as simple as the following:

# requirements.yml
---
- src: franklinkim.ufw
- src: jnv.unattended-upgrades
- src: tersmitten.fail2ban
- src: willshersystems.sshd

# roles/my-company.common/meta/main.yml
---
dependencies:
  - { role: franklinkim.ufw }
  - { role: jnv.unattended-upgrades }
  - { role: tersmitten.fail2ban }
  - { role: willshersystems.sshd }

Wow, that was easy! All of Bryan Kennedy’s best practices can be implemented with just two files: One for downloading the required roles from Ansible Galaxy, and the other for including those roles for execution.

Creating Your Own Default Configuration for Security-Related Roles

Each of those roles comes with their own respective default configuration. That configuration can be seen by looking through the “defaults/main.yml” directory in each role’s respective GitHub repo.

Suppose you wanted to provide your own configuration for these roles which overrides the author’s defaults. How would you do that? A good place to put your own defaults for the configuration of your infrastructure is in the “group_vars/all” file in your Ansible repo. The “group_vars/all” file defines variables which will take precedence and override the  variables from the roles themselves.

The variable names and structure can be obtained by reading the docs for each respective role. Here’s an example of what your custom default security configuration might look like:

# group_vars/all
---
# Configure "franklinkim.ufw"
ufw_default_input_policy: DROP
ufw_default_output_policy: ACCEPT
ufw_default_forward_policy: DROP
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }

# Configure "jnv.unattended-upgrades"
unattended_automatic_reboot: false
unattended_package_blacklist: [jenkins, nginx]
unattended_mail: 'it-alerts@your-company.com'
unattended_mail_only_on_error: true

# Configure "tersmitten.fail2ban"
fail2ban_bantime: 600
fail2ban_maxretry: 3
fail2ban_services:
  - name: ssh
    enabled: true
    port: ssh
    filter: sshd
    logpath: /var/log/auth.log
    maxretry: 6
    findtime: 600

# Configure "willshersystems.sshd"
sshd_defaults:
  Port: 22
  Protocol: 2
  HostKey:
    - /etc/ssh/ssh_host_rsa_key
    - /etc/ssh/ssh_host_dsa_key
    - /etc/ssh/ssh_host_ecdsa_key
    - /etc/ssh/ssh_host_ed25519_key
  UsePrivilegeSeparation: yes
  KeyRegenerationInterval: 3600
  ServerKeyBits: 1024
  SyslogFacility: AUTH
  LogLevel: INFO
  LoginGraceTime: 120
  PermitRootLogin: no
  PasswordAuthentication: no
  StrictModes: yes
  RSAAuthentication: yes
  PubkeyAuthentication: yes
  IgnoreRhosts: yes
  RhostsRSAAuthentication: no
  HostbasedAuthentication: no
  PermitEmptyPasswords: no
  ChallengeResponseAuthentication: no
  X11Forwarding: no
  PrintMotd: yes
  PrintLastLog: yes
  TCPKeepAlive: yes
  AcceptEnv: LANG LC_*
  Subsystem: "sftp {{ sshd_sftp_server }}"
  UsePAM: yes

The above should be tailored to your needs, but hopefully you get the idea. By placing the above content in your “group_vars/all” file, it will provide the default security configuration for every machine in your infrastructure (ie. every member of the group “all”).

Overriding Your Own Defaults for Sub-Groups

The above configuration provides a good baseline of defaults for any machine in your infrastructure, but what about machines that need modifications or exceptions to these rules? Ex: Your HTTP servers need to have firewall rules that allow for traffic on ports 80 & 443, PostgreSQL servers need firewall rules for port 5432, or maybe there’s some random machine that needs X11 forwarding over SSH turned on. How can we override our own defaults?

We can continue to use the power of “group_vars” and “host_vars” to model the configuration for subsets of machines. If, for example, we wanted our web servers to have ports 80 &443 open, but PostgreSQL servers have port 5432 open, we could override the variables from “group_vars/all” with respective variables in “group_vars/webservers” and “group_vars/postgresql-servers”. Ex:

# "group_vars/webservers"
---
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }
  - { port: 80, rule: allow }
  - { port: 443, rule: allow }

===
# "group_vars/postgresql-servers"
---
ufw_rules:
  - { ip: '127.0.0.1/8', rule: allow }
  - { port: 22, rule: allow }
  - { port: 5432, rule: allow }

Rules for individual hosts can be set using “host_vars/<hostname>”. Ex: “host_vars/nat-proxy1.your-company.com”

# host_vars/nat-proxy1.your-company.com
---
# Allow IP traffic forwarding for our NAT proxy
ufw_default_forward_policy: ACCEPT

Ansible’s “group_vars”, “host_vars”, and variable precedence are a pretty powerful features. For additional readings, see:

Gotchas

There are a couple of “gotchas” to be aware of with the aforementioned scheme:

  1. Variable replacing vs. merging – When Ansible is applying the precedence of variables from different contexts, it will use variable “replacement” as the default behavior. This makes sense in the context of scalar variables like strings and integers, but may surprise you in the context of hashes where you might have expected the sub-keys to be automatically merged. If you want your hashes to be merged then you need to set “hash_behaviour=merge” in your “ansible.cfg” file. Be warned that this is a global setting though.
  2. Don’t ever forget to leave port 22 open for SSH – Building upon the first gotcha: Knowing that variables are replaced and not merged, you must to remember to always include an “allow” rule for port 22 when overriding the “ufw_rules” variable. Ex:
    ufw_rules:
      - { ip: '127.0.0.1/8', rule: allow }
      - { port: 22, rule: allow }

    Omitting the rule for port 22 may leave your machine unreachable, except via console.

  3. Unattended upgrades for daemons – Unattended security upgrades are a great way to keep you protected from the next #HeartBleed or #ShellShock security vulnerability. Unfortunately unattended upgrades can themselves cause outages.
    Imagine, for example, that you have unattended upgrades enabled on a machine with Jenkins and/or nginx installed: When the OS’s package manager goes to check for potential upgrades and finds that a new version of Jenkins/nginx is available, it will merrily go ahead and upgrade your package, which in-turn will likely cause the package’s install/upgrade script to restart your service, thus causing a temporary and unexpected downtime. Oops!
    You can prevent specific packages from bring automatically upgraded by listing them in the variable, “unattended_package_blacklist”. This is especially useful for daemons. Ex:

    unattended_package_blacklist: [jenkins, nginx]
    

Conclusion

Ansible and the community-contributed code at Ansible Galaxy make it quick and easy to implement the best-practices described at “My First 5 Minutes On A Server; Or, Essential Security for Linux Servers“. Ansible’s notion of “group_vars”, “host_vars”, and variable precedence provides a powerful way for one to configure these security best-practices for groups of machines in their infrastructure. Using all of these together one can automate & codify the security configuration of their machines for speed of deployment, review-ability, and consistency.

What other sorts of best-practices would be good to implement across your infrastructure? Feel free to leave comments below. Thanks!

ansible_logo_black_square

Managing Secrets with Ansible Vault – The Missing Guide (Part 2 of 2)

(This post is part 2/2 in a series. For part 1 see: Managing Secrets with Ansible Vault – The Missing Guide (Part 1 of 2))

How to use Ansible Vault with Test Kitchen

Once you’ve codified all of your secrets into Ansible “var files” and encrypted them with Ansible Vault, you’ll probably want to test the deployment of these secrets with Test Kitchen. Unfortunately you will quickly find that Test Kitchen does not play with Vault in an ideal way: In order for Test Kitchen to run “ansible-playbook” it now needs the password to your Vault in order to decrypt the secrets within the var files.

How does the “kitchen-ansible” plugin expect to receive the password to your Vault? Via a plain-text file on your filesystem, as specified by the “ansible_vault_password_file” parameter in your “.kitchen.yml” file. Oh boy!

This does not seems like a scalable solution to me… I hardly trust myself to manage a plain-text file with the password to our Vault. Beyond that, I would be terrified to let an entire organization of folks know the password to the Vault and instruct them to store that password in a plain-text file in their own respective file systems just so that they could run Test Kitchen tests as they iterate w/Ansible. In practice this would be only marginally better than simply checking in the secrets as plain-text into git, as all this structure around Vault and Ansible vars would only be pushing the problem of secret management one level higher.

So how can we test with Test Kitchen when using Ansible Vault? Here’s a nifty solution to the problem that builds upon the solution that we implemented in Part 1 of this guide:

  • Define a well-known Unix hostname for your Test Kitchen VM. Ex: “test-kitchen”
  • Create two versions of your vars files: One for production which is encrypted, the other for your test environment which is unencrypted. The structure of the files will be largely the same (ex. the files to be placed, w/their respective owner, group, mode), but the contents of the files for production will differ from the files for your test environment.
  • In “tasks/main.yml”, use “include_vars” to include the appropriate var file for whichever environment you happen to be in. This can be done by using the “with_first_found” arg to “include_vars”. See example below.
# .kitchen.yml
---
# Set the hostname of our Test Kitchen-created VM to be “test-kitchen”
driver:
  name: vagrant
  vm_hostname: test-kitchen
...<snip>...

##########

# vars/vpn-secrets-prod.yml - A Vault-encrypted file
$ANSIBLE_VAULT;1.1;AES256
34336333316361306432303864336464623165316461396266626562393232316565383263663234
3963633535363737613136656535343436613335636663380a373766653966663337666539613166
32313738303263303130353665333031373930353938653766653732623061326462633065393134
3135386639333637630a393439343733616439373731383932383562356164633832363639636633
64373237333661653066346566366135326539636564343632666363663866653264396564396162
62353461326435373433633034313338376265396130363965313464656332373737306462323433
34646361363065656331336337313763313939303533646138323834336330323533353239363663
...<snip>...

# vars/vpn-secrets-test-kitchen.yml
---
vpn_secret_files:
  /etc/openvpn/easy-rsa/keys/ec2-openvpn.key:
    owner: root
    group: root
    mode: "u=r,go="
    content: |
      -----BEGIN PRIVATE KEY-----
      MIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQD5koXgI24E360f
      nhxCfOPVORzFW1CN7u/zOQdvKoIStogF0UQifDCnY/POEjoBmzBrg/UyAmsqLIli
      xMtRIuvEhwaGEUQPoZNCaRW+1XtJ3kDvr9MVTlJTcNGOlGe/E+HyAKBq5vinxzzM
      9ba8M9Nc1PQ93B1OTUY1QGHVYRvSFYDJ5Fnz23xKeNsnY3hmRkV7CDZXSdy9nbmy
      1X9uz7z5bG7PKUVD3JZjI75CHAEDJKtscBv9ez/z16YTxwahIL3CXfqBq8peyAZ0
      n4Mzj4Lt8Cwaw2Kw3w3gMhbhf4fy284+hYqHe9uqYJC6dJJSKDIXqoLSD+e8aN+v
      BAEQcAWXAgMBAAECggEAbmHJ6HqDHJC5h3Rs11NZiWL7QKbEmCIH6rFcgmRwp0oo
      GzqVQhNfiYmBubECCtfSsJrqhbXgJAUStqaHrlkdogx+bCmSyr8R3JuRzJerMd6l
      Jd3EJHZBnzoU1VT6Fd77Xge868tASySp1ZUPv2nEoBhn9jw2kf1HgiH5o2CR53ZP
      pnL72Ng7MHpKuyoAZ9DtUU7yGG4RTCN2JuPGD6IwKoXBs1b7tqsMncz86u6Iibwk
      Np4j3vPmSLfQxvBP85T0xzSURlnP+bFCaJDPfXYIgDLROkrFAgJ2ADCm4gwfk93i
      Z/wnk8tFjnxUy2V5UbtWqqkVHmvdHHCc/6bZfcNOsQKBgQD/v94YX3vhgZRiz1kZ
      c0v2lxFZqNgMPC7EADmO34nFq7KtmVXYQfpoiooGDfQXTqfVGQsyTcpg5HLZvlyb
      qm9oaXpZY4yP/SLF6Pc00/iDTleSxGROyqhsaBotXpqSSC3rv92D9Zas/Xdz3lHD
      NSY9EVsiFId7O4OkvLuZVDvZQwKBgQD50Rs873/yUdyCwKx9/GF4yWVRg7//FTyQ
      Cj1KCBK5tDqOc+hiIS1GF0HRkcvIot71owTe+PG9OouXlUuxWrtc+fzgGSPaYjMp
      Ub69EcSNtUsK8MUS+VADbR5VDzS27OM1g+pJO7BbHpPWuEI1cjYmW/+3cCzFYnIV
      5z6OctbjHQKBgEVQWP8+EbMijXbiP4G4T+Q7OUaVjkhynzIb5X2ldA+Q41JNdoiw
      CRAATDwr1/XhKXeF3BT8JFdyUvZUs4C1BpDD1ZcYdeYocx40b5tvv7DGsNFkTNNV
      9aO76yxUsYvn6Bo22/CBxR6Ja7CJlptTclOmuo5YBggOLzWcuTNrMvVFAoGASIoV
      lK4ewuhOVZFJBRRB4Wbpiq/tEk7CVTkD7vlFJrNUxYSWl9f2Y4HhVM83Ez1n7H+3
      rF8xIrdbTVrGresguLDGYvQp2wHkxTy9W/1Ky7M25ShgsU+/kh8fTaeqsOs8Vo/F
      ehpg7TSFzTWX1Bkj7COOr19dQLuDUSTin05tY2kCgYB35ZHVDMR6TlW0Kp/l7gAx
      FQx5hojllzHr3RRv8a4rBbhsdAJGBr5QHZbzVeuw1z6NlDc/4brer3y52FnnHbD3
      fkUrvh+g1xHeXF4Yekr5Mu2D7PoQoFRRai2hjPnIHRLmHI45EPri3USoHuNPl+qB
      l23chS70zQ9VDmqEs9gjLA==
      -----END PRIVATE KEY-----
  /etc/openvpn/easy-rsa/keys/ca.key:
    owner: root
    group: root
    mode: "u=r,go="
    content: |
      -----BEGIN PRIVATE KEY-----
      MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDPm22e2QTeTnLN
      PT//6kyB8tM/2kE6+LsFD3TFA4XvS3gwNZLybjXpPtncF4qLxjq3c4uSBp2tuAa2
      VvWUCAyQX4EcOuCFhh1AIUHX9O4F2JhLtNH366D6LmfGE7Lck85R6bzErYJ5OzBN
      /3WSGtWmLbQWhXTvNwG5re17Ds7DLQ6/XRXCg91lAbtGqYCvw9F6X8N3VNdcovqN
      Ud+tJ4XjmGfPD8ZgSk/iVKeLzz5fuNxON+ygdUJ9IQJGu7kvJOhWD1F3p3lzuS4E
      7zyR8r9QK6lGdk2/ifmY5f+tmI92fvVl2HD2DroEVp42hCYEpNogm8BKXHFHBA9N
      0mugGMzVAgMBAAECggEAAK5a2rWNjYkmWUQFLLrBC4AXb1Mw+ZeNTYPydx7+1n0h
      5M6YL9Fqvdwl7NHq83BwCuAHKjB5XfOHmhuI7LZmDCc0DjqnN+jruaUiSSoVidFf
      Foh+U9jjC08RqhWwdYbKm3wv0VlcXzdxfiADa7pIzyXBPH2tl4dPqyNF7yxqQzum
      F42D4IExbYYkGR7bP6RePrUiaO3iU/EwDL5Dey4+93K+EaxbdxIhMLclvnQ8I0tl
      tFGn4AbbOqPqzPxWZhWk2gT//jMTtJh6FxQLQkvDoEnta5UYQ2E38r33jK+Wasga
      lGZEyNOTMq1MMdPrCzXloJSnerCXC4vTFt62AOdIQQKBgQD3SvUeaXV7Xf67vL0t
      EdBG9YL0Zz2MxxoVAth44svMzQ4gR6/pkakEhMzR51I/Skl/wzCJFHY1Z4nq9DoA
      RY5APjO63uHdZEKKYZ1MTXmO6F+IkUY5MCBvyCtLsnkAcToyuyDuhV4NBfjydw6E
      L5S1H9NI7klvaPxq5I7KkzSeJQKBgQDW6sDBvi6ctV3w8GCTUjP5Ker1FuKYL7Yn
      HI6RIGnWB2hS8NbEe8ODgzsVOVnC6x+WCNBiu/GmF8wlue7PCH7rLEa8diiM+J9/
      QYXtezfLIhPqhPJZDj5IX7bIotkvUzv+ywvUfCtJ3aCAu8DMi09x1GRgU6go/4ZK
      SCmVmj588QKBgQCNhr2gCRTuZM37nbnayF4drjajL06/eddIfRdsn8epTxWtjbl0
      gCNt7Z7W5n9gr2A/GXN2kFpSmA4LhHiJXUVbKP4sDZDQRqf6UIFYgOJ30i+SlinN
      Yui9cJ6utNahVSvMiuH/AB7iby+ZfF+3cQ+3VR5zl8Q5WalUd7fs4bB0bQKBgBI1
      x+lipO5wS6pro7M35uF41Mi5jK+ac1OzDr1rQqx46jUE5R224uUUzH/K4Tkr1PxQ
      eN+0zw/kuk6EB6ERNjfVA5VaaaswMcuFkMSDiUGz/H4Fj8dN9qcJPSKY8dAZvF6l
      c7YoYz6aAcyGnBp4v12EwpCK5he7NvS6UpOzgxHxAoGBAOjiBQtwikKLzLYwg1gF
      QYh1TLvEJIRFYEFQveVUKxmSskN4W6VQrTrcqobYHM9tOSbSe+Ib/y/khpaEz0PE
      E5gxeUbxhTj0PVvOKJmyCKWDPL8o61MGVhX1nAJarfbdP1XM9fl4S3pZH14bIhOU
      FG0e4jNsDq6vdwytV9R/GyAv
      -----END PRIVATE KEY-----

#######

# tasks/main.yml
#
# Leverage the fact that our ".kitchen.yml" file is setting the hostname of
# test VMs to "test-kitchen". Using "with_first_found" we can load the
# unencrypted "vpn-secrets-test-kitchen.yml" for test VMs, otherwise load the
# Ansible Vault-encrypted "vpn-secrets-prod.yml" file.
#
# Use "no_log: true" to keep from echoing the key contents to stdout.
# See: http://docs.ansible.com/faq.html#how-do-i-keep-secret-data-in-my-playbook
#
- name: VPN Server | Load VPN secret keys
  include_vars: "{{ item }}"
  no_log: true
  with_first_found:
    - "vpn-secrets-{{ ansible_hostname }}.yml"
    - "vpn-secrets-prod.yml"

- name: VPN Server | Copy secret files
  copy:
    dest="{{ item.key }}"
    content="{{ item.value.content }}"
    owner="{{ item.value.owner }}"
    group="{{ item.value.group }}"
    mode="{{ item.value.mode }}"
  with_dict: vpn_secret_files
  no_log: true
  notify:
    - restart openvpn

The magic lies in the “with_first_found” argument above. In the Test Kitchen environment “vpn-secrets-{{ ansible_hostname }}.yml” will interpolate to “vpn-secrets-test-kitchen.yml” because of our well-defined hostname. Since this “vpn-secrets-test-kitchen.yml” file exists in unencrypted form under “vars/”, Ansible will grab that var file for your Test Kitchen environment. If the hostname is something other than “test-kitchen” (ie. production), then Ansible’s “with_first_found” will reach the “vpn-secrets-prod.yml” var file, which is encrypted with Vault and will require a password to unlock and proceed.

Sanity Checking Ourselves with Serverspec

Now that we have Vault working nicely with Test Kitchen, a final step would be to add automated tests to make sure that we are indeed deploying files with the correct permissions, now and in the future. For more details on using Ansible & Test Kitchen with Serverspec, see Testing Ansible Roles with Test Kitchen. Here’s what a Serverspec test for our above files would look like:

# test/integration/default/serverspec/secret_keys_spec.rb

require 'serverspec'

# Secret keys should not be world readable.
secret_keys = [
  '/etc/openvpn/dh2048.pem',
  '/etc/openvpn/ipp.txt',
  '/etc/openvpn/openvpn.key',
  '/etc/openvpn/ta.key',
  '/etc/openvpn/easy-rsa/keys/ca.key',
  '/etc/openvpn/easy-rsa/keys/ec2-openvpn.key'
]

for secret_key in secret_keys
  describe file(secret_key) do
    it { should be_file }
    it { should be_mode 400 }
    it { should be_owned_by 'root' }
    it { should be_grouped_into 'root' }
  end
end

Deploying to Production with Jenkins

A final piece of the puzzle to figure out was how to actually run “ansible-playbook” with a code base that utilizes Ansible Vault within the context of a job-runner like Jenkins. In order words, how to provide Jenkins with the password to unlock the Vault. I found a couple of options here:

  • Put the Vault password into a locked-down file (mode 400) on your Jenkins slaves that run Ansible. This only works if your Jenkins slaves have some level of security around the users that Jenkins uses. I’m not crazy about passwords in text files, but in theory this shouldn’t be any worse than a locked-down, 400-mode file like those in “/etc/sudoers.d/…”.
  • Modify the Jenkins job that runs Ansible to require a Password parameter, run “ansible-playbook” within that job with that password parameter being echo’d in, and then use the Jenkins Mask Passwords plugin to mask the contents of that password from your build logs. The downside of this is that it complicates automated execution of the Jenkins job that invokes Ansible as it now requires a password to be invoked.
  • Store the Ansible Vault password in another secret management system like HashiCorp’s Vault. This starts to get pretty meta 🙂

Ultimately you have to decide which of these three options fits best within your infrastructure and workflow.

Conclusion

There you have it, my two-part guide to using Ansible Vault from soup to nuts. Hopefully you’ve found these notes to be useful in getting an end to end system for securely managing your infrastructure’s secrets. Please let me know in the comments if I’ve left anything out. Thanks!

ansible_logo_black_square

Testing Ansible Roles with Test Kitchen

Recently while attending DevOps Days Austin 2015, I participated in a breakout session focused on how to test code for configuration management tools like Puppet, Chef, and Ansible. Having started to use Ansible to manage our infrastructure at Delphix I was searching for a way to automate the testing of our configuration management code across a variety of platforms, including Ubuntu, CentOS, RHEL, and Delphix’s custom Illumos-based OS, DelphixOS. Dealing with testing across all of those platforms is a seemingly daunting task to say the least!

Intro to Test Kitchen

The conversation in that breakout session introduced me to Test Kitchen (GitHub), a tool that I’ve been very impressed by and have had quite a bit of fun writing tests for. Test Kitchen is a tool for automated testing of configuration management code written for tools like Ansible. It automates the process of spinning up test VMs, running your configuration management tool against those VMs, executing verification tests against those VMs, and then tearing down the test VMs.

What’s makes Test Kitchen so powerful and useful is its modular design:

Using Test Kitchen

After learning about Test Kitchen at the DevOps Days conference, I did some more research and stumbled across the following presentation which was instrumental in getting started with Test Kitchen and Ansible: Testing Ansible Roles with Test Kitchen, Serverspec and RSpec (SlideShare).

In summary one needs to add three files to their Ansible role to begin using Test Kitchen:

  • A “.kitchen.yml” file at the top-level. This file describes:
    • The driver to use for VM provisioning. Ex: Vagrant, AWS, Docker, etc.
    • The provisioner to use. Ex: Puppet, Chef, Ansible.
    • A list of 1 or more operating to test against. Ex: Ubuntu 12.04, Ubuntu 14.04, CentOS 6.5, or even a custom VM image specified by URL.
    • A list of test suites to run.
  • A “test/integration/test-suite-name/test-suite-name.yml” file which contains the Ansible playbook to be applied.
  • One or more test files in “test/integration/test-suite-name/test-driver-name/”. For example, when using the BATS test-runner to run a test suite named “default”: “test/integration/default/bats/my-test.bats”.

Example Code

A full example of Test Kitchen w/Ansible is available via the delphix.package-caching-proxy Ansible role in Delphix’s GitHub repo. Here are direct links to the aforementioned files/directories:682240

Running Test Kitchen

Using Test Kitchen couldn’t be easier. From the directory that contains your “.kitchen.yml” file, just run “kitchen test” to automatically create your VMs, configure them, and run tests against them:

$ kitchen test
-----> Starting Kitchen (v1.4.1)
-----> Cleaning up any prior instances of 
-----> Destroying ...
 Finished destroying  (0m0.00s).
-----> Testing 
-----> Creating ...
 Bringing machine 'default' up with 'virtualbox' provider...
 ==> default: Importing base box 'opscode-ubuntu-14.04'...
==> default: Matching MAC address for NAT networking...
 ==> default: Setting the name of the VM: kitchen-ansible-package-caching-proxy-default-ubuntu-1404_default_1435180384440_80322
 ==> default: Clearing any previously set network interfaces...
 ==> default: Preparing network interfaces based on configuration...
 default: Adapter 1: nat
 ==> default: Forwarding ports...
 default: 22 => 2222 (adapter 1)
 ==> default: Booting VM...
 ==> default: Waiting for machine to boot. This may take a few minutes...

..  ...

-----> Running bats test suite
 ✓ Accessing the apt-cacher-ng vhost should load the configuration page for Apt-Cacher-NG
 ✓ Hitting the apt-cacher proxy on the proxy port should succeed
 ✓ The previous command that hit ftp.debian.org should have placed some files in the cache
 ✓ Accessing the devpi server on port 3141 should return a valid JSON response
 ✓ Accessing the devpi server via the nginx vhost should return a valid JSON response
 ✓ Downloading a Python package via our PyPI proxy should succeed
 ✓ We should still be able to install Python packages when the devpi contianer's backend is broken
 ✓ The vhost for the docker registry should be available
 ✓ The docker registry's /_ping url should return valid JSON
 ✓ The docker registry's /v1/_ping url should return valid JSON
 ✓ The front-end serer's root url should return http 204
 ✓ The front-end server's /_status location should return statistics from our web server
 ✓ Accessing http://www.google.com through our proxy should always return a cache miss
 ✓ Downloading a file that is not in the cache should result in a cache miss
 ✓ Downloading a file that is in the cache should result in a cache hit
 ✓ Setting the header 'X-Refresh: true' should result in a bypass of the cache
 ✓ Trying to purge when it's not in the cache should return 404
 ✓ Downloading the file again after purging from the cache should yield a cache miss
 ✓ The yum repo's vhost should return HTTP 200

 19 tests, 0 failures
 Finished verifying  (1m52.26s).
-----> Kitchen is finished. (1m52.49s)

And there you have it, one command to automate your entire VM testing workflow!

Next Steps

Giving individual developers on our team the ability to quickly run a suite of automated tests is a big win, but that’s only the first step. The workflow we’re planning is to have Jenkins also run these automated Ansible tests every time someone pushes to our git repo. If those tests succeed we can automatically trigger a run of Ansible against our production inventory. If, on the other hand, the Jenkins job which runs the tests is failing (red), we can use that to prevent Ansible from running against our production inventory. This would be a big win for validating infrastructure changes before pushing them to production.

ansible_logo_black_square

Ansible Role for Package Hosting & Caching

The Operations Team at Delphix has recently published an Ansible role called delphix.package-caching-proxy. We are using this role internally to host binary packages (ex. RPMs, Python “pip” packages), as well as to locally cache external packages that our infrastructure and build process depends upon.

This role provides:

It also provides the hooks for monitoring the front-end Nginx server through collectd. More details here.

Why Is this Useful?

This sort of infrastructure can be useful in a variety of situations, for example:

  • When your organization has remote offices/employees whose productivity would benefit from having fast, local access to large binaries like ISOs, OVAs, or OS packages.
  • When your dev process depends on external dependencies from services that are susceptible to outages, ex. NPM or PyPI.
  • When your dev process depends on third-party artifacts that are pinned to certain versions and you want a local copy of those pinned dependencies in case those specific versions become unavailable in the future.

Sample Usage

While there are a verity of configuration options for this role, the default configuration can be deployed with an Ansible playbook as simple as the following:

---
- hosts: all
  roles:
    - delphix.package-caching-proxy

Underneath the Covers

This role works by deploying a front-end Nginx webserver to do HTTP caching, and also configures several Nginx server blocks (analogous to Apache vhosts) which delegate to Docker containers for the apps that run the Docker Registry, the PyPI server, etc.

Downloading, Source Code, and Additional Documentation/Examples

This role is hosted in Ansible Galaxy at https://galaxy.ansible.com/list#/roles/3008, and the source code is available on GitHub at: https://github.com/delphix/ansible-package-caching-proxy.

Additional documentation and examples are available in the README.md file in the GitHub repo, at: https://github.com/delphix/ansible-package-caching-proxy/blob/master/README.md

Acknowledgements

Shoutouts to some deserving folks:

  • My former Development Infrastructure Engineering Team at VMware, who proved this idea out by implementing a similar set of caching proxy servers for our global remote offices in order to improve developer productivity.
  • The folks who conceived the Snakes on a Plane Docker Global Hack Day project.

How Should I Get Application Configuration into my Docker Containers?

A commonly asked question by folks that are deploying their first Docker containers into production is, “How should I get application configuration into my Docker container?” The configuration in question could be settings like the number of worker processes for a web service to run with, JVM max heap size, or the connection string for a database. The reality is that there are several standard ways to do this, each with their own pros and cons.

The ~3.5 Ways to Send Configuration to your Dockerized Apps

1. Baking the Configuration into the Container

Baking your application’s configuration into a Docker image is perhaps the easiest pattern to understand. Basically one can use commands within the “Dockerfile” to drop configuration files into the right places via the Dockerfile’s COPY directive, or modify those configuration files at image build time with “sed” or ”echo” via the RUN command.

If there’s a container available on the Docker Hub Registry that does everything you want save for one or two config settings, you could fork that “Dockerfile” on GitHub, make modifications to the “Dockerfile” in your GitHub fork to drop in whatever configuration changes you want, then add it as a new container on the Docker Hub Registry.

That is what I did for:

Pros:

Cons:

  • Since the configuration is baked into the image, any future configuration changes will require additional modifications to the image’s build file (its “Dockerfile”) and a new build of the container image itself.

2a. Setting the Application Configuration Dynamically via Environment Variables

This is a commonly-used pattern for images on the Docker Hub Registry. For an example, see the environment variables “POSTGRES_USER” and “POSTGRES_PASSWORD” for the official PostgreSQL Docker image.

Basically, when you “docker run” you will pass in pre-defined environment variables like so: "docker run -e SETTING1=foo -e SETTING2=bar ... <image name>". From there, the container’s entry point (startup script) will look for those environment variables, and “sed” or “echo” them into whatever relevant config files the application uses before actually starting the app.

It’s worth mentioning that the container’s entry point script should contain reasonable defaults for each of those environment variables if the invoker does not pass those environment variables in, so that the container will always be able to start successfully. 

Pros:

  • This approach makes your container more dynamic in terms of configuration. 

Cons:

  • You are sacrificing dev/prod parity because now folks can configure the container to behave differently in dev & prod.
  • Some configurations are too complex to model with simple key/value pairs, for example an nginx/apache virtual host configuration.

2b. Setting the Application Configuration Dynamically via Environment Variables

This is a similar idea to using environment variables to pass in configuration, but instead the container’s startup script will reach out to a key-value (KV) store on the network like Consul or etcd to get configuration parameters.

This makes it possible to do more complex configurations than is possible with simple environment variables, because the KV store can have a hierarchical structure of many levels. It’s worth noting that widely-used tooling exists for grabbing the values from the KV store substituting them into your config files. Tools like confd even allow for automatic app-reloading upon changes to the KV configuration. This allows you to make your app’s configuration truly dynamic!

See:

Pros:

  • This approach makes your container more dynamic in terms of configuration.
  • The KV store allows for more complex & dynamic configuration information 

Cons:

  • This introduces an external dependency of the KV store, which must be highly-available.
  • You are sacrificing dev/prod parity because now folks can configure the container to behave differently in dev & prod.

3. Map the Config Files in Directly via Docker Volumes

Docker Volumes allow you to map any file/directory from the host OS into a container, like so: “docker run -v <source path>:<dest path> ...”

Therefore if the config file(s) for your containerized app happened to be available on the filesystem of the base OS, then you could map that config file (or dir) into the container. Ex:

“docker run -v /home/dan/my_statsd_config.conf:/etc/statsd.conf hopsoft/graphite-statsd”

Pros:

  • You don’t have to modify the container to get arbitrary configurations in.

Cons:

  • You lose dev/prod parity because now your app’s config can be anything
  • If you’re doing this in production, now you have to get that external config file onto the base OS for sharing into the container (a configuration management tool like Ansible, Chef, or Puppet comes in handy here)

Conclusion

As you can see, there are many potential ways to get application configuration to your Dockerized apps, each with their own trade-offs. Which way is best? It really depends on how much dynamism you require and whether or not you want the extra burden of managing dependencies like a KV store.

Ansible vs Puppet – An Overview of the Solutions

This is part 1/2 in a series. For part #2 see: Ansible vs Puppet – Hands-On with Ansible.

Having recently joined Delphix in a DevOps role, I was tasked with increasing the use of a configuration management tool within the R&D infrastructure. Having used Puppet for the past 3 years at VMware and Virtual Instruments I was inclined to deploy another Puppet-based infrastructure at Delphix, but before doing so I thought I’d take the time to research the new competitors in this landscape.

Opinions From Colleagues

I started by reaching out to three former colleagues and asking for their experiences with alternatives to Puppet, keeping in mind that two of the three are seasoned Chef experts. Rather than giving me a blanket recommendation, they responded with questions of their own:

  • How does the application stack look? Is is a typical web server, app, and db stack or more complex?
  • Are you going to be running the nodes in-house, cloud, or hybrid? What type of hypervisors are involved?
  • What is the future growth? (ie. number of nodes/environments)
  • How much in-depth will the eng team be involved in writing recipes, manifests, or playbooks?

After describing Delphix’s infrastructure and use-cases (in-house VMware; not a SaaS company/product but there are some services living in AWS; less than 100 nodes; I expect the Engineering team to be involved in writing manifests/cookbooks), I received the following recommendations:

Colleague #1: “Each CM tool will have its learning curve but I believe the decision will come down to what will the engineers adapt to the easiest. I would tend to lean towards Ansible for the mere fact thats it’s the easiest to implement. There are no agents to be installed and works well with vSphere/Packer. You will most likely have the engineers better off on it better since they know Python.”

Colleague #2: “I would say that if you don’t have a Puppet champion (someone that knows it well and is pushing to have it running) it is probably not worth for this. Chef would be a good solution but it has a serious learning curve and from your details it is not worth either. I would go with something simple (at least in this phase); I think Ansible would be a good fit for this stage. Maybe even with some Docker here or there ;). There will be no state or complicated cluster configs, but you seem to not need that. You don’t have to install any agent and just need SSH.“

The third colleague was also in agreement that I should take a good look at Ansible before starting with deployment of Puppet.

Researching Ansible on My Own

Having gotten similar recommendations from 3/3 of my former colleagues, I spent some time looking into Ansible. Here are some major differences between Ansible & Puppet that I gathered from my research:

  • Server Nodes
    • Puppet infrastructure generally contains 1 (or more) “puppetmaster” servers, along with a special agent package installed on each client node. (Surprisingly, the Puppetization of the puppetmaster server itself is an issue that does not have a well-defined solution)
    • Ansible has neither a special master server, nor special agent executables to install. The executor can be any machine with a list (inventory) of the nodes to contact, the Ansible playbooks, and the proper SSH keys/credentials in order to connect to the nodes.
  • Push vs Pull
    • Puppet nodes have special client software and periodically check into a puppet master server to “pull” resource definitions.
    • Ansible follows a “push” workflow. The machine where Ansible runs from SSH’s into the client machines and uses SSH to copy files, remotely install packages, etc. The client machine VM requires no special setup outside of a working installation of Python 2.5+.
  • Resources & Ordering
    • Puppet: Resources defined in a Puppet manifest are not applied in order of their appearance (ex: top->bottom) which is confusing for people who come from conventional programming languages (C, Java, etc). Instead resources are applied randomly, unless explicit resource ordering is used. Ex: “before”, ”require”, or chaining arrows.
    • Ansible: The playbooks are applied top-to-bottom, as they appear in the file. This is more intuitive for developers coming from other languages. An example playbook that can be read top-down: https://gist.github.com/phips/aa1b6df697b8124f1338
  • Resource Dependency Graphs
    • Internally, Puppet internally creates a directed graph of all of the resources to be defined in a system along with the order they should be applied in. This is a robust way of representing the resources to be applied and Puppet can even generate a graph file so that one can visualize everything that Puppet manages. On the down side building this graph is susceptible to “duplicate resource definition” errors (ex: multiple definitions of a given package, user, file, etc). Also, conflicting rules from a large collection of 3rdparty modules can lead to circular dependencies.
    • Since Ansible is basically a thin-wrapper for executing commands over SSH, there is no resource dependency graph built internally. One could view this as a weakness as compared with Puppet’s design but it also means that these “duplicate resource” errors are completely avoided. The simpler design lends itself to new users understanding the flow of the playbook more easily.
  • Batteries Included vs DIY
  • Language Extensibility
  • Syntax
  • Template Language
    • Puppet templates are based upon Ruby’s ERB.
    • Ansible templates are based upon Jinja2, which is a superset of Django’s templating language. Most R&D orgs will have more experience with Django.
  • DevOps Tool Support

Complexity & Learning Curve

I wanted to call this out even though it’s already been mentioned several times above. Throughout my conversations with my colleagues and in my research, it seemed that Ansible was a winner in terms of ease of adoption. In terms of client/server setup, resource ordering, “batteries included”, Python vs Ruby, and Jinja vs ERB, I got the impression that I’d have an easier time getting teammates in my R&D org to get a working understanding of Ansible.

Supporting this sentiment was a post I found about the Taxi startup Lyft abandoning Puppet because the dev team had difficulty learning Puppet. From “Moving Away from Puppet”:

“[The Puppet code base] was large, unwieldy and complex, especially for our core application. Our DevOps team was getting accustomed to the Puppet infrastructure; however, Lyft is strongly rooted in the concept of ‘If you build it you run it’. The DevOps team felt that the Puppet infrastructure was too difficult to pick up quickly and would be impossible to introduce to our developers as the tool they’d use to manage their own services.”

Getting Hands-On

Having done my research on Puppet vs Ansible, I was now ready to dive in and implement some sample projects. How did my experience with Ansible turn out? Read on in Part #2 of this series: Ansible vs Puppet – Hands-On with Ansible.