Prevent Let's Encrypt failed authorizations with Ansible

It happens once every few years. Because of whatever reason I request another VPS at a service provider, provision the machine with Ansible and deploy a few services, usually as docker containers. And what? They don’t work, unfortunately. In 99% of the cases, I forgot to update the DNS records and with Traefik and Let’s Encrypt, I hit the rate limit while investigating and I can’t obtain any new certificate the upcoming hour. With a few lines in Ansible, this won’t happen again. Hopefully.

The goal is quite simpel: with Ansible I can deploy any service like a static site or web application in “one click” (or with a single command in my terminal). I just need to be sure the cert renewal process will succeed.

The condition to assert is simple: is the host’s IP address equal to the resolved address of the FQDN for the service I want to deploy? Please note this is a simple and straightforward approach and will be different in cases you work with clusters, CDNs etc. But then I’m sure you have your DNS management at a more professional level than me, as a guy who just wants to run his own things for certain stuff.

My choice here is to deploy the container, configure it completely but just make sure the container is in a stopped state. You could manage it differently, by failing the deployment and letting you correct the DNS settings first, but then you might be blocked by DNS TTL before you can configure your container.

Another way might be to make Ansible configure the DNS settings for you, but this feels too dangerous to fail, I haven’t seen any Ansible module to interact with Digital Ocean’s networking API and you still have to deal with DNS TTLs.

DNS queries with Ansible

You can query domains with Ansible using the community plugin dig. Note it does require dnspython first at the local controller node so I had to install the package python3-dnspython on my Ubuntu laptop first.

Then the Ansible logic is quite simple:

  1. Query DNS A record for the service you’re deploying

  2. Query DNS A record for the host machine you’re deploying to

  3. Assert if both records are equal. If true, set a variable for the container state to started otherwise make it stopped

  4. Configure container with given state from above

In Ansible yaml code the tasks would look like:

- name: Query DNS lookup for domain {{ myservice_fqdn }}
  set_fact: service_ip="{{ lookup('dig', myservice_fqdn)}}"

- name: Query DNS lookup for host
  set_fact: host_ip="{{ lookup('dig', ansible_host)}}"

- name: Set container service state based on DNS configuration
  set_fact: > 
    container_state="{% if(service_ip == host_ip) %}
    started{% else %}
    stopped{% endif %}"

- name: Check DNS configuration for {{ myservice_fqdn }} is correct
    msg: >
      "Warning: Service is configured at '{{ myservice_fqdn }}'
      but it resolves to '{{ service_ip }}'.
      This does not match the host '{{ ansible_host }}'
      which is located at '{{ host_ip }}'.
      The container state is set to '{{ container_state }}'
      to prevent it from starting
      and flooding Let's Encrypt renewal requests."      
  when: service_ip != host_ip
  ignore_errors: True

- name: Create the container
    name: "{{ docker_container }}"
    image: "{{ docker_image }}"
    pull: yes
    state: "{{ container_state }}"
    restart_policy: unless-stopped
    networks_cli_compatible: yes
      - name: "{{ traefik_docker_network }}"
      traefik.enable: "true" "websecure" "Host(`{{ myservice_fqdn }}`)" "true" "le"

The tests all went successfully. Now I’ve to wait a few years to acertain this was the right & only thing to prevent me making these mistakes again.