Reliability and infrastructure-as-code

I like using ansible, because it provides a reproductible way to install servers. This is reproductible ... in theory. This week I get stuck with the reinstallation of a jenkins server. I was confident because we are using ansible and I had already made a dozen of installation. But as usual, unexpected events occured :-). The ansible playbook failed with this type of errors:

  • Failed to fetch the yum repository key at https://jenkins-ci.org/redhat/jenkins-ci.org.key
  • Impossible to download some jenkins' plugin. Failed to connect to updates.jenkins-ci.org at port 443: [Errno 104]

It is a network problem that could be reproduced by a simple oneliner:

[root@vmt-jenkins ~]# LANG= curl -XGET -v -L https://jenkins-ci.org/redhat/jenkins-ci.org.key
* About to connect() to jenkins-ci.org port 443 (#0)
[...]
* Ignoring the response-body
* Connection #2 to host jenkins.io left intact
* Issue another request to this URL: 'https://pkg.jenkins.io/redhat/jenkins-ci.org.key'
* About to connect() to pkg.jenkins.io port 443 (#3)
*   Trying 52.202.51.185...
* Connection timed out
* Failed connect to pkg.jenkins.io:443; Connection timed out
* Closing connection 3
curl: (7) Failed connect to pkg.jenkins.io:443; Connection timed out

I first believed it was a network issue on our company network. But same error occured on an external network :-(. So I create an issue on the bugtracker jenkins. After opening this issue, I take time to think what could have be done to avoid this disaster. To be resilient in this faulty context, my advises are:

  • write playbooks that can use a mirror repository. If you are using an additionnal repository (in my case http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo) this can be done by using a variable containing the URL of the mirror. If the main repository is not working, change the variable to use a mirror.
  • Setup a good monitoring of the required services and dependencies of your application. In my case, monitoring the jenkins repository sounds weird as it is an official repository and we expect that it is always up and running. But dependencies could also be LDAP or DNS services. Monitoring give you an exact situation of the service health: never broken, sometimes broken, always up. If your monitoring system detect failures of the service, do not plan an installation but wait for better condition.
  • Regularly run your playbooks. If you don't regularly run your playbooks, you don't see problems . You just posptone the troubleshooting phase. At the end, you have to fix all bug at once. This is harder than steadily doing small fixes.
  • Write the most reliable playbook possible. Catch the errors and if possible retry the command. It is impossible with some ansible modules because they don't allow to specify neither timeout option nor retries option.
  • Do not forget to backup the running hosts even if you are using infra-as-code. Because if your playbook fails you will be happy to restore a backup.

Systems using infra-as-code are really cool. But in case of network issue or moving dependencies, you may be trapped with no possiblity to install. Furthermore if you are running on-premises servers.

By @Romain JACQUET in
Tags :