Operational Feedback on Concourse CI

Nowadays, most startups use some form of CI (continuous integration) system, either homemade or off-the-shelf.
I’m currently in charge of all software and infrastructure at Earthcube. I decided to use Concourse for CI (reasons here + suggestion from friends I trust) and this blog post is about giving the world some operational feedback about it.

First, let’s start with the good points. Concourse pipelines are defined with yaml files that you can keep in your source control, which is good since everything is thus well-versioned.
Second, every single thing Concourse does, is done inside a container, so you don’t have to worry about how to configure and maintain the machines Concourse runs on.
Third, Concourse web UI is really nice, and it makes production builds tracking very easy and convenient.

However, running Concourse in production brings a few challenges, that I had to overcome. First, I’d like to emphasize that my experience may be different than others, and I know some other Concourse users would have a different feedback. So I’m just putting here what worked for me in my configuration.
To put things in perspective, here is our usage pattern: Earthcube is a 15-people startup, we have a dozen CI pipelines, each running 2-3 times per day.

Disclaimer: Several problems I encountered may be due to insufficient investigation from my side. I had to make it work quick under tight deadlines in a Docker/Docker-compose based infrastructure, and this post describes just that.

The first think to know is that Concourse comes from the CloudFoundry ecosystem. It does not use Docker for running its containers, but uses Garden, a component of CloudFoundry.
If you want to run Concourse the “natural” way, you would use Bosh, CloudFoundry management tool for distributed systems.

So here is the first thing: Concourse is not a CI system built around Docker.
Nevertheless, I set up Concourse with Docker-compose, which means basically running Concourse binaries in Docker containers.
I used at first the docker-compose setup described on Concourse github.

#Problem 1: Robustness to docker container restart

The problem is, if for some reason you perform a docker-compose restart, the concourse binary inside the docker container does not stop gracefully, and next time the master is up, the worker is stuck or unresponsive.
I found a way around thanks to https://github.com/EugenMayer/docker-image-concourseci-worker-solid, we basically use a wrapper to retire the worker gracefully when the docker-compose application is stopped.

Here is what the final docker-compose.yml looks like:

version: '3'

image: postgres:9.6
restart: always
POSTGRES_DB: concourse
POSTGRES_USER: concourse
PGDATA: /database

image: concourse/concourse
links: [concourse-db]
command: web
depends_on: [concourse-db]
ports: ["9001:8080"]
volumes: ["./keys/web:/concourse-keys"]
restart: always

image: solidworker:latest
privileged: true
links: [concourse-web]
depends_on: [concourse-web]
restart: always
volumes: ["./keys/worker:/concourse-keys"]
CONCOURSE_TSA_HOST: concourse-web

To build the solidworker image, use the github link mentioned above. Without this trick, I encountered several Concourse documented issues:

Concourse job stuck in “pending” state (example)

Worker stay in “stalled” state indefinitely after restarting the docker containers (example)



 #Problem 2: Docker daemon hang

In some rare situations (which I had difficulities to characterize precisely), the Concourse worker container hangs, and you cannot stop it, nor with a docker-compose stop nor with a docker daemon stop. In this case, you get the error message: 

ERROR: An HTTP request took too long to complete.

(this issue with Docker daemon has been documented in other various situations too)

This means the Concourse worker container (in a docker-compose setup) may (rarely) hang and take with him your Docker daemon.
In order not to take any risk, I worked around this issue by installing a VM on the server, and deploying Concourse docker-compose inside the VM, so that it does not share its docker daemon with other important applications.

#Problem 3: Connectivity between worker and master

At some point, when I tried to solve the worker hang issue, and when I learned on Concourse Slack channel that the worker was quite resource-intensive, I thought I could run the master+db in some place and the worker in some place else.
I had troubles doing that, network connectivity was fine (thanks to a vpn) but I couldn’t get the worker to stay registered and pick up jobs. I suspect I forgot to open some ports somewhere, even though I had read the documentation describing the master/worker communications.
Distributing the work is probably much easier if you use Bosh for your infrastructure.

#Problem 4: Disk space usage

Concourse worker sometimes does not clean everything behind, for example when some builds are aborted. The typical error message is:

“No space left on device”  (example1, example2)

Disk space usage can increase quickly, I recommend 200GB + make a script to regularly docker stop/rm the concourse worker docker container and its volume.


So if I had to sum up the key take-aways:
– run Concourse with the docker-compose written above, with the worker wrapper trick
– deploy it inside a VM or on a separate server
– make sure Concourse has at least 150-200GB disk space

Overall, after working around the 4 problems described above, I’m now pretty satisfied with Concourse, and I’d recommend it for any small and medium-size (let’s say up to 50-70 people) startup. This stands for a docker-compose setup, Concourse can probably be run at a much bigger scale with other tooling such as CloudFoundry Bosh.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s