[Monday Dev Heaven] Continuous Integration at Nuxeo
What we test
At Nuxeo our Jenkins CI is composed of hundreds of jobs. The primary goal of the Continuous Integration(CI) is to give our developers feedback such as:
- Compilation error
- Unit tests regression
- Unit tests database regression on PosgreSQL, Oracle, SQL Server and MySQL
- Unit tests OS regression: Linux and Windows
- Functional tests regression using FunkLoad, WebDriver, Selenium
- Functional tests database regression on PostgreSQL, Oracle, SQL Server and MySQL
- Functional tests OS regression: Linux and Windows
Theses builds are triggered by developers' commits. It follows implicit Maven dependencies as well as explicit dependencies to run functional tests.
Other chains are scheduled or manually triggered to run miscellaneous tasks such as:
- Performance regression using FunkLoad benchmark
- Permanent full build
- Daily snapshots deployment, synchronizing public and internal Maven repositories
- Daily integration release build
- Perform official releases and hotfix builds
- Miscellaneous administration tasks
To give you an idea, here are some numbers:
- 2 Nexus Maven repositories (internal and public)
- 40 commits/day on 100 Git and Mercurial repositories
- 200 Nuxeo jars to build a Nuxeo distribution
- 100 to 600 Jenkins builds/day
- 350 Jenkins jobs
- 15 Jenkins slaves
Because of our platform approach we have many layers and it has some tricky side effects. For instance a commit in the runtime layer will impact the core, services, features, ui etc… Furthermore incoming commits cause lengthening of the chain that is already running. So in the end a fully tested build can take a very long time, delaying the feedback that developers are awaiting.
The other drawback is that many different commits can be linked to a regression, making it harder to see what caused the error when the build broke. All of this really slows down the development process. If the build is broken during the day, it will be hard to actually know why it got broken, and even longer to know when the build has been fixed.
So we've taken a close look at our CI chain and tried to improve its flow.
Improved the build times
The first thing to do is to use the Timestamper Jenkins plugin and to configure your jobs to add timestamps to the console output. This will help you understand where the time is spent during a build.
Fast Maven repository access
In our case, we noticed that the times maven spent checking for snapshots updates was often random, and sometimes very, verylong. This meant that our Nexus Maven repository had some issues.
Indeed, our Nexus scheduled task that removed old snapshots was taking a few hours and generating a constant load of 2 on our NFS access. This was in fact related to a Nexus bug. This problem was impacting the whole build/test chain, but also developers that used the same internal repository.
-> Moving the internal daily snapshots to a local disk and scheduling the task during the night fixes the problem, the removal task is now done in less than 20 minutes and the build times are much more constant.
-> Using local instead of remote slaves for jobs that are IO bound also helps a lot. IO bound job are those that upload big artifacts or download a lot. Using Rackspace slaves for long and CPU bound jobs is fine.
Once the needed artifacts have been downloaded, the Maven compilation and test steps are mainly single-CPU bound because Maven 2 does not support parallel builds (even in Maven 3 it will be hard with non thread safe plugins).
There is an exception for GWT builds, which use multiple processes and will try to use all the CPU you have (this is why your CPU fan starts!).
GWT by default is building different permutations for all targets - locales * browsers - and tries to do it in parallel. While this is fine for production, it is not necessary for the CI chain. We can speed up the build and save lots of CPU usage by reducing the number of permutation from 36 to 6 (supporting FF, IE and chrome with 2 locales is enough) and by disabling compilation optimization. See this page for more information on this.
Running selenium tests requires memory and CPU, a quad core is fine here because you have Selenium, Firefox, Maven and your application running at the same time.
-> We have made some test using Chrome and yes, it seems a bit faster, but changing the target browser is not always an option.
-> We are in the process of rewriting all the Selenium test suites using Webdriver in plain Java, the refactoring should help.
Improved the build chain
Understanding your chain
Once your slaves are building as fast as possible, the next question is how to speed up the chain to have quicker feedback.
But how do you get an overview of the chain ?
Looking at the build history page … hmm the browser is stuck trying to display hundreds of the daily builds.
Browsing the UI for downstream/upstream builds ? with dozen of dependencies it is too long for a human :/
Now you can easily check if the dependencies are the ones you were expecting. You also get an idea of the time between a commit and the end of the subsequent build chain. The throughput gives you an idea of how well the different builds are parallelized.
Obviously this kind of tool should be done as a Jenkins plugin, but I didn't have time to dig into Jenkins plugin development. Also, another goal of Jenkviz was to put build info into a relational database to perform plain SQL queries. Maybe Dependency graph view plugin will be a good starting point to move Jenkviz into a Jenkins plugin.
With the dependency tree better optimized, here are some other choices we made:
-> Stop on first failed/unstable build. Continuing the chain on unstable builds will delay the feedback and make it complex to understand. This also help with the load on the slaves. Even though this is an obvious solution, this was previously not possible in our case due to a bug in Maven/Jenkins that prevented deployment to a repository other than the default - it is now fixed and artifact deployment is done by Jenkins post-build.
-> We decided to group some addons artifacts that were generating a huge load because they were launched at the same time. This is possible using the Multiple SCMs plugin. We are still evaluating the gains from this change.
-> We pre-assigned priorities for jobs with low duration to give earlier feedback, using the Priority Sorter plugin.
-> We give a fast functional feedback using FunkLoad, which takes only 3 minutes for basic coverage while the full Selenium test suites can take more than 20 minutes.
-> We use a permanent chain to make sure we have full feedback on functional tests at least every 2h, because the main chain can be postponed by new commits. This chain runs on dedicated slaves. Jobs are triggered in a predetermined sequence using the Join plugin. The drawback here is that we don't have an accurate list of changesets between builds.
Useful Jenkins plugins
Here are a summary of useful Jenkins plugins that we used to improve our CI on that subject:
- Priority Sorter
- Jenkins Multiple SCMs plugin (trigger Git)
- Jenkins build timeout plugin
- Join plugin
Category: Product & Development