Monday, February 4, 2019

Debugging Jenkins plugins and core with an IDE

I'll collect notes here as I go along, to make an article later where it belongs better - like https://jenkins.io/doc/developer/building/ or https://wiki.jenkins.io/display/JENKINS/Building+Jenkins which contains a somewhat less detailed version of this information :)

Debugging Jenkins plugins and core with an IDE, feeling like you're 5
Based on a true story :)

While developing some fixes to plugins that would improve their behavior for our use-cases (and not really being a Java developer), at some point the cycle of editing a bit of code and running `mvn package` and deploying it to a Jenkins server instance to see how that goes (and maybe read some logs or System.err.println() lines) somehow did not scale well :)

Subsequently I found ways to extend the existing self-tests in plugins, mostly by looking at existing precedent code and working my way up from there, discovering what can be done at all and how -- and essentially anything you can do interactively, there is a Java command for that in the background... though finding ways to get it called right can be tricky for a newcomer to the ecosystem, thus reading into existing tests of this or other plugins is a really good way to get ideas.

But still, even if the testing became faster and more reliable and I could more quickly (with just an `mvn test`) see that my further changes to code do not break the expected behaviors that I defined with tests for the new or modified feature, it had a big overhead of running hundreds of tests for minutes sometimes, and my new baby lost in the middle of the haystack. And if something failed, this actually was harder to debug because I did not find a way to read the Logger or System.*.println messages that would trace me what is happening in my code. And if it misbehaved, I had little idea why with no details to theorize on.

So it was time to use a real debugger, to step through the code and see the variables as they change.

For historic reasons, my IDE of choice was Netbeans (the details below apply to others, modulo their GUI nuances), which is not a frequent choice nowadays, but was used by some Jenkins developers who even made various plugins to integrate with a Jenkins server like http://wiki.netbeans.org/HudsonInNetBeans etc. The release I used was the last Oracle Netbeans development version (8.2+), so note the current ones are from Apache, but are being released in phases (9, 10, 11...) during 2018-2019, with 8.2 plugins being the recommended choice for additional features that were not re-released yet (such as C/C++ support).

So, Netbeans supports maven well, and Jenkins and its plugins use maven to configure the projects. So it looks like a good match.

In my case however, the Maven provided as part of the older Netbeans distribution was older than what Jenkins requires, so I installed the OS package which is new enough (or I could download the newest from Apache) and went to Tools/Options/Java/Maven to specify the "Maven home" -- the base directory for the installation to use (/usr for the package).

Then I opened an existing project (from the git-checked-out workspace), and set breakpoints all over the place where I think my code could fail or otherwise be in an interesting state. Then I right-clicked the sources, and there was a Debug Test File option, with the latter starting `mvn test` and eventually hitting my breakpoints.

That simple thing was a huge step forward, I eventually found that the code was actually doing what I told it to do, by coding, though not what I intended it to do, by thinking ;)

And then I stumbled on something even better: when I went to Test Packages in the Navigator pane, and to the source of the test I am interested in, and to a test routine that I was tweaking, the context menu offered to Debug Focused Test Method. Now this was a big optimization - while the Debug Test File run of the project still did a lot of tests before it struck what I was interested in, this newly found mode actually started the test suite right from this routine, or close to that, shrinking my change-test iterations from minutes to seconds!

(Note that while I did not yet try this, the docs imply that Java code can actually be changed, recompiled and reloaded on the fly during a debugging session, as long as the method signatures are not changed... this might work for plugin code/test development, but less likely for jenkins core where the WAR file is debugged).

Next, my interest turned toward improvements in the jenkins-core. Now this is a separate beast, compared to plugin debugging. When you test a plugin, it downloads a lot of maven artifacts, including some old Jenkins binary build (the minimal compatible version, as declared by the plugin manifest), and starts it as a server with your plugin loaded into it (and on a side note, I could not run the tests or debug sessions on an illumos system, because those old artifacts do not contain a fixed libzfs4j version and crash - so had to test in a Debian VM). On the opposite, the Jenkins repository actually contains a "Jenkins main module" with nested maven projects, for jenkins-core, jenkins-cli, jenkins-war and several variants of tests for jenkins-core.

Normally, with the old approach of code peppered with logging statements, I would go to the main module project, run `mvn package`, wait for a while, and find the `war/target/jenkins.war` that I can run with a specified JENKINS_HOME on our server or my laptop.

But in the IDE, there was no good option to really debug such a build. If I trigger a debug of the main module, the server starts but tracing into the code does not happen. If I trigger the jenkins-core, it asks for what classes I want to execute... and I have no idea, neither of the suggested options succeeded. And trying the jenkins war project, it referred to jenkins-cli and jenkins-core artifacts by versions that try to pull in something that I did not edit and build -- a binary from the internet, often something that does not exist there (an X.Y-SNAPSHOT version) so the attempt fails. And also in some of those cases where the server at least started, some resources were not found, so the web interface was unusable after logging in.

On another side note, loading the project into the IDE also began its indexing, and my development VM ran out of temporary space. As I found, indexing produces and later removes huge amounts of data, peaking around 3Gb. I restarted Netbeans with `netbeans -J-Djava.io.tmpdir=$HOME/tmp` so it used the big home dataset rather than smaller tmpfs.

So on the Jenkins hackathon today, I asked around and found that an explicit separate start of the maven in debug mode, and then linking to that from the IDE, is what works for people. So:
* in your home directory, create or edit the `$HOME/.m2/settings.xml` and put in the pluginGroup tag for "org.jenkins-ci.tools" as follows:
https://gist.github.com/daniel-beck/fdc54fb7df1aa65b9e6194bb5f2ea816

<?xml version="1.0" encoding="UTF-8"?>
 <settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
 <pluginGroups>
  <pluginGroup>org.jenkins-ci.tools</pluginGroup>
 </pluginGroups>
</settings>
* start a terminal window from the IDE (in case of Netbeans, go to root of the workspace and click Tools / Open in terminal, or go to Window / IDE Tools / Terminal), especially if your test is on another machine (such as it happens to be in my case);
* build the edited code, e.g. in the main module do: `mvn install -DskipTests=true` (note that `mvn package` would not do the job)
* run mvnDebug with the newly made WAR file: `(cd war ; mvnDebug -DskipTests=true jenkins-dev:run)`
** the default JENKINS_HOME for the test would be a `war/work` subdirectory in your workspace
** the jenkins-dev mode launches it without the installation wizard, so when you get into the server, you instantly have the user interface with no login
** to optionally run with a JENKINS_HOME of your choice, e.g. with test jobs and user accounts previously set-up, you can `(cd war ; mvnDebug -DJENKINS_HOME=$HOME/jenkinstest -DskipTests=true jenkins-dev:run)` instead
* in the IDE, set your breakpoints and go to Debug / Attach to debugger / JPDA / SocketAttach / select the port number from terminal (e.g. 8000)
** make sure your internet connection works at this point, as maven would pull a number of dependencies
** the IDE can show your breakpoint as a torn red box, with a pop-up that these sources are not the ones preferred (e.g. if you have maven artifacts downloaded by earlier plugin builds, Netbeans likes them better as they are found earlier alphabetically by default, I guess). In this case follow the pop-up's suggestion to open Window / Debugging / Sources and there right-click to Add Source Root and select your checked-out jenkins core workspace. Then remove the checkbox at an .m2/repository/org/jenkins-ci/main/jenkins-core/X.Y.Z/jenkins-core-X.Y.Z-sources.jar. Alternately, the context menu after that right-click also offered to just move items (e.g. directories from your workspace) up in priority.
* when the console says that your "Jenkins is fully up and running", open a browser and go to http://localhost:8080/jenkins/
** be careful to not press ENTER in the terminal that your mvnDebug runs in, by default that restarts the tested program instance ;)

Great thanks to Daniel Beck for sharing the magic bits and helping me move forward with this.

Saturday, February 2, 2019

Long live the SUNWdhcpd!

I was updating a very aged and venerable infrastructure server from OpenIndiana 151a8 (the last "dev" release) into modern-day OpenIndiana hipster. This caused a number of disruptions due to changes in various daemons' configuration and old data file handling, which was not surprising given that half a decade was skipped overnight as far as installed software versions were concerned.

But the more complicated part was that this server was also providing DHCP to the network, with neat use of SUNWdhcp server with heavy use of macros. SUNWdhcp macros are snippets which group a few DHCP options as relevant for this or that client profile, such as subnet-addressing related options, and preferred "nearer" DNS and NTP servers, etc. They can be combined into further macros to build up ultimate configurations applicable to various subnets and hardware types, and IP addresses can be reserved (with or without a particular device's MAC address) to map a macro configuration to the device and thoroughly set it up, maybe not in the same way as its neighbor. Also the server allows for programmatic reconfiguration of DHCP settings and reservations, using CLI and GUI tools. In particular, this is heavily used by the Sun Ray Server software to set up address management for its clients.

In short, upgrading into some other respectable solution like ISC dhcpd was complicated, as it does not serve similar concepts. It was probably possible to generate one configuration file from another with some smart script, but then updating the configs (say, add a new DNS server replica for clients to talk to) would require many coordinated changes in many places, rather than changing one line in a macro. So this was never pursued in the past years, as the server ticked and its software grew older and older.

But now, the upgrades came... And the service just disappeared, because long ago between OI 151a8 and the recommended interim step of OI hipster-2015, the SUNWdhcp server just stopped working in new illumos-gate builds... and nobody looked much at the why's... and then it was ripped away.

I went into the snapshot of older version deployment and tarballed the files which were the content of "dhcp" and "dhcpmgr" packages, and unpacked that into the new root - but the service did not work, which was sort of sadly expected.

Finally, I had the big-kick incentive to take a harder look. If the issue would be something simple, fixing it would be "cheaper" than migrating those server configs, and would allow to retain other benefits of having SUNWdhcp instead if there.

And indeed, with the help of Andy Fiddaman as we met up on FOSDEM, we traced the issue into the following chain of events: the dhcp-server service starts in.dhcpd daemon; the daemon starts a helper dsvclockd that finds and loads shared-object modules to operate with different formats of data files with DHCP configurations (there are plain-text files and binary databases). However, the helper exited very quickly, and the in.dhcpd daemon did not find a "door" to interact with the helper, so it also exited. We ultimately traced that the helper claims that no modules are to be used, meaning that either none were found, or of those found none contained the expected symbols. But like in the years before, the service was looking in correct directory, and libraries there did contain the symbol... so why?

Then it caught our attention that the trace of the binary did refer to the directory, but not to actual shared object files. Andy's magic with dtrace and mdb confirmed that the glob() command returns GLOB_NOMATCH and so it does not like the pattern it searches by for some reason. But the same pattern string did find the libraries when used e.g. in shell command line...

The git history of glob.c showed that it was not very turbulent, with a screenful of most recent commits being in 2008, 2013, 2015 and 2017. But the timeframe of a very big change in 2013 did match the breakage of the DHCP software. Given the scale of that glob code change, it is reasonable to assume that some fringe behaviors changed, even if we can't quickly point where for this particular issue.

What's better, this bit of data pushed us to good experiments: the pattern that DHCP code was looking for involved an escaped period character before the extension, which could be quite a edge case. Adding symlinks that would have the backslash (or two) in the filename did not resolve the problem. The next idea was to remove that backslash from the pattern (monkey-patching the binary /usr/lib/libdhcpsvc.so.1 with Midnight Commander, to remove the slash character and add a zero-character in the end of string, so file size stays the same). And this was a hit!

The SUNWdhcp server has again started and hands out addresses, and its management tools work again!

I am not quite holding my breath that the fixed version at https://github.com/jimklimov/illumos-gate/tree/revive-SUNWdhcpd can be re-introduced into illumos-gate (or arranged as a standalone project easily), but at least now we know how people can fix their setups in place :)

Also note that for the GUI tools, you would need an Oracle Java (with better X11 integration than openjdk) and at that, a 32-bit capable build (it uses JNI to get into shared objects, so bitness has to match) meaning that you need an Oracle Java 6 or 7 (tested both) and a -d32 command-line option if your JRE/JDK directory includes both bitnesses. To run that JVM, make sure also that you have the SUNWlibC package installed.

The tools are wrapped by scripts which hardcode use of /usr/java/bin/java, so if your system is updated to use the newer java by default, you may have to tweak the scripts for dhcpmgr, dhtadm, pntadm, and dhcpconfig. On a side note, similar fix may be needed by printmgr and slp.