Saturday, February 2, 2019

Long live the SUNWdhcpd!

I was updating a very aged and venerable infrastructure server from OpenIndiana 151a8 (the last "dev" release) into modern-day OpenIndiana hipster. This caused a number of disruptions due to changes in various daemons' configuration and old data file handling, which was not surprising given that half a decade was skipped overnight as far as installed software versions were concerned.

But the more complicated part was that this server was also providing DHCP to the network, with neat use of SUNWdhcp server with heavy use of macros. SUNWdhcp macros are snippets which group a few DHCP options as relevant for this or that client profile, such as subnet-addressing related options, and preferred "nearer" DNS and NTP servers, etc. They can be combined into further macros to build up ultimate configurations applicable to various subnets and hardware types, and IP addresses can be reserved (with or without a particular device's MAC address) to map a macro configuration to the device and thoroughly set it up, maybe not in the same way as its neighbor. Also the server allows for programmatic reconfiguration of DHCP settings and reservations, using CLI and GUI tools. In particular, this is heavily used by the Sun Ray Server software to set up address management for its clients.

In short, upgrading into some other respectable solution like ISC dhcpd was complicated, as it does not serve similar concepts. It was probably possible to generate one configuration file from another with some smart script, but then updating the configs (say, add a new DNS server replica for clients to talk to) would require many coordinated changes in many places, rather than changing one line in a macro. So this was never pursued in the past years, as the server ticked and its software grew older and older.

But now, the upgrades came... And the service just disappeared, because long ago between OI 151a8 and the recommended interim step of OI hipster-2015, the SUNWdhcp server just stopped working in new illumos-gate builds... and nobody looked much at the why's... and then it was ripped away.

I went into the snapshot of older version deployment and tarballed the files which were the content of "dhcp" and "dhcpmgr" packages, and unpacked that into the new root - but the service did not work, which was sort of sadly expected.

Finally, I had the big-kick incentive to take a harder look. If the issue would be something simple, fixing it would be "cheaper" than migrating those server configs, and would allow to retain other benefits of having SUNWdhcp instead if there.

And indeed, with the help of Andy Fiddaman as we met up on FOSDEM, we traced the issue into the following chain of events: the dhcp-server service starts in.dhcpd daemon; the daemon starts a helper dsvclockd that finds and loads shared-object modules to operate with different formats of data files with DHCP configurations (there are plain-text files and binary databases). However, the helper exited very quickly, and the in.dhcpd daemon did not find a "door" to interact with the helper, so it also exited. We ultimately traced that the helper claims that no modules are to be used, meaning that either none were found, or of those found none contained the expected symbols. But like in the years before, the service was looking in correct directory, and libraries there did contain the symbol... so why?

Then it caught our attention that the trace of the binary did refer to the directory, but not to actual shared object files. Andy's magic with dtrace and mdb confirmed that the glob() command returns GLOB_NOMATCH and so it does not like the pattern it searches by for some reason. But the same pattern string did find the libraries when used e.g. in shell command line...

The git history of glob.c showed that it was not very turbulent, with a screenful of most recent commits being in 2008, 2013, 2015 and 2017. But the timeframe of a very big change in 2013 did match the breakage of the DHCP software. Given the scale of that glob code change, it is reasonable to assume that some fringe behaviors changed, even if we can't quickly point where for this particular issue.

What's better, this bit of data pushed us to good experiments: the pattern that DHCP code was looking for involved an escaped period character before the extension, which could be quite a edge case. Adding symlinks that would have the backslash (or two) in the filename did not resolve the problem. The next idea was to remove that backslash from the pattern (monkey-patching the binary /usr/lib/libdhcpsvc.so.1 with Midnight Commander, to remove the slash character and add a zero-character in the end of string, so file size stays the same). And this was a hit!

The SUNWdhcp server has again started and hands out addresses, and its management tools work again!

I am not quite holding my breath that the fixed version at https://github.com/jimklimov/illumos-gate/tree/revive-SUNWdhcpd can be re-introduced into illumos-gate (or arranged as a standalone project easily), but at least now we know how people can fix their setups in place :)

Also note that for the GUI tools, you would need an Oracle Java (with better X11 integration than openjdk) and at that, a 32-bit capable build (it uses JNI to get into shared objects, so bitness has to match) meaning that you need an Oracle Java 6 or 7 (tested both) and a -d32 command-line option if your JRE/JDK directory includes both bitnesses. To run that JVM, make sure also that you have the SUNWlibC package installed.

The tools are wrapped by scripts which hardcode use of /usr/java/bin/java, so if your system is updated to use the newer java by default, you may have to tweak the scripts for dhcpmgr, dhtadm, pntadm, and dhcpconfig. On a side note, similar fix may be needed by printmgr and slp.

No comments:

Post a Comment