Tuesday, April 13, 2021

A safe worker on main Jenkins node

A CI farm using Jenkins can start small and grow big. It can be a PoC or a multi-config builder running on a laptop for a single developer's fun, or it can be a dispatcher of jobs commanding numerous machines and swarms of VMs and containers. In the latter case, when a Jenkins instance grows big (and public), security considerations come into view. One of these is the safety of the Jenkins controller (nee master) against the payload of the jobs it runs on the lowest layers in the OS - how safe are its configuration files and running processes against the arbitrary scripts someone puts into Git?

Documents like https://wiki.jenkins.io/display/JENKINS/Security+implication+of+building+on+master or https://www.jenkins.io/redirect/building-on-controller/ and plugins like https://wiki.jenkins.io/display/JENKINS/Job+Restrictions+Plugin address this situation, effectively by reducing the list of jobs that can run on the controller.

In many cases, for a modern deployment, you would not need to have jobs running on the controller; you would have a pipeline using some agent definition (maybe as a docker, maybe as a match to some label expression) and its payload would run there. The "master" node can then serve zero executors and not pick any work up, except for some system tasks like initial parsing of the pipeline scripts or SCM scanning (which do not consume executor count).

However, more likely in legacy but possibly in new deployments tailored to the physical layout of some build farm, you may need "infrastructural" jobs, whether for grooming your hypervisor, or orchestrating integration tests of product images, or collecting health stats from non-Jenkins players in your farm.

Quite likely, such jobs may need a worker on a predictable host or even a persistent workspace to pass some state between runs. Reasons may include using certain files you leave in the FS (though an anti-pattern for pure Jenkins setups - credentials may be better, including a text-file "credential"); using NFS shares; persistent workspaces to pass state around, etc., and this is where the solution below can help:

My recurring pattern for avoiding that "insecure" setup while providing the equivalent type of worker is to just create a persistent agent (SSH, Swarm...) running with a different Unix/Linux account on the same machine as the Jenkins controller, labeled e.g. "master-worker" and limited in node configuration to only run jobs that match by label. Those several infra jobs which need it, explicitly want to run on that node via label expression - by agent definition in pipelines, or "Restrict where this project can be run" in legacy job types (e.g. Freestyle). The original "(master)" node then has a limit to run 0 jobs, so effectively it only processes pipeline start-ups; you can manage it at your $JENKINS_URL/computer/(master)/configure/ (parentheses included).

So this way such "master-worker" is just another worker not endangering the Jenkins master (as far as messing with FS and processes is concerned - agent.jar runs under a different account which just happens to be on the "localhost" relative to controller), but it is persistent unlike containers, dockers, etc. and runs on a predictable machine which may be an advantage.

Monday, February 4, 2019

Debugging Jenkins plugins and core with an IDE

I'll collect notes here as I go along, to make an article later where it belongs better - like https://jenkins.io/doc/developer/building/ or https://wiki.jenkins.io/display/JENKINS/Building+Jenkins which contains a somewhat less detailed version of this information :)

Debugging Jenkins plugins and core with an IDE, feeling like you're 5
Based on a true story :)

While developing some fixes to plugins that would improve their behavior for our use-cases (and not really being a Java developer), at some point the cycle of editing a bit of code and running `mvn package` and deploying it to a Jenkins server instance to see how that goes (and maybe read some logs or System.err.println() lines) somehow did not scale well :)

Subsequently I found ways to extend the existing self-tests in plugins, mostly by looking at existing precedent code and working my way up from there, discovering what can be done at all and how -- and essentially anything you can do interactively, there is a Java command for that in the background... though finding ways to get it called right can be tricky for a newcomer to the ecosystem, thus reading into existing tests of this or other plugins is a really good way to get ideas.

But still, even if the testing became faster and more reliable and I could more quickly (with just an `mvn test`) see that my further changes to code do not break the expected behaviors that I defined with tests for the new or modified feature, it had a big overhead of running hundreds of tests for minutes sometimes, and my new baby lost in the middle of the haystack. And if something failed, this actually was harder to debug because I did not find a way to read the Logger or System.*.println messages that would trace me what is happening in my code. And if it misbehaved, I had little idea why with no details to theorize on.

So it was time to use a real debugger, to step through the code and see the variables as they change.

For historic reasons, my IDE of choice was Netbeans (the details below apply to others, modulo their GUI nuances), which is not a frequent choice nowadays, but was used by some Jenkins developers who even made various plugins to integrate with a Jenkins server like http://wiki.netbeans.org/HudsonInNetBeans etc. The release I used was the last Oracle Netbeans development version (8.2+), so note the current ones are from Apache, but are being released in phases (9, 10, 11...) during 2018-2019, with 8.2 plugins being the recommended choice for additional features that were not re-released yet (such as C/C++ support).

So, Netbeans supports maven well, and Jenkins and its plugins use maven to configure the projects. So it looks like a good match.

In my case however, the Maven provided as part of the older Netbeans distribution was older than what Jenkins requires, so I installed the OS package which is new enough (or I could download the newest from Apache) and went to Tools/Options/Java/Maven to specify the "Maven home" -- the base directory for the installation to use (/usr for the package).

Then I opened an existing project (from the git-checked-out workspace), and set breakpoints all over the place where I think my code could fail or otherwise be in an interesting state. Then I right-clicked the sources, and there was a Debug Test File option, with the latter starting `mvn test` and eventually hitting my breakpoints.

That simple thing was a huge step forward, I eventually found that the code was actually doing what I told it to do, by coding, though not what I intended it to do, by thinking ;)

And then I stumbled on something even better: when I went to Test Packages in the Navigator pane, and to the source of the test I am interested in, and to a test routine that I was tweaking, the context menu offered to Debug Focused Test Method. Now this was a big optimization - while the Debug Test File run of the project still did a lot of tests before it struck what I was interested in, this newly found mode actually started the test suite right from this routine, or close to that, shrinking my change-test iterations from minutes to seconds!

(Note that while I did not yet try this, the docs imply that Java code can actually be changed, recompiled and reloaded on the fly during a debugging session, as long as the method signatures are not changed... this might work for plugin code/test development, but less likely for jenkins core where the WAR file is debugged).

Next, my interest turned toward improvements in the jenkins-core. Now this is a separate beast, compared to plugin debugging. When you test a plugin, it downloads a lot of maven artifacts, including some old Jenkins binary build (the minimal compatible version, as declared by the plugin manifest), and starts it as a server with your plugin loaded into it (and on a side note, I could not run the tests or debug sessions on an illumos system, because those old artifacts do not contain a fixed libzfs4j version and crash - so had to test in a Debian VM). On the opposite, the Jenkins repository actually contains a "Jenkins main module" with nested maven projects, for jenkins-core, jenkins-cli, jenkins-war and several variants of tests for jenkins-core.

Normally, with the old approach of code peppered with logging statements, I would go to the main module project, run `mvn package`, wait for a while, and find the `war/target/jenkins.war` that I can run with a specified JENKINS_HOME on our server or my laptop.

But in the IDE, there was no good option to really debug such a build. If I trigger a debug of the main module, the server starts but tracing into the code does not happen. If I trigger the jenkins-core, it asks for what classes I want to execute... and I have no idea, neither of the suggested options succeeded. And trying the jenkins war project, it referred to jenkins-cli and jenkins-core artifacts by versions that try to pull in something that I did not edit and build -- a binary from the internet, often something that does not exist there (an X.Y-SNAPSHOT version) so the attempt fails. And also in some of those cases where the server at least started, some resources were not found, so the web interface was unusable after logging in.

On another side note, loading the project into the IDE also began its indexing, and my development VM ran out of temporary space. As I found, indexing produces and later removes huge amounts of data, peaking around 3Gb. I restarted Netbeans with `netbeans -J-Djava.io.tmpdir=$HOME/tmp` so it used the big home dataset rather than smaller tmpfs.

So on the Jenkins hackathon today, I asked around and found that an explicit separate start of the maven in debug mode, and then linking to that from the IDE, is what works for people. So:
* in your home directory, create or edit the `$HOME/.m2/settings.xml` and put in the pluginGroup tag for "org.jenkins-ci.tools" as follows:

<?xml version="1.0" encoding="UTF-8"?>
 <settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
   xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
* start a terminal window from the IDE (in case of Netbeans, go to root of the workspace and click Tools / Open in terminal, or go to Window / IDE Tools / Terminal), especially if your test is on another machine (such as it happens to be in my case);
* build the edited code, e.g. in the main module do: `mvn install -DskipTests=true` (note that `mvn package` would not do the job)
* run mvnDebug with the newly made WAR file: `(cd war ; mvnDebug -DskipTests=true jenkins-dev:run)`
** the default JENKINS_HOME for the test would be a `war/work` subdirectory in your workspace
** the jenkins-dev mode launches it without the installation wizard, so when you get into the server, you instantly have the user interface with no login
** to optionally run with a JENKINS_HOME of your choice, e.g. with test jobs and user accounts previously set-up, you can `(cd war ; mvnDebug -DJENKINS_HOME=$HOME/jenkinstest -DskipTests=true jenkins-dev:run)` instead
* in the IDE, set your breakpoints and go to Debug / Attach to debugger / JPDA / SocketAttach / select the port number from terminal (e.g. 8000)
** make sure your internet connection works at this point, as maven would pull a number of dependencies
** the IDE can show your breakpoint as a torn red box, with a pop-up that these sources are not the ones preferred (e.g. if you have maven artifacts downloaded by earlier plugin builds, Netbeans likes them better as they are found earlier alphabetically by default, I guess). In this case follow the pop-up's suggestion to open Window / Debugging / Sources and there right-click to Add Source Root and select your checked-out jenkins core workspace. Then remove the checkbox at an .m2/repository/org/jenkins-ci/main/jenkins-core/X.Y.Z/jenkins-core-X.Y.Z-sources.jar. Alternately, the context menu after that right-click also offered to just move items (e.g. directories from your workspace) up in priority.
* when the console says that your "Jenkins is fully up and running", open a browser and go to http://localhost:8080/jenkins/
** be careful to not press ENTER in the terminal that your mvnDebug runs in, by default that restarts the tested program instance ;)

Great thanks to Daniel Beck for sharing the magic bits and helping me move forward with this.

Saturday, February 2, 2019

Long live the SUNWdhcpd!

I was updating a very aged and venerable infrastructure server from OpenIndiana 151a8 (the last "dev" release) into modern-day OpenIndiana hipster. This caused a number of disruptions due to changes in various daemons' configuration and old data file handling, which was not surprising given that half a decade was skipped overnight as far as installed software versions were concerned.

But the more complicated part was that this server was also providing DHCP to the network, with neat use of SUNWdhcp server with heavy use of macros. SUNWdhcp macros are snippets which group a few DHCP options as relevant for this or that client profile, such as subnet-addressing related options, and preferred "nearer" DNS and NTP servers, etc. They can be combined into further macros to build up ultimate configurations applicable to various subnets and hardware types, and IP addresses can be reserved (with or without a particular device's MAC address) to map a macro configuration to the device and thoroughly set it up, maybe not in the same way as its neighbor. Also the server allows for programmatic reconfiguration of DHCP settings and reservations, using CLI and GUI tools. In particular, this is heavily used by the Sun Ray Server software to set up address management for its clients.

In short, upgrading into some other respectable solution like ISC dhcpd was complicated, as it does not serve similar concepts. It was probably possible to generate one configuration file from another with some smart script, but then updating the configs (say, add a new DNS server replica for clients to talk to) would require many coordinated changes in many places, rather than changing one line in a macro. So this was never pursued in the past years, as the server ticked and its software grew older and older.

But now, the upgrades came... And the service just disappeared, because long ago between OI 151a8 and the recommended interim step of OI hipster-2015, the SUNWdhcp server just stopped working in new illumos-gate builds... and nobody looked much at the why's... and then it was ripped away.

I went into the snapshot of older version deployment and tarballed the files which were the content of "dhcp" and "dhcpmgr" packages, and unpacked that into the new root - but the service did not work, which was sort of sadly expected.

Finally, I had the big-kick incentive to take a harder look. If the issue would be something simple, fixing it would be "cheaper" than migrating those server configs, and would allow to retain other benefits of having SUNWdhcp instead if there.

And indeed, with the help of Andy Fiddaman as we met up on FOSDEM, we traced the issue into the following chain of events: the dhcp-server service starts in.dhcpd daemon; the daemon starts a helper dsvclockd that finds and loads shared-object modules to operate with different formats of data files with DHCP configurations (there are plain-text files and binary databases). However, the helper exited very quickly, and the in.dhcpd daemon did not find a "door" to interact with the helper, so it also exited. We ultimately traced that the helper claims that no modules are to be used, meaning that either none were found, or of those found none contained the expected symbols. But like in the years before, the service was looking in correct directory, and libraries there did contain the symbol... so why?

Then it caught our attention that the trace of the binary did refer to the directory, but not to actual shared object files. Andy's magic with dtrace and mdb confirmed that the glob() command returns GLOB_NOMATCH and so it does not like the pattern it searches by for some reason. But the same pattern string did find the libraries when used e.g. in shell command line...

The git history of glob.c showed that it was not very turbulent, with a screenful of most recent commits being in 2008, 2013, 2015 and 2017. But the timeframe of a very big change in 2013 did match the breakage of the DHCP software. Given the scale of that glob code change, it is reasonable to assume that some fringe behaviors changed, even if we can't quickly point where for this particular issue.

What's better, this bit of data pushed us to good experiments: the pattern that DHCP code was looking for involved an escaped period character before the extension, which could be quite a edge case. Adding symlinks that would have the backslash (or two) in the filename did not resolve the problem. The next idea was to remove that backslash from the pattern (monkey-patching the binary /usr/lib/libdhcpsvc.so.1 with Midnight Commander, to remove the slash character and add a zero-character in the end of string, so file size stays the same). And this was a hit!

The SUNWdhcp server has again started and hands out addresses, and its management tools work again!

I am not quite holding my breath that the fixed version at https://github.com/jimklimov/illumos-gate/tree/revive-SUNWdhcpd can be re-introduced into illumos-gate (or arranged as a standalone project easily), but at least now we know how people can fix their setups in place :)

Also note that for the GUI tools, you would need an Oracle Java (with better X11 integration than openjdk) and at that, a 32-bit capable build (it uses JNI to get into shared objects, so bitness has to match) meaning that you need an Oracle Java 6 or 7 (tested both) and a -d32 command-line option if your JRE/JDK directory includes both bitnesses. To run that JVM, make sure also that you have the SUNWlibC package installed.

The tools are wrapped by scripts which hardcode use of /usr/java/bin/java, so if your system is updated to use the newer java by default, you may have to tweak the scripts for dhcpmgr, dhtadm, pntadm, and dhcpconfig. On a side note, similar fix may be needed by printmgr and slp.

Monday, March 12, 2018

Moving and renaming projects in OBS

For some part of our software project lifecycle with http://42ity.org/ we use an on-premise setup of OBS aka OpenSUSE Build Service, at least initially that was the name (not to be confused with Open Broadcaster Software) ;) Having a local setup allows us to modify and manage it in ways the common cloud service can not be changed to our whim.

One such issue we were facing recently was that a few packages (ultimate recipes) were placed into wrong projects or sub-projects (a scope to group packages for the purposes of dependencies, upgrades, etc.). This caused a mess, because beside the pedantic "it is not clean" complaints, packages at different levels ended up building against different dependencies - and it this case it was not intended (sometimes it is, hence the scoping). Packages have a history that tracks evolution of a recipe, and sometimes it is important to keep it for development reference, or even as part of the workflow in our case (determine whether *this* version of the Git sources was packaged earlier).

Unfortunately, OBS does not directly support moving such data around nor even renaming packages. You can "branch" packages (to track an original package's changes and merge them with customized bits of the clone), but that's about it. Thanks to help on the IRC channel, however, we found a way to do it on the backend, fiddling under the hood.

Take backups or snapshots before following these notes!

Now, it is important to keep in mind that OBS is a wad of scripts written in several languages for different purposes, such as a Ruby web-frontend, a perl backend for scheduling etc., shell scripts to deal with OS nuances such as build-root setup... and the backend involves (and represents) a database to keep some info and claim unique IDs among other stuff, while a lot is kept in arcane directory structures as files placed in expected locations with expected names and contents and role in the overall solution. So fiddling under the hood is tweaking the implementation detail, and might not work the same way in all versions (FWIW, ours is based on a Nov 2015 release) and might have poorly traceable consequences. Still, it worked for us once so I decided to record the experience :)

1. Identify the poorly named package. For the recent example's sake, it will be a "fty:master:/appliance:/fosspkg" which used a third-party FOSS package "fosspkg" almost verbatim - just added the pkgconfig files that configure scripts in some of our components relied on and the original code did not provide. The problem here was that the top-level project "fty:master" ended up using the upstream distributions' version of the package for its builds and tests, while its sub-project "appliance" used another - so preinstall images (made in top-level project) did not match, and our own common components subsequently placed into the top-level project did not even build (needed those pkgconfig files). We want this "fosspkg" recipe moved upwards, to the top-level project.

2. Create a new empty "fty:master:/fosspkg" recipe, using the common web-gui "Create package" link (or CLI, or REST API) so the system properly assigns the new IDs and other resources it wants.

3. Fire up an SSH session to the OBS master server, where the fun happens henceforth. If you used packaged setup initially, the data to modify will be in several locations under "/srv/obs" directory:
  • /srv/obs/sources/fosspkg contains the source code (reused by all packages with the same name located under different projects), and likely remains unchanged.
  • /srv/obs/trees/fty:master:appliance/fosspkg and /srv/obs/trees/fty:master/fosspkg contain files with references to metadata for each commit into the repo (list of source and other files with their hashes that comprise this or that revision of the recipe) - just copy the files from original version into the newly created one.
  • /srv/obs/projects/fty:master:appliance.pkg/ and /srv/obs/projects/fty:master.pkg/ directories contain some *.xml, *.del, *.rev and *.mrev files for each package in this project (e.g. fosspkg.xml and fosspkg.mrev for initial state). Copy over the original package's REV file, migrate carefully the XML file contents, into the new location. TODO: Not sure what should be done with the MREV file - is it safe replace the new one with the old one (moved away into "fosspkg.mrev.del")?
    • The XML file contains the description, title, enabled build targets (if customized compared to the project level) and such - you might transplant the description for example, if you haven't done so with web-gui already; note that COPYING the XML file verbatim is not a good idea, as it references the "project" in its top tag (can copy and edit, though, if there were many custom settings);
    • The REV text-table file contains the actual history of the package - which revision was done when and which number it was in the order of succession, as well as other revision-specific details; is not initially there before the first upload of actual contents into the package;
    • The MREV file seems track the initial creation of the component? :)
    • The DEL files track components that were recipe'd before, but have since been deleted.
  • /srv/obs/repos/fty:/master:/appliance/Distro_X.Y/ and /srv/obs/repos/fty:/master/Distro_X.Y/ contain build products of the project (including package sources and architecture-specific subdirectories). To avoid rebuilds, you can copy over the products from the old location into the new one (this might make sense if builds are costly, but if the actual dependencies have changed due to relocation - you might miss out on something here). To free up resources of the server, you can otherwise remove the build products of the original package you are essentially removing.
  • /srv/obs/build/fty:master:appliance/Distro_X.Y/ARCH/ and /srv/obs/build/fty:master/Distro_X.Y/ARCH/ contain the latest build results (binaries, logfile, reason) as well as the history of builds (when, what, how long) - move it over as well, to keep the history tracking. Beside the "fosspkg" package subdirectory under ARCH, note also the ":full", ":repo", ":logfiles.*" and ":meta" ones - move over their contents for "*fosspkg*" matching files as well.
  • Revise with a shell command like:
    # find /srv/obs -name '*fosspkg*' 2>/dev/null | grep master
    that there is nothing unexpected under any "appliance" related locations.
  • Refresh the new package's page in OBS web-gui. It may display that the builds are "broken" since it has no metadata about last build in its binary tables. Go to Repositories tab to "Explicitly disable" and then "Take default" the Build Flag - if all went well, the system should discover that it was built, and the status will become "succeeded" with the old build's logs seen upon clicking.
  • If the original recipe was "Branched" into other projects, branch the new one into desired locations; carry over customizations that the branch might have (if any) - the web-gui might help with its show of differences.
  • Semi-Finally, disable all the flags in Repository tab of the original obsoleted recipe (so it is in fact not built nor used), and perhaps use
    osc wipebinaries --build-disabled fty:master:appliance fosspkg
    - hopefully this would e.g. rectify the binary metadata we did not touch under the "build" directories.
  • Note that as far as OBS recalculating the dependencies, nothing happened (via web-gui) to trigger rebuilds of stuff that is impacted by your changed package. Probably it is up to you to determine the scope of fallout and trigger the rebuilds.
    • For greater consistency, and if your time/cost constraints permit (this sort of things is always done in a rush, right?) go into the new component's build states (in the Web-GUI) and click Trigger Rebuild. This would ensure the build products are re-done honestly, and downstream stuff is triggered honestly.
    • You might have some luck with e.g.
      # osc whatdependson fty:master fosspkg Debian_8.0 x86_64
      at least for dependent packages right inside the same project level.
    • Also a direct search in the filesystem like
      # grep fosspkg /srv/obs/build/fty\:master/Debian_8.0/x86_64/*/.meta.success
      might help.
  • Finally, when you're sure the new one works well - delete the original with web-gui.

Friday, June 23, 2017

rsync from PC to Android to pull data as root

During my backup adventures, I came across many nice tools, such as SSHDroid and SimpleSSHD that allow me to SSH from my computer to the Android device. They even include niceities like rsync and ssh binaries that are quite useful. Alas, this team failed for me on a particular older phone that I wanted to slurp to the PC "as is" for all accessible files (including system ones) keeping their original permissions and owners, which of course needs root on both sides. That should have been easy to do with rooting support in both Android toolkits, but on this phone the SimpleSSHD did allow me to connect (and with public-key auth available for free, unlike SSHDroid) - but only as a "user", and SSHDroid could not start at all due to some execution errors.

Well, I thought, if I can log in by SSH and do su easily, I can just run the utilities from command line to start the rsync session from the phone to PC's SSH server. Alas, this idea got broken on several accounts:
  • First of all, even though I could run both /data/data/berserker.android.apps.sshdroid/dropbear/rsync and /data/data/berserker.android.apps.sshdroid/dropbear/ssh individually, running the common rsync -avPHK /data/ root@mypcIPaddr:/androidbackup/data/ failed again due to permissions error:
    rsync: Failed to exec ssh: Permission denied (13)
    rsync error: error in IPC code (code 14) at jni/pipe.c(84) [sender=3.0.8]
    Segmentation fault
    I tried to wiggle around this by e.g. passing a -e '/data/data/berserker.android.apps.sshdroid/dropbear/ssh' argument, or adding the directory to beginning of PATH - but got the same results.
    I finally got these to work using the system path (and note these must be copies with root ownership - not symlinks to original files). Also mind that the Android/su shell syntax is quirky, to say the least:
    # mount -o remount,rw /system
    # cp /data/data/berserker.android.apps.sshdroid/dropbear/ssh /system/xbin/ssh
    # cp /data/data/berserker.android.apps.sshdroid/dropbear/rsync /system/xbin/rsync
    # chmod 755 /system/xbin/rsync /system/xbin/ssh
    # chown 0:0 /system/xbin/rsync /system/xbin/ssh
    # mount -o remount,ro /system
    This got me pretty far, now running just rsync (without prefixing the long path to app instance) became possible, and it could call ssh at least as:
    # rsync -e /system/xbin/ssh -avPHK \
        /data/ root@mypcIPaddr:/androidbackup/files/data/
    /system/xbin/ssh: Exited: Error connecting: Connection timed out
    Segmentation fault
    So this was better, but not enough.
  • Trying to connect using ssh, or even netcat, to any port of my PC seems impossible :\ Even after I disabled firewalls the best I could.
  • Trying to circumvent the firewalls issue, I made an SSH session from the PC to Android with a TCP tunnel, so I could rsyncback from the phone through it:
    root@pc# ssh -R 22222:localhost:22
    root@phone# rsync -e '/system/xbin/ssh -p 22222' -avPHK \
        /data/ root@localhost:/androidbackup/files/data/
    Login for root@localhost
    Segmentation fault 
    So this was close - the SSH session from phone to PC got established this way, but rsync still crashed before doing anything meaningful. Similar experiment with netcat tunnels also failed.
Going back to the idea of initiating SSH connections from the PC, which at least works, I tried to make the rsync binary setuid as root:
# chmod 106755 /system/xbin/rsync
But connections from PC to the android, using rsync --rsync-path=/system/xbin/rsync ... failed with permissions error accessing system dirs.

Finally, I made a wrapper script that calls su and this panned out:
su -c /system/xbin/rsync "$@"
I saved it as /system/xbin/rsync-su (and also chmodded it like the original programs above), and it worked!

Unfortunately, it seems that the Android-side rsync still crashes after some time or amount of transfers - though not quite repeatable (has long stretches of running well, too), so I wrapped it in a loop and excluded some files it had most problems with (passing over another time, without exclusions, allows to copy the few files by ssh+tar):
root@pc# while ! rsync --timeout=5 \
    --exclude='*.db' --exclude='*.db-*' \
    --partial-dir=.partial -e 'ssh -p 2222' \
    --rsync-path=/system/xbin/rsync-su \
    -avPHK root@phoneIPaddr:/data/ ./files/data/ \
    ; do echo "`date` : RETRY"; sleep 1; done

The sleep 1 is needed to reliably abort the loop by Ctrl+C if I want to stop it and tweak something.

UPDATE: in hindsight, maybe I should have looked for luckier builds of the tools as well. The Apps2SD project includes a formidable collection of programs, delivering busybox and rsync in particular. There are also binary builds for various platforms at https://github.com/floriandejonckheere/rsync-android. But since my immediate pain has been resolved by a stone-hammer described in this post, I did not look into other possibly finer tools, at least not on this phone (this is recovery after all - even adding SW there is problematic).

For completeness, the target directory on the PC contains a few scripts I conjured up in haste, and a files/ subdirectory into which Android content lands.
One is dubbed tarcp, which does the copying with tar and ssh:

set -o pipefail


# A files/ should be under CWD
ssh -p "$PHONE_SSHPORT" "root@$PHONE_IP" \
  'su -c "cd / && tar cvf - \"'"$@"'\" "' \
  | (cd files/ && tar xvf - )
This is sort of wasteful, because when retries are needed (wifi flakiness, etc.) it re-copies the whole set of arguments.
Another is loopcp which calls the first one in a loop - to do those retries if needed:
for D in "$@" ; do while ! time ./tarcp "$D" ; do \
  echo "`date`: RETRY" ; sleep 1; done; done
Finally, there is a tool to help determine directory sizes on Android from PC, so I could copy-paste a lot of relatively small targets for loopcp to run over, and retries are relatively cheap (not like re-pulling several gigabytes over and over, after getting a last-minute hiccup):


#echo \
ssh -p $PHONE_SSHPORT root@$PHONE_IP 'su -c "cd / && \
   du -ks '"$1"'/* | while read S D \
      ; do [ \"\${S}\" -lt 100000 ] && \
        printf \"\\\"\$D\\\" \" ; done ; \
   echo ; echo ; \
   du -ks '"$@"'/* | sort -n | while read S D \
      ; do [ \"\${S}\" -ge 100000 ] && \
        printf \"\$S\t\$D\n\" ; done ; echo"'
This outputs a long string of double-quoted subdirectory names (or files) present in the argument-directory and are under 100Mb in size, and prints a sorted (by size) list of larger objects. The long string can be copy-pasted as argument list for loopcp; the list of larger objects is for drilling into them and breaking their contents into another string of small chunks of work.

Actually, this set of scripts is what I started with after initial failures with rsync, but in the end the tedious breaking down the list of arguments for "loopcp" and the uncertainty that I got all the bits and FS rights correct in the copy, pushed me to find a way to run the rsync after all - though most of the file content was copied by that time, using the scripts.

Wednesday, June 21, 2017

Tarballing Android files and partitions over adb

Some time back I was recovering a phone, and before I made something irreversible, I opted for a low-level backup. This copy of old data helped somewhat in migration of program configs to my next phone, too -- although changes between Android 4 and Android 6 did not make this an easy task, clobbering everything with security features ;)

Of course, this adventure required a lot of googling, and posting back as I learned, so I want to save as a cheat-sheet a solution I used: https://android.stackexchange.com/questions/85564/need-one-line-adb-shell-su-push-pull-to-access-data-from-windows-batch-file

Note that this was done using a Windows PC and corresponding tools and drivers, and the Linux side might have different caveats. And of course Android side was rooted, USB debugging allowed, busybox installed, etc. - thanks to a handy habit to do this early, while deploying a new phone.

So, a slightly edited copy of my post from SO follows:

First thing to note is that the adb shell generally sets up a text terminal (so can convert single end-of-line characters to CRLF, mangling binary data like partition images or TAR archives). While you can work around this in Unix/Linux versions of adb (e.g. add stty raw to your shelled command) or use some newer adb with exec-out option, on Windows it still writes CRLF to its output. The neat trick is to pass data through base64 encoding and decoding (binaries are available for Windows en-masse, google them). Also note that errors or verbose messages printed to stderr in shell end up on stdout of the adb shell program in the host system - so you want to discard those after inevitable initial experimentation.

Here goes (note that in command examples below, long command lines are wrapped for readability - you will need to unwrap them back while copy-pasting... not that my examples would match your phones exactly anyway):

adb shell "su -c 'cat /dev/block/mmcblk0p25 | base64' 2>/dev/null" \
  | base64 -d > s3-mmcblk0p25.img

Can be easily scripted by Windows cmd.exe shell to cover all the partitions (one can look the list up by ls -la /dev/block/ and/or by cat /proc/diskstats or by cat /proc/partitions), e.g.:

for /L %P in (1,1,25) do ..\platform-tools\adb shell \
  "su -c 'cat /dev/block/mmcblk0p%P | base64' 2>/dev/null" \
  | base64 -d > s3-mmcblk0p%P.img

(Note to use %%P in pre-created and saved CMD batch files, or %P in interactive shell).

Don't forget that there are also mmcblk0boot[01] partitions, and that the mmcblk0 overall contains all those partitions in a GPT wrapping, just like any other harddisk or impersonator of one :)

To estimate individual partition sizes, you can look at the output of:

fdisk -u -l /dev/block/mmcblk0*

Unfortunately, I did not quickly and easily manage to tar cf - mmcblkp0* and get the partition contents, so I could pipe it to e.g. 7z x -si and get the data out as multiple files in a portable one-liner as well.

To tar some files you can:

adb shell "su -c 'cd /mnt/data && tar czf - ./ | base64' \
  2>/dev/null" | base64 -d > s3-mmcblk0p25-userdata.tar.gz

Of course this solution is not perfect due to increase of transferal time by one third, but given that for recovery purposes grasping for any straws that can help is good enough... well... :) In any case, if I send a whole partition for export like this, I do not really care if it takes an hour vs hour and a half, as long as it reliably does the job. Also if, as some others have documented, the Windows variant of adb.exe does effectively always output CRLF separated text, there seems little we can do but abuse it with ASCII-friendly encapsulation. Alternately, adb exec-out might be the solution - but for some reason did not work well for me.

Hope this helps someone else, Jim Klimov

Wednesday, March 15, 2017

Digging in the Git history: cheatsheet

I've recently had to rewrite git history of some internal projects going opensource, and wanted to eliminate issues that should not be seen in the wild - like hardcoded testing passwords.

For a task like this it does not suffice to just add a commit that replaces those values with expansible variables populated from elsewhere (e.g. a testbed-local config file). One should also dig through all commits to make sure the string does not pop up anywhere in the history over the years (and git rebase / edit / continue rebase the offending commits as if we were smart enough to do this initially years ago).

There is quite a bit of mixed documentation on this, both in official docs and good suggestions in some blogs and stackexchange forums, but it took considerable time to end up with a few one-liners to do the job, so lest I forget -- I'd just post them here. After all, they can be useful to just find needles in a haystack too, such as finding which commits had to do with a certain keyword.

Finally note, that this script is not optimized for performance etc., but rather for readability and debugging of the procedure ;)
#! /bin/sh

### Copyright (C) 2017 by Jim Klimov

usage() {
    cat << EOF
You can call this script as
    $0 find_commits_contains [PATTERN]
    $0 find_commits_intro [PATTERN]
    $0 find_commits_drop [PATTERN]
    $0 fix_history [PATTERN] [REPLACEMENT]

Note that PATTERN and REPLACEMENT are fixed strings (not regexes) in this context
For the SED usage there is also a PATTERN_SED that you can optionally export

[ -n "${PATTERN-}" ] || PATTERN="needle"
### This script runs from the basedir of a git repo (or a few)
### Logs and other work files are commonly stored in the parent dir by default
[ -n "${LOGDIR-}" ] || LOGDIR="`pwd`/.."
[ -n "${LOGTAG-}" ] || LOGTAG="$(basename `pwd`)"

find_commits_contains() {
    ### This lists which "$COMMITID:$PATHNAME:$LINETEXT" contain the PATTERN in the LINETEXT
    ### Note that this lists all commits whose checked-out workspace would have the pattern
    [ -n "${WORKSPACE_MATCHES_FILE-}" ] || WORKSPACE_MATCHES_FILE="${LOGDIR}/gitdig_commits-workspace-contains__${LOGTAG}.txt"
    git grep "${PATTERN}" $(git rev-list --all --remotes) \

show_commits_color() {
    ### This finds which commits dealt the pattern (whose diff adds or removes a line with it, or has it in context)
    ### Note that this starts with a colorful depiction of first-pass diffs, for esthetic viewing pleasure
    ### and is parsed including color markup by other tools below
    git rev-list --all --remotes | \
        while read CMT ; do ( \
            git show --color --pretty='format:%b' "$CMT" | \
            egrep "${PATTERN}" && echo "^^^ $CMT" \
        ) ; done

find_commits_intro() {
    ### This finds which commits INTRODUCE the pattern (whose diff adds a line with it)
    ### Note that this starts with a colorful depiction of first-pass diffs, for esthetic viewing pleasure
    [ -n "${COMMIT_INTRODUCES_FILE-}" ] || COMMIT_INTRODUCES_FILE="${LOGDIR}/gitdig_commits-intro__${LOGTAG}.txt"
    show_commits_color | tee "${COMMIT_INTRODUCES_FILE}".tmp

    ### ...and this picks out the lines for commits which actually add the PATTERN
    ### (because there are also not interesting context and removal lines as well,
    ### which should disappear after the rebase)
    cat "${COMMIT_INTRODUCES_FILE}".tmp | \
    egrep '^([\^]|.*32m\+)' | \
    ggrep -A1 '32m\+' | \
    grep '^\^' \

find_commits_drop() {
    ### This finds which commits REMOVE the pattern (whose diff drops a line with it)
    ### Note that this starts with a colorful depiction of first-pass diffs, for esthetic viewing pleasure
    [ -n "${COMMIT_DROPS_FILE-}" ] || COMMIT_DROPS_FILE="${LOGDIR}/gitdig_commits-drop__${LOGTAG}.txt"
    show_commits_color | tee "${COMMIT_DROPS_FILE}".tmp

    ### ...and this picks out the lines for commits which actually add the PATTERN
    ### (because there are also not interesting context and removal lines as well,
    ### which should disappear after the rebase)
    cat "${COMMIT_DROPS_FILE}".tmp | \
    egrep '^([\^]|.*31m\-)' | \
    ggrep -A1 '31m\-' | \
    grep '^\^' \
    | tee "${COMMIT_DROPS_FILE}"

fix_history() {
    ### When the inspections are done with, we want to clean up the history
    ### Note that as part of this, we would drop "unreachable" commits that
    ### are not part of any branch's history (because these would still contain
    ### the original offending pattern.
    ### WARNING: This is the one destructive operation in this suite.
    ### You are advised to run it in a scratch full copy of your git repo.
    ### After it is successfully done, you are advised to destroy the published
    ### copy of the repo completely (e.g. on github) and force-push the cleaned
    ### one into a newly created instance of the repo, and have your team members
    ### destroy and re-fork their clones both on cloud platform and their local
    ### workspaces, so that the destroyed offending commits do not resurface.

    ### Note that you can stack more sed '-e ...' blocks below, e.g. to rewrite
    ### more patterns in one shot. Also note that the pattern and replacement
    ### representation for simple grep and regex in sed may vary... you may want
    ### to automate escaping of special chars.
    [ -n "${PATTERN_SED-}" ] || PATTERN_SED="${PATTERN}"
    git filter-branch --tree-filter "git grep '${PATTERN}' | sed 's,:.*\$,,' | sort | uniq | while read F ; do sed -e 's,${PATTERN_SED},${REPLACEMENT},g' -i '\$F'; done"
    git reflog expire --expire-unreachable=now --all
    git gc --prune=now

[ -n "$2" ] && PATTERN="$2"
[ -n "$3" ] && REPLACEMENT="$3"

case "$1" in
    -h|--help) usage; exit 0 ;;
    find|grep) ACTION="find_commits_intro"; shift ;;
    show|diff) ACTION="find_commits_show"; shift ;;
    drop|remove|del|delete) ACTION="find_commits_drop"; shift ;;
    fix_history|find_commits_contains|find_commits_intro|find_commits_drop) ACTION="$1"; shift ;;
    *) usage; exit 1 ;;

echo "Running routine: '$ACTION'"

I'll also post a copy of this to my github collection of git scripts, to track further possible evolution.
A similar operation to extract just a history of certain file(s) into a new repository can be done through a series of mail-patches, e.g.:

oldrepo$ rm -rf /tmp/ggg ; mkdir -p /tmp/ggg

oldrepo$ FILES="some-script.sh"

### NOTE: Maybe factor $FILES (hash of?) into PRJ to allow different exports
### from same project to a new repo. Or just tag this manually :)

oldrepo$ PRJ="$(basename "`pwd`" | sed 's,[^A-Za-z\-_0-9\.],_,g')"

### The following results in a patch-file series ordered by UTC Unix epoch
timestamp of the original commit, allowing to mix similar export series
from different files or projects, and then importing as one monotonous
### history in a new repository. Note that git log tends to not end the last
### line well so we help it:

oldrepo$ ( git log --pretty='format:%H %ad' --date=unix $FILES ; echo "" ) | \
   ( A=0; while read HASH TIME ; do \
     [ -n "$HASH" ] && git format-patch -1 --no-numbered --stdout "$HASH"
     > /tmp/ggg/"$TIME-$PRJ-`printf '%04d' "$A"`".patch ; A=$(($A+1)); \
     done )

### Can filter the patch files further, e.g. update commit message formatting...

newrepo$ git am /tmp/ggg/*.patch