Sometimes when you go hunting for bugs, it’s pretty mundane. Other times, you strike gold…
On Monday night, Jan, Liane and I stayed late at work to help with some maintenance on our building’s file and mail server, which we affectionately know as Jurassic. The point of Jurassic is to run the latest Solaris Nevada builds in a production environment. The system’s regular admin is on vacation, and Jurassic was experiencing some unusual problems, and so a group of kernel engineers volunteered to help out. It’s our data, and our code, after all…
Jurassic had experienced two failed disks, and we really wanted wanted to replace those. For some complicated reasons, we needed to reboot the system, which was fine with us anyway, because we wanted to see firsthand a problem which had been reported but not diagnosed: why was the svc:/system/filesystem/root:default service experiencing timeouts on boot? This service, it turns out, doesn’t do much (despite its rather grand name): it takes care of finding and mounting a performance-optimized copy of libc, then runs a devfsadm invocation which ensures that the kernel has the latest copies of various driver.conf files. This made the timeout we saw all the more puzzling: why would a five minute timeout expire for this service? To make matters worse, SMF tries three times to start this service, and so the aggregate time was 3 × 5 = 15 minutes.
Once we waited, what we found was pretty surprising: the "hang" we were seeing was due to seemingly stuck devfsadm processes– three in fact:
# pgrep -lf devfsadm 100015 /usr/sbin/devfsadm -I -P 100050 /usr/sbin/devfsadm -I -P 100054 /usr/sbin/devfsadm -I -P
The next step I usually take in a case like this is to use pstack to see what the processes are doing. However, in this case, that wasn’t working:
# pstack 100050 pstack: cannot examine 100050: unanticipated system error
Hmm. Next we tried mdb -k, to look at the kernel:
# mdb -k mdb: failed to open /dev/ksyms: No such file or directory
This was one of those "oh shit…" moment. This is not a failure mode I’ve seen before. Jan and I speculated that because devfsadm was running, perhaps it was blocking device drivers from loading, or being looked up. Similarly:
# mpstat 1 mpstat: kstat_open failed: No such file or directory
This left us with one recourse: kmdb, the in situ kernel debugger. We dropped in and looked up the first devfsadm process, walked its threads (it had only one) and printed the kernel stack trace for that thread:
# mdb -K kmdb: target stopped at: kmdb_enter+0xb: movq %rax,%rdi > 0t100015::pid2proc | ::walk thread | ::findstack -v stack pointer for thread ffffff0934cbce20: ffffff003c57dc80 [ ffffff003c57dc80 _resume_from_idle+0xf1() ] ffffff003c57dcc0 swtch+0x221() ffffff003c57dd00 sema_p+0x26f(ffffff0935930648) ffffff003c57dd60 hwc_parse+0x88(ffffff09357aeb80, ffffff003c57dd90, ffffff003c57dd98) ffffff003c57ddb0 impl_make_parlist+0xab(e4) ffffff003c57de00 i_ddi_load_drvconf+0x72(ffffffff) ffffff003c57de30 modctl_load_drvconf+0x37(ffffffff) ffffff003c57deb0 modctl+0x1ba(12, ffffffff, 3, 8079000, 8047e14, 0) ffffff003c57df00 sys_syscall32+0x1fc()
The line highlighted above was certainly worth checking out– you can tell just from the name that we’re in a function which loads a driver.conf file. We looked at the source code, here: i_ddi_load_drvconf(). The argument to this function, which you can see from the stack trace, is ffffffff, or -1. You can see from the code that this indicates "load driver.conf files for all drivers" to the routine. This ultimately results in a call to impl_make_parlist(m), where ‘m’ is the major number of the device. So what’s the argument to impl_make_parlist()? You can see it above, it’s ‘e4’ (in hexadecimal). Back in kmdb:
> e4::major2name ii
This whole situation was odd– why would the kernel be (seemingly) stuck, parsing ii.conf? Normally, driver.conf files are a few lines of text, and should parse in a fraction of a second. ⁞ We thought that perhaps the parser had a bug, and was in an infinite loop. We figured that the driver’s .conf file might have gotten corrupted, perhaps with garbage data. Then the payoff:
$ ls -l /usr/kernel/drv/ii.conf -rw-r--r-- 1 root sys 3410823 May 12 17:39 /usr/kernel/drv/ii.conf $ wc -l ii.conf 262237 ii.conf
Wait, what? ii.conf is 262,237 lines long? What’s in it?
... # 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0. ii_bitmap=1; # # # 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0. ii_bitmap=1; ii_bitmap=1; # # # 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0. ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; # ...
This pattern repeats, over and over, with the number of ii_bitmap=1; lines doubling, for 18 doublings! We quickly concluded that some script or program had badly mangled this file. We don’t use this driver on Jurassic, so we simply moved the .conf file aside. After that, we were able to re-run the devfsadm command without problems.
Dan Mick later tracked down the offending script, the i.preserve packaging script for ii.conf, and filed a bug. Excerpting from Dan’s analysis:
Investigating, it appears that SUNWiiu's i.preserve script, used as a class-action script for the editable file ii.conf, will: 1) copy the entire file to the new version of the file each time it's run (grep -v -w with the pattern "ii_bitmap=" essentially copies the whole file, because lines consisting of "ii_bitmap=1;" are not matched in 'word' mode; the '=' is a word separator, not part of the word pattern itself) 2) add two blank comment lines and one substantive comment each time it's run, despite presence of that line in the file (it should be tested before being added) 3) add in every line containing an instance of "ii_bitmap=" to the new file (which grows as a power of two). (this also results from the grep -v -w above.) Because jurassic is consistently live-upgraded rather than fresh-installed, the error in the class-action script has multiplied immensely over time.
This was all compounded by the fact that driver properties (like ii_bitmap=1;) are stored at the end of a linked list, which means that the entire list must be traversed prior to insertion. This essentially turns this into a (n+1)\*n/2 pathology, where n is something like: (2LU+1)-1 (LU here is the number of times the system has been live-upgraded). Plugging in the numbers we see:
(218 \* 218) / 2 = 2\^35 = 34 Billion
I wrote a quick simulation of this algorithm, and ran it on Jurassic. It is just an approximation, but it’s amazing to watch this become an explosive problem, especially as the workload gets large enough to fall out of the cpu’s caches.
|LU Generation||List Items||List Operations||Time (simulated)|
|13||215 = 8192||225 = 33554432||.23s|
|14||216 = 16384||227 = 134217728||.97s|
|15||217 = 32768||229 = 536870912||5.7s|
|16||218 = 65536||231 = 2147483648||18.8s|
|17||219 = 131072||233 = 8589934592||2m06s|
|18||220 = 262144||235 = 34359738368||8m33s|
Rest assured that we’ll get this bug squashed. As you can see, you’re safe unless you’ve done 16 or more live upgrades of your Nevada system!