2010 California Propositions Guide

Back in 2008, the last time we had a general election in California, I posted a simple Guide to 2008 California Ballot Propositions.  For the 2010 election (coming this Tuesday) I have moved this information to a new website, www.propgrok.org.  I also added some features to allow you to share your stances on the propositions via facebook or twitter.  If you find the site useful, please mention the site on facebook or twitter– I’d really appreciate it.

(Please note that propgrok.org is not affiliated with Oracle; views expressed therein are my own).

An irritating bug in bash

Here’s what kernel developers talk about at coffee: bugs in our command shells.  At least, that was the topic the other day when Stephen, Dave and I were complaining about the various troubles we’d had with different command shells.

While others moved to zsh some years ago, I have been a bash user since kicking the tcsh habit.  But for years I have been plagued by a subtle nuisance in bash: sometimes it doesn’t catch terminal window resizes properly.  The result is that command line editing works very poorly until bash finally figures this out. After a while, I worked out that this behavior happens only when the window size change happens while you’re in some application spawned by bash.  So if you’re in an editor like vim, change the window size, and then exit (or suspend) the editor, bash will be confused about the terminal size.

While this has always annoyed me, it never quite reached the threshold for me to do anything about it.  But recently it has been bugging me more and more.  After we returned from coffee, I dug into the bash manual and discovered a little-known option, the checkwinsize builtin.  In a nutshell, you can set this shell option as follows:

    shopt -s checkwinsize

which the bash manual says: If set, Bash checks the window size after each command and, if necessary, updates the values of LINES and COLUMNS. Sounds great!  As an aside, I think that as a modern shell, bash should set this option by default.  (Others think so too).

With much self-satisfaction I set this option and got ready for line editing bliss.  But, no joy.  I checked and rechecked, and finally started using truss, and then DTrace, to try to understand the problem.  After some digging I eventually discovered the following bug in the shell.  Here’s the meat of the writeup I submitted to the bash-bug list:

On Solaris/OpenSolaris platforms, I have discovered what I believe is a
bug in lib/sh/winsize.c.
I discovered with a debugger that the get_new_window_size() function
has no effect on Solaris.  In fact, here is what this file looks like if
you compile it:
$ dis winsize.o
disassembly for winsize.o
section .text
get_new_window_size:     c3                 ret
That's it-- an empty function.  The problem is that the appropriate header
file is not getting pulled in, in order to #define TIOCGWINSZ.
As a result, even with 'shopt -s checkwinsize' set on Solaris, bash
does not check the win size on suspend of a program, or on program
exit.  This is massively frustrating, and I know of several Solaris
users who have switched to zsh as a result of this bug.
I have not tried bash 4.0, but looking at the source code, it appears
that the bug is present there as well.
I added an ifdef clause which looks to see if the HAVE_TERMIOS_H define
is set, after the #include of config.h.  If it is, then I #include the
termios.h header file.  This solves the problem, which I confirmed by
rebuilding and dis'ing the function.  I also ran my recompiled bash
and confirmed that it now worked correctly.

Hopefully the bash maintainers will take note and fix this bug.  In the mean time, I’m going to see if we can get the fix for this applied to the Nevada (and hence, OpenSolaris) version of bash.

Update: The bash maintainers have fixed this bug in the following patch to bash 4.x.  Hurray!

Speeding to a Halt

On Sunday I committed changes into Nevada build 110 (and hence, into OpenSolaris 2009.next) to improve the speed at which OpenSolaris systems shutdown.  On a generic x86 test system, it used to take about 41 seconds to shut the system down starting from the time you typed init 6, shutdown(1m), or pressed the shutdown button in GNOME.  This form of shutdown is known as gentle shutdown because SMF takes care to stop each software service in dependency order.  In contrast, when you use reboot, halt, or poweroff, you’re using what I’ll call violent shutdown.  In the latter, the boot archive is brought up to date and that’s about it.  It has traditionally been much, much faster than gentle shutdown.

This relatively long shutdown time has had an interesting effect: Solaris developers almost universally cheat, and use violent shutdown.  Typing reboot is terribly ingrained in my fingertips.  This is bad, because it means less test coverage for the shutdown method which actually matters to customers.  I recently began to be bothered by this because the GNOME gui uses init 6, and so shutting down via the nice shiny “Shut Down” button is also terribly slow.

On something of a whim, I dusted off our variant of the bootchart project which we ported in-house several years ago to get some initial information on what was happening during system shutdown.  Click here to see a graphical representation of a typical system shutting down (note: big image).  To read the image, note that time is on the X axis, from left to right.  Processes are represented as horizontal bars representing their duration.  At the rightmost side of the chart, the system has stopped.

In the image I’ve highlighted a couple of points of interest:

  • The pppd shutdown script seems to sleep 1, always, even if you aren’t using ppp; since pppd isn’t converted to SMF (bug 6310547), we will try to stop it on all systems on every shutdown.
  • The wbem service seems to sleep 1 while shutting down, and the the webconsole service takes a while to shutdown.  However, these services are present only on Nevada, and not on OpenSolaris, so I chose not to pursue trying to fix them.
  • The deferred patching script, installupdates is really slow.  And needlessly so– it can run in a few milliseconds with a simple fix; I filed a bug.
  • There are some long calls to /usr/bin/sleep.  In the chart linked above, you can see svc-autofs, rpc-bind, and svc-syseventd each taking five seconds to stop.  Five seconds is a really long time!
  • There’s a call to something called killall near the end of shutdown.  Then, 5 seconds later, another.  Then, 10 seconds later, things proceed again.  I wondered what the killall was all about?  Did it really need 15 seconds to do its work?

After a bit of effort (ok, a lot of effort), I’ve cleaned up these, and some other problems I spotted along the way.  It turns out that the five second delays are from some poor quality shell code in /lib/share/smf_include.sh:

smf_kill_contract() {
# Kill contract.
/usr/bin/pkill -$2 -c $1
# If contract does not empty, keep killing the contract to catch
# any child processes missed because they were forking
/usr/bin/sleep 5
/usr/bin/pgrep -c $1 > /dev/null 2>&1
while [ $? -eq 0 ] ; do

Ugh. So this shell function rather bizarrely always waits five seconds, even if the contract empties out in 1/10th of a second!  I fixed this to have a smarter algorithm, and to keep checking at more frequent intervals (every 0.2 seconds).

I discovered that the calls to killall(1m) were really ancient history, and probably did not need to occupy 15 seconds worth of system shutdown.  I have shortened the interval substantially.

Another problem we faced was that, in the last moments before shutdown, startd runs some commands using system(3c).  This can be a problem if one of those commands, for some reason, wedges up.  So, I’ve replaced the calls to system with calls which timeout after a set number of seconds.  This is some nice insurance that the system continues to make progress as it shuts down.  Since I wound up with so much extra time available at shutdown, I’ve taken the chance to add a call to lockfs(1m) in an effort to get as much on-disk UFS consistency as possible.

So, here is the intermediate picture.  I’ve slightly revised the shutdown messaging, as well, to include a datestamp and a measurement of how long the shutdown took:

svc.startd: The system is coming down.  Please wait.
svc.startd: 83 system services are now being stopped.
Mar  5 19:43:34 soe-x4100-3 syslogd: going down on signal 15
svc.startd: Killing user processes.
Mar  5 19:43:40 The system is down.  Shutdown took 17 seconds.

But wait, there’s more!  On OpenSolaris, we don’t have the time consuming wbem or webconsole services.  So, we can disable those and try again.  And, we use ZFS, for which the time consuming lockfs call at the end of shutdown is a no-op (on UFS, it takes at least two seconds).  This slimmed down stack results in an impressive shutdown time:

Mar  6 02:51:51 The system is down.  Shutdown took 7 seconds.

And here is what it looks like.  If you want to see the complete set of changes, the codereview is also available.  As you can see, revising the way we kill off processes at the end of the life of the system is the big unrealized win.  And doing so would likely shave about 3 more seconds off, for a gentle shutdown of 4-5 seconds.  I ran out of time to do that this time around.

Some caveats:

  • You mileage may vary: you might run a different mix of services on your system, and perhaps one of those has a slow shutdown method which will gum up the works.  If you want to test how long a service takes to stop, try ptime svcadm disable -s <servicename>.
  • Your performance improvement is likely to be less dramatic on systems with less available parallelism.  Most of my test systems have two or four CPUs.

I should add a coda here: this work is greatly improved by recent bootadm performance work by Enrico.  While building the boot archive is still sometimes triggered on shutdown, it takes a lot less time than it did previously.

I had a good time working on this project; I hope you’ll enjoy using it.

California Proposition Guide

I grew up in Pennsylvania, a state without voter propositions.  Upon arriving in California some years ago, I found the "prop" system to be complicated and kind of a big pain.  Over the years I’ve taken to making myself a chart of what the various useful (to me) sources are saying about the propositions.  This time I am sharing it here as I hope it will be useful to others.  Any mistakes are probably the result of transcription error on my part.  I’ll be happy to correct them.  In compiling this, I picked my region’s two largest papers, plus the LA Times.  You’ll also find the recommendation of the California League of Women Voters and the "party lines" of the two major parties.  Whatever you do: go out and vote!

Useful Links:

Prop Pro / Con / BallotPedia CA LWV SF Chron SJ Merc LA Times CA Dem Ca GOP
1A High Speed Rail www.californiahighspeedtrains.com
Yes Yes Yes Yes Yes No
2 Standards for Confining Farm
No Stance No Yes No Yes No
3 Children’s Hospital Bond Act. Grant Program. www.imaginewithus.org
Yes Yes Yes Yes Yes No
4 Abortion Waiting Period/Parental Notification www.yeson4.net
No No No No No Yes
5 Nonviolent Drug Offenses. Sentencing, Parole and Rehabilitation. www.prop5yes.com
Yes No No No Yes No
6 Police and Law Enforcement Funding. Criminal Penalties and Laws. (“Runner Initiative”) www.safeneighborhoodsact.com
No No No No No Yes
7 Renewable Energy Generation www.yeson7.net
No No No
No No No
8 Eliminates Right of Same-Sex Couples to Marry www.protectmarriage.com
No No No No No Yes
9 Criminal Justice System. Victims’ Rights. Parole. www.friendsofmarsyslaw.org
No No No No No Yes
10 Alternative Fuel Vehicles and Renewable Energy. Bonds. www.prop10yes.com
No No No No Neutral No
11 Redistricting Reform: California Voters FIRST Act yesprop11.org
Yes Yes Yes Yes No Neutral
12 Veterans’ Bond Act of 2008 ???
No Stance Yes Yes Yes Yes Yes

BigAdmin Updates Zones Page

The zones team recently identified was that our web presence has been lacking.  A quick survey surprised us: the relatively unloved BigAdmin: Solaris Containers page generates a lot of daily visits.  We realized that this is because it is the top search result if you google for "Solaris Zones".  So Penny and Robert (one of the BigAdmin maintainers) set out to make improvements.  The new page is now posted.  Take a look!  Also, if you have suggestions for further improvements, or pointers to materials we should add, please leave them in my blog’s comments section.

Blastwave on Solaris 8 Containers

I just got word of a great new Solaris 8 Containers success story on sun.com, from blastwave.org.  I think my favorite quote from Dennis Clarke (Founder of Blastwave) is:

"I virtualized critical Solaris 8 production servers and nobody
.  I literally shut the server down, backed it up, created
a Solaris 8 Container, restored the environment, and brought the server
back up. The process was simple, transparent, and completely flawless."

Our team did a lot of challenging work to ensure this sort of customer experience.  It’s gratifying to see it pay off for an important contributor to the Solaris ecosystem.

So if you’ve been trying to persuade your PHB that you should give Solaris 8 Containers a try, here is a great reference!

Zones (Containers) Hosting Providers

I’ve been keeping track of a number of companies who provide virtual server hosting based on Solaris Zones.  Since we occasionally get asked about this, I thought I’d share my personal list.  This is not an endorsement of these businesses.  In the interest of full disclosure, I was once taken to lunch by David and Jason from Joyent.  Sorted alphabetically:

I’m pretty sure this list is incomplete.  I’ll try to keep it updated as I learn more.  Feel free to post (or send me) corrections and additions, and I’ll add them.

Finally, if you are a hosting provider, and want to consider adding zones to your suite of offerings, we’re more than willing to talk, anytime.

Update #1: Added Gangus and Layered Technologies.
Update #2: Removed Layered Technologies (we’re not sure); added Stability Hosting
Update #3 (6/30/2008): Added Beacon VPS (confirmed by email with Beacon)

ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; …

Sometimes when you go hunting for bugs, it’s pretty mundane.  Other times, you strike gold… 

On Monday night, Jan, Liane and I stayed late at work to help with some maintenance on our building’s file and mail server, which we affectionately know as Jurassic.  The point of Jurassic is to run the latest Solaris Nevada builds in a production environment.  The system’s regular admin is on vacation, and Jurassic was experiencing some unusual problems, and so a group of kernel engineers volunteered to help out.  It’s our data, and our code, after all…

Jurassic had experienced two failed disks, and we really wanted wanted to replace those.  For some complicated reasons, we needed to reboot the system, which was fine with us anyway, because we wanted to see firsthand a problem which had been reported but not diagnosed: why was the svc:/system/filesystem/root:default service experiencing timeouts on boot? This service, it turns out, doesn’t do much (despite its rather grand name): it takes care of finding and mounting a performance-optimized copy of libc, then runs a devfsadm invocation which ensures that the kernel has the latest copies of various driver.conf files.  This made the timeout we saw all the more puzzling: why would a five minute timeout expire for this service? To make matters worse, SMF tries three times to start this service, and so the aggregate time was 3 × 5 = 15 minutes.

Once we waited, what we found was pretty surprising: the "hang" we were seeing was due to seemingly stuck devfsadm processes– three in fact:

# pgrep -lf devfsadm
100015 /usr/sbin/devfsadm -I -P
100050 /usr/sbin/devfsadm -I -P
100054 /usr/sbin/devfsadm -I -P

The next step I usually take in a case like this is to use pstack to see what the processes are doing.  However, in this case, that wasn’t working:

# pstack 100050
pstack: cannot examine 100050: unanticipated system error

Hmm. Next we tried mdb -k, to look at the kernel:

# mdb -k
mdb: failed to open /dev/ksyms: No such file or directory

This was one of those "oh shit…" moment.  This is not a failure mode I’ve seen before.  Jan and I speculated that because devfsadm was running, perhaps it was blocking device drivers from loading, or being looked up. Similarly:

# mpstat 1
mpstat: kstat_open failed: No such file or directory

This left us with one recourse: kmdb, the in situ kernel debugger.  We dropped in and looked up the first devfsadm process, walked its threads (it had only one) and printed the kernel stack trace for that thread:

# mdb -K
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
> 0t100015::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread ffffff0934cbce20: ffffff003c57dc80
[ ffffff003c57dc80 _resume_from_idle+0xf1() ]
ffffff003c57dcc0 swtch+0x221()
ffffff003c57dd00 sema_p+0x26f(ffffff0935930648)
ffffff003c57dd60 hwc_parse+0x88(ffffff09357aeb80, ffffff003c57dd90,
ffffff003c57ddb0 impl_make_parlist+0xab(e4)
  ffffff003c57de00 i_ddi_load_drvconf+0x72(ffffffff)
ffffff003c57de30 modctl_load_drvconf+0x37(ffffffff)
ffffff003c57deb0 modctl+0x1ba(12, ffffffff, 3, 8079000, 8047e14, 0)
ffffff003c57df00 sys_syscall32+0x1fc()

The line highlighted above was certainly worth checking out– you can tell just from the name that we’re in a function which loads a driver.conf file.  We looked at the source code, here: i_ddi_load_drvconf().  The argument to this function, which you can see from the stack trace, is ffffffff, or -1.  You can see from the code that this indicates "load driver.conf files for all drivers" to the routine.  This ultimately results in a call to impl_make_parlist(m), where ‘m’ is the major number of the device. So what’s the argument to impl_make_parlist()?  You can see it above, it’s ‘e4’ (in hexadecimal).  Back in kmdb:

[12]> e4::major2name

This whole situation was odd– why would the kernel be (seemingly) stuck, parsing ii.conf?  Normally, driver.conf files are a few lines of text, and should parse in a fraction of a second. ⁞ We thought that perhaps the parser had a bug, and was in an infinite loop.  We figured that the driver’s .conf file might have gotten corrupted, perhaps with garbage data.  Then the payoff:

$ ls -l /usr/kernel/drv/ii.conf
-rw-r--r--   1 root    sys     3410823  May 12 17:39  /usr/kernel/drv/ii.conf
$ wc -l ii.conf
262237 ii.conf

Wait, what?  ii.conf is 262,237 lines long?  What’s in it?

# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.

This pattern repeats, over and over, with the number of ii_bitmap=1; lines doubling, for 18 doublings!  We quickly concluded that some script or program had badly mangled this file.  We don’t use this driver on Jurassic, so we simply moved the .conf file aside.  After that, we were able to re-run the devfsadm command without problems.

Dan Mick later tracked down the offending script, the i.preserve packaging script for ii.conf, and filed a bug.  Excerpting from Dan’s analysis:

 Investigating, it appears that SUNWiiu's i.preserve script, used as a class-action
script for the editable file ii.conf, will:
1) copy the entire file to the new version of the file each time it's run (grep -v -w
with the pattern "ii_bitmap=" essentially copies the whole file, because lines
consisting of "ii_bitmap=1;" are not matched in 'word' mode; the '=' is a word
separator, not part of the word pattern itself)
2) add two blank comment lines and one substantive comment each time it's run,
despite presence of that line in the file (it should be tested before being
3) add in every line containing an instance of "ii_bitmap=" to the new file
(which grows as a power of two).  (this also results from the grep -v -w
Because jurassic is consistently live-upgraded rather than fresh-installed,
the error in the class-action script has multiplied immensely over time.

This was all compounded by the fact that driver properties (like ii_bitmap=1;) are stored at the end of a linked list, which means that the entire list must be traversed prior to insertion. This essentially turns this into a (n+1)\*n/2 pathology, where n is something like: (2LU+1)-1 (LU here is the number of times the system has been live-upgraded).  Plugging in the numbers we see:

(218 \* 218) / 2 = 2\^35 = 34 Billion

I wrote a quick simulation of this algorithm, and ran it on Jurassic.  It is just an approximation, but it’s amazing to watch this become an explosive problem, especially as the workload gets large enough to fall out of the cpu’s caches.

LU Generation  List Items List Operations Time (simulated) 
13 215 = 8192 225 = 33554432 .23s
14 216 = 16384 227 = 134217728 .97s
15 217 = 32768 229 = 536870912 5.7s
16 218 = 65536 231 = 2147483648 18.8s
17 219 = 131072 233 = 8589934592 2m06s
18 220 = 262144 235 = 34359738368 8m33s

Rest assured that we’ll get this bug squashed.  As you can see, you’re safe unless you’ve done 16 or more live upgrades of your Nevada system!

Tiny Robot, Nice Lens

Here’s a toy I picked up recently in New York at the Moma Store:

It’s a little robot you build yourself from cleverly designed rolls of paper. He’s called a Piperoid (he’s apparently called "Jet Jonathan") and is from Piperoid Japan.  In the US, Piperoids seem to be available only from the MOMA Store.

This was a fun toy to build: the build process was logical, didn’t take too long, and required just enough effort to feel satisfying.  No glue is needed.  The joints are movable and we’ve had fun with that.  He’s about 5 inches tall.

This shot also demonstrates another toy, the Nikon AF Nikkor 50mm f/1.8D lens that Danek lent me to try out on my Nikon D40.   This picture was shot in dim incandescent light, although I did add a white LED point source to throw shadows– due to color correction it appears somewhat blue in this picture.  I originally tried to shoot at f/1.8 (wide open, with the shallowest depth of field, and allowing the most light) but the depth of field is really narrow and I couldn’t get enough of him in focus.  Focus was also tough because this lens is manual focus on the D40.  In the end I had to stop down to f/2.0 and settle for a slightly longer exposure time.  I needed to use my gorillapod wrapped over the back of a chair because the exposure was 1/5 sec.  To post-process I simply did an auto-normalize and and auto-white-balance in The Gimp.  For kicks, I tried the same shot on the kit lens (The AF-S DX 18-55mm f/3.5-5.6G) that comes with the D40: to get a similar shot I needed about a 1.6 second exposure, and the results weren’t nearly as nice.  For just over $100 USD, the 50mm lens is a good deal, but I sure wish Nikon would bring out some auto-focus prime lenses fully suited to the D40.  Thanks Danek!