OpenSolaris has arrived. I’m really
happy to be able to show off the OS. If you couldn’t tell from our blogs over
the past year, we’ve been itching– aching– to let people join in our fun. In
a way, a lot of us have seen our ‘blog entries as a walk on the shores of a very
deep pool of knowledge, most of which is in the source code itself. But code
alone can be terribly obscure. Cruise on over to
tbl(1) if you doubt. To summarize, knowledge in code can be
useless from a maintenance, diagnosability and reusability perspective. But
well documented code can be enlightening and useful. I’ll try to show you what
I mean, by giving you a tour of some of the comments I’ve written in the
source base, and a taste of the kinds of bugs which have cropped up thus
I thought I would get started by talking about a subsystem I
developed for Solaris 10. By now you’ve heard of Solaris Zones– and if not,
Solaris Zones: Operating System Support for Consolidating Commercial
Workloads is, I think, a good introduction (but I’m a coauthor, so I
admit bias). One aspect of Zones I care a lot about is the Zones console.
I’m particularly proud of this subsystem because I designed and implemented
entirely from scratch myself, and the design blends a range of techniques
I’ve picked up over the years: dynamic device instances, a modular design,
reuse of existing facilities and a familiar interaction model for users.
Since we believe in “big theory” comments explaining the overall design
of code, usr/src/cmd/zoneadmd/zcons.c
speaks for itself:
/\* \* Console support for zones requires a significant infrastructure. The \* core pieces are contained in this file, but other portions of note \* are in the zlogin(1M) command, the zcons(7D) driver, and in the \* devfsadm(1M) misc_link generator. \* \* Care is taken to make the console behave in an "intuitive" fashion for \* administrators. Essentially, we try as much as possible to mimic the \* experience of using a system via a tip line and system controller. \* \* The zone console architecture looks like this: \* \* Global Zone | Non-Global Zone \* .--------------. | \* .-----------. | zoneadmd -z | | .--------. .---------. \* | zlogin -C | | myzone | | | ttymon | | syslogd | \* `-----------' `--------------' | `--------' `---------' \* | | | | | | | \* User | | | | | V V \* - - - - - - - - -|- - - -|- - - -|-|- - - - - - -|- - /dev/zconsole - - - \* Kernel V V | | | \* [AF_UNIX Socket] | `--------. .-------------' \* | | | \* | V V \* | +-----------+ \* | | ldterm, | \* | | etc. | \* | +-----------+ \* | +-[Anchor]--+ \* | | ptem | \* V +-----------+ \* +---master---+---slave---+ \* | | \* | zcons driver | \* | zonename="myzone" | \* +------------------------+ \* \* There are basically three major tasks which the console subsystem in \* zoneadmd accomplishes: \* \* - Setup and teardown of zcons driver instances. One zcons instance \* is maintained per zone; we take advantage of the libdevice APIs \* to online new instances of zcons as needed. Care is taken to \* prune and manage these appropriately; see init_console_dev() and \* destroy_console_dev(). The end result is the creation of the \* zcons(7D) instance and an open file descriptor to the master side. \* zcons instances are associated with zones via their zonename device \* property. This the console instance to persist across reboots, \* and while the zone is halted. \* \* - Initialization of the slave side of the console. This is \* accomplished by pushing various STREAMS modules onto the console. \* The ptem(7M) module gets special treatment, and is anchored into \* place using the I_ANCHOR facility. This is so that the zcons driver \* always has terminal semantics, as would a real hardware terminal. \* This means that ttymon(1M) works unmodified; at boot time, ttymon \* will do its own plumbing of the console stream, and will even \* I_POP modules off. Hence the anchor, which assures that ptem will \* never be I_POP'd. \* \* - Acting as a server for 'zlogin -C' instances. When zlogin -C is \* run, zlogin connects to zoneadmd via unix domain socket. zoneadmd \* functions as a two-way proxy for console I/O, relaying user input \* to the master side of the console, and relaying output from the \* zone to the user. \*/
One of the (in my opinion) elegant attributes of this design is that it defers
as much as possible to userland. The zcons(7d) driver, in
is only 707 lines of code (of which 155 or 22% is comments).
And happily, this code has mostly “just worked” since I integrated it, and
has needed little maintenance. One bug in the zcons driver (4983336)
was the result of changes I made following codereview feedback (always a
danger), causing messages in the stream to occasionally arrive out of order–
messages could occasionally “pass” each other.
The only other kernel changes I made for the console were
to pseudonex itself– I needed to bring it into compliance with the
interfaces (part of the interface family we use for
hotplug). That was so that we could dynamically instantiate new console
nodes when we need them. When we online new zcons nodes, we make them
children of a new zconsnex node, like this:
$ prtconf -P ... pseudo, instance #0 zconsnex, instance #1 zcons, instance #0 zcons, instance #1 zcons, instance #2 zcons, instance #3 $ prtconf -v /devices/pseudo/zconsnex@1/zcons@0 zcons, instance #0 Hardware properties: name='ddi-no-autodetach' type=int items=1 value=00000001 name='auto-assign-instance' type=int items=1 value=00000001 name='zonename' type=string items=1 value='xanadu' Device Minor Nodes: dev=(227,1) dev_path=/pseudo/zconsnex@1/zcons@0:zoneconsole spectype=chr type=minor dev_link=/dev/zcons/xanadu/zoneconsole dev=(227,0) dev_path=/pseudo/zconsnex@1/zcons@0:masterconsole spectype=chr type=minor dev_link=/dev/zcons/xanadu/masterconsole
In the above, you can see that you can easily related a zone console to
the zone using it, via the ‘zonename’ property on the device node.
To my annoyance, when I started to work on
there was no header comment at all about how it worked! This was a real
nuisance, as the pseudonex has some subtle behavior in its device instance
number assignment. I left behind an improved header comment, but it could
probably still use more work:
/\* \* Pseudo devices are devices implemented entirely in software; pseudonex \* (pseudo) is the traditional nexus for pseudodevices. Instances are \* typically specified via driver.conf files; e.g. a leaf device which \* should be attached below pseudonex will have an entry like: \* \* name="foo" parent="/pseudo" instance=0; \* \* pseudonex also supports the devctl (see ) interface via \* its :devctl minor node. This allows priveleged userland applications to \* online/offline children of pseudo as needed. \* \* In general, we discourage widespread use of this tactic, as it may lead to a \* proliferation of nodes in /pseudo. It is preferred that implementors update \* pseudo.conf, adding another 'pseudo' nexus child of /pseudo, and then use \* that for their collection of device nodes. To do so, add a driver alias \* for the name of the nexus child and a line in pseudo.conf such as: \* \* name="foo" parent="/pseudo" instance= valid-children="bar","baz"; \* \* Setting 'valid-children' is important because we have an annoying problem; \* we need to prevent pseudo devices with 'parent="pseudo"' set from binding \* to our new pseudonex child node. A better way might be to teach the \* spec-node code to understand that parent="pseudo" really means \* parent="/pseudo". \* \* At some point in the future, it would be desirable to extend the instance \* database to include nexus children of pseudo. Then we could use devctl \* or devfs to online nexus children of pseudo, auto-selecting an instance #, \* and the instance number selected would be preserved across reboot in \* path_to_inst. \*/
This much should have given you sufficient context to understand the
code at the top half of usr/src/cmd/zoneadmd/zcons.c,
which takes care of managing the zcons pseudo children. I was particularly
happy with this code. There’s something cool about specifying a new zones
console device, and zapping it into existence all on the fly. This leads to
the more subtle bug I faced: 4981626, which was reported just before one of the
S10 beta releases, and so rapidly put me in the hot seat to root cause and fix
it. The symptom was vexing: infrequently, multiprocessor systems would
see one of their several zones fail to startup at boot time. Worse yet,
the problem could only be seen on non-DEBUG systems, making the problem
potentially even harder to track down. We had only some messages on the
console to work from:
Jan 21 14:33:01 xanadu devfsadmd: driver failed to attach: zcons failed to create devlinks: No such device or address console setup: device initialization failed zoneadm: zone 'xanadu-z2': could not start zoneadmd zoneadm: zone 'xanadu-z2': call to zoneadmd failed
After some head scratching and basic investigation with DTrace (which sadly,
I’ve lost), I arrived at a hypothesis: we had a race condition in which the
zcons device node (such as /devices/pseudo/zconsnex@1/zcons@7)
was getting automatically torn down before the system had a chance to make the
device sufficiently “busy” that the system would leave it alone. So I
had an initial hypothesis about the race which looked promising:
zoneadmd: devctl() to create zone console node rc3: modunload -i 0, (which tears down the zcons node) zoneadmd: ask devfsadmd to make links for the zcons driver devfsadmd: no such device as 'zcons' attached, fail. zoneadmd: call to devfsadmd fail! fail to start up the zone.
You can see that this interaction is pretty complex. A little more digging
revealed that this hypothesis was wrong, and that things were much worse.
The first step was to isolate the problem, and get boot out of the way. I wrote
a simple program called zcons_test. In one window, I ran zcons_test in
a loop. This simple C program which basically does nothing more than
if ((hdl = di_devlink_init("zcons", DI_MAKE_LINK)) == NULL) perror("di_devlink_init");
(I’ll leave as an exercise to the reader to track down what di_devlink_init
actually does). In another window, I ran ‘modunload -i 146’ in a loop. (n.b.
using whatever module number corresponded to zcons on that machine).
This was run in a rigged-up environment in which nothing was holding the
zcons driver busy. What I saw on occasion (every minute or so) while
# while :; do ./zcons_test; done di_devlink_init: No such device or address di_devlink_init: No such device or address
At this point, it was easy to use DTrace to narrow down where the ENXIO
was coming from. I found the following snippet in di_ioctl, which is where we
actually do the device online:
modunload_disable(); (void) i_ddi_load_drvconf(i); (void) ndi_devi_config_driver(ddi_root_node(), ndi_flags, i); kmem_free(drv_name, MAXNAMELEN); ddi_rele_driver(i); rv = i_ddi_devs_attached(i); modunload_enable(); return ((rv == DDI_SUCCESS)? 0 : ENXIO);
Progress: It is this ENXIO which is causing the “No such device or address”
message. But what the hell? module unloading is disabled during this sequence
of events, right? So how could the modunload loop be affecting this? I used
DTrace to track down the call chain which was triggering the unload.
It looked like this:
modctl(2)-> modctl_modunload()-> mod_uninstall_all() mod_uninstall() ...
The problem here would seem to be that the call to modunload_disable()
doesn’t in fact disable the modunload! It does manage to block some
automatic, period module unloads (the mod_uninstall_daemon).
Even more insidious is that the code for modctl_modunload()
makes it appear that the “at-bootup” incarnation of this bug can \*only\*
appear on non-DEBUG systems! In the end, I worked around the problem
by adding the ddi-no-autodetach property to the zcons device nodes.
I also filed 4988141 modunload(1M) can race with di_ioctl(DINFOLODRV),
which will hopefully soon be fixed.
So that is, as they say, the nickel tour!