Vacation…

I’m taking a few days off following the OpenSolaris 2008.05 release to attend a wedding in New York.  Here are a couple of better shots from Manhattan, where I spent the weekend:

The Roasting Plant, an innovative coffee shop/roastery; everything is automated and all of the machinery was designed by the owners.  Beans swirl through the store in pneumatic tubes (click for larger versions):

Four seconds at Grand Central:

MOMA (from the Eliasson retrospective):

The view from Central Park:

I haven’t yet had time to even do a basic editing pass over these.  The MOMA shot needs a bit of effort, I think.  I like the Grand Central shot quite a bit, but I really wish I’d had image stabilization.  At 100% the fine details are blurry.  I had a gorillapod, but the shot was tough to get, as people’s footsteps on the stairway I was shooting from caused a fair amount of jiggle.

A field guide to Zones in OpenSolaris 2008.05

I have had a busy couple of months. After wrapping up work on
Solaris 8 Containers (my teammate Steve ran the Solaris 9 Containers effort),
I turned my attention to helping the Image Packaging
team (rogue’s
gallery
) with their efforts to get OpenSolaris 2008.05 out the door.

Among
other things
, I have been working hard to provide a basic level of
zones functionality for OpenSolaris 2008.05. I wish I could have gotten
more done, but today I want to cover what does and does not work. I
want to be clear that Zones support in OpenSolaris 2008.05 and beyond will evolve
substantially
. To start, here’s an example of configuring a zone on
2008.05:

# zonecfg -z donutshop
donutshop: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:donutshop> create
zonecfg:donutshop> set zonepath=/zones/donutshop
zonecfg:donutshop> add net
zonecfg:donutshop:net> set physical=e1000g0
zonecfg:donutshop:net> set address=129.146.228.5/23
zonecfg:donutshop:net> end
zonecfg:donutshop> add capped-cpu
zonecfg:donutshop:capped-cpu> set ncpus=1.5
zonecfg:donutshop:capped-cpu> end
zonecfg:donutshop> commit
zonecfg:donutshop> exit
# zoneadm list -vc
ID NAME             STATUS     PATH                           BRAND    IP
0 global           running    /                              native   shared
- donutshop        configured /zones/donutshop               ipkg     shared

If you’re familiar with deploying zones, you can see that there is a lot which is familiar here.  But you can also see that donutshop isn’t, as you would normally expect, using
the native brand. Here we’re using the ipkg brand. The
reason is that commands like zoneadm and zonecfg have
some special behaviors for native zones which presume that you’re using
a SystemV Packaging based OS. In the future, we’ll make native less
magical, and the zones you install will be branded native as you would expect. Jerry is actually
working on that right now. Note also that I used the relatively new
CPU
Caps
resource management feature to put some resource limits on the
zone– it’s easy to do!. Now let’s install the zone:

# zoneadm -z donutshop install
A ZFS file system has been created for this zone.
Image: Preparing at /zones/donutshop/root ... done.
Catalog: Retrieving from http://pkg.opensolaris.org:80/ ... done.
Installing: (output follows)
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  49/49   7634/7634 206.85/206.85
PHASE                                        ACTIONS
Install Phase                            12602/12602
Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=681
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=741
Done: Installation completed in 208.535 seconds.
Next Steps: Boot the zone, then log into the zone console
(zlogin -C) to complete the configuration process

There are a couple of things to notice, both in the configuration
and in the install:

Non-global zones are not sparse, for now
Zones are said to be sparse if /usr, /lib,
/platform, /sbin and optionally /opt are
looped back, read-only, from the global zone. This allows a substantial
disk space savings in the traditional zones model (which is that the zones
have the same software installed as the global zone).

Whether we will ultimately choose to implement
sparse zones, or not, is an open question. I plan to bring this question to the Zones community, and to some key customers, in the near future.

Zones are installed from a network repository
Unlike with traditional zones, which are sourced by copying bits from the global
zone, here we simply spool the contents from the network repository.
The upside is that this was easy to implement; the downside is that
you must be connected to the network to deploy a zone. Getting the bits
from the global zone is still desirable, but we don’t have that implemented
yet.

By default, zones are installed using the system’s
preferred authority (use pkg authority to see what
that is set to). The preferred authority is the propagated into the
zone. If you want to override that, you can specify a different
repository using the new -a argument to zoneadm install:

# zoneadm -z donutshop install -a ipkg=http://ipkg.eng:80
Non-global zones are small
Traditionally, zones are installed with all of the same software
that the global zone contains. In the case of "whole root" zones
(the opposite of sparse), this means that non-global zones are about
the same size as global zones– easily at least a gigabyte in size.

Since we’re not supporting sparse zones, I decided to pare down
the install as much as I could, within reason: the default zone
installation is just 206MB, and has a decent set of basic tools.
But you have to add other stuff you might need. And we can even
do more: some package refactoring should yield another 30-40MB
of savings, as packagings like Tcl and Tk should not be needed
by default. For example, Tk (5MB) gets dragged in as a dependency
of python (the packaging system is written in python); Tcl (another
5MB) is dragged in by Tk. Tk then pulls in parts of X11.
Smallness yields speed: when connected to a fast package repository
server, I can install a zone in just 24 seconds!.

I’m really curious to know what reaction people will have to such
minimalist environments. What do you think?

Once you start thinking about such small environments, some new concerns surface: vim (which in 2008.05 we’re using as our vi implementation)
is 17MB, or almost 9% of the disk space used by the zone!

Non-global zones are independent of the global zone
Because ipkg zones are branded, they exist independently
of the global zone. This means that if you do an image-update
of the global zone, you’ll also need to update each of your zones,
and ensure that they are kept in sync. For now this is a manual
process– in the future we’ll make it less so.
ZFS support notes
OpenSolaris 2008.05 makes extensive use of ZFS, and enforces ZFS
as the root filesystem. Additional filesystems are created for
/export, /export/home and /opt. Non-global zones don’t yet follow this convention.
Additionally, I have sometimes seen our auto-zfs file system
creation fail to work (you can see it working properly in the example above). We haven’t
yet tracked down that problem– my suspicion is that there is a bad interaction
with the 2008.05 filesystem layout’s use of ZFS legacy mounts.

As a result of this (and for other reasons too, probably), zones don’t
participate in the boot-environment subsystem. This means that you
won’t get an automatic snapshot when you image-update your
zone or install packages. That means no automatic rollback for zones.
Again, this is something we will endeavor to fix.

Beware of bug 6684810
You may see a message like the following when you boot your zone:

zoneadm: zone 'donutshop': Unable to set route for interface lo0 to éÞùÞ$
zoneadm: zone 'donutshop':

This is a known bug (6684810); fortunately the message is harmless.

In the next month, I hope to: take a vacation, launch a discussion with
our community about sparse root zones, and to make a solid plan for
the overall support of zones on OpenSolaris. I’ve got a lot to do,
but that’s easily balanced by the fact that I’ve been having a blast
working on this project…

Songbird for Solaris

Looks like Alfred’s hard work has paid off.  You can pull down a package of Songbird for OpenSolaris (see Alfred’s blog entry for the links).  Songbird is a next-gen media player built atop the Mozilla platform.   Although I’ve had it crash once, on the whole it has worked quite well.  SteveL’s mashtape extension is really neat, and you can see it in action in the screenshot below (it’s the thing offering pictures, youtube videos, etc. at the bottom of the window).

Next steps would be to get this into the OpenSolaris package repository– I hope that someday soon you will be able to pkg install songbird.

Nice work guys!

Solaris 8 Containers, Solaris 9 Containers

In the flurry of today’s launch event, we’ve launched Solaris 8 Containers (which was previously called Solaris 8 Migration Assistant, or Project Etude).  Here is the datasheet about the product.  Even better: We’ve also announced that Solaris 9 Containers will be available soon!  Jerry and Steve on the containers team have been toiling away like mad to make this possible.

Why the rename?  Well, for one thing, it’s easier to say 🙂  It also signals a shift in the way Sun will offer this technology to customers:

  • Professional Services Engagement: No longer required, now recommend.  It’s also simpler to order a SunPS engagement for this product.
  • Partners: (Some of) Sun’s partners are now ready to deliver this solution to customers.  Talk to your partner for more information.
  • Right to Use: Previously, we provided a 90 day evaluation RTU.  Now, the RTU is unlimited.  However, you must still pay for support.

I invite you to download Solaris 8 Containers, and give it a try! And as always, talk to your local SE or Sales Rep if you’re interested in obtaining support licenses (or any kind of help with) your Solaris 8 (or 9) containers.

Here’s Joost, our fearless marketing leader, with an informative talk about the why and how of Solaris 8 Containers. 

Finding bugs in python code, using DTrace

I have spent a lot of time in recent months helping the IPS project.  This week I integrated usability improvements, and tracked down two different performance problems.  The first was notable because I think it’s the kind of thing that would have been hard to find without DTrace.   Here’s a writeup I did, cribbed from the bug report:

While doing work on IPS I discovered a large number of system calls being made
during the "creating plan" phase of a fresh installation of the
‘redistributable’ cluster of packages.

I used the following crude dtrace script to track down the problem:

#!/usr/sbin/dtrace -Fs

python$1:::function-entry
/copyinstr(arg1) == $$2/
{
self->t=1;
calls++;
}

python$1:::function-return
/copyinstr(arg1) == $$2 && self->t/
{
self->t=0;
}

syscall:::entry
/self->t/
{
@a[probefunc]=count()
}

END
{
printf("calls: %d", calls);
}

No points for elegance but it works.

Successive guesses allowed me to narrow this down to the get_link_actions()
routine. Here’s an example of its system call impact during the following
sequence:

# rm -fr /tmp/foo
# export PKG_IMAGE=/tmp/foo
# pkg image-create -a foo=http://ipkg.eng /tmp/foo
# pkg install -n redistributable

I ran the D script as follows:

$ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions

And once the command was done, got this output:

CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

brk 26
stat64 774400
close 775280
fcntl 775280
fsat 775280
fstat64 775280
getdents64 1553200

So here we can see that something is amiss– this routine is the source
of 4.7 million syscalls.
Here’s the code:

        def get_link_actions(self):
""" return a dictionary of hardlink action lists indexed by
target """
(a) if self.link_actions:
return self.link_actions

d = {}
for act in self.gen_installed_actions():
if act.name == "hardlink":
t = act.get_target_path()
if t in d:
d[t].append(act)
else:
d[t] = [act]

self.link_actions = d
return d

When Bart and I first looked at this, it seemed by inspection to be working as
designed– the first time the routine is called, we populate a cache, and
subsequently the cache is returned. At some point Bart went "Aha!" and we
realized that the problem here was that the comparison at (a) wasn’t working
right, and that, for a not-yet-installed image, the link_actions dictionary is
exactly equal to {}, the empty dictionary. That evaluates to "false" in
python. So we always think that we haven’t yet filled out the link_actions
cache, and go to fill it. That sends us into gen_installed_actions(), which is
expensive. It seems to get more and more expensive as we accumulate more
manifests, as well.

So, while there are exactly zero link actions we need to worry about, we rescan
all manifests for every new manifest we evaluate.

This is a situation which will crop up most often with zones, since that’s the
place we’re most likely to create new images from scratch.

The fix is very simple: set self->link_actions to None initially (in the object
constructor), and change the comparison at (a) to test against None explictly.

Having fixed this, the impact of get_link_actions is:

CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

stat64 3
close 4
fcntl 4
fsat 4
fstat64 4
getdents64 8

Or, basically, nothing. In some informal testing I got these numbers
with ptime(1):
Before:

real     1:26.730
user 1:10.988
sys 11.624

After:

real     1:12.932
user 1:02.475
sys 6.361

Or about a 19% improvement in real time. And quite a bit in sys time. Not a huge win, but not too shabby either.

I used the same technique to find a similar bug in another part of the code, but alas, Danek had already found the same bug, so I can’t claim it.

OpenSolaris Elections Disappointment

The results of the OpenSolaris Elections are in.  Congratulations to the new board members!

So why did I entitle this post Elections Disappointment?  My complaint is with the "community priorities" question which was asked; this was a chance for core contributors to indicate what they felt were pressing priorities.  I was not happy with the OGB’s choice (which I didn’t know about until I voted) to reuse the same set of priorities questions from last year.  One of which was:

  • Deploy a public code review facility on opensolaris.org

Which was subsequently identified by voters in this election as the #3 priority.  Did the OGB believe that we did not accomplish this goal in the past year?  I know that I took the results of the last poll (it was voted #3 in that poll, too) to heart, and worked hard to make that goal a reality, publicized it using the official channels, and have been enhancing it and taking care of it since that time, mostly on my own time.  I felt that my contribution was really undermined by the question.

Second: Who voted for this item as a priority?  If you voted for this as a priority, I would like to hear why you did (anonymously is fine).  Are you unaware that cr.opensolaris.org exists?  Are you unsatisfied with the service it provides?

I hope the new OGB will find a way to reformulate a more cogent poll question about priorities.

cr.opensolaris.org gets an ATOM feed

For the past couple of weeks, I have been working late at night and on weekends to add an ATOM feed (i.e. blog feed) to cr.opensolaris.org, so that as people post new code reviews, they are automatically discovered and published.  Stephen has been heckling me to do this work for more than a year.  This weekend I managed to finish it, despite the incredibly nice weather in the bay area: I was stuck inside with a nasty cold.

As an aside, I’m looking for help with cr.opensolaris.org.  This is a great opportunity for someone to step up and help out with an important part of our community infrastructure.  Send me mail.

You can check out the results of my hacking on cr.opensolaris.org.  Or you can subscribe to the feed.  If you want to opt-out of publishing reviews, you can create a file called "opt-out" in the same directory as your webrev’s index.html file.  Or you can create a file called "opt-out" in your home directory, if you’d like to opt out of all reviews.

Implementation Notes 

This was an interesting learning experience for me, since I had to
learn a lot about ATOM in the process.  I also learned the XSLT
language along the way as well, and how to process HTML using python.  All in all, I’d say this project took
about 20 hours of effort, and resulted in about 500 lines of python
code.  The most difficult problems to solve were:

  • I wanted the feed to include some meaningful information about the codereview.  If you subscribe to the feed using your favorite reader, you’ll see that a portion of the "index.html" file from each webrev is included.  This is done using a somewhat tricky piece of python code.  In retrospect, using XSL for this might have been a better choice, although I’ve found that people have a tendency to introduce non-standard HTML artifacts into their webrev index.html files, and I don’t know how well XSL would cope with that.
  • ATOM has some rules about generating unique and lasting IDs for things– this is the contents of the <id> tag in the ATOM specification.  I found a lot of valuable information on dive-into-mark.  For cr.opensolaris.org, this was complicated by the fact that the user might log in and move their codereview around, or might copy one review over another.  In the end, I solved this by remembering the <id> tag in a dot-file which rides along with the codereview.  A cronjob roves around the filesystem looking for new reviews, and adds the special tag-file.  By storing the original <id> tag value, and looking at the modtime of the index.html file, I can correctly compute both the value of the <id> and <updated> fields for each entry.  If a user deletes a codereview, the dot-file will go away with it.
  • Once I had an ATOM feed I needed to transform it back into HTML for display on the home page.  The only problem was that there aren’t a lot of good examples of this on the web– many of the ATOM-to-HTML conversions only work with ATOM 0.3, not the 1.0 specification, and I didn’t know the first thing about XPATH or XSL.  In the end, I only needed 25 lines or so of XSLT code.

Future Work 

I think of the current implementation as a "1.0"– it’ll probably last us pretty well for a while.  One thing I’d like to research for a future revision is actually placing the entries into a lightweight blog engine, and letting it do the rest of the work: Using an excellent list from Social Desire I took a quick look at Blosxom, Flatpress, Nanoblogger, and some others.

The joy of ‘zpool scrub’

Some days, when it’s cold and you’re not feeling very motivated (like me, today), it’s nice to do a zpool scrub on the machines you manage, and then once it’s done:

$ zpool status
pool: aux
state: ONLINE
scrub: scrub completed with 0 errors on Tue Jan 29 15:52:38 2008
config:
NAME          STATE     READ WRITE CKSUM
aux           ONLINE       0     0     0
mirror      ONLINE       0     0     0
c1t0d0s7  ONLINE       0     0     0
c1t1d0s7  ONLINE       0     0     0
errors: No known data errors

And then relax, knowing that your data is safe.

A big mess…

Recently I’ve been thinking about, learning about, and contributing to IPS, the Image Packaging System, a next generation package management solution project which is happening in the OpenSolaris community.  IPS happens to also be the packaging system project Indiana has elected to use.  Stephen has written extensively about his thoughts on his blog.  And Bart has too.  IPS subsumes a lot of existing functionality which appears in various parts of the system today.  But a lot of people seem to be willing to look at it only as a package manager in the sense that "pkgadd" is a package manager.

My problem with this is that "pkgadd" is only a small part of a larger problem.  So, to explain that, I want to distill a series of email posts I made to pkg-discuss last month into a coherent blog entry, since I’ve referred back to them a couple of times.

My feelings on this topic are pretty nicely summarized by an article by Peter J. Denning which recently appeared in Communications of ACM, entitled "Mastering the Mess."  The whole article is instructive, but see in particular: "Signs of Mess."

If one accepts Dr. Denning’s "mess" framework, then the next question is whether we are in what he dubs, "a mess."  I personally think the answer is "yes."   In no particular order (apologies to anything I left out), as a community, we have:

 

  • SVR4 package creation tools
  • SVR4 package deployment tools
  • Sun’s patch creation tools
  • Sun’s patch application and inventory tools (patchadd, showrev -p)
  • PCA (Patch Check Advanced, a nice open source tool I use)
  • Solaris Patch Manager (smpatch)
  • pfinstall
  • Live Upgrade
  • flash archive creation and deployment tools
  • graphical install (old and dwarf caiman, etc)
  • ttinstall
  • Jumpstart
  • virt-install (from xVM)
  • zones install
  • zones attach/detach logic (which knows how to parse various patch and packaging databases)
  • So-called "toxic" patching for zones
  • Zones support for live upgrade (zulu)
  • BFU/ACR (update part of the system, but violates package metadata)
  • IDR (patches the system, but renders system subsequently unpatchable until IDR is removed and a "real patch" is applied)
  • Solaris Product Registry (I’ve never really understood what this was for, but you can try it via prodreg(1))
  • Service Tags — a layer which adds "software RFID tags" in a sense: UUIDs backed by a Sun-maintained ontology; helps to inventory what is on your system.
  • pkg-get
  • Network Repositories (like Blastwave)
  • DVD media & CD media construction tools (several of these, I think)
  • Various other unbundled products which promise to ease "patching pain"
  • Various system minimization tools
  • Layered inventory management tools
  • Numerous hand-rolled and home-grown solutions built on some or all of the above.                                            

Some parts of the mess represent great (from the perspective of the those caught up in the mess) technologies which people have spent a lot of time and effort building.  But a lot of the above represent accreted layers with duplicated functionality.  In some cases, the various layers interact in complex and subtle, and perhaps interface-violating ways. To people outside of the mess (i.e. new users we would like to entice) the mess looks bizarre, and terrible.  Another sign of a "big mess":  In several cases, huge engineering efforts have resulting in only modest improvements.  In some cases, huge engineering efforts have been total failures: Sun attempted a rewrite of the SVR4 packaging system in the early part of this decade, the project basically failed.

It’s easy to look at the above list and feel a sense of hopelessness– how will we \*ever\* improve upon this situation?  Will people keep creating new and different tools which add more layers? 

I’ll cite a second source which has helped guide my thinking on this topic: Jeff Bonwick.  Jeff spent years relentlessly seeking out and blowing up duplicated and broken kernel functionality, and then took on the storage stack.  The result was ZFS, which was recently labelled "a rampant layering violation" by a detractor.  Jeff responded this way.  In particular, Jeff said:

"We found that by refactoring the problem a bit — that is, changing           
where the boundaries are between layers — we could make the whole thing
much simpler."

Which to me summarizes my thinking about what the \*opportunity\* is here: to rethink the layers and to merge and unmerge them to come to a more complete, efficient, modern
solution.

IPS is heading in this direction: Packaging, patching, upgrade, live upgrade, the mechanisms for Software Delivery, the toolset for delivering packages/patches, and the software-enforced policy decisions seem to be condensing here into a coherent entity– which means we’ll have many fewer layers.  And because the system will be fast, lightweight, redistributable and shared, we should also be able to discard artifacts such as BFU and ACR (in other words, OpenSolaris developers will use the same tools our customers use to update systems).  The huge amount of code which handles zones patch and packaging should be greatly reduced.  Package dependencies will be far more accurate and minimization will be easier, and diskless client support should be far more robust.

What I see with Caiman, IPS and Distro Constructor is the opportunity to do for software delivery and update to OpenSolaris systems what ZFS did for storage management.  I do not think we have all the answers just yet, but I think we can get there.