Zones (Containers) Hosting Providers

I’ve been keeping track of a number of companies who provide virtual server hosting based on Solaris Zones.  Since we occasionally get asked about this, I thought I’d share my personal list.  This is not an endorsement of these businesses.  In the interest of full disclosure, I was once taken to lunch by David and Jason from Joyent.  Sorted alphabetically:

I’m pretty sure this list is incomplete.  I’ll try to keep it updated as I learn more.  Feel free to post (or send me) corrections and additions, and I’ll add them.

Finally, if you are a hosting provider, and want to consider adding zones to your suite of offerings, we’re more than willing to talk, anytime.

Update #1: Added Gangus and Layered Technologies.
Update #2: Removed Layered Technologies (we’re not sure); added Stability Hosting
Update #3 (6/30/2008): Added Beacon VPS (confirmed by email with Beacon)

Advertisements

ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; …

Sometimes when you go hunting for bugs, it’s pretty mundane.  Other times, you strike gold… 

On Monday night, Jan, Liane and I stayed late at work to help with some maintenance on our building’s file and mail server, which we affectionately know as Jurassic.  The point of Jurassic is to run the latest Solaris Nevada builds in a production environment.  The system’s regular admin is on vacation, and Jurassic was experiencing some unusual problems, and so a group of kernel engineers volunteered to help out.  It’s our data, and our code, after all…

Jurassic had experienced two failed disks, and we really wanted wanted to replace those.  For some complicated reasons, we needed to reboot the system, which was fine with us anyway, because we wanted to see firsthand a problem which had been reported but not diagnosed: why was the svc:/system/filesystem/root:default service experiencing timeouts on boot? This service, it turns out, doesn’t do much (despite its rather grand name): it takes care of finding and mounting a performance-optimized copy of libc, then runs a devfsadm invocation which ensures that the kernel has the latest copies of various driver.conf files.  This made the timeout we saw all the more puzzling: why would a five minute timeout expire for this service? To make matters worse, SMF tries three times to start this service, and so the aggregate time was 3 × 5 = 15 minutes.

Once we waited, what we found was pretty surprising: the "hang" we were seeing was due to seemingly stuck devfsadm processes– three in fact:

# pgrep -lf devfsadm
100015 /usr/sbin/devfsadm -I -P
100050 /usr/sbin/devfsadm -I -P
100054 /usr/sbin/devfsadm -I -P

The next step I usually take in a case like this is to use pstack to see what the processes are doing.  However, in this case, that wasn’t working:

# pstack 100050
pstack: cannot examine 100050: unanticipated system error

Hmm. Next we tried mdb -k, to look at the kernel:

# mdb -k
mdb: failed to open /dev/ksyms: No such file or directory

This was one of those "oh shit…" moment.  This is not a failure mode I’ve seen before.  Jan and I speculated that because devfsadm was running, perhaps it was blocking device drivers from loading, or being looked up. Similarly:

# mpstat 1
mpstat: kstat_open failed: No such file or directory

This left us with one recourse: kmdb, the in situ kernel debugger.  We dropped in and looked up the first devfsadm process, walked its threads (it had only one) and printed the kernel stack trace for that thread:

# mdb -K
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
> 0t100015::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread ffffff0934cbce20: ffffff003c57dc80
[ ffffff003c57dc80 _resume_from_idle+0xf1() ]
ffffff003c57dcc0 swtch+0x221()
ffffff003c57dd00 sema_p+0x26f(ffffff0935930648)
ffffff003c57dd60 hwc_parse+0x88(ffffff09357aeb80, ffffff003c57dd90,
ffffff003c57dd98)
ffffff003c57ddb0 impl_make_parlist+0xab(e4)
  ffffff003c57de00 i_ddi_load_drvconf+0x72(ffffffff)
ffffff003c57de30 modctl_load_drvconf+0x37(ffffffff)
ffffff003c57deb0 modctl+0x1ba(12, ffffffff, 3, 8079000, 8047e14, 0)
ffffff003c57df00 sys_syscall32+0x1fc()

The line highlighted above was certainly worth checking out– you can tell just from the name that we’re in a function which loads a driver.conf file.  We looked at the source code, here: i_ddi_load_drvconf().  The argument to this function, which you can see from the stack trace, is ffffffff, or -1.  You can see from the code that this indicates "load driver.conf files for all drivers" to the routine.  This ultimately results in a call to impl_make_parlist(m), where ‘m’ is the major number of the device. So what’s the argument to impl_make_parlist()?  You can see it above, it’s ‘e4’ (in hexadecimal).  Back in kmdb:

[12]> e4::major2name
ii

This whole situation was odd– why would the kernel be (seemingly) stuck, parsing ii.conf?  Normally, driver.conf files are a few lines of text, and should parse in a fraction of a second. ⁞ We thought that perhaps the parser had a bug, and was in an infinite loop.  We figured that the driver’s .conf file might have gotten corrupted, perhaps with garbage data.  Then the payoff:

$ ls -l /usr/kernel/drv/ii.conf
-rw-r--r--   1 root    sys     3410823  May 12 17:39  /usr/kernel/drv/ii.conf
$ wc -l ii.conf
262237 ii.conf

Wait, what?  ii.conf is 262,237 lines long?  What’s in it?

...
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
ii_bitmap=1;
#
#
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
ii_bitmap=1;
ii_bitmap=1;
#
#
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
ii_bitmap=1;
ii_bitmap=1;
ii_bitmap=1;
ii_bitmap=1;
#
...

This pattern repeats, over and over, with the number of ii_bitmap=1; lines doubling, for 18 doublings!  We quickly concluded that some script or program had badly mangled this file.  We don’t use this driver on Jurassic, so we simply moved the .conf file aside.  After that, we were able to re-run the devfsadm command without problems.

Dan Mick later tracked down the offending script, the i.preserve packaging script for ii.conf, and filed a bug.  Excerpting from Dan’s analysis:

 Investigating, it appears that SUNWiiu's i.preserve script, used as a class-action
script for the editable file ii.conf, will:
1) copy the entire file to the new version of the file each time it's run (grep -v -w
with the pattern "ii_bitmap=" essentially copies the whole file, because lines
consisting of "ii_bitmap=1;" are not matched in 'word' mode; the '=' is a word
separator, not part of the word pattern itself)
2) add two blank comment lines and one substantive comment each time it's run,
despite presence of that line in the file (it should be tested before being
added)
3) add in every line containing an instance of "ii_bitmap=" to the new file
(which grows as a power of two).  (this also results from the grep -v -w
above.)
Because jurassic is consistently live-upgraded rather than fresh-installed,
the error in the class-action script has multiplied immensely over time.

This was all compounded by the fact that driver properties (like ii_bitmap=1;) are stored at the end of a linked list, which means that the entire list must be traversed prior to insertion. This essentially turns this into a (n+1)\*n/2 pathology, where n is something like: (2LU+1)-1 (LU here is the number of times the system has been live-upgraded).  Plugging in the numbers we see:

(218 \* 218) / 2 = 2\^35 = 34 Billion

I wrote a quick simulation of this algorithm, and ran it on Jurassic.  It is just an approximation, but it’s amazing to watch this become an explosive problem, especially as the workload gets large enough to fall out of the cpu’s caches.

LU Generation  List Items List Operations Time (simulated) 
13 215 = 8192 225 = 33554432 .23s
14 216 = 16384 227 = 134217728 .97s
15 217 = 32768 229 = 536870912 5.7s
16 218 = 65536 231 = 2147483648 18.8s
17 219 = 131072 233 = 8589934592 2m06s
18 220 = 262144 235 = 34359738368 8m33s

Rest assured that we’ll get this bug squashed.  As you can see, you’re safe unless you’ve done 16 or more live upgrades of your Nevada system!

Tiny Robot, Nice Lens

Here’s a toy I picked up recently in New York at the Moma Store:

It’s a little robot you build yourself from cleverly designed rolls of paper. He’s called a Piperoid (he’s apparently called "Jet Jonathan") and is from Piperoid Japan.  In the US, Piperoids seem to be available only from the MOMA Store.

This was a fun toy to build: the build process was logical, didn’t take too long, and required just enough effort to feel satisfying.  No glue is needed.  The joints are movable and we’ve had fun with that.  He’s about 5 inches tall.

This shot also demonstrates another toy, the Nikon AF Nikkor 50mm f/1.8D lens that Danek lent me to try out on my Nikon D40.   This picture was shot in dim incandescent light, although I did add a white LED point source to throw shadows– due to color correction it appears somewhat blue in this picture.  I originally tried to shoot at f/1.8 (wide open, with the shallowest depth of field, and allowing the most light) but the depth of field is really narrow and I couldn’t get enough of him in focus.  Focus was also tough because this lens is manual focus on the D40.  In the end I had to stop down to f/2.0 and settle for a slightly longer exposure time.  I needed to use my gorillapod wrapped over the back of a chair because the exposure was 1/5 sec.  To post-process I simply did an auto-normalize and and auto-white-balance in The Gimp.  For kicks, I tried the same shot on the kit lens (The AF-S DX 18-55mm f/3.5-5.6G) that comes with the D40: to get a similar shot I needed about a 1.6 second exposure, and the results weren’t nearly as nice.  For just over $100 USD, the 50mm lens is a good deal, but I sure wish Nikon would bring out some auto-focus prime lenses fully suited to the D40.  Thanks Danek!

Vacation…

I’m taking a few days off following the OpenSolaris 2008.05 release to attend a wedding in New York.  Here are a couple of better shots from Manhattan, where I spent the weekend:

The Roasting Plant, an innovative coffee shop/roastery; everything is automated and all of the machinery was designed by the owners.  Beans swirl through the store in pneumatic tubes (click for larger versions):

Four seconds at Grand Central:

MOMA (from the Eliasson retrospective):

The view from Central Park:

I haven’t yet had time to even do a basic editing pass over these.  The MOMA shot needs a bit of effort, I think.  I like the Grand Central shot quite a bit, but I really wish I’d had image stabilization.  At 100% the fine details are blurry.  I had a gorillapod, but the shot was tough to get, as people’s footsteps on the stairway I was shooting from caused a fair amount of jiggle.

A field guide to Zones in OpenSolaris 2008.05

I have had a busy couple of months. After wrapping up work on
Solaris 8 Containers (my teammate Steve ran the Solaris 9 Containers effort),
I turned my attention to helping the Image Packaging
team (rogue’s
gallery
) with their efforts to get OpenSolaris 2008.05 out the door.

Among
other things
, I have been working hard to provide a basic level of
zones functionality for OpenSolaris 2008.05. I wish I could have gotten
more done, but today I want to cover what does and does not work. I
want to be clear that Zones support in OpenSolaris 2008.05 and beyond will evolve
substantially
. To start, here’s an example of configuring a zone on
2008.05:

# zonecfg -z donutshop
donutshop: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:donutshop> create
zonecfg:donutshop> set zonepath=/zones/donutshop
zonecfg:donutshop> add net
zonecfg:donutshop:net> set physical=e1000g0
zonecfg:donutshop:net> set address=129.146.228.5/23
zonecfg:donutshop:net> end
zonecfg:donutshop> add capped-cpu
zonecfg:donutshop:capped-cpu> set ncpus=1.5
zonecfg:donutshop:capped-cpu> end
zonecfg:donutshop> commit
zonecfg:donutshop> exit
# zoneadm list -vc
ID NAME             STATUS     PATH                           BRAND    IP
0 global           running    /                              native   shared
- donutshop        configured /zones/donutshop               ipkg     shared

If you’re familiar with deploying zones, you can see that there is a lot which is familiar here.  But you can also see that donutshop isn’t, as you would normally expect, using
the native brand. Here we’re using the ipkg brand. The
reason is that commands like zoneadm and zonecfg have
some special behaviors for native zones which presume that you’re using
a SystemV Packaging based OS. In the future, we’ll make native less
magical, and the zones you install will be branded native as you would expect. Jerry is actually
working on that right now. Note also that I used the relatively new
CPU
Caps
resource management feature to put some resource limits on the
zone– it’s easy to do!. Now let’s install the zone:

# zoneadm -z donutshop install
A ZFS file system has been created for this zone.
Image: Preparing at /zones/donutshop/root ... done.
Catalog: Retrieving from http://pkg.opensolaris.org:80/ ... done.
Installing: (output follows)
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  49/49   7634/7634 206.85/206.85
PHASE                                        ACTIONS
Install Phase                            12602/12602
Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=681
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=741
Done: Installation completed in 208.535 seconds.
Next Steps: Boot the zone, then log into the zone console
(zlogin -C) to complete the configuration process

There are a couple of things to notice, both in the configuration
and in the install:

Non-global zones are not sparse, for now
Zones are said to be sparse if /usr, /lib,
/platform, /sbin and optionally /opt are
looped back, read-only, from the global zone. This allows a substantial
disk space savings in the traditional zones model (which is that the zones
have the same software installed as the global zone).

Whether we will ultimately choose to implement
sparse zones, or not, is an open question. I plan to bring this question to the Zones community, and to some key customers, in the near future.

Zones are installed from a network repository
Unlike with traditional zones, which are sourced by copying bits from the global
zone, here we simply spool the contents from the network repository.
The upside is that this was easy to implement; the downside is that
you must be connected to the network to deploy a zone. Getting the bits
from the global zone is still desirable, but we don’t have that implemented
yet.

By default, zones are installed using the system’s
preferred authority (use pkg authority to see what
that is set to). The preferred authority is the propagated into the
zone. If you want to override that, you can specify a different
repository using the new -a argument to zoneadm install:

# zoneadm -z donutshop install -a ipkg=http://ipkg.eng:80
Non-global zones are small
Traditionally, zones are installed with all of the same software
that the global zone contains. In the case of "whole root" zones
(the opposite of sparse), this means that non-global zones are about
the same size as global zones– easily at least a gigabyte in size.

Since we’re not supporting sparse zones, I decided to pare down
the install as much as I could, within reason: the default zone
installation is just 206MB, and has a decent set of basic tools.
But you have to add other stuff you might need. And we can even
do more: some package refactoring should yield another 30-40MB
of savings, as packagings like Tcl and Tk should not be needed
by default. For example, Tk (5MB) gets dragged in as a dependency
of python (the packaging system is written in python); Tcl (another
5MB) is dragged in by Tk. Tk then pulls in parts of X11.
Smallness yields speed: when connected to a fast package repository
server, I can install a zone in just 24 seconds!.

I’m really curious to know what reaction people will have to such
minimalist environments. What do you think?

Once you start thinking about such small environments, some new concerns surface: vim (which in 2008.05 we’re using as our vi implementation)
is 17MB, or almost 9% of the disk space used by the zone!

Non-global zones are independent of the global zone
Because ipkg zones are branded, they exist independently
of the global zone. This means that if you do an image-update
of the global zone, you’ll also need to update each of your zones,
and ensure that they are kept in sync. For now this is a manual
process– in the future we’ll make it less so.
ZFS support notes
OpenSolaris 2008.05 makes extensive use of ZFS, and enforces ZFS
as the root filesystem. Additional filesystems are created for
/export, /export/home and /opt. Non-global zones don’t yet follow this convention.
Additionally, I have sometimes seen our auto-zfs file system
creation fail to work (you can see it working properly in the example above). We haven’t
yet tracked down that problem– my suspicion is that there is a bad interaction
with the 2008.05 filesystem layout’s use of ZFS legacy mounts.

As a result of this (and for other reasons too, probably), zones don’t
participate in the boot-environment subsystem. This means that you
won’t get an automatic snapshot when you image-update your
zone or install packages. That means no automatic rollback for zones.
Again, this is something we will endeavor to fix.

Beware of bug 6684810
You may see a message like the following when you boot your zone:

zoneadm: zone 'donutshop': Unable to set route for interface lo0 to éÞùÞ$
zoneadm: zone 'donutshop':

This is a known bug (6684810); fortunately the message is harmless.

In the next month, I hope to: take a vacation, launch a discussion with
our community about sparse root zones, and to make a solid plan for
the overall support of zones on OpenSolaris. I’ve got a lot to do,
but that’s easily balanced by the fact that I’ve been having a blast
working on this project…

Songbird for Solaris

Looks like Alfred’s hard work has paid off.  You can pull down a package of Songbird for OpenSolaris (see Alfred’s blog entry for the links).  Songbird is a next-gen media player built atop the Mozilla platform.   Although I’ve had it crash once, on the whole it has worked quite well.  SteveL’s mashtape extension is really neat, and you can see it in action in the screenshot below (it’s the thing offering pictures, youtube videos, etc. at the bottom of the window).

Next steps would be to get this into the OpenSolaris package repository– I hope that someday soon you will be able to pkg install songbird.

Nice work guys!

Solaris 8 Containers, Solaris 9 Containers

In the flurry of today’s launch event, we’ve launched Solaris 8 Containers (which was previously called Solaris 8 Migration Assistant, or Project Etude).  Here is the datasheet about the product.  Even better: We’ve also announced that Solaris 9 Containers will be available soon!  Jerry and Steve on the containers team have been toiling away like mad to make this possible.

Why the rename?  Well, for one thing, it’s easier to say 🙂  It also signals a shift in the way Sun will offer this technology to customers:

  • Professional Services Engagement: No longer required, now recommend.  It’s also simpler to order a SunPS engagement for this product.
  • Partners: (Some of) Sun’s partners are now ready to deliver this solution to customers.  Talk to your partner for more information.
  • Right to Use: Previously, we provided a 90 day evaluation RTU.  Now, the RTU is unlimited.  However, you must still pay for support.

I invite you to download Solaris 8 Containers, and give it a try! And as always, talk to your local SE or Sales Rep if you’re interested in obtaining support licenses (or any kind of help with) your Solaris 8 (or 9) containers.

Here’s Joost, our fearless marketing leader, with an informative talk about the why and how of Solaris 8 Containers.