QEMU for Solaris x86

QEMU is a CPU emulator. While it can do a lot of different things, for now, think of it as a virtual x86 box which runs as a user process. The kernel group in Menlo Park hosted an intern this summer who worked on (among other things) getting QEMU up and running well on Solaris, and getting Solaris to install and boot inside of QEMU.

We need to do some more work on this before it is ready for “prime time” but it looks promising. Eric pointed out to me that QEMU and technologies like it will
more easily enable folks to contribute to the Open Source Solaris effort, and that’s important. On the
other hand, it won’t help people trying to write drivers (where I hope we’ll see a lot of community activity),
as the emulated x86 system is pretty minimal.

The great news is that we’re not alone in believing in QEMU: ThinkSolaris.com [2/2006: Now defunct] has posted a QEMU binary for x86, and some screenshots.
Andrei showed up in my office today with a demo of
running WindowsXP inside of QEMU using the packages he downloaded from ThinkSolaris. However– there is more to do.
We need to get networking up and running before QEMU will be really useful, and our intern identified the interface between
Solaris (running inside QEMU) and the QEMU IDE emulation as a major performance problem. As a kernel developer I’m looking
forward to being able to boot Solaris atop Solaris using an open emulation technology…

UPDATE: To be clear, I’m not announcing any policy that we’re planning to ship QEMU, nor is it part of our virtualization strategy at this time. For more about our approach to OS virtualization, please take a look at Solaris Zones. I am interested in QEMU because it offers an interesting possibility for doing routine kernel development work on a subset of kernel subsystems.

UPDATE 2: Thanks to the indefatiguable Jurgen Keil we now have
working networking! Hooray! A suggestion to Martin et al: Packaging this for blastwave.org (/opt/csw) would
be fantastic.

What’s New in Solaris Express 8/04 (Build 63)

Solaris Express 8/04 (or Build 63, as we call it) is now
available for download.

For one of our Beta programs, I’ve been authoring a “newsletter” of sorts for the past 5 months; this describes new features in the S10 Express releases. Partly, this is a rollup of information that can be found in the
What’s New
doc, but it also features other stuff that I have spotted that might not otherwise be publicized.

Bart suggested that I share the Beta6 version of the newsletter on my weblog.
And so here I am; it’s more fun than doing laundry, at least. Remember– this is what’s new since the last
express build
. It’s a huge list nonetheless. Who say’s we’re slow?

[Now that I’ve gone to all the trouble to write this, I see that Alan Hargreaves has beat me to the punch— hey, no fair announcing the features before we ship the thing! (Heck, I thought pre-announcing stuff would get me into trouble.) And Alan Coopersmith has excellent coverage of
X11-related (and more) changes. I’ve shamelessly stolen some of the highlights that Alan (C, not H) mentions which I missed. Anyway, I hope that my descriptions will fill in some extra detail, and I’ll try to update this entry with links to the documentation as I have the time. Plus, there is some exclusive content you’ll find only here. Hmph.]

Networking and Sharing

  • NFSv4 is on by default. The nfs client will attempt to use
    version 4 for all mounts; the NFS server will offer version 4 automatically,
    along with version 3 and version 2.
  • DHCP Event Scripting allows you to trigger shell scripts when
    various DHCP client events occur.
  • Stream Control Transmission Protocol (SCTP) is now
    implemented. SCTP is a networking technology akin to TCP that is popular in the Telco
    space.

Debugging and Observability

  • kmdb, the replacement for kadb (the kernel debugger) is
    available on all platforms and kadb is gone. Good riddance. This brings most of
    the functionality of
    mdb to an in situ debugger.
    The joy of ::ps is now available all of the time. Another cool feature of kmdb is
    that you don’t need to remember to ‘boot to the debugger’ as you once did. Running
    mdb -K will cause kmdb to carjack the running the OS.
  • DTrace MIB (networking) and
    fpuinfo providers.
  • dtrace(1M) now sports -c and -p options; -c makes it easier to start up a process
    and trace it, and -p lets you attach dtrace to a running process. In both cases, dtrace
    exits when the target process exits, and the PID of the target process becomes available
    as $target. So now it’s trivial to use the PID provider to trace a process
    from its earliest moments of life:

    dtrace -n 'pid$target:::entry{@[probefunc]=count()}' -c /usr/bin/id
    
  • per-thread modes for truss, pstack, and pflags

x86 Platform

  • The SAN Foundation Software (the fibre channel stack) is now available
    on x86 systems. Hook your
    Maserati
    up to your
    Stinger
    and kick some ass!
  • x86 Basic SATA Support provides support for using Serial ATA on
    motherboards using ICH5 & ICH5R hubs and/or Silicon Image SATA
    controllers 3112, 3114 and 3512, operating in IDE mode.
  • SunVTS is now available for Solaris x86,
    although as on SPARC, it’s not installed by default. This is definitely on my list of
    things to try.

Security Features

  • OpenSSL is now supported, and available
    in /usr/sfw. This version of OpenSSL is integrated with the
    Solaris Cryptographic Framework (which in Solaris 10 unifies all of the crypto on the system)
    via PKCS#11. Secure and fast… Who knew?
  • IPSec/IKE NAT-Traversal. You can now use IPSec and IKE from behind
    a NAT (Network Address Translation) box, making IPSec more useful
    from (e.g.) DSL setups.

Open Source Integration

  • The Zebra Multiprotocol Routing Suite is now supported, and
    available in /usr/sfw.
  • Sendmail has been upgraded to 8.13
  • BIND 9 (the DNS server) is now available,
    and supported. It is in /usr/sfw. /usr/sfw/doc includes a BIND 8.x to 9.x transition guide.
  • Samba has been upgraded to 3.0.4
  • libusb 0.1.8 is now available in /usr/sfw.
    [I’m really excited to have this. I’m going to try
    to see if I can get gphoto to talk to my Canon Powershot S45]
  • Perl 5.8.4 is now available; and PerlGcc makes it easy to build perl modules using GCC (we use Sun’s compilers to build the OS, including Perl, and so special effort is required to bridge to perl modules built with GCC.

Zones

  • Security auditing has been extended to work with Zones.
  • New project.max-lwps, project.max-tasks and
    zone.max-lwps resource controls help you better contain workloads.
  • CPU visibility inside zones is now restricted to the CPUs assigned to
    the resource pool, if you bind the zone to a pool. This means that
    you can configure a zone to only “know about” 2 CPUs on your 12 CPU server.
    This can be very useful for per-CPU licensing schemes.

Other Stuff

  • The default depth for Xsun is now 24-bit on all frame buffers that
    support it (it’s in the /usr/dt/config/Xservers shipped with the
    system now instead of forcing everyone to edit by hand). Hooray!
  • Sun’s OpenGL implementation is now bundled in with the OS; it will no longer
    require a separate download.
  • SVM’s metaimport(1M) command now allows a user to import replicated
    SVM disksets (replicated via Hitachi TruCopy/ShadowImage or the like).

Chargeback? Or, a Long Answer to a Short Question

I spoke to an account manager (I’ll call him “Sales Guy”) at Sun today regarding a large financial customer; the conversation is a good illustration of an oft-cited problem at Sun– we (and our customers) often are not aware that solution they need is already lurking in the product somewhere! As you’ll see from the subsequent discussion, this is also an illustration of how a great feature could benefit from being a little bit easier to use.

Sales Guy: My customer is consolidating lots of Oracle databases (more than 100!) onto a single Sun-Fire 6800 server; different departments at the company “own” the content in those databases, and so the IT group would really like to do charge-back to the departments for their usage. But of course all of these Oracle databases run as orauser and so traditional accounting methods (such as charging the usage to the user ID) doesn’t work for us. I know there’s a new accounting feature starting in Solaris 9; could that work for them?

Me: Yes! I’ll try to step you through the process of setting this up; I’ll borrow (steal) some of the techniques Liane explained in her blog, so you might wish to read that first.
The recommended method is to use projects (see project(4)) to put a workload label on each database. Once we’ve done that, we’ll convince the extended accounting subsystem to reflect that workload tag when it produces accounting records.

[“Extended accounting” in some senses means “new accounting”; we’re not talking about the old SystemV accounting subsystem].

Let’s say these 100 database were divided between the finance, marketing and trading groups. First, you would define three projects in /etc/project (predictably: finance, marketing and trading). Then, each database belonging to the various departments would be launched using newtask(1) to label it.
It might look something like this:

# newtask -p finance /usr/local/bin/start_oracle_db_widget_sales

Subsequently, every process which makes up this database instance (in this case, the “widget sales” database) would be labelled as part of the finance workload. It turns out that there are a number of good reasons to categorize your
processes in this way. Foremost is prstat(1) which has a handy -J option, breaking down your processes by project (and hence, by workload!). Marketing’s three databases are using 87% of the total CPU? Now you’ll know. Another good reason is that you can use this tag to attach a variety of resource controls to the workload (a topic best left for another day).

On to step two: how do we get the workload tags into something we can use for chargeback? To start with, we enable the accounting facility for tasks using acctadm(1M). (A task is a tree of processes all of which are part of a particular project). As with process accouting, records for the task will be written out when the task exits:

# acctadm -f /var/task_accounting task
# acctadm -e extended task
# acctadm -E task

The first command tells the kernel where to write the accounting records; the second tells the kernel what information to write, and the third immediately enables task accounting. The System Administration Guide documents how to make these settings persist across reboot.

[Hopefully we’ll soon be able to move this configuration under the aegis of smf(5) and make this
procedure simpler.]

Of course, the databases in this case are going to be long running; it could be that no tasks will exit for a very long time (or might never exit and write out accounting records if the system lost power or crashed). This can be solved
using the wracct(1M) command (pronounced “racket”) to flush out partial accounting records at our leisure. Something like this in a daily cron job would do the trick:

# procs=`pgrep -j marketing,finance,trading`
# tasks_to_wracct=`ps -o taskid= -p $procs | sort -u`
# /usr/bin/wracct -t interval -i "$tasks_to_wracct" task

The first command makes a list of all of the processes associated with the marketing, finance and trading projects. The second converts that list of processes into a list of task IDs. Finally, this list is sent to
wracct, which writes an interval record; this basically means that the accounting subsystem’s bean counters for the tasks in question are reset to zero following the call to wracct— meaning that when we walk through the accounting records later, a simple sum is all we need to determine total resource usage.

The key insight is that when the kernel produces the stream of accounting records, every record will be tagged by project. And so we can simply add up the CPU and other usage counters and produce a nice report of the activity of the marketing, finance, and trading departments, because we have projects corresponding to them. To do so, we’ll need some software to extract this data from the accounting records. We’ve provided some nice APIs which let you roll your own: libexacct(3LIB) allows you to read accounting data from C or C++, and
Sun::Solaris::Exacct provides a perl API. There’s also some example source code in /usr/demo/libexacct. There is also some third party software available to provide nice reporting from the extended accounting records, but I don’t have a list at the moment.

[And obviously, “write your own” isn’t a great solution; we need to get cracking and provide a reporting solution right out of the box; minimally, I think we should be able to transform the accounting data into XML for further processing].

Sales Guy (yes, he’s still here): Great! I think they’ll be able to set this up. You know, the other problem they’re having… the customer is running four different versions of WebLogic on this server, and they have to go through all kinds of gyrations to get WebLogic to bind to different ports…

Me: Zones to the rescue!

Sales Guy: Yeah, that’s what I thought; they are excited to have Zones.

I’ll talk more about how Zones solves this aspect of consolidation real soon now.

[Note that I’ve made some minor updates based on feedback from Stephen; 8/4/04]

The Test Begins…

I’ve been a kernel engineer at Sun working on Solaris since I graduated from Brown University in 1998; I sit down the hall from
Bryan, Stephen and others here in sunny Menlo Park, CA. Like almost everyone else in Silicon Valley, I’m not from around here.

So why yet another Blog? After all, Sun apparently has more than 500 going at this point, and the Solaris group is already heavily represented. Someone recently (and politely) described me as a “contrarian.” Perhaps that means I can express a minority perspective in my writing (this is, at least in part the inspiration for the title of this blog; I will leave it as an exercise to the reader to guess or suggest the others). I’ve also been inspired by the excellent articles which Eric has written, and I hope to follow his example in writing useful, informative pieces. On the other hand, I regard the prospect of keeping a blog to be somewhat tedious, vaguely narcissistic, and certainly exhibitionist; as Homer Simpson recently said, “Instead of one big-shot controlling all the media, now there’s a thousand freaks Xeroxing their worthless opinions.” I’ll give it a try and see what happens.

I am also motivated by the chance to talk about some of the technology I have worked on, or have used, or have built and then misplaced somewhere in my home directory. In summary, for the last two and a half years I have primarily been involved in the
Solaris Zones project, led
by Andy Tucker. Zones has built a server consolidation facility directly into Solaris 10, and we think that’s unique among commodity operating systems.
The combination of Zones and the Solaris Resource
Manager
(also built directly into the OS) add up to a powerful solution which our marketing department has dubbed
N1 Grid Containers. More recently, I spent several months ensuring that Zones and Solaris 10’s new Service
Management Facility (smf(5)) interoperate seamlessly.