Chargeback? Or, a Long Answer to a Short Question

I spoke to an account manager (I’ll call him “Sales Guy”) at Sun today regarding a large financial customer; the conversation is a good illustration of an oft-cited problem at Sun– we (and our customers) often are not aware that solution they need is already lurking in the product somewhere! As you’ll see from the subsequent discussion, this is also an illustration of how a great feature could benefit from being a little bit easier to use.

Sales Guy: My customer is consolidating lots of Oracle databases (more than 100!) onto a single Sun-Fire 6800 server; different departments at the company “own” the content in those databases, and so the IT group would really like to do charge-back to the departments for their usage. But of course all of these Oracle databases run as orauser and so traditional accounting methods (such as charging the usage to the user ID) doesn’t work for us. I know there’s a new accounting feature starting in Solaris 9; could that work for them?

Me: Yes! I’ll try to step you through the process of setting this up; I’ll borrow (steal) some of the techniques Liane explained in her blog, so you might wish to read that first.
The recommended method is to use projects (see project(4)) to put a workload label on each database. Once we’ve done that, we’ll convince the extended accounting subsystem to reflect that workload tag when it produces accounting records.

[“Extended accounting” in some senses means “new accounting”; we’re not talking about the old SystemV accounting subsystem].

Let’s say these 100 database were divided between the finance, marketing and trading groups. First, you would define three projects in /etc/project (predictably: finance, marketing and trading). Then, each database belonging to the various departments would be launched using newtask(1) to label it.
It might look something like this:

# newtask -p finance /usr/local/bin/start_oracle_db_widget_sales

Subsequently, every process which makes up this database instance (in this case, the “widget sales” database) would be labelled as part of the finance workload. It turns out that there are a number of good reasons to categorize your
processes in this way. Foremost is prstat(1) which has a handy -J option, breaking down your processes by project (and hence, by workload!). Marketing’s three databases are using 87% of the total CPU? Now you’ll know. Another good reason is that you can use this tag to attach a variety of resource controls to the workload (a topic best left for another day).

On to step two: how do we get the workload tags into something we can use for chargeback? To start with, we enable the accounting facility for tasks using acctadm(1M). (A task is a tree of processes all of which are part of a particular project). As with process accouting, records for the task will be written out when the task exits:

# acctadm -f /var/task_accounting task
# acctadm -e extended task
# acctadm -E task

The first command tells the kernel where to write the accounting records; the second tells the kernel what information to write, and the third immediately enables task accounting. The System Administration Guide documents how to make these settings persist across reboot.

[Hopefully we’ll soon be able to move this configuration under the aegis of smf(5) and make this
procedure simpler.]

Of course, the databases in this case are going to be long running; it could be that no tasks will exit for a very long time (or might never exit and write out accounting records if the system lost power or crashed). This can be solved
using the wracct(1M) command (pronounced “racket”) to flush out partial accounting records at our leisure. Something like this in a daily cron job would do the trick:

# procs=`pgrep -j marketing,finance,trading`
# tasks_to_wracct=`ps -o taskid= -p $procs | sort -u`
# /usr/bin/wracct -t interval -i "$tasks_to_wracct" task

The first command makes a list of all of the processes associated with the marketing, finance and trading projects. The second converts that list of processes into a list of task IDs. Finally, this list is sent to
wracct, which writes an interval record; this basically means that the accounting subsystem’s bean counters for the tasks in question are reset to zero following the call to wracct— meaning that when we walk through the accounting records later, a simple sum is all we need to determine total resource usage.

The key insight is that when the kernel produces the stream of accounting records, every record will be tagged by project. And so we can simply add up the CPU and other usage counters and produce a nice report of the activity of the marketing, finance, and trading departments, because we have projects corresponding to them. To do so, we’ll need some software to extract this data from the accounting records. We’ve provided some nice APIs which let you roll your own: libexacct(3LIB) allows you to read accounting data from C or C++, and
Sun::Solaris::Exacct provides a perl API. There’s also some example source code in /usr/demo/libexacct. There is also some third party software available to provide nice reporting from the extended accounting records, but I don’t have a list at the moment.

[And obviously, “write your own” isn’t a great solution; we need to get cracking and provide a reporting solution right out of the box; minimally, I think we should be able to transform the accounting data into XML for further processing].

Sales Guy (yes, he’s still here): Great! I think they’ll be able to set this up. You know, the other problem they’re having… the customer is running four different versions of WebLogic on this server, and they have to go through all kinds of gyrations to get WebLogic to bind to different ports…

Me: Zones to the rescue!

Sales Guy: Yeah, that’s what I thought; they are excited to have Zones.

I’ll talk more about how Zones solves this aspect of consolidation real soon now.

[Note that I’ve made some minor updates based on feedback from Stephen; 8/4/04]