Finding bugs in python code, using DTrace

I have spent a lot of time in recent months helping the IPS project.  This week I integrated usability improvements, and tracked down two different performance problems.  The first was notable because I think it’s the kind of thing that would have been hard to find without DTrace.   Here’s a writeup I did, cribbed from the bug report:

While doing work on IPS I discovered a large number of system calls being made
during the "creating plan" phase of a fresh installation of the
‘redistributable’ cluster of packages.

I used the following crude dtrace script to track down the problem:

#!/usr/sbin/dtrace -Fs

/copyinstr(arg1) == $$2/

/copyinstr(arg1) == $$2 && self->t/


printf("calls: %d", calls);

No points for elegance but it works.

Successive guesses allowed me to narrow this down to the get_link_actions()
routine. Here’s an example of its system call impact during the following

# rm -fr /tmp/foo
# export PKG_IMAGE=/tmp/foo
# pkg image-create -a foo=http://ipkg.eng /tmp/foo
# pkg install -n redistributable

I ran the D script as follows:

$ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions

And once the command was done, got this output:

CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

brk 26
stat64 774400
close 775280
fcntl 775280
fsat 775280
fstat64 775280
getdents64 1553200

So here we can see that something is amiss– this routine is the source
of 4.7 million syscalls.
Here’s the code:

        def get_link_actions(self):
""" return a dictionary of hardlink action lists indexed by
target """
(a) if self.link_actions:
return self.link_actions

d = {}
for act in self.gen_installed_actions():
if == "hardlink":
t = act.get_target_path()
if t in d:
d[t] = [act]

self.link_actions = d
return d

When Bart and I first looked at this, it seemed by inspection to be working as
designed– the first time the routine is called, we populate a cache, and
subsequently the cache is returned. At some point Bart went "Aha!" and we
realized that the problem here was that the comparison at (a) wasn’t working
right, and that, for a not-yet-installed image, the link_actions dictionary is
exactly equal to {}, the empty dictionary. That evaluates to "false" in
python. So we always think that we haven’t yet filled out the link_actions
cache, and go to fill it. That sends us into gen_installed_actions(), which is
expensive. It seems to get more and more expensive as we accumulate more
manifests, as well.

So, while there are exactly zero link actions we need to worry about, we rescan
all manifests for every new manifest we evaluate.

This is a situation which will crop up most often with zones, since that’s the
place we’re most likely to create new images from scratch.

The fix is very simple: set self->link_actions to None initially (in the object
constructor), and change the comparison at (a) to test against None explictly.

Having fixed this, the impact of get_link_actions is:

CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

stat64 3
close 4
fcntl 4
fsat 4
fstat64 4
getdents64 8

Or, basically, nothing. In some informal testing I got these numbers
with ptime(1):

real     1:26.730
user 1:10.988
sys 11.624


real     1:12.932
user 1:02.475
sys 6.361

Or about a 19% improvement in real time. And quite a bit in sys time. Not a huge win, but not too shabby either.

I used the same technique to find a similar bug in another part of the code, but alas, Danek had already found the same bug, so I can’t claim it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s