Performance Monitoring with OProfile
oprofile is a low overhead, open-source tool that hooks into Linux and can keep track of CPU event monitoring information. This is a fairly general statement and for this post I'll be using the Intel Penryn microarchitecture, which should have similar event counters to most recent Intel processors. You can get the canonical list of event counters from Intel's own documentation in Chapter 30, Performance Monitoring, of Volume 3B, System Programming Guide (available from Intel's site). Alternatively, the Japan Advanced Institute of Science and Technology have an interactive version with all the events for most Intel processors.
Event Counters
If you are unaware, almost every processor manufactured in recent history has some collection of event counters that are incremented when some processor event occurs. These events can range from clock cycles ticking by, instructions being retired, thermal thresholds being passed, or second level cache misses.
So far, I've only really used the CPU clock cycles, level 1 cache line replacement, and instructions retired event counters. Your needs might not match mine, so venture over to the Programmer Manual when you need something else!
Event Ratios
Related to the event counters are event ratios. These simple ratios can help you find specific performance issues in your program. For example, if your program does a lot of memory accesses, the processor may need to replaced cache lines frequently. But cache line replacements are naturally occurring in programs, how do we find excessive? Simple! We can just use the ratio of L1 cache replacements to the number of instructions retired. Then we'll have an idea of how many times per instruction an L1 cache line is replaced.
Using oprofile
First, you'll have to be running Linux, then you'll want to install the
"oprofile" package. Since this software installs kernel modules for
monitoring, you'll also need root/sudo access to allow the module to be
loaded and unloaded for monitoring sessions. Here, I'll be running as a
user and using the sudo
command when needed.
opcontrol
opcontrol
is main program that lets you interact with the kernel. If you
need a down-and-dirty list of the events available for monitoring,
opcontrol --list-events
will show you all the event counters at your
disposal.
On my processor, the default event to monitor is CPU_CLK_UNHALTED which
will tell me where the processor spent most of the time executing. If you
want to monitor different events, you can specify what event(s) to monitor
at the command line. The separate
flag simply tells the profiler to
separate traces for each individual CPU (there are other separation levels,
but we won't use them here).
$ sudo opcontrol --event L1D_REPL:10000 --event INST_RETIRED:10000 --separate=cpu
The :10000
after each counter simply specifies what the trigger threshold
is for raising processor exception. In other words, every 10,000
instructions retired the processor raises an exception that the oprofile
daemon will catch and then increment the sample counter for that event.
So, if you see that oprofile has 1 sample of the INST_RETIRED counter then
the processor has seen 10,000 such events.
Now that we have the event counters configured, we can start the monitoring.
$ sudo opcontrol --start
Since the system is doing other activities, it is best if what you want to
monitor can monopolize the system for the while. In my case I build a
simple program that purposefully causes the L1 cache to have a lot of
misses (l1thrash
source
code). I'll also set
the program to execute on one processor (CPU 1).
$ taskset 02 ./l1thrash
After it finishes executing, stop oprofile from running and save the profile session on the disk.
$ sudo opcontrol --stop
$ sudo opcontrol --save l1thrash
Now we have our profile saved to disk and we can view it with opreport
.
opreport
Finally, we get to see how the program handled! Since we were smart and saved our profile to a session, we'll have to specify that at the command line. You might want to pipe the output to less since it can be long at times. On my eight core system the output looks ugly.
$ opreport session:l1thrash
CPU: Core 2, speed 2494.04 MHz (estimated)
Counted L1D_REPL events (Cache lines allocated in the L1 data cache) with a unit mask of 0x0f (No unit mask) count 10000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
cpu:0| cpu:1| cpu:2| cpu:3| cpu:4| cpu:5| cpu:6| cpu:7|
samples| %| samples| %| samples| %| samples| %| samples| %| samples| %| samples| %| samples| %|
------------------------------------------------------------------------------------------------------------------------------------------------
541 95.7522 2969 0.9630 301 92.6154 484 69.9422 797 92.6744 707 88.2647 675 90.3614 707 89.8348 vmlinux
7 1.2389 21 0.0068 6 1.8462 6 0.8671 9 1.0465 6 0.7491 6 0.8032 5 0.6353 oprofile
6 1.0619 3 9.7e-04 6 1.8462 1 0.1445 3 0.3488 2 0.2497 1 0.1339 4 0.5083 nf_ses_watch
5 0.8850 16 0.0052 6 1.8462 7 1.0116 25 2.9070 23 2.8714 30 4.0161 27 3.4307 libc-2.5.so
3 0.5310 2 6.5e-04 1 0.3077 1 0.1445 2 0.2326 4 0.4994 0 0 1 0.1271 libpython2.4.so.1.0
1 0.1770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e1000e
1 0.1770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 irqbalance
1 0.1770 1 3.2e-04 1 0.3077 0 0 0 0 0 0 0 0 0 0 sshd
0 0 2 6.5e-04 0 0 0 0 6 0.6977 8 0.9988 8 1.0710 18 2.2872 bash
0 0 0 0 0 0 0 0 1 0.1163 0 0 0 0 1 0.1271 gawk
0 0 0 0 0 0 3 0.4335 1 0.1163 2 0.2497 1 0.1339 0 0 bnx2
0 0 0 0 3 0.9231 0 0 0 0 0 0 0 0 0 0 ehci_hcd
0 0 305283 99.0179 0 0 0 0 1 0.1163 4 0.4994 1 0.1339 2 0.2541 l1thrash
0 0 10 0.0032 0 0 0 0 14 1.6279 12 1.4981 19 2.5435 13 1.6518 ld-2.5.so
0 0 3 9.7e-04 0 0 0 0 1 0.1163 1 0.1248 2 0.2677 2 0.2541 libcrypto.so.0.9.8b
0 0 0 0 1 0.3077 0 0 0 0 0 0 0 0 0 0 libm-2.5.so
0 0 0 0 0 0 0 0 0 0 0 0 1 0.1339 0 0 libpthread-2.5.so
0 0 0 0 0 0 0 0 0 0 0 0 1 0.1339 0 0 syslogd
0 0 0 0 0 0 0 0 0 0 1 0.1248 0 0 0 0 which
0 0 0 0 0 0 1 0.1445 0 0 0 0 0 0 0 0 libcups.so.2
0 0 0 0 0 0 0 0 0 0 0 0 2 0.2677 0 0 libusb-0.1.so.4.4.4
0 0 0 0 0 0 189 27.3121 0 0 30 3.7453 0 0 7 0.8895 oprofiled
0 0 1 3.2e-04 0 0 0 0 0 0 1 0.1248 0 0 0 0 cupsd
You may notice that the columns try to be sorted in descending order by the
number of samples taken for a specific process. However, on CPU 1 (where
we ran l1thrash
) the sorted order isn't close to correct. Luckily, we
know that the bulk of our program only ran on CPU 1, so we can reissue the
opreport
command specifying that we only care about that processor.
$ opreport session:l1thrash cpu:1
CPU: Core 2, speed 2494.04 MHz (estimated)
Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 10000
Counted L1D_REPL events (Cache lines allocated in the L1 data cache) with a unit mask of 0x0f (No unit mask) count 10000
INST_RETIRED:1...| L1D_REPL:10000|
samples| %| samples| %|
------------------------------------
1834500 91.0882 305283 99.0179 l1thrash
154499 7.6713 2969 0.9630 vmlinux
21655 1.0752 21 0.0068 oprofile
2176 0.1080 16 0.0052 libc-2.5.so
442 0.0219 10 0.0032 ld-2.5.so
435 0.0216 2 6.5e-04 bash
108 0.0054 3 9.7e-04 libcrypto.so.0.9.8b
47 0.0023 3 9.7e-04 nf_ses_watch
43 0.0021 1 3.2e-04 sshd
35 0.0017 2 6.5e-04 libpython2.4.so.1.0
10 5.0e-04 0 0 libavahi-common.so.3.4.3
10 5.0e-04 1 3.2e-04 cupsd
9 4.5e-04 0 0 libcups.so.2
7 3.5e-04 0 0 bnx2
3 1.5e-04 0 0 libavahi-core.so.4.0.5
1 5.0e-05 0 0 libpthread-2.5.so
1 5.0e-05 0 0 timemodule.so
That looks better! Since we've narrowed down the output to one CPU, we now
get to see both events that we monitored too. You can see that the
majority of the time was spent in our l1thrash
program, but how did it
do?
We know that the number of samples is the number of times that the event
counter on the processor hit 10,000 for both counters. So, we find that
our l1thrash
program caused (305,283)(10,000) = 3,052,830,000 level 1 cache
replacements and retired (1,834,500)(10,000) = 18,345,000,000 instructions.
Egads! Is that good or bad? Well, now we can throw in our ratio
calculation for the L1 data cache miss:
That seems pretty bad to me! We can also see that the Linux kernel
(vmlinux
) had a ratio of 2,969:154,499 or about 1.9%, that is
a fairly typical miss ratio.
A Second Example
This is a real example of a program I am actively trying to improve. The
program is a kernel module (nf_ses_watch
) designed to intercept packets
at a decent rate, it is not performing well. Here I'll use the default
CPU_CLK_UNHALTED event monitor to see where the processor spends most of
its time.
$ # I've already loaded the kernel module and started my packet generator
$ sudo opcontrol --event default
$ sudo opcontrol --start
$ # I'll wait about 30 seconds so there are enough samples to be meaningful
$ sudo opcontrol --stop
$ sudo opcontrol --save bombard
Now I have my saved session and can look at the profile. I've also taken the time to set the interrupt affinity of the Ethernet device to a specific processor (CPU 7), so now we can see if all the time was spent in my code of Linux code.
$ opreport session:bombard cpu:7
CPU: Core 2, speed 2494.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
CPU_CLK_UNHALT...|
samples| %|
------------------
737746 86.6169 nf_ses_watch
88183 10.3533 vmlinux
16810 1.9736 e1000e
3680 0.4321 oprofiled
2594 0.3046 oprofile
1578 0.1853 libc-2.5.so
900 0.1057 bash
78 0.0092 ld-2.5.so
52 0.0061 ophelp
26 0.0031 libavahi-common.so.3.4.3
22 0.0026 libavahi-core.so.4.0.5
13 0.0015 gawk
9 0.0011 libcrypto.so.0.9.8b
9 0.0011 libpython2.4.so.1.0
9 0.0011 sshd
8 9.4e-04 bnx2
4 4.7e-04 libpthread-2.5.so
3 3.5e-04 grep
2 2.3e-04 ipv6
2 2.3e-04 auditd
1 1.2e-04 cat
1 1.2e-04 libdl-2.5.so
1 1.2e-04 libm-2.5.so
1 1.2e-04 libpcre.so.0.0.1
1 1.2e-04 dirname
1 1.2e-04 automount
Wow! Over 86% of the time we were executing code in the nf_ses_watch
kernel module (my code)! Let's see if we can dig a little deeper. First,
oprofile has already done the work for us and tracks the specific symbol
name within a piece of code that was active when the sample was taken with
the --symbols
option (this results in a very long list of kernel
symbols). But, in the case of a kernel module, opreport
doesn't know
where to find the symbol names so we have to tell it where the kernel
module lives with --image-path
.
$ opreport session:bombard cpu:7 --symbols --image-path ~/nf_ses_watch/kmod | head
warning: /bnx2 could not be found.
warning: /e1000e could not be found.
warning: /ipv6 could not be found.
warning: /oprofile could not be found.
warning: /sbin/auditd could not be read.
CPU: Core 2, speed 2494.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
warning: could not check that the binary file /home/mjschultz/mon/module/kmod/nf_ses_watch.ko has not been modified since the profile was taken. Results may be inaccurate.
samples % image name app name symbol name
733996 86.1767 nf_ses_watch.ko nf_ses_watch do_rip_entry
16810 1.9736 e1000e e1000e (no symbols)
10308 1.2102 vmlinux vmlinux rb_get_reader_page
9785 1.1488 vmlinux vmlinux read_hpet
8701 1.0216 vmlinux vmlinux ring_buffer_consume
3606 0.4234 vmlinux vmlinux netif_receive_skb
3530 0.4144 vmlinux vmlinux kfree
(I've piped the output through head
to keep it reasonable.) We can see
the real dirt here! By a huge margin, the do_rip_entry
symbol in my
nf_ses_watch
module executes more than the Ethernet driver that is
handling the raw packets. So that is where I'll be looking when I try to
resolve my bug.
Conclusions
If you are looking to optimize your program, oprofile is a great tool to use. The default event monitor (CPU clock cycles on most processors), can give you an idea of what part of your program is using most of the processor time. Once you know that, you can focus your efforts on reducing the number of cycles spent in that function. But don't forget about all those other events too. If you have a memory intensive application, maybe you could reduce the memory contention and get an effective speedup with almost no refactoring!
(I've tried my best to be accurate with this information and I welcome any explicit corrections or clarifications.)