Tips 6

Why Profile?

Profiling code is absolutely important. While algorithms and run time complexities are good first approximations for how to implement code, optimizations can only be done by first and foremost measuring. Getting accurate data for how the code actually runs is the only reliable benchmark.

Otherwise, how do you even know if your optimization worked or not?

The standard way to optimize is to profile your code, find a portion of code that’s taking far too long (usually within a single function), rewrite that function, and then profile your code once more. You repeat this cycle until you are satisfied with teh code performance.

I’ll say it again: theories are meaningless. Until you profile your code it does not carry any weight.

With that in mind, we’ll be studying the GProf, a well-known C profilers.

Mac Users

GProf does not work with Apple. They prefer strongly that you use Shark instead, which comes builtin with Developer Tools.

GProf

How do we use gprof?

You have to compile your source code with the -pg option.

-pg : Generate extra code to write profile information suitable for the analysis program gprof. You must use this option when compiling the source files you want data about, and you must also use it when linking. – GCC man page

I used gcc -pg -Wall -pedantic -std=c99 blah.c -o blah for this example. After running the ./blah executable, a new file gmon.out will be created (no matter your program, it will always be called this).

Time to use GProf:

gprof [executable name] gmon.out > [whatever name you want . txt]
gprof sleep gmon.out > analysis.txt

The analysis will be long. One important aspect is the flat profile. It will look something like this:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 33.34      0.02     0.02     7208     0.00     0.00  open
 16.67      0.03     0.01      244     0.04     0.12  offtime
 16.67      0.04     0.01        8     1.25     1.25  memccpy
 16.67      0.05     0.01        7     1.43     1.43  write
 16.67      0.06     0.01                             mcount
  0.00      0.06     0.00      236     0.00     0.00  tzset
  0.00      0.06     0.00      192     0.00     0.00  tolower
  0.00      0.06     0.00       47     0.00     0.00  strlen
  0.00      0.06     0.00       45     0.00     0.00  strchr
  0.00      0.06     0.00        1     0.00    50.00  main
  0.00      0.06     0.00        1     0.00     0.00  memcpy
  0.00      0.06     0.00        1     0.00    10.11  print
  0.00      0.06     0.00        1     0.00     0.00  profil
  0.00      0.06     0.00        1     0.00    50.00  report

The time column is a percent, and it should all add up to 100%. As you can imagine, you want to minimize the big ticket items. For example, the program spends a third of its time in the open function; if you wrote this, you’d want to do as much as possible to minimize that runtime and ignore something like report which doesn’t take up any time at all. The key is focusing on those that give the greatest marginal speedup.

Another useful component is the call graph, which shows how much time was spent in each function and its children. So even if one function by itself did not take up too much time, it would show up as a culprit in the call graph if it schildren were very inefficient.

Here is a sample call graph.

granularity: each sample hit covers 2 byte(s) for 20.00% of 0.05 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.00    0.05                 start [1]
                0.00    0.05       1/1           main [2]
                0.00    0.00       1/2           on_exit [28]
                0.00    0.00       1/1           exit [59]
-----------------------------------------------
                0.00    0.05       1/1           start [1]
[2]    100.0    0.00    0.05       1         main [2]
                0.00    0.05       1/1           report [3]
-----------------------------------------------
                0.00    0.05       1/1           main [2]
[3]    100.0    0.00    0.05       1         report [3]
                0.00    0.03       8/8           timelocal [6]
                0.00    0.01       1/1           print [9]
                0.00    0.01       9/9           fgets [12]
                0.00    0.00      12/34          strncmp <cycle 1> [40]
                0.00    0.00       8/8           lookup [20]
                0.00    0.00       1/1           fopen [21]
                0.00    0.00       8/8           chewtime [24]
                0.00    0.00       8/16          skipspace [44]
-----------------------------------------------
[4]     59.8    0.01        0.02       8+472     <cycle 2 as a whole>   [4]
                0.01        0.02     244+260         offtime <cycle 2> [7]
                0.00        0.00     236+1           tzset <cycle 2> [26]
-----------------------------------------------

The line at the top index % time self children called name is known as the primary line.

There are many better resources that go through all the intricacies on how to interpret everything that gprof will output, and I’ll leave it to the resources section to do that.

The last thing I wanted to talk about was running GProf multiple times (definitely for things that are CPU-bound). GProf is subject to inaccuracies because it works on a sampling method, so the best way to ensure that you’re being accurate is to do this multiple times.

Once you’ve produced a gmon.out file, move it to gmon.sum; then, run your program again and merge the new gmon.out using gprof -s [executable-name] gmon.out gmon.sum.

Do this several times until you are satisfied, and then you can produce the final text file with gprof [executable-name] gmon.sum > [textfile name].

Resources

Official GProf documentation

Handy tutorial

Very short tutorial

Lengthy tutorial