CSE 522S - Advanced Operating Systems

CSE 522S: Studio 4

Tracing the Linux Kernel

"Yes," he said, "they are quite plain: a hobbit's footprints. Pippin's I think. He is smaller than the other. And look at this!" He held up a thing that glittered in the sunlight. It looked like the new-opened leaf of a beech-tree, fair and strange in that treeless plain.

"The brooch of an elven-cloak!" cried Legolas and Gimli together.

"Not idly do the leaves of Lórien fall," said Aragorn. "This did not drop by chance: it was cast away as a token to any that might follow."

—The Two Towers, Book III, Chapter 2

One reality of working with the Linux kernel is that traditional troubleshooting tools such as debuggers are much less accessible. Even the time honored printk function may be too heavyweight to be used in time sensitive code paths such as interrupt handlers and scheduler code. However, there are very robust tracing utilities available to the kernel developer that allow you to record system behavior and review it after execution.

In this studio, you will:

Use the Linux system call tracer strace
Use the Linux kernel tracer (ftrace) via two utilities called trace-cmd and kernelshark

Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete.

As you work through these exercises, please record your answers, and when finished email your results to dferry@email.wustl.edu with the phrase Kernel Tracing in the subject line.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

As the answer to the first exercise, list the names of the people who worked together on this studio.
First, write a C program that only returns zero (i.e. does nothing). We'll also be using the test programs we used in previous studios. You can download these programs here. Leave this answer blank.
Execute the system call tracer on your do-nothing program with a command such as "strace ./do_nothing". This shows you all of the system calls that are made in the process of executing the program (in this case, all the effort it takes to simply do nothing). You can get information about interpreting the output of strace on the manual page.

Pick out two system calls and describe what you think they are accomplishing in this context. Every system call listed here has a manual page entry that will help you decode what is going on. In this case, you should specify the second chapter of the man pages, which covers system calls, otherwise it's possible you might get the wrong page. For example, "man write" will bring up the documentation for the program called write, while "man 2 write" gives you the documentation for the system call named write. See the manual pages documentation (i.e. "man man") for details.
The system call tracer lets us see how our program's commands are eventually implemented by intermediate libraries such as the C standard library. Use strace to trace the sing program from last time. What system call is invoked by the call to printf?
This tracer can also help you understand implementation details of libraries (as well as program binaries that you lack the source to). Look at the trace results of "strace ./dense_mm 100" and compare that to the results of "strace ./dense_mm 300". What is different about the execution of the program? Which two system calls are involved? What do you think is the motivation for using one over the other?
Write a short program that uses the C standard library function malloc. Allocate ten memory regions of 100 bytes each. Profile this program with strace, how are those ten calls to malloc translated to system calls?
Modify your program to allocate ten memory regions of 1,000,000 (one million) bytes each. Profile it with strace, how are these calls translated to system calls?
What do the previous two answers tell you about the implementation of malloc in the standard C library?
Now we're going to switch to a different tracing utility. The tracer itself is called ftrace (short for Function Tracer) and is a part of the Linux kernel. We will be using two utilities called trace-cmd and Kernelshark to access the function tracer.

Install these utilites with the command "sudo apt-get install trace-cmd kernelshark" and leave this answer blank.
This tracer is extraordinarily powerful, and a full description is beyond the scope of this assignment. However, to generate your first trace, execute the command "sudo trace-cmd record -e sched_switch ./dense_mm 100". Make sure that in the output for each CPU there are zero dropped events.

To explain that command a little further, trace-cmd is the tracing utility that lets us work with traces. The record token specifies that we want to record a new trace. The option -e sched_switch specifies that the event we want to trace is invocations to the kernel function sched_switch (which is the function responsible for switching processes). Finally, the program to record is given at the end. As a result, a file called "trace.dat" will be generated. For more information, you can see the man pages at "man trace-cmd" and "man trace-cmd-record".

Leave this question blank.
Now we will use Kernelshark to inspect our trace. Note: the questions from here on require a GUI. Open the trace with the command "kernelshark trace.dat". By default you will be looking at a timeline of each CPU in the system. Each process in the system will be given a unique color so you can track individual processes as they are scheduled on and off of processors as well as migrated between processors.

Start by zooming in on the trace until you can make out discrete events. To zoom in: press and hold the left mouse button; drag the cursor to the right; and then release to define a zoom window. Zooming out is the reverse: press and hold the left mouse button; drag the cursor to the left; and then release the mouse button.

Try zooming around your trace. Leave this answer blank.
We can also enable a process-centric view rather than a CPU-centric view. Go to the Plots menu, select Tasks, and then find the process dense_mm and activate it. Scroll down or enlarge the viewing window until you see the process timeline at the bottom. This timeline only shows the activity of this process, and different colors represent execution time on different processors. Red boxes on this timeline represent time where this task was not scheduled on any processor. Leave this answer blank.
You can use the CPU and task timelines to see exactly how your process executed over its lifetime. What tasks preempt your process (dense_mm)? If you zoom in to where you can see discrete events, you can mouse over those events to see exactly which processes are switched.
One of the CPUs is likely to have spent most of its time servicing your process. Look at the other three CPUs. What do they spend their time doing?
Is the frequency of sched_switch events the same across all processors at all points in time? What do you think is happening?
Odds are that dense_mm doesn't create a really interesting trace because it can only use one processor at a time. Generate a new parallel program trace with the command "sudo trace-cmd record -e sched_switch ./parallel_dense_mm 200" and open it with Kernelshark. Identify a spot in the timeline where you feel the system is not using CPU resources efficiently, what do you think is happening?

Optional Enrichment Exercises

The system call tracer strace can be used on any binary. Try using it on common programs such as ls, touch, or su. You can even attach it to running programs such as the ssh authentication daemon with the -p option.

This is a great first step in reverse engineering a binary or looking for security vulnerabilities as well, though we don't cover that in this course.
The list option can be used with trace-cmd to query ftrace about what kinds of tracers are currently available on the system. The command "sudo trace-cmd list -e" will print out all available events. You can narrow the possible events with simple pattern matching: the command "sudo trace-cmd list -e sched" will list all events with "sched" in their name, which includes every event in the scheduler subsystem as well as a few others. See the man page at "man trace-cmd-list" for more info.
Ftrace contains many different tracers that let you measure different aspects of your system. For example, you can measure total kernel stack size, how long it takes interrupts to run, how long interrupts and kernel preemptions delay your processes, and more. The combination of Ftrace, trace-cmd, and Kernelshark is extremely powerful. Try searching the documentation or online to find examples for doing all of the above, as well as other neat things people have done.