Demonstrations of execsnoop, the Linux ftrace version.


Here's execsnoop showing what's really executed by "man ls":

# ./execsnoop 
Tracing exec()s. Ctrl-C to end.
   PID   PPID ARGS
 22898  22004 man ls
 22905  22898 preconv -e UTF-8
 22908  22898 pager -s
 22907  22898 nroff -mandoc -rLL=164n -rLT=164n -Tutf8
 22906  22898 tbl
 22911  22910 locale charmap
 22912  22907 groff -mtty-char -Tutf8 -mandoc -rLL=164n -rLT=164n
 22913  22912 troff -mtty-char -mandoc -rLL=164n -rLT=164n -Tutf8
 22914  22912 grotty

Many commands. This is particularly useful for understanding application
startup.


Another use for execsnoop is identifying short-lived processes. Eg, with the -t
option to see timestamps:

# ./execsnoop -t
Tracing exec()s. Ctrl-C to end.
TIMEs               PID   PPID ARGS
7419756.154031     8185   8181 mawk -W interactive -v o=1 -v opt_name=0 -v name= [...]
7419756.154131     8186   8184 cat -v trace_pipe
7419756.245264     8188   1698 ./run
7419756.245691     8189   1696 ./run
7419756.246212     8187   1689 ./run
7419756.278993     8190   1693 ./run
7419756.278996     8191   1692 ./run
7419756.288430     8192   1695 ./run
7419756.290115     8193   1691 ./run
7419756.292406     8194   1699 ./run
7419756.293986     8195   1690 ./run
7419756.294149     8196   1686 ./run
7419756.296527     8197   1687 ./run
7419756.296973     8198   1697 ./run
7419756.298356     8200   1685 ./run
7419756.298683     8199   1688 ./run
7419757.269883     8201   1696 ./run
[...]

So we're running many "run" commands every second. The PPID is included, so I
can debug this further (they are "supervise" processes).

Short-lived processes can consume CPU and not be visible from top(1), and can
be the source of hidden performance issues.


Here's another example: I noticed CPU usage was high in top(1), but couldn't
see the responsible process:

$ top
top - 00:04:32 up 78 days, 15:41,  3 users,  load average: 0.85, 0.29, 0.14
Tasks: 123 total,   1 running, 121 sleeping,   0 stopped,   1 zombie
Cpu(s): 15.7%us, 34.9%sy,  0.0%ni, 49.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   7629464k total,  7537216k used,    92248k free,  1376492k buffers
Swap:        0k total,        0k used,        0k free,  5432356k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7225 bgregg-t  20   0 29480 6196 2128 S    3  0.1   0:02.64 ec2rotatelogs
    1 root      20   0 24320 2256 1340 S    0  0.0   0:01.23 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S    0  0.0   1:19.61 ksoftirqd/0
    4 root      20   0     0    0    0 S    0  0.0   0:00.00 kworker/0:0
    5 root      20   0     0    0    0 S    0  0.0   0:00.01 kworker/u:0
    6 root      RT   0     0    0    0 S    0  0.0   0:16.00 migration/0
    7 root      RT   0     0    0    0 S    0  0.0   0:17.29 watchdog/0
    8 root      RT   0     0    0    0 S    0  0.0   0:15.85 migration/1
    9 root      20   0     0    0    0 S    0  0.0   0:00.00 kworker/1:0
[...]

See the line starting with "Cpu(s):". So there's about 50% CPU utilized (this
is a two CPU server, so that's equivalent to one full CPU), but this CPU usage
isn't visible from the process listing.

vmstat agreed, showing the same average CPU usage statistics:
 
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0  92816 1376476 5432188    0    0     0     3    2    1  0  1 99  0
 1  0      0  92676 1376484 5432264    0    0     0    24 6573 6130 12 38 49  0
 1  0      0  91964 1376484 5432272    0    0     0     0 6529 6097 16 35 49  0
 1  0      0  92692 1376484 5432272    0    0     0     0 6192 5775 17 35 49  0
 1  0      0  92692 1376484 5432272    0    0     0     0 6554 6121 14 36 50  0
 1  0      0  91940 1376484 5432272    0    0     0    12 6546 6101 13 38 49  0
 1  0      0  92560 1376484 5432272    0    0     0     0 6201 5769 15 35 49  0
 1  0      0  92676 1376484 5432272    0    0     0     0 6524 6123 17 34 49  0
 1  0      0  91932 1376484 5432272    0    0     0     0 6546 6107 10 40 49  0
 1  0      0  92832 1376484 5432272    0    0     0     0 6057 5710 13 38 49  0
 1  0      0  92248 1376484 5432272    0    0    84    28 6592 6183 16 36 48  1
 1  0      0  91504 1376492 5432348    0    0     0    12 6540 6098 18 33 49  1
[...]

So this could be caused by short-lived processes, who vanish before they are
seen by top(1). Do I have my execsnoop handy? Yes:

# ~/perf-tools/bin/execsnoop 
Tracing exec()s. Ctrl-C to end.
   PID   PPID ARGS
 10239  10229 gawk -v o=0 -v opt_name=0 -v name= -v opt_duration=0 [...]
 10240  10238 cat -v trace_pipe
 10242   7225 sh [?]
 10243  10242 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.201201.3122.txt
 10245   7225 sh [?]
 10246  10245 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.202201.3122.txt
 10248   7225 sh [?]
 10249  10248 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.203201.3122.txt
 10251   7225 sh [?]
 10252  10251 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.204201.3122.txt
 10254   7225 sh [?]
 10255  10254 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.205201.3122.txt
 10257   7225 sh [?]
 10258  10257 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.210201.3122.txt
 10260   7225 sh [?]
 10261  10260 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.211201.3122.txt
 10263   7225 sh [?]
 10264  10263 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.212201.3122.txt
 10266   7225 sh [?]
 10267  10266 /usr/sbin/lsof -X /logs/tomcat/cores/threaddump.20141215.213201.3122.txt
[...]

The output scrolled quickly, showing that many shell and lsof processes were
being launched. If you check the PID and PPID columns carefully, you can see that
these are ultimately all from PID 7225. We saw that earlier in the top output:
ec2rotatelogs, at 3% CPU. I now know the culprit.

I should have used "-t" to show the timestamps with this example.


Run -h to print the USAGE message:

# ./execsnoop -h
USAGE: execsnoop [-hrt] [-a argc] [-d secs] [name]
                 -d seconds      # trace duration, and use buffers
                 -a argc         # max args to show (default 8)
                 -r              # include re-execs
                 -t              # include time (seconds)
                 -h              # this usage message
                 name            # process name to match (REs allowed)
  eg,
       execsnoop                 # watch exec()s live (unbuffered)
       execsnoop -d 1            # trace 1 sec (buffered)
       execsnoop grep            # trace process names containing grep
       execsnoop 'log$'          # filenames ending in "log"

See the man page and example file for more info.