As part of the Seer research project, I collected a large body of traces of user behavior on mobile machines. The traces consist of about half a gigabyte of compressed (gzipped) trace files that record all file activity except reads and writes by nine different users who were working both connected and disconnected on Linux-based laptop computers. The trace lengths range from about 1 to about 6 months.
This Web page provides access to the individual traces, and describes the trace format. All traces have been "sanitized" by removing user names and private pathnames, but most pathnames have been left untouched (it turns out to be important for many research purposes to know what program a user was running, or to have an idea of the type of file being accessed).
The traces themselves are currently available in a gzipped binary format. The binary format is not yet fully documented (I'm working on it, but it's a slow process).
In the meantime, Linux/i386 users can download a program called dumpobs that will read the compressed binary format and produce ASCII output. This output can then be processed to your heart's content. (If you the version above doesn't work, try this alternate version. The alternate is known to be required on RedHat 7.1. I don't know if it runs on other systems.
Dumpobs is a normal Unix program. That means that it can read a file
from standard input or accept multiple filenames on the command line.
Files provided on stdin
must be uncompressed, but if a
file specified on the command line ends with ".gz
" then
it will be uncompressed automatically.
Two switches are accepted. The -n
switch causes the name
of the file being dumped to be prepended to each output line, followed
by a colon in the manner of grep
. This can be useful if
you are searching for a particular record in a lot of files. The
-o
n switch specifies a starting value other
than zero for the record offset (see below).
The ASCII output of dumpobs contains three types of lines. All three types begin with a decimal number that represents the offset of the record from the start of the input file.
The record type is identified by the second field, which will be one
of RESTART
, TIMESTAMP
, or UID
.
The RESTART record occurs whenever the Seer system was restarted, usually due to a reboot of the machine being traced. It is formatted as follows:
offset RESTART (type) at timestamp ascii-time
where
scratchor
reexec(in practice, the latter never appears in the trace files).
ctime
format. If you want the timestamp expressed in the timezone
used to collect the traces, run dumpobs
with the
TZ
environment variable set to
PST8PDT
.
The TIMESTAMP
record is generally produced once every ten minutes
while the traced computer is up and running. It also (usually)
appears at the beginning of every trace file. It is formatted as
follows:
offset TIMESTAMP timestamp ascii-time
where offset, timestamp, and ascii-time are
all interpreted as for the RESTART
record above.
The final record type is the actual trace record, identified by a
second field equal to UID
. It is formatted as follows:
offset UID uid PID pid program flag timestamp.microseconds call(args) = result (error)
where
exit
,
rename
, unlink
, and
rmdir
were traced.
dumpobs
was unable to
determine the program name.
B
, A
, S
, or
G
. B
and A
indicate
that the trace was taken before or after actual execution of
the system call, respectively. Most traces are taken after
the call completes, but exec
calls are traced
both before and after the call. S
indicates that
the call was traced through certain special hooks in the
kernel, and applies only to fork
and
exit
calls; this is of limited value to most
users. Finally, G
indicates that the system call
is a "fake" one generated internally by the Seer system. This
is used to generate chdir
indications that will
correctly reflect the working directory of a process when it
first appears in the trace.
open
are given symbolically. Note that
sometimes arguments are abbreviated, truncated, or "faked";
see below for more information.
errno.h
). Note that in some of the
early trace files, failed system calls were not recorded, so
this information is not always available.
As mentioned above, some system calls are traced with only limited argument information, and others are traced with "fake" arguments or return values. These include:
O_CREAT
, O_TRUNC
, and
O_EXCL
flags are displayed in addition to the
open mode. The full flags are available in the binary trace
file, however.
stat
(e.g., file size and times) are not
available even in the binary trace file.
F_DUPFD
, F_GETFD
,
F_SETFD
, F_GETOWN
,
F_SETOWN
, the file descriptor involved is shown
in decimal as the third argument. For other calls, the third
argument is given in hex.
Except as noted above, the omitted arguments are missing from the
binary trace file as well as the dumpobs
output.
The traces are stored in a gzipped binary format. Each trace record is described by the following "C" structure:
typedef struct { unsigned char callcode; /* System call code */ unsigned char errcode; /* Error return (errno) from system call */ unsigned short flags; /* Flags associated with entry */ short argsize; /* Size of arguments, ints */ uid_t uid; /* UID of process */ pid_t pid; /* PID being traced */ int retval; /* Return value from call, if success */ struct timeval calltime; /* Time call was recorded */ int args[1]; /* Argument[s] to system call */ } seerbuf_t;
With certain exceptions, each record represents a system call by one user. The records appear in the order they were seen by the Linux kernel.
The fields in the structure have the following meaning:
callcode
errcode
flags
field below.
Geoff Kuenning's home page.
This page maintained by
Geoff Kuenning.