A Fast File System for UNIX*
Marshall Kirk McKusick, William N. Joy+,
Samuel J. Leffler*, Robert S. Fabry
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720
ABSTRACT
A reimplementation of the UNIX file system is
described. The reimplementation provides substan-
tially higher throughput rates by using more flex-
ible allocation policies that allow better local-
ity of reference and can be adapted to a wide
range of peripheral and processor characteristics.
The new file system clusters data that is sequen-
tially accessed and provides two block sizes to
allow fast access to large files while not wasting
large amounts of space for small files. File
access rates of up to ten times faster than the
traditional UNIX file system are experienced.
Long needed enhancements to the programmers'
interface are discussed. These include a mecha-
nism to place advisory locks on files, extensions
of the name space across file systems, the ability
to use long file names, and provisions for admin-
istrative control of resource usage.
Revised February 18, 1984
----------- * UNIX is a trademark of Bell Laboratories. +
William N. Joy is currently employed by: Sun Microsystems,
Inc, 2550 Garcia Avenue, Mountain View, CA 94043 * Samuel J.
Leffler is currently employed by: Lucasfilm Ltd., PO Box 2009,
San Rafael, CA 94912 This work was done under grants from the
National Science Foundation under grant MCS80-05144, and the
Defense Advance Research Projects Agency (DoD) under ARPA Order
No. 4031 monitored by Naval Elec- tronic System Command un-
der Contract No. N00039-82-C-0235.
SMM:05-2 A Fast File System for UNIX
CR Categories and Subject Descriptors: D.4.3 [Operating Sys-
tems]: File Systems Management - file organization, direc- tory
structures, access methods; D.4.2 [Operating Systems]: Storage
Management - allocation/deallocation strategies, secondary
storage devices; D.4.8 [Operating Systems]: Per- formance -
measurements, operational analysis; H.3.2 [Infor- mation
Systems]: Information Storage - file organization
Additional Keywords and Phrases: UNIX, file system organiza-
tion, file system performance, file system design, applica- tion
program interface.
General Terms: file system, measurement, performance.
A Fast File System for UNIX SMM:05-3
TABLE OF CONTENTS
1. Introduction
2. Old file system
3. New file system organization 3.1. Optimizing storage uti-
lization 3.2. File system parameterization 3.3. Layout
policies
4. Performance
5. File system functional enhancements 5.1. Long file names
5.2. File locking 5.3. Symbolic links 5.4. Rename
5.5. Quotas
Acknowledgements
References
1. Introduction
This paper describes the changes from the original 512 byte
UNIX file system to the new one released with the 4.2 Berkeley
Software Distribution. It presents the motivations for the
changes, the methods used to effect these changes, the rationale
behind the design decisions, and a description of the new imple-
mentation. This discussion is followed by a summary of the re-
sults that have been obtained, directions for future work,
and the additions and changes that have been made to the facili-
ties that are available to program- mers.
The original UNIX system that runs on the PDP-11+ has sim-
ple and elegant file system facilities. File system in-
put/output is buffered by the kernel; there are no align- ment
constraints on data transfers and all operations are made to
appear synchronous. All transfers to the disk are in 512
byte blocks, which can be placed arbitrarily within the data area
of the file system. Virtually no constraints other than
available disk space are placed on file growth [Ritchie74],
[Thompson78].*
----------- + DEC, PDP, VAX, MASSBUS, and UNIBUS are trade-
marks of Digital Equipment Corporation. * In practice, a file's
size is constrained to be less than about one gigabyte.
SMM:05-4 A Fast File System for UNIX
When used on the VAX-11 together with other UNIX en-
hancements, the original 512 byte UNIX file system is inca-
pable of providing the data throughput rates that many applica-
tions require. For example, applications such as VLSI de-
sign and image processing do a small amount of pro- cessing on a
large quantities of data and need to have a high throughput
from the file system. High throughput rates are also needed by
programs that map files from the file system into large
virtual address spaces. Paging data in and out of the file sys-
tem is likely to occur frequently [Ferrin82b]. This re-
quires a file system providing higher bandwidth than the original
512 byte UNIX one that provides only about two percent of
the maximum disk bandwidth or about 20 kilobytes per second per
arm [White80], [Smith81b].
Modifications have been made to the UNIX file system to im-
prove its performance. Since the UNIX file system inter- face
is well understood and not inherently slow, this devel- opment
retained the abstraction and simply changed the underlying
implementation to increase its throughput. Con- sequently, users
of the system have not been faced with mas- sive software conver-
sion.
Problems with file system performance have been dealt with
extensively in the literature; see [Smith81a] for a survey.
Previous work to improve the UNIX file system per- formance has
been done by [Ferrin82a]. The UNIX operating system drew
many of its ideas from Multics, a large, high performance oper-
ating system [Feiertag71]. Other work includes Hydra
[Almes78], Spice [Thompson80], and a file system for a LISP en-
vironment [Symbolics81]. A good intro- duction to the physi-
cal latencies of disks is described in [Pechura83].
2. Old File System
In the file system developed at Bell Laboratories (the
``traditional'' file system), each disk drive is divided into
one or more partitions. Each of these disk partitions may con-
tain one file system. A file system never spans mul- tiple
partitions.+ A file system is described by its super- block,
which contains the basic parameters of the file sys- tem.
These include the number of data blocks in the file -----------
+ By ``partition'' here we refer to the subdivi- sion of
physical space on a disk drive. In the traditional file system,
as in the new file sys- tem, file systems are really located
in logical disk partitions that may overlap. This overlap-
ping is made available, for example, to allow pro- grams to copy
entire disk drives containing multi- ple file systems.
A Fast File System for UNIX SMM:05-5
system, a count of the maximum number of files, and a
pointer to the free list, a linked list of all the free
blocks in the file system.
Within the file system are files. Certain files are dis-
tinguished as directories and contain pointers to files that
may themselves be directories. Every file has a descrip-
tor associated with it called an inode. An inode contains
information describing ownership of the file, time stamps mark-
ing last modification and access times for the file, and an ar-
ray of indices that point to the data blocks for the file.
For the purposes of this section, we assume that the first 8
blocks of the file are directly referenced by values stored in
an inode itself*. An inode may also contain references to
indirect blocks containing further data block indices. In a
file system with a 512 byte block size, a singly indirect
block contains 128 further block addresses, a doubly indirect
block contains 128 addresses of further singly indirect blocks,
and a triply indirect block contains 128 addresses of further
doubly indirect blocks.
A 150 megabyte traditional UNIX file system consists of 4
megabytes of inodes followed by 146 megabytes of data. This
organization segregates the inode information from the data;
thus accessing a file normally incurs a long seek from the file's
inode to its data. Files in a single directory are not typi-
cally allocated consecutive slots in the 4 megabytes of in-
odes, causing many non-consecutive blocks of inodes to be ac-
cessed when executing operations on the inodes of several
files in a directory.
The allocation of data blocks to files is also subopti- mum.
The traditional file system never transfers more than 512 bytes
per disk transaction and often finds that the next sequential
data block is not on the same cylinder, forcing seeks between
512 byte transfers. The combination of the small block size,
limited read-ahead in the system, and many seeks severely limits
file system throughput.
The first work at Berkeley on the UNIX file system at-
tempted to improve both reliability and throughput. The relia-
bility was improved by staging modifications to criti- cal file
system information so that they could either be completed or
repaired cleanly by a program after a crash [Kowalski78].
The file system performance was improved by a factor of more than
two by changing the basic block size from 512 to 1024
bytes. The increase was because of two factors: each disk trans-
fer accessed twice as much data, and most files could be de-
scribed without need to access indi- rect blocks since the direct
blocks contained twice as much data. The file system with these
changes will henceforth be ----------- * The actual number may
vary from system to sys- tem, but is usually in the range 5-13.
SMM:05-6 A Fast File System for UNIX
referred to as the old file system.
This performance improvement gave a strong indication that
increasing the block size was a good method for improv- ing
throughput. Although the throughput had doubled, the old
file system was still using only about four percent of the disk
bandwidth. The main problem was that although the free list
was initially ordered for optimal access, it quickly became
scrambled as files were created and removed. Eventually the
free list became entirely random, causing files to have their
blocks allocated randomly over the disk. This forced a seek be-
fore every block access. Although old file systems provided
transfer rates of up to 175 kilobytes per second when they were
first created, this rate deterio- rated to 30 kilobytes per sec-
ond after a few weeks of moder- ate use because of this random-
ization of data block place- ment. There was no way of restoring
the performance of an old file system except to dump, re-
build, and restore the file system. Another possibility,
as suggested by [Maruyama76], would be to have a process
that periodically reorganized the data on the disk to restore lo-
cality.
3. New file system organization
In the new file system organization (as in the old file sys-
tem organization), each disk drive contains one or more file
systems. A file system is described by its super- block,
located at the beginning of the file system's disk partition.
Because the super-block contains critical data, it is replicated
to protect against catastrophic loss. This is done when the file
system is created; since the super- block data does not
change, the copies need not be refer- enced unless a head crash
or other hard disk error causes the default super-block to be
unusable.
To insure that it is possible to create files as large as
232 bytes with only two levels of indirection, the mini- mum
size of a file system block is 4096 bytes. The size of file sys-
tem blocks can be any power of two greater than or equal to
4096. The block size of a file system is recorded in the file
system's super-block so it is possible for file systems with
different block sizes to be simultaneously accessible on the
same system. The block size must be decided at the time
that the file system is created; it can- not be subsequently
changed without rebuilding the file sys- tem.
The new file system organization divides a disk parti- tion
into one or more areas called cylinder groups. A cylinder
group is comprised of one or more consecutive cylinders on a
disk. Associated with each cylinder group is some bookkeeping
information that includes a redundant copy
A Fast File System for UNIX SMM:05-7
of the super-block, space for inodes, a bit map describing
available blocks in the cylinder group, and summary informa- tion
describing the usage of data blocks within the cylinder group.
The bit map of available blocks in the cylinder group re-
places the traditional file system's free list. For each cylin-
der group a static number of inodes is allocated at file system
creation time. The default policy is to allocate one in-
ode for each 2048 bytes of space in the cylinder group, ex-
pecting this to be far more than will ever be needed.
All the cylinder group bookkeeping information could be
placed at the beginning of each cylinder group. However if this
approach were used, all the redundant information would be on the
top platter. A single hardware failure that destroyed the
top platter could cause the loss of all redun- dant copies of the
super-block. Thus the cylinder group bookkeeping informa-
tion begins at a varying offset from the beginning of the cylin-
der group. The offset for each suc- cessive cylinder group
is calculated to be about one track further from the beginning of
the cylinder group than the preceding cylinder group. In
this way the redundant infor- mation spirals down into the pack
so that any single track, cylinder, or platter can be lost
without losing all copies of the super-block. Except for the
first cylinder group, the space between the beginning of the
cylinder group and the beginning of the cylinder group informa-
tion is used for data blocks.+
3.1. Optimizing storage utilization
Data is laid out so that larger blocks can be trans-
ferred in a single disk transaction, greatly increasing file sys-
tem throughput. As an example, consider a file in the new
file system composed of 4096 byte data blocks. In the old
file system this file would be composed of 1024 byte
----------- + While it appears that the first cylinder group
could be laid out with its super-block at the ``known'' lo-
cation, this would not work for file systems with blocks sizes
of 16 kilobytes or greater. This is because of a require-
ment that the first 8 kilobytes of the disk be reserved for a
bootstrap program and a separate requirement that the cylinder
group information begin on a file system block boundary. To
start the cylinder group on a file system block boundary, file
sys- tems with block sizes larger than 8 kilobytes would
have to leave an empty space between the end of the boot block
and the beginning of the cylin- der group. Without knowing the
size of the file system blocks, the system would not know
what roundup function to use to find the beginning of the first
cylinder group.
SMM:05-8 A Fast File System for UNIX
blocks. By increasing the block size, disk accesses in the new
file system may transfer up to four times as much infor- mation
per disk transaction. In large files, several 4096 byte
blocks may be allocated from the same cylinder so that even larg-
er data transfers are possible before requiring a seek.
The main problem with larger blocks is that most UNIX file
systems are composed of many small files. A uniformly large
block size wastes space. Table 1 shows the effect of file system
block size on the amount of wasted space in the file system.
The files measured to obtain these figures reside on one of our
time sharing systems that has roughly 1.2 gigabytes of on-
line storage. The measurements are based on the active user
file systems containing about 920 megabytes of formatted space.
+------------+---------+--------------------------------------------------+
|Space used | % waste | Organization
|
+------------+---------+--------------------------------------------------+
| 775.2 Mb | 0.0 | Data only, no separation between files
| | 807.8 Mb | 4.2 | Data only, each file starts on 512
byte boundary | | 828.7 Mb | 6.9 | Data + inodes, 512 byte
block UNIX file system | | 866.5 Mb | 11.8 | Data + in-
odes, 1024 byte block UNIX file system | | 948.5 Mb | 22.4
| Data + inodes, 2048 byte block UNIX file system | | 1128.3 Mb
| 45.6 | Data + inodes, 4096 byte block UNIX file system |
+------------+---------+--------------------------------------------------+
Table 1 - Amount of wasted space as a function of block size.
The space wasted is calculated to be the percentage of space on
the disk not containing user data. As the block size on the
disk increases, the waste rises quickly, to an intolera- ble
45.6% waste with 4096 byte file system blocks.
To be able to use large blocks without undue waste,
small files must be stored in a more efficient way. The new file
system accomplishes this goal by allowing the division of a
single file system block into one or more fragments. The file
system fragment size is specified at the time that the file
system is created; each file system block can optionally be
broken into 2, 4, or 8 fragments, each of which is address-
able. The lower bound on the size of these fragments is con-
strained by the disk sector size, typically 512 bytes. The
block map associated with each cylinder group records the space
available in a cylinder group at the fragment level; to deter-
mine if a block is available, aligned fragments are examined.
Figure 1 shows a piece of a map from a 4096/1024 file sys-
tem. Each bit in the map records the status of a fragment; an
``X'' shows that the fragment is in use, while a ``O'' shows
that the fragment is available for allocation. In this example,
fragments 0-5, 10, and 11 are in use, while fragments 6-9,
and 12-15 are free. Fragments of adjoining blocks cannot be
used as a
A Fast File System for UNIX SMM:05-9
+-----------------+----------------------------+ |Bits in map
| XXXX XXOO OOXX OOOO | |Fragment numbers | 0-3 4-7
8-11 12-15 | |Block numbers | 0 1 2 3 |
+-----------------+----------------------------+ Figure 1 - Exam-
ple layout of blocks and fragments in a 4096/1024 file system.
full block, even if they are large enough. In this example,
fragments 6-9 cannot be allocated as a full block; only
fragments 12-15 can be coalesced into a full block.
On a file system with a block size of 4096 bytes and a
fragment size of 1024 bytes, a file is represented by zero or
more 4096 byte blocks of data, and possibly a single frag-
mented block. If a file system block must be fragmented to ob-
tain space for a small amount of data, the remaining fragments
of the block are made available for allocation to other files.
As an example consider an 11000 byte file stored on a
4096/1024 byte file system. This file would uses two full
size blocks and one three fragment portion of another block. If
no block with three aligned fragments is available at the time
the file is created, a full size block is split yielding the
necessary fragments and a single unused fragment. This re-
maining fragment can be allocated to another file as needed.
Space is allocated to a file when a program does a
write system call. Each time data is written to a file, the sys-
tem checks to see if the size of the file has increased*. If
the file needs to be expanded to hold the new data, one of three
conditions exists:
1) There is enough space left in an already allocated
block or fragment to hold the new data. The new data
is written into the available space.
2) The file contains no fragmented blocks (and the last
block in the file contains insufficient space to hold
the new data). If space exists in a block already
allocated, the space is filled with new data. If the
remainder of the new data contains more than a full
block of data, a full block is allocated and the first
full block of new data is written there. This process
is repeated until less than a full block of new data
remains. If the remaining new data to be written will
fit in less than a full block, a block with the neces-
sary fragments is located, otherwise a full block is
located. The remaining new data is written into the
----------- * A program may be overwriting data in the middle of
an existing file in which case space would already have been
allocated.
SMM:05-10 A Fast File System for UNIX
located space.
3) The file contains one or more fragments (and the frag-
ments contain insufficient space to hold the new data).
If the size of the new data plus the size of the data
already in the fragments exceeds the size of a full
block, a new block is allocated. The contents of the
fragments are copied to the beginning of the block and
the remainder of the block is filled with new data.
The process then continues as in (2) above. Otherwise,
if the new data to be written will fit in less than a
full block, a block with the necessary fragments is
located, otherwise a full block is located. The con-
tents of the existing fragments appended with the new
data are written into the allocated space.
The problem with expanding a file one fragment at a a time
is that data may be copied many times as a fragmented block ex-
pands to a full block. Fragment reallocation can be minimized if
the user program writes a full block at a time, except for a par-
tial block at the end of the file. Since file systems with
different block sizes may reside on the same system, the file
system interface has been extended to provide application pro-
grams the optimal size for a read or write. For files the opti-
mal size is the block size of the file system on which the file
is being accessed. For other objects, such as pipes and sockets,
the optimal size is the underlying buffer size. This feature
is used by the Stan- dard Input/Output Library, a package used by
most user pro- grams. This feature is also used by certain
system utili- ties such as archivers and loaders that do their
own input and output management and need the highest possi-
ble file system bandwidth.
The amount of wasted space in the 4096/1024 byte new file
system organization is empirically observed to be about the same
as in the 1024 byte old file system organization. A file
system with 4096 byte blocks and 512 byte fragments has about the
same amount of wasted space as the 512 byte block UNIX file
system. The new file system uses less space than the 512 byte or
1024 byte file systems for indexing information for large
files and the same amount of space for small files. These sav-
ings are offset by the need to use more space for keeping
track of available free blocks. The net result is about the same
disk utilization when a new file system's fragment size
equals an old file system's block size.
In order for the layout policies to be effective, a file
system cannot be kept completely full. For each file system
there is a parameter, termed the free space reserve, that gives
the minimum acceptable percentage of file system blocks that
should be free. If the number of free blocks drops below
this level only the system administrator can
A Fast File System for UNIX SMM:05-11
continue to allocate blocks. The value of this parameter may
be changed at any time, even when the file system is mounted
and active. The transfer rates that appear in sec- tion 4 were
measured on file systems kept less than 90% full (a reserve of
10%). If the number of free blocks falls to zero, the file
system throughput tends to be cut in half, because of the inabil-
ity of the file system to localize blocks in a file. If a
file system's performance degrades because of overfilling, it may
be restored by removing files until the amount of free space
once again reaches the mini- mum acceptable level. Access rates
for files created during periods of little free space may be re-
stored by moving their data once enough space is available. The
free space reserve must be added to the percentage of waste when
comparing the organizations given in Table 1. Thus, the per-
centage of waste in an old 1024 byte UNIX file system is roughly
compa- rable to a new 4096/512 byte file system with the free
space reserve set at 5%. (Compare 11.8% wasted with the old
file system to 6.9% waste + 5% reserved space in the new file
system.)
3.2. File system parameterization
Except for the initial creation of the free list, the old
file system ignores the parameters of the underlying hard-
ware. It has no information about either the physical charac-
teristics of the mass storage device, or the hardware that in-
teracts with it. A goal of the new file system is to parameter-
ize the processor capabilities and mass storage characteris-
tics so that blocks can be allocated in an opti- mum configura-
tion-dependent way. Parameters used include the speed of
the processor, the hardware support for mass storage transfers,
and the characteristics of the mass stor- age devices. Disk
technology is constantly improving and a given installation can
have several different disk technolo- gies running on a sin-
gle processor. Each file system is parameterized so that it can
be adapted to the characteris- tics of the disk on which it is
placed.
For mass storage devices such as disks, the new file sys-
tem tries to allocate new blocks on the same cylinder as the
previous block in the same file. Optimally, these new blocks
will also be rotationally well positioned. The dis- tance be-
tween ``rotationally optimal'' blocks varies greatly; it
can be a consecutive block or a rotationally delayed block
depending on system characteristics. On a processor with an
input/output channel that does not require any processor inter-
vention between mass storage transfer requests, two consecutive
disk blocks can often be accessed without suffering lost time
because of an intervening disk revolution. For processors with-
out input/output channels, the main processor must field an in-
terrupt and prepare for a new disk transfer. The expected time
to service this inter- rupt and schedule a new disk transfer
depends on the speed
SMM:05-12 A Fast File System for UNIX
of the main processor.
The physical characteristics of each disk include the num-
ber of blocks per track and the rate at which the disk spins.
The allocation routines use this information to cal- culate the
number of milliseconds required to skip over a block. The char-
acteristics of the processor include the expected time to
service an interrupt and schedule a new disk transfer. Given a
block allocated to a file, the allo- cation routines calculate
the number of blocks to skip over so that the next block in the
file will come into position under the disk head in the ex-
pected amount of time that it takes to start a new disk transfer
operation. For programs that sequentially access large amounts
of data, this strat- egy minimizes the amount of time spent wait-
ing for the disk to position itself.
To ease the calculation of finding rotationally optimal
blocks, the cylinder group summary information includes a
count of the available blocks in a cylinder group at differ- ent
rotational positions. Eight rotational positions are distin-
guished, so the resolution of the summary information is 2 mil-
liseconds for a typical 3600 revolution per minute drive. The
super-block contains a vector of lists called rotational layout
tables. The vector is indexed by rota- tional position.
Each component of the vector lists the index into the block map
for every data block contained in its rotational position.
When looking for an allocatable block, the system first looks
through the summary counts for a rotational position with a non-
zero block count. It then uses the index of the rotational posi-
tion to find the appro- priate list to use to index through only
the relevant parts of the block map to find a free block.
The parameter that defines the minimum number of mil-
liseconds between the completion of a data transfer and the ini-
tiation of another data transfer on the same cylinder can be
changed at any time, even when the file system is mounted and ac-
tive. If a file system is parameterized to lay out blocks
with a rotational separation of 2 milliseconds, and the disk
pack is then moved to a system that has a processor requiring 4
milliseconds to schedule a disk operation, the throughput will
drop precipitously because of lost disk rev- olutions on nearly
every block. If the eventual target machine is known, the file
system can be parameterized for it even though it is initial-
ly created on a different pro- cessor. Even if the move is not
known in advance, the rota- tional layout delay can be recon-
figured after the disk is moved so that all further allocation is
done based on the characteristics of the new host.
A Fast File System for UNIX SMM:05-13
3.3. Layout policies
The file system layout policies are divided into two dis-
tinct parts. At the top level are global policies that use
file system wide summary information to make decisions regarding
the placement of new inodes and data blocks. These rou-
tines are responsible for deciding the placement of new directo-
ries and files. They also calculate rotationally optimal block
layouts, and decide when to force a long seek to a new cylinder
group because there are insufficient blocks left in the
current cylinder group to do reasonable layouts. Below the glob-
al policy routines are the local allocation routines that use
a locally optimal scheme to lay out data blocks.
Two methods for improving file system performance are to
increase the locality of reference to minimize seek latency
as described by [Trivedi80], and to improve the lay- out of da-
ta to make larger transfers possible as described by
[Nevalainen77]. The global layout policies try to improve
performance by clustering related information. They cannot at-
tempt to localize all data references, but must also try to
spread unrelated data among different cylinder groups. If too
much localization is attempted, the local cylinder group may
run out of space forcing the data to be scattered to non-local
cylinder groups. Taken to an extreme, total localization
can result in a single huge cluster of data resembling the old
file system. The global policies try to balance the two con-
flicting goals of local- izing data that is concurrently accessed
while spreading out unrelated data.
One allocatable resource is inodes. Inodes are used to de-
scribe both files and directories. Inodes of files in the same
directory are frequently accessed together. For exam- ple, the
``list directory'' command often accesses the inode for each
file in a directory. The layout policy tries to place all the
inodes of files in a directory in the same cylinder group.
To ensure that files are distributed throughout the disk, a
different policy is used for direc- tory allocation. A new
directory is placed in a cylinder group that has a greater than
average number of free inodes, and the smallest number of di-
rectories already in it. The intent of this policy is to allow
the inode clustering pol- icy to succeed most of the time.
The allocation of inodes within a cylinder group is done using a
next free strategy. Although this allocates the inodes random-
ly within a cylin- der group, all the inodes for a particular
cylinder group can be read with 8 to 16 disk transfers. (At
most 16 disk transfers are required because a cylinder group may
have no more than 2048 inodes.) This puts a small and con-
stant upper bound on the number of disk transfers required
to access the inodes for all the files in a directory. In con-
trast, the old file system typically requires one disk
SMM:05-14 A Fast File System for UNIX
transfer to fetch the inode for each file in a directory.
The other major resource is data blocks. Since data
blocks for a file are typically accessed together, the pol- icy
routines try to place all data blocks for a file in the same
cylinder group, preferably at rotationally optimal positions
in the same cylinder. The problem with allocating all the data
blocks in the same cylinder group is that large files will
quickly use up available space in the cylinder group, forcing a
spill over to other areas. Further, using all the space in a
cylinder group causes future allocations for any file in the
cylinder group to also spill to other areas. Ideally none
of the cylinder groups should ever become completely full. The
heuristic solution chosen is to redirect block allocation to a
different cylinder group when a file exceeds 48 kilobytes, and at
every megabyte there- after.* The newly chosen cylinder group
is selected from those cylinder groups that have a greater
than average num- ber of free blocks left. Although big files
tend to be spread out over the disk, a megabyte of data is
typically accessible before a long seek must be performed, and
the cost of one long seek per megabyte is small.
The global policy routines call local allocation rou-
tines with requests for specific blocks. The local alloca- tion
routines will always allocate the requested block if it is free,
otherwise it allocates a free block of the requested size
that is rotationally closest to the requested block. If the
global layout policies had complete informa- tion, they could
always request unused blocks and the allo- cation routines would
be reduced to simple bookkeeping. However, maintaining
complete information is costly; thus the implementation of the
global layout policy uses heuris- tics that employ only partial
information.
If a requested block is not available, the local allo-
cator uses a four level allocation strategy:
1) Use the next available block rotationally closest to
the requested block on the same cylinder. It is
assumed here that head switching time is zero. On disk
----------- * The first spill over point at 48 kilobytes is
the point at which a file on a 4096 byte block file system
first requires a single indirect block. This appears to be a
natural first point at which to redirect block allocation.
The other spillover points are chosen with the intent of
forcing block allocation to be redirected when a file has used
about 25% of the data blocks in a cylinder group. In observ-
ing the new file system in day to day use, the heuristics appear
to work well in minimizing the number of completely filled
cylinder groups.
A Fast File System for UNIX SMM:05-15
controllers where this is not the case, it may be pos-
sible to incorporate the time required to switch
between disk platters when constructing the rotational
layout tables. This, however, has not yet been tried.
2) If there are no blocks available on the same cylinder,
use a block within the same cylinder group.
3) If that cylinder group is entirely full, quadratically
hash the cylinder group number to choose another cylin-
der group to look for a free block.
4) Finally if the hash fails, apply an exhaustive search
to all cylinder groups.
Quadratic hash is used because of its speed in finding un-
used slots in nearly full hash tables [Knuth75]. File sys-
tems that are parameterized to maintain at least 10% free space
rarely use this strategy. File systems that are run without
maintaining any free space typically have so few free blocks
that almost any allocation is random; the most important char-
acteristic of the strategy used under such conditions is that
the strategy be fast.
4. Performance
Ultimately, the proof of the effectiveness of the algo-
rithms described in the previous section is the long term per-
formance of the new file system.
Our empirical studies have shown that the inode layout pol-
icy has been effective. When running the ``list direc- tory''
command on a large directory that itself contains many di-
rectories (to force the system to access inodes in multiple
cylinder groups), the number of disk accesses for inodes is cut
by a factor of two. The improvements are even more dramatic for
large directories containing only files, disk accesses for
inodes being cut by a factor of eight. This is most encouraging
for programs such as spooling dae- mons that access many small
files, since these programs tend to flood the disk request queue
on the old file system.
Table 2 summarizes the measured throughput of the new file
system. Several comments need to be made about the conditions
under which these tests were run. The test pro- grams measure
the rate at which user programs can transfer data to or from a
file without performing any processing on it. These programs
must read and write enough data to insure that buffering in
the operating system does not affect the results. They are
also run at least three times in succession; the first to get
the system into a known state and the second two to insure
that the experiment has
SMM:05-16 A Fast File System for UNIX
stabilized and is repeatable. The tests used and their re-
sults are discussed in detail in [Kridle83]+. The systems were
running multi-user but were otherwise quiescent. There was no
contention for either the CPU or the disk arm. The only differ-
ence between the UNIBUS and MASSBUS tests was the controller.
All tests used an AMPEX Capricorn 330 megabyte Winchester disk.
As Table 2 shows, all file system test runs were on a VAX
11/750. All file systems had been in production use for at least
a month before being measured. The same number of system calls
were performed in all tests; the basic system call overhead was a
negligible portion of the total running time of the tests.
+------------------------------+--------------------------------------+
| Type of Processor and | Read
| | File System Bus Measured | Speed Bandwidth
% CPU |
+------------------------------+--------------------------------------+
| old 1024 750/UNIBUS | 29 Kbytes/sec 29/983 3%
11% | |new 4096/1024 750/UNIBUS | 221 Kbytes/sec 221/983
22% 43% | |new 8192/1024 750/UNIBUS | 233 Kbytes/sec
233/983 24% 29% | |new 4096/1024 750/MASSBUS | 466
Kbytes/sec 466/983 47% 73% | |new 8192/1024 750/MASSBUS
| 466 Kbytes/sec 466/983 47% 54% |
+------------------------------+--------------------------------------+
Table 2a - Reading rates of the old and new UNIX file systems.
+------------------------------+--------------------------------------+
| Type of Processor and | Write
| | File System Bus Measured | Speed Bandwidth
% CPU |
+------------------------------+--------------------------------------+
| old 1024 750/UNIBUS | 48 Kbytes/sec 48/983 5%
29% | |new 4096/1024 750/UNIBUS | 142 Kbytes/sec 142/983
14% 43% | |new 8192/1024 750/UNIBUS | 215 Kbytes/sec
215/983 22% 46% | |new 4096/1024 750/MASSBUS | 323
Kbytes/sec 323/983 33% 94% | |new 8192/1024 750/MASSBUS
| 466 Kbytes/sec 466/983 47% 95% |
+------------------------------+--------------------------------------+
Table 2b - Writing rates of the old and new UNIX file systems.
Unlike the old file system, the transfer rates for the new
file system do not appear to change over time. The through-
put rate is tied much more strongly to the amount of free space
that is maintained. The measurements in Table 2 were based on
a file system with a 10% free space reserve. Synthetic work
loads suggest that throughput deteriorates to about half the
rates given in Table 2 when the file systems are full.
The percentage of bandwidth given in Table 2 is a mea- sure
of the effective utilization of the disk by the file
----------- + A UNIX command that is similar to the reading
test that we used is ``cp file /dev/null'', where ``file'' is
eight megabytes long.
A Fast File System for UNIX SMM:05-17
system. An upper bound on the transfer rate from the disk is
calculated by multiplying the number of bytes on a track by the
number of revolutions of the disk per second. The bandwidth
is calculated by comparing the data rates the file system is able
to achieve as a percentage of this rate. Using this met-
ric, the old file system is only able to use about 3-5% of the
disk bandwidth, while the new file system uses up to 47% of the
bandwidth.
Both reads and writes are faster in the new system than in
the old system. The biggest factor in this speedup is because
of the larger block size used by the new file sys- tem. The
overhead of allocating blocks in the new system is greater than
the overhead of allocating blocks in the old system, however
fewer blocks need to be allocated in the new system because they
are bigger. The net effect is that the cost per byte allocated
is about the same for both systems.
In the new file system, the reading rate is always at
least as fast as the writing rate. This is to be expected
since the kernel must do more work when allocating blocks than
when simply reading them. Note that the write rates are
about the same as the read rates in the 8192 byte block file sys-
tem; the write rates are slower than the read rates in the 4096
byte block file system. The slower write rates occur because the
kernel has to do twice as many disk allo- cations per second,
making the processor unable to keep up with the disk transfer
rate.
In contrast the old file system is about 50% faster at
writing files than reading them. This is because the write sys-
tem call is asynchronous and the kernel can generate disk trans-
fer requests much faster than they can be serviced, hence
disk transfers queue up in the disk buffer cache. Because
the disk buffer cache is sorted by minimum seek dis- tance, the
average seek between the scheduled disk writes is much less
than it would be if the data blocks were written out in the ran-
dom disk order in which they are generated. However when the
file is read, the read system call is pro- cessed synchronously
so the disk blocks must be retrieved from the disk in the
non-optimal seek order in which they are requested. This forces
the disk scheduler to do long seeks resulting in a lower
throughput rate.
In the new system the blocks of a file are more opti- mal-
ly ordered on the disk. Even though reads are still syn-
chronous, the requests are presented to the disk in a much bet-
ter order. Even though the writes are still asyn-
chronous, they are already presented to the disk in minimum seek
order so there is no gain to be had by reordering them. Hence
the disk seek latencies that limited the old file sys- tem have
little effect in the new file system. The cost of allocation
is the factor in the new system that causes writes to be
slower than reads.
SMM:05-18 A Fast File System for UNIX
The performance of the new file system is currently lim-
ited by memory to memory copy operations required to move data
from disk buffers in the system's address space to data buffers
in the user's address space. These copy operations account for
about 40% of the time spent performing an input/output
operation. If the buffers in both address spaces were prop-
erly aligned, this transfer could be per- formed without
copying by using the VAX virtual memory man- agement hardware.
This would be especially desirable when transferring large
amounts of data. We did not implement this because it would
change the user interface to the file system in two major ways:
user programs would be required to allocate buffers on page
boundaries, and data would disap- pear from buffers after being
written.
Greater disk throughput could be achieved by rewriting the
disk drivers to chain together kernel buffers. This would
allow contiguous disk blocks to be read in a single disk trans-
action. Many disks used with UNIX systems contain either 32 or
48 512 byte sectors per track. Each track holds exactly two or
three 8192 byte file system blocks, or four or six 4096 byte
file system blocks. The inability to use contiguous disk blocks
effectively limits the perfor- mance on these disks to less
than 50% of the available band- width. If the next block for a
file cannot be laid out con- tiguously, then the minimum spacing
to the next allocatable block on any platter is between a sixth
and a half a revolu- tion. The implication of this is that
the best possible layout without contiguous blocks uses only half
of the band- width of any given track. If each track con-
tains an odd number of sectors, then it is possible to resolve
the rota- tional delay to any number of sectors by finding a
block that begins at the desired rotational position on anoth-
er track. The reason that block chaining has not been imple-
mented is because it would require rewriting all the disk
drivers in the system, and the current throughput rates are al-
ready limited by the speed of the available processors.
Currently only one block is allocated to a file at a
time. A technique used by the DEMOS file system when it
finds that a file is growing rapidly, is to preallocate sev- eral
blocks at once, releasing them when the file is closed if they
remain unused. By batching up allocations, the sys- tem can re-
duce the overhead of allocating at each write, and it can cut
down on the number of disk writes needed to keep the block
pointers on the disk synchronized with the block allocation [Pow-
ell79]. This technique was not included because block al-
location currently accounts for less than 10% of the time spent
in a write system call and, once again, the current
throughput rates are already limited by the speed of the avail-
able processors.
A Fast File System for UNIX SMM:05-19
5. File system functional enhancements
The performance enhancements to the UNIX file system did
not require any changes to the semantics or data struc- tures
visible to application programs. However, several changes
had been generally desired for some time but had not been intro-
duced because they would require users to dump and restore all
their file systems. Since the new file system already required
all existing file systems to be dumped and restored, these
functional enhancements were introduced at this time.
5.1. Long file names
File names can now be of nearly arbitrary length. Only pro-
grams that read directories are affected by this change. To
promote portability to UNIX systems that are not running the new
file system, a set of directory access routines have been intro-
duced to provide a consistent interface to direc- tories on both
old and new systems.
Directories are allocated in 512 byte units called
chunks. This size is chosen so that each allocation can be
transferred to disk in a single operation. Chunks are bro- ken
up into variable length records termed directory entries.
A directory entry contains the information neces- sary to map the
name of a file to its associated inode. No directory entry is
allowed to span multiple chunks. The first three fields of a
directory entry are fixed length and contain: an inode number,
the size of the entry, and the length of the file name contained
in the entry. The remain- der of an entry is variable length
and contains a null ter- minated file name, padded to a 4 byte
boundary. The maximum length of a file name in a directory is
currently 255 char- acters.
Available space in a directory is recorded by having one
or more entries accumulate the free space in their entry size
fields. This results in directory entries that are larger
than required to hold the entry name plus fixed length
fields. Space allocated to a directory should always be com-
pletely accounted for by totaling up the sizes of its entries.
When an entry is deleted from a directory, its space is re-
turned to a previous entry in the same directory chunk by in-
creasing the size of the previous entry by the size of the
deleted entry. If the first entry of a direc- tory chunk is
free, then the entry's inode number is set to zero to indicate
that it is unallocated.
5.2. File locking
The old file system had no provision for locking files.
Processes that needed to synchronize the updates of a file had
to use a separate ``lock'' file. A process would try to
SMM:05-20 A Fast File System for UNIX
create a ``lock'' file. If the creation succeeded, then the pro-
cess could proceed with its update; if the creation failed,
then the process would wait and try again. This mechanism
had three drawbacks. Processes consumed CPU time by looping over
attempts to create locks. Locks left lying around because of
system crashes had to be manually removed (normally in a system
startup command script). Finally, processes running as system
administrator are always permit- ted to create files, so were
forced to use a different mech- anism. While it is possible to
get around all these prob- lems, the solutions are not straight
forward, so a mechanism for locking files has been added.
The most general schemes allow multiple processes to con-
currently update a file. Several of these techniques are dis-
cussed in [Peterson83]. A simpler technique is to seri- alize
access to a file with locks. To attain reasonable efficien-
cy, certain applications require the ability to lock pieces of a
file. Locking down to the byte level has been implemented in
the Onyx file system by [Bass81]. However, for the standard sys-
tem applications, a mechanism that locks at the granularity of a
file is sufficient.
Locking schemes fall into two classes, those using hard
locks and those using advisory locks. The primary differ- ence
between advisory locks and hard locks is the extent of enforce-
ment. A hard lock is always enforced when a program tries to
access a file; an advisory lock is only applied when it is re-
quested by a program. Thus advisory locks are only effective
when all programs accessing a file use the locking scheme. With
hard locks there must be some override policy implemented in
the kernel. With advisory locks the policy is left to the user
programs. In the UNIX system, programs with system admin-
istrator privilege are allowed override any protection scheme.
Because many of the pro- grams that need to use locks must
also run as the system administrator, we chose to implement advi-
sory locks rather than create an additional protection scheme
that was incon- sistent with the UNIX philosophy or could not
be used by system administration programs.
The file locking facilities allow cooperating programs to
apply advisory shared or exclusive locks on files. Only one
process may have an exclusive lock on a file while mul- tiple
shared locks may be present. Both shared and exclu- sive locks
cannot be present on a file at the same time. If any lock is re-
quested when another process holds an exclu- sive lock, or
an exclusive lock is requested when another process holds any
lock, the lock request will block until the lock can be
obtained. Because shared and exclusive locks are advisory only,
even if a process has obtained a lock on a file, another pro-
cess may access the file.
A Fast File System for UNIX SMM:05-21
Locks are applied or removed only on open files. This
means that locks can be manipulated without needing to close and
reopen a file. This is useful, for example, when a pro- cess
wishes to apply a shared lock, read some information and de-
termine whether an update is required, then apply an exclusive
lock and update the file.
A request for a lock will cause a process to block if the
lock can not be immediately obtained. In certain instances
this is unsatisfactory. For example, a process that wants
only to check if a lock is present would require a separate mech-
anism to find out this information. Conse- quently, a pro-
cess may specify that its locking request should return with
an error if a lock can not be immediately obtained. Being able
to conditionally request a lock is useful to ``daemon'' process-
es that wish to service a spool- ing area. If the first in-
stance of the daemon locks the directory where spooling takes
place, later daemon processes can easily check to see if an ac-
tive daemon exists. Since locks exist only while the locking
processes exist, lock files can never be left active after
the processes exit or if the system crashes.
Almost no deadlock detection is attempted. The only
deadlock detection done by the system is that the file to
which a lock is applied must not already have a lock of the same
type (i.e. the second of two successive calls to apply a lock of
the same type will fail).
5.3. Symbolic links
The traditional UNIX file system allows multiple direc- tory
entries in the same file system to reference a single file.
Each directory entry ``links'' a file's name to an inode and
its contents. The link concept is fundamental; inodes do not
reside in directories, but exist separately and are referenced
by links. When all the links to an inode are removed, the inode
is deallocated. This style of refer- encing an inode does not
allow references across physical file systems, nor does it sup-
port inter-machine linkage. To avoid these limitations symbolic
links similar to the scheme used by Multics [Feiertag71] have
been added.
A symbolic link is implemented as a file that contains a
pathname. When the system encounters a symbolic link while
interpreting a component of a pathname, the contents of the
symbolic link is prepended to the rest of the path- name, and
this name is interpreted to yield the resulting pathname. In
UNIX, pathnames are specified relative to the root of the file
system hierarchy, or relative to a pro- cess's current work-
ing directory. Pathnames specified rela- tive to the root are
called absolute pathnames. Pathnames specified relative to
the current working directory are termed relative pathnames.
If a symbolic link contains an
SMM:05-22 A Fast File System for UNIX
absolute pathname, the absolute pathname is used, otherwise the
contents of the symbolic link is evaluated relative to the lo-
cation of the link in the file hierarchy.
Normally programs do not want to be aware that there is a
symbolic link in a pathname that they are using. However cer-
tain system utilities must be able to detect and manipu- late
symbolic links. Three new system calls provide the ability
to detect, read, and write symbolic links; seven system utili-
ties required changes to use these calls.
In future Berkeley software distributions it may be pos-
sible to reference file systems located on remote ma-
chines using pathnames. When this occurs, it will be pos- sible
to create symbolic links that span machines.
5.4. Rename
Programs that create a new version of an existing file typ-
ically create the new version as a temporary file and then
rename the temporary file with the name of the target file. In
the old UNIX file system renaming required three calls to the
system. If a program were interrupted or the system crashed be-
tween these calls, the target file could be left with only its
temporary name. To eliminate this possi- bility the rename sys-
tem call has been added. The rename call does the rename op-
eration in a fashion that guarantees the existence of the target
name.
Rename works both on data files and directories. When re-
naming directories, the system must do special validation checks
to insure that the directory tree structure is not corrupted
by the creation of loops or inaccessible directo- ries. Such
corruption would occur if a parent directory were moved in-
to one of its descendants. The validation check requires
tracing the descendents of the target direc- tory to insure
that it does not include the directory being moved.
5.5. Quotas
The UNIX system has traditionally attempted to share all
available resources to the greatest extent possible. Thus any
single user can allocate all the available space in the file
system. In certain environments this is unaccept- able. Conse-
quently, a quota mechanism has been added for restricting the
amount of file system resources that a user can obtain. The quo-
ta mechanism sets limits on both the number of inodes and
the number of disk blocks that a user may allocate. A separate
quota can be set for each user on each file system. Re-
sources are given both a hard and a soft limit. When a program
exceeds a soft limit, a warning is printed on the users termi-
nal; the offending program is not terminated unless it exceeds
its hard limit. The idea
A Fast File System for UNIX SMM:05-23
is that users should stay below their soft limit between lo-
gin sessions, but they may use more resources while they are
actively working. To encourage this behavior, users are warned
when logging in if they are over any of their soft limits.
If users fails to correct the problem for too many login ses-
sions, they are eventually reprimanded by having their soft
limit enforced as their hard limit.
Acknowledgements
We thank Robert Elz for his ongoing interest in the new file
system, and for adding disk quotas in a rational and efficient
manner. We also acknowledge Dennis Ritchie for his suggestions
on the appropriate modifications to the user interface. We ap-
preciate Michael Powell's explanations on how the DEMOS file sys-
tem worked; many of his ideas were used in this implementa-
tion. Special commendation goes to Peter Kessler and Robert Hen-
ry for acting like real users during the early debugging stage
when file systems were less stable than they should have been.
The criticisms and sug- gestions by the reviews contributed
significantly to the coherence of the paper. Finally we thank
our sponsors, the National Science Foundation under grant
MCS80-05144, and the Defense Advance Research Projects Agency
(DoD) under ARPA Order No. 4031 monitored by Naval Electronic
System Command under Contract No. N00039-82-C-0235.
References
[Almes78] Almes, G., and Robertson, G. "An Exten-
sible File System for Hydra" Proceedings
of the Third International Conference on
Software Engineering, IEEE, May 1978.
[Bass81] Bass, J. "Implementation Description
for File Locking", Onyx Systems Inc, 73
E. Trimble Rd, San Jose, CA 95131 Jan
1981.
[Feiertag71] Feiertag, R. J. and Organick, E. I.,
"The Multics Input-Output System", Pro-
ceedings of the Third Symposium on Oper-
ating Systems Principles, ACM, Oct 1971.
pp 35-41
[Ferrin82a] Ferrin, T.E., "Performance and Robust-
ness Improvements in Version 7 UNIX",
Computer Graphics Laboratory Technical
Report 2, School of Pharmacy, University
SMM:05-24 A Fast File System for UNIX
of California, San Francisco, January
1982. Presented at the 1982 Winter
Usenix Conference, Santa Monica, Cali-
fornia.
[Ferrin82b] Ferrin, T.E., "Performance Issuses of
VMUNIX Revisited", ;login: (The Usenix
Association Newsletter), Vol 7, #5,
November 1982. pp 3-6
[Kridle83] Kridle, R., and McKusick, M., "Perfor-
mance Effects of Disk Subsystem Choices
for VAX Systems Running 4.2BSD UNIX",
Computer Systems Research Group, Dept of
EECS, Berkeley, CA 94720, Technical
Report #8.
[Kowalski78] Kowalski, T. "FSCK - The UNIX System
Check Program", Bell Laboratory, Murray
Hill, NJ 07974. March 1978
[Knuth75] Kunth, D. "The Art of Computer Program-
ming", Volume 3 - Sorting and Searching,
Addison-Wesley Publishing Company Inc,
Reading, Mass, 1975. pp 506-549
[Maruyama76] Maruyama, K., and Smith, S. "Optimal
reorganization of Distributed Space Disk
Files", CACM, 19, 11. Nov 1976. pp
634-642
[Nevalainen77] Nevalainen, O., Vesterinen, M. "Deter-
mining Blocking Factors for Sequential
Files by Heuristic Methods", The Com-
puter Journal, 20, 3. Aug 1977. pp
245-247
[Pechura83] Pechura, M., and Schoeffler, J. "Esti-
mating File Access Time of Floppy
Disks", CACM, 26, 10. Oct 1983. pp
754-763
[Peterson83] Peterson, G. "Concurrent Reading While
Writing", ACM Transactions on Program-
ming Languages and Systems, ACM, 5, 1.
Jan 1983. pp 46-55
[Powell79] Powell, M. "The DEMOS File System",
Proceedings of the Sixth Symposium on
Operating Systems Principles, ACM, Nov
1977. pp 33-42
[Ritchie74] Ritchie, D. M. and Thompson, K., "The
UNIX Time-Sharing System", CACM 17, 7.
A Fast File System for UNIX SMM:05-25
July 1974. pp 365-375
[Smith81a] Smith, A. "Input/Output Optimization
and Disk Architectures: A Survey", Per-
formance and Evaluation 1. Jan 1981. pp
104-117
[Smith81b] Smith, A. "Bibliography on File and I/O
System Optimization and Related Topics",
Operating Systems Review, 15, 4. Oct
1981. pp 39-54
[Symbolics81] "Symbolics File System", Symbolics Inc,
9600 DeSoto Ave, Chatsworth, CA 91311
Aug 1981.
[Thompson78] Thompson, K. "UNIX Implementation",
Bell System Technical Journal, 57, 6,
part 2. pp 1931-1946 July-August 1978.
[Thompson80] Thompson, M. "Spice File System",
Carnegie-Mellon University, Department
of Computer Science, Pittsburg, PA 15213
#CMU-CS-80, Sept 1980.
[Trivedi80] Trivedi, K. "Optimal Selection of CPU
Speed, Device Capabilities, and File
Assignments", Journal of the ACM, 27, 3.
July 1980. pp 457-473
[White80] White, R. M. "Disk Storage Technology",
Scientific American, 243(2), August
1980.