Back to the top of the comp.arch.storage FAQ.


7. File Systems

This topic is also discussed frequently in comp.os.research. See www.maths.tcd.ie.


7.1. NFS {Brief}

The Network File System, originally developed by Sun Microsystems and now pretty standard in the Unix world, and clients exist for PC, Mac, VMS, and other non-Unix OSes. V2, the common version, supports single files only up to 2^32 (4GB) bytes. I'm not sure if there are any limits to a file system size under NFS, other than those imposed by the client and server OSes (SHMO).

NFS is defined in RFC 1094. V3 is now RFC 1813.

There is at least one newsgroup devoted specifically to NFS: comp.protocols.nfs.


7.1.1. NFS V3

NFS V3 supports 64-bit files and write caching.

The first implementation was from Digital with DEC OSF/1 V3.0 for Alpha AXP. Silicon Graphics supports it on IRIX 5.3. Cray will support it on UNICOS 9. I don't know about other vendors but I have heard rumours that the releases coming in the second half of 1995 will support it.

Further information on NFS V3 can be found from gatekeeper.dec.com

(jmaki@csc.fi, 95/1/22)

Solaris 2.5, available Nov. 95, is reported to have V3 support. Network Appliances have it as of 3.0, Sept. 95. (guy@netapp.com (Guy Harris), 95/10/6)


7.2. AFS {Brief}

The Andrew File System (SHMO). Allows naming of files worldwide as if they were a locally-mounted FS (from cooperating clients, of course).

There's an "alt" group for AFS - "alt.filesystems.afs". Available commercially from Transarc.


7.3. DFS {Brief}

Another remote file system protocol that supports large files. I don't know anything about it, or if any implementations really exist yet.


7.4. Log based file systems


Further Information:
    %z InProceedings
    %K hpdb:Rosenblum91
    %s golding@cis.ucsc.edu (Thu Oct 17 11:12:07 1991)
    %A Mendel Rosenblum
    %A John K. Ousterhout
    %y UCBCS.
    %T The design and implementation of a log-structured file system
    %C Proc. 13th SOSP.
    %c Asilomar, Pacific Grove, CA
    %p ACM. SIGOPS
    %D 13 Oct. 1991
    %P 1 15
    %x This paper presents a new technique for disk storage management
    %x called a log-structured file system.  A log-structured file system
    %x writes all modifications to disk sequentially in a log-like
    %x structure, thereby speeding up both file writing and crash
    %x recovery.  The log is the only structure on disk; it contains
    %x indexing information so that files can be read back from the log
    %x efficiently.  In order to maintain large free areas on disk for
    %x fast writing, we divide the log into segments and use a segment
    %x cleaner to compress the live information from heavily fragmented
    %x segments.  We present a series of simulations that demonstrate the
    %x efficiency of a simple cleaning policy based on cost and benefit.
    %x We have implemented a prototype log-structured file system called
    %x Sprite LFS; it outperforms current Unix file systems by an order of
    %x magnitude for small-file writes while matching or exceeding Unix
    %x performance for reads and large writes.  Even when the overhead for
    %x cleaning is included, Sprite LFS can use 70% of the disk bandwidth
    %x for writing, whereas Unix file systems typically can use only
    %x 5--10%.

(tage@cs.utwente.nl)

Also, these papers:

Ousterhout and Douglis, "Beating the I/O Bottleneck: A Case for Log- structured File Systems", Operating Systems Review, No. 1, Vol. 23, pp. 11-27, 1989, also available as Technical Report UCB/CSD 88/467.

Rosenblum and Ousterhout, "The Design and Implementation of a Log- Structured File System", ACM SIGOPS Operating Systems Review, No. 5, Vol. 25, 1991.

Seltzer, "File System Performance and Transaction Support", PhD Thesis, University of California, Berkeley, 1992, also available as Technical Report UCB/ERL M92.

Seltzer, Bostic, McKusick and Staelin, "An Implementation of a Log- Structured File System for UNIX", Proc. of the Winter 1993 USENIX Conf., pp. 315-331, 1993.

listed from the man page for mount_lfs under FreeBSD-2.1.5. (rdv, 97/1/17)


7.5. Mainframe File Systems

A brief description of mainframe file systems (as well as CKD (Count, Key, Data) disks) by Dick Wilmot is available.


7.6. Parallel System File Systems

This discussion comes up occassionally on comp.arch and comp.os.research. I don't know which newsgroups/mailing lists the PIO (Parallel I/O) people hang out in, but it doesn't seem to be here. They show up occassionally in comp.sys.super and comp.parallel. They do have their own conferences, though.

The important work seems to be going on with the supercomputing gang -- LLNL, CMU, Caltech, UIUC, Dartmouth, ORNL, SNL, etc. Work is also being done by the parallel database community, including vendors such as Teradata.

A paper presented at the ACM International Supercomputing Conference in 1993 showed what to me seemed to be pretty appalling performance for reading data and distributing it to multiple processors on an Intel Delta supercomputer (sorry I don't have the reference in front of me). (rdv, 94/8/12) The paper is old, now, and the Intel guys say they have improved performance to up to 130 MB/sec. on the new Paragon using their Parallel File System (PFS).

There is an excellent web site on parallel I/O at Dartmouth: www.cs.dartmouth.edu

There is also a mailing list housed at Dartmouth, parallel-io@dartmouth.edu.

The annual conference is I/O in Parallel and Distributed Systems (IOPADS); 1997's is co-located with Supercomputing '97 in San Jose, Nov. 17. Papers are due March 25, 1997. See www.cs.dartmouth.edu.


7.7. Microsoft Windows NT {Brief}

ちょっとわからないですが、マイクロソフトのウインドースNTのファイルシス テムは全部64ビットだと思います。

I seem to recall that NT supports 64-bit file systems for its own native file systems? Anybody know for sure (SHMO)? (rdv, 94/8/24)

From *Inside the Windows NT(TM) File System*, by Helen Custer:

"NTFS allocates clusters and uses 64 bits to number them, which results in a possible 2^64 clusters, each up to 4KB. Each file can be of virtually infinite size, that is, 2^64 bytes long."

"Clusters" can be between 512 and 4K bytes.

The Win32 API supports 64-bit file sizes, albeit in a cheesy fashion reminiscent of V6 UNIX - no 64-bit integral types used, just pairs of 32-bit integral types. (guy@netapp.com (Guy Harris), 95/10/6)


7.8. Large Unix File Systems

There is now an industry group working on standardizing an API for files larger than 2 GB (the max size normally supported on most Unix systems). More info as I get it. The WWW-enabled can have a look at www.sas.com:80 and see the various proposals on the table.

オペレーティングシステムが2ギガバイト以上のファイルかまたは2ギガバイ ト以上のファイルシステム(パーティッション)をサポートしているか間違い やすいです。以下の表にはそれら両方が入っています。この情報はほとんど ben@rex.uokhsc.edu (Benjamin Z. Goldsteen) と Ed Hamrick (EdHamrick@aol.com)と Peter Poorman (poorman@convex.com) からです。 Note that it is VERY easy to confuse whether an OS supports _files_ larger than 2 GB or _file systems_ larger than 2 GB. My table lists some of both (thanks to ben@rex.uokhsc.edu (Benjamin Z. Goldsteen), Ed Hamrick (EdHamrick@aol.com) and Peter Poorman (poorman@convex.com) for much of this information).

64ビットの整数を使用するシステムで、64ビット長のファイルをサポート のはやりやすいですが、32ビットの整数のシステムでは、もっと複雑です。 ほとんどの32ビットシステムで、オペレーティングシステムカーネルのファ イルオッフセット(一番大事なところは、VFSレーアーである)は32ビット ですから、2ギガバイト以上はできないということです。

It is straightforward for systems with 64-bit integers to support 64-bit files; for systems with 32-bit integers it is more complex. On most 32-bit systems the offsets passed around inside the kernel (most importantly, at the VFS layer) the file offsets and sizes tend to be passed as 32-bit (signed) integers, meaning no files >2^31.

ほとんどのシステム(SunOSや Linuxなど)で、lseekというシステムコール関 数の引数のタイプはoff_tです。このoff_tの定義はtypedef long off_t;である。 On most systems, the argument to lseek is of type off_t, which (on SunOS and Linux, and plausibly on OSF/1 and others) is declared in a header file as "typedef long off_t;".

クライエントはアクセッスできるために、三つのことは必要です。クライエン ト側のロカルファイルシステムと適当なネットワークプロトコルとサーバー側 の64ビットのファイルシステムサポートである。FTPで無限の容量はできる ことだと思います。NFSバージャン2で、2ギガバイトまでできます。NFSバージャ ン3で64ビットのファイルできること。Unitree(ユーニトリー)以外、サー バーはサポートするため、ロカルファイルシステムで64ビットサポートは必 要です。 For clients to really have access to large files, three pieces are required: local FS support, an appropriate network protocol, and server support for 64-bit FSes. For FTP access, I believe _literally_ inifinitely large files are possible, but I'm not sure(SHMO). For NFS access, NFS V2 supports only 2GB files. NFS V3, just becoming available now, supports full 64-bit files, I believe (anybody have a reference to the docs? RFC? SHMO). With the notable exception of Unitree (which does not use, depend on, or appear as, a local FS on the server), server support for 64-bit files is provided only when the server's own local FSes are 64-bit.

大ファイルサポートしてあるシステムで、プログラーマーまたはユーザーには 全部透明ではない。クレイのユニコス(UniCOS)とディジタルのOSF/1は透明で すが、コンヴェックスのConvexOSは透明しない。二つのシステムコールがあり ます。lseekの引数は32ビットとlseek64の引数は64ビットです。で、プロ グラムはバージャンアップしなければいけない。(コンヴェックスのフォート ランは透明です。) Even for the systems that _do_ support large files, not all are programmer or user-transparent for supporting large files. UniCOS is, OSF/1 is, ConvexOS is not (there are two system calls, lseek and lseek64, with 32-bit and 64-bit file offsets, respectively, though the Fortran interface is transparent).

This brings up the related issues. A complete large files implementation needs not only the system calls, but also the stdio library and the runtime libraries for the languages (Fortran, Cobol,...). Further, system utilities (sed, dd, etcetera) need to be capable of dealing with large files.

(It has been pointed out that the GNU C compiler runs on most of these machines, so it is possible to use "long long" as a 64-bit int on them, but what matters for file systems is the system compiler.)

以下の表は簡単すぎるですが、考えられるものだと思います。 Here's the start of a table on these. Really such a simple table can't do the problem justice, but it'll give you an idea. Keep in mind that many of these systems support many file system types; I've listed only the most interesting so far from this point of view. I'd like to flesh it out more completely, though.

#include "bigunixfs.table"

A slightly more detailed description of certain implementations is available with the WWW version.

In addition, the HPSS (see above) supports large files, as does Unitree (though the Unitree interface to them is limited).


7.9. Non-Unix Large File Systems

(info about non-Unix large FSes also welcome; SHMO)

ディジタルのOpenVMS(何でものバージャン)はRMSインターフェースで2TBま でのファイルサイズをできます。でも、C言語のランタイムライブレリーでま だ2ギガバイト以下の限度です。 OpenVMS (any version) supports 2TB files (32-bit unsigned block number, 9-bit offset) through its RMS interface (still limited to 2GB through the C run-time library), but file systems are limited to ~7GB (as of Open AXP 1.5 and OpenVMS VAX 6.0 the max volume size has been bumped to 1 TB). (from a friend, rdv, 94/8/26, and Rod Widdowson, Filesystems group, OpenVMS engineering, Scotland).


My Home Page at Caltech

email me at rdv@isi.edu

Copyright 1996 Rod Van Meter