PVFS was first developed in 1993 by Walt Ligon and Eric Blumer as a parallel file system for Parallel Virtual Machine as part of a NASA grant to study the I/O patterns of parallel programs. PVFS version 0 was based on Vesta, a parallel file system developed at IBM T. J. Watson Research Center. Starting in 1994 Rob Ross re-wrote PVFS to use TCP/IP and departed from many of the original Vesta design points. PVFS version 1 was targeted to a cluster of DEC Alpha workstations networked using switched FDDI. Like Vesta, PVFS striped data across multiple servers and allowed I/O requests based on a file view that described a strided access pattern. Unlike Vesta, the striping and view were not dependent on a common record size. Ross' research focused on scheduling of disk I/O when multiple clients were accessing the same file. Previous results had shown that scheduling according to the best possible disk access pattern was preferable. Ross showed that this depended on a number of factors including the relative speed of the network and the details of the file view. In some cases a scheduling based on network traffic was preferable, thus a dynamically adaptable schedule provided the best overall performance. In late 1994 Ligon met with Thomas Sterling and John Dorband at Goddard Space Flight Center and discussed their plans to build the first Beowulf computer. It was agreed that PVFS would be ported to Linux and be featured on the new machine. Over the next several years Ligon and Ross worked with the GSFC group including Donald Becker, Dan Ridge, and Eric Hendricks. In 1997 at a cluster meeting in Pasadena, CA Sterling asked that PVFS be released as an open source package.
PVFS2
In 1999 Ligon proposed the development of a new version of PVFS initially dubbed PVFS2000 and later PVFS2. The design was initially developed by Ligon, Ross, and Phil Carns. Ross completed his PhD in 2000 and moved to Argonne National Laboratory and the design and implementation was carried out by Ligon, Carns, Dale Witchurch, and Harish Ramachandran at Clemson University, Ross, Neil Miller, and Rob Latham at Argonne National Laboratory, and Pete Wyckoff at Ohio Supercomputer Center. The new file system was released in 2003. The new design featured object servers, distributed metadata, views based on MPI, support for multiple network types, and a software architecture for easy experimentation and extensibility. PVFS version 1 was retired in 2005. PVFS version 2 is still supported by Clemson and Argonne. Carns completed his PhD in 2006 and joined Axicom, Inc. where PVFS was deployed on several thousand nodes for data mining. In 2008 Carns moved to Argonne and continues to work on PVFS along with Ross, Latham, and Sam Lang. Brad Settlemyer developed a mirroring subsystem at Clemson, and later a detailed simulation of PVFS used for researching new developments. Settlemyer is now at Oak Ridge National Laboratory. in 2007 Argonne began porting PVFS for use on an IBM Blue Gene/P. In 2008 Clemson began developing extensions for supporting large directories of small files, security enhancements, and redundancy capabilities. As many of these goals conflicted with development for Blue Gene, a second branch of the CVS source tree was created and dubbed "Orange" and the original branch was dubbed "Blue." PVFS and OrangeFS track each other very closely, but represent two different groups of user requirements. Most patches and upgrades are applied to both branches. As of 2011 OrangeFS is the main development line.
Features
In a cluster using PVFS nodes are designated as one or more of: client, data server, metadata server. Data servers hold file data, metadata servers hold metadata include stat-info, attributes, and datafile-handles as well as directory-entries. Clients run applications that utilize the file system by sending requests to the servers over the network.
Object-based design
PVFS has an object based design, which is to say all PVFS server requests involved objects called dataspaces. A dataspace can be used to hold file data, file metadata, directory metadata, directory entries, or symbolic links. Every dataspace in a file system has a unique handle. Any client or server can look up which server holds the dataspace based on the handle. A dataspace has two components: a bytestream and a set of key/value pairs. The bytestream is an ordered sequence of bytes, typically used to hold file data, and the key/value pairs are typically used to hold metadata. The object-based design has become typical of many distributed file systems including Lustre, Panasas, and pNFS.
Separation of data and metadata
PVFS is designed so that a client can access a server for metadata once, and then can access the data servers without further interaction with the metadata servers. This removes a critical bottleneck from the system and allows much greater performance.
MPI-based requests
When a client program requests data from PVFS it can supply a description of the data that is based on MPI_Datatypes. This facility allows MPI file views to be directly implemented by the file system. MPI_Datatypes can describe complex non-contiguous patterns of data. The PVFS server and data codes implement data flows that efficiently transfer data between multiple servers and clients.
Multiple network support
PVFS uses a networking layer named BMI which provides a non-blocking message interface designed specifically for file systems. BMI has multiple implementation modules for a number of different networks used in high performance computing including TCP/IP, Myrinet, Infiniband, and Portals.
Stateless (lockless) servers
PVFS servers are designed so that they do not share any state with each other or with clients. If a server crashes another can easily be restarted in its place. Updates are performed without using locks.
User-level implementation
PVFS clients and servers run at user level. Kernel modifications are not needed. There is an optional kernel module that allows a PVFS file system to be mounted like any other file system, or programs can link directly to a user interface such as MPI-IO or a Posix-like interface. This features makes PVFS easy to install and less prone to causing system crashes.
System-level interface
The PVFS interface is designed to integrate at the system level. It has similarities with the Linux VFS, this making it easy to implement as a mountable file system, but is equally adaptable to user level interfaces such as MPI-IO or Posix-like interfaces. It exposes many of the features of the underlying file system so that interfaces can take advantage of them if desired.
Architecture
PVFS consists of 4 main components, and a number of utility programs. The components are the PVFS2-server, the pvfslib, the PVFS-client-core, and the PVFS kernel module. Utilities include the karma management tool, utilities such as pvfs-ping, pvfs-ls, pvfs-cp, etc. that all operate directly on the file system without using the kernel module. Another key design point is the PVFS protocol which describes the messages passed between client and server, though this is not strictly a component.
PVFS2-server
The PVFS server runs as a process on a node designated as an I/O node. I/O nodes are often dedicated nodes but can be regular nodes that run application tasks as well. The PVFS server usually runs as root, but can be run as a user if preferred. Each server can manage multiple distinct file systems and is designated to run as a metadata server, data server, or both. All configuration is controlled by a configuration file specified on the command line, and all servers managing a given file system use the same configuration file. The server receives requests over the network, carries out the request which may involve disk I/O and responds back to the original requester. Requests normally come from client nodes running application tasks but can come from other servers. The server is composed of the request processor, the job layer, Trove, BMI, and flow layers.
Request processor
The request processor consists of the server process' main loop and a number of state machines. State machines are based on a simple language developed for PVFS that manage concurrency within the server and client. A state machine consists of a number of states, each of which either runs a C state action function or calls a nested state machine. In either case return codes select which state to go to next. State action functions typically submit a job via the job layer which performs some kind of I/O via Trove or BMI. Jobs are non-blocking, so that once a job is issued the state machine's execution is deferred so that another state machine can run servicing another request. When Jobs are completed the main loop restarts the associated state machine. The request processor has state machines for each of the various request types defined in the PVFS request protocol plus a number of nested state machines used internally. The state machine architecture makes it relatively easy to add new requests to the server in order to add features or optimize for specific situations.
Job layer
The Job layer provides a common interface for submitting Trove, BMI, and flow jobs and reporting their completion. It also implements the request scheduler as a non-blocking job that records what kind of requests are in progress on which objects and prevents consistency errors due to simultaneously operating on the same file data.
Trove
Trove manages I/O to the objects stored on the local server. Trove operates on collections of data spaces. A collection has its own independent handle space and is used to implement distinct PVFS file systems, A data space is a PVFS object and has its own unique handle and is stored on one server. Handles are mapped to servers through a table in the configuration file. A data space consists of two parts: a bytestream, and a set of key/value pairs. A bytestream is sequence of bytes of indeterminate length and is used to store file data, typically in a file on the local file system. Key/value pairs are used to store metadata, attributes, and directory entries. Trove has a well defined interface and can be implemented in various ways. To date the only implementation has been the Trove-dbfs implementation that stores bytestreams in files and key/value pairs in a Berkeley DB database. Trove operations are non-blocking, the API provides post functions to read or write the various components and functions to check or wait for completion.