Mbox


Mbox is a generic term for a family of related file formats used for holding collections of email messages, first implemented for Fifth Edition Unix.
All messages in an mbox mailbox are concatenated and stored as plain text in a single file.
Each message starts with the four characters "From" followed by a space and the sender's email address.
RFC 4155 defines that a UTC timestamp follows after another separating space character.
Unlike the Internet protocols used for the exchange of email, the format used for the storage of email has never been formally defined through the RFC standardization mechanism and has been entirely left to the developer of an email client.
However, the POSIX standard defined a loose frame in conjunction with the mailx program.
In 2005 finally, the application/mbox media type was standardized as RFC 4155, and hints that mbox stores mailbox messages in their original Internet Message format, except for the used newline character, seven-bit clean data storage, and the requirement that each newly added message is terminated with a completely empty line within the mbox database.
A format similar to mbox is the MH Message Handling System. Other systems, such as Microsoft Exchange Server and the Cyrus IMAP server store mailboxes in centralised databases managed by the mail system and not directly accessible by individual users.
The maildir mailbox format is often cited as an alternative to the mbox format for network email storage systems.

Family

The mbox format uses a single blank line followed by the string 'From ' to delimit messages; this can create ambiguities if a message contains the same sequence in the message text.
Over the decades that followed, four popular but incompatible variants arose: mboxo, mboxrd, mboxcl, and mboxcl2. The naming scheme was developed by Daniel J. Bernstein, Rahul Dhesi, and others in 1996. Each originated from a different version of Unix. mboxcl and mboxcl2 originated from the file format used by Unix System V Release 4 mail tools. mboxrd was invented by Rahul Dhesi et al. as a rationalisation of mboxo and subsequently adopted by some Unix mail tools including qmail.
All these variants have the problem that the content of the message is modified in order to remove the ambiguities, as shown below, so that applications have to know which quoting rule has been used in order to perform the correct reversion, which turned out to be impractical.
Using MIME and choosing a content-transfer-encoding that quotes "From_" lines in a standard-compliant fashion ensures that message content doesn't need to be changed, but only their MIME representation.
Therefore checksums remain constant, a necessary precondition for supporting S/MIME and Pretty Good Privacy.
Applications which newly create messages and store them in mbox database files will likely use this approach to detach message content from database storage format.
mboxo and mboxrd locate the message start by scanning for From lines that are found before the email message headers. If a "From " string occurs at the beginning of a line in either the header or the body of a message, the email message must be modified before the message is stored in an mbox mailbox file or the line will be taken as a message boundary.
To avoid misinterpreting a "From " string at the beginning of the line in the email body as the beginning of a new email, some systems "From-munge"
the message, typically by prepending a greater-than sign:
>From my point of view...
In the mboxo format, such lines have irreversible ambiguity.
In the mboxo format, this can lead to corruption of the message. If a line already contained >From at the beginning, it is unchanged when written. When subsequently read by the mail software, the leading > is erroneously removed. The mboxrd format solves this by converting From to >From and converting >From to >>From , etc. The transformation is then always reversible.
Example:
From MAILER-DAEMON Fri Jul 8 12:08:34 2011
This is the body.
>From.
There are 3 lines.
From MAILER-DAEMON Fri Jul 8 12:08:34 2011
This is the second body.
The mboxcl and mboxcl2 formats use a Content-Length: header to determine the messages’ lengths and thereby the next real From line. mboxcl still quotes From lines in the messages themselves as mboxrd does, while mboxcl2 doesn’t.

''Modified mbox''

Some email clients use a modification of the mbox format for their mail folders.
Various mutually incompatible mechanisms have been used by different mbox formats to enable message file locking, including fcntl and lockf.
This does not work well with network mounted file systems, such as the Network File System, which is why traditionally Unix used additional "dot lock" files, which could be created atomically even over NFS.
Because more than one message is stored in a single file, some form of file locking is needed to avoid the corruption that can result from two or more processes modifying the mailbox simultaneously. This could happen if a network email delivery program delivers a new message at the same time as a mail reader is deleting an existing message.
Mbox files should be locked also while they are being read. Otherwise the reader may see corrupted message contents if another process is modifying the mbox at the same time, even though no actual file corruption occurs.

As a patch format

In open source development, it is common to send patches in the diff format to a mailing list for discussion. The diff format allows for irrelevant "headers", such as mbox data, to be added. Version control systems like git have support for generating mbox-formatted patches and for sending them to the list as emails in a thread.