A distributed mailserver: requirements and design

With contributions from Matthew O'Connor and Paul Visscher.

Building a reliable email system is expensive; at present, it requires massive investment in multiple redundant links to the rest of the Internet, system administrators on call 24x7, network-attached RAID, multiple redundant server machines, big UPSes, expensive proprietary mail-server software that supports clustering, and (as of 2001, since the rolling blackouts started) all of this somewhere besides California. Worse, it's hard; it's likely that there will still be failures.

Backup MX records reduce the visibility of your mail-server unreliability from the outside world, but they don't do anything to keep your mail available when your main mail server is down.

This paper describes a mail server architecture that can provide better reliability more inexpensively than the above-described system, although it hasn't yet been implemented.

Correctness criteria

In order of importance, it is desirable that email not be delivered incorrectly, corrupted, lost, spuriously bounced, unduly delayed, or duplicated. It should be promptly and reliably delivered to the correct mailbox or mailboxes.

The problem to solve

In an ideal world, qmail or Postfix on a single good-quality computer can do this quite well. In the real world, the following problems can arise:

operational errors, leading to misdelivered, corrupted, lost, spuriously bounced, delayed, or duplicated email
disk system failures, such as disk crashes or failing disk controllers, leading to lost or delayed email
machine crashes due to hardware failure, leading to delayed email
machine crashes due to poor choice of operating system, leading to delayed email
machine power failures, leading to lost or delayed email
network link degradation (e.g. packet loss), leading to lost or delayed email
loss of connection to the Internet, leading to lost or delayed email
network partitions, leading to lost or delayed email
flaky memory corrupting messages in memory, leading to misdelivered, corrupted, lost, spuriously bounced, or delayed email
flaky disk I/O corrupting messages en route to or from the disk, with the same results

Any of the problems that cause email to be delayed can cause it to be spuriously bounced or duplicated under some circumstances, assuming correct software.

If you are using buggy mail server software, it may introduce failure modes of its own.

The distributed solution

The causes of failure described in the previous section fall into three categories: machine-specific, site-specific, and global. Operational errors and buggy mail server software can be any of the three; disk crashes, machine crashes, and flaky hardware are usually machine-specific; and network link problems and power failures are usually site-specific.

(Shared-storage clusters convert disk crashes into global problems; RAID with disks from different vendors generally converts disk crashes into a very rare problem.)

There are two ways to reduce the probability of failure caused by any one of these: by directly attacking the problem and by adding redundancy to tolerate failures.

To directly attack the problem of machine crashes due to hardware failure, you might use hardware monitoring, have sysadmins on-call 24x7, and use high-quality components. To add redundancy, you might cluster multiple mailservers, so that if one fails, the other will continue to work.

To directly attack the problem of flaky memory, you might buy ECC RAM, use machines from a high-quality vendor, not overclock your system bus, do a thorough burn-in test with memory testing software before putting a machine into service as a mail server, and add checksums to messages to detect message corruption. There doesn't seem to be any practical way to add redundancy to mitigate this problem, though.

In general, to mitigate a machine-specific cause of failure by adding redundancy, you need to add more machines; to mitigate a site-specific cause of failure by adding redundancy, you need to add more sites; and you can't mitigate a global problem by adding redundancy.

Cross-site clusters are not well-supported by existing mail-server software; the result is that sites that require high availability spend a lot of money on directly attacking site-specific causes of failure, by buying UPSes, arranging for redundant network links, and arranging for 24x7 system administrator support for this infrastructure.

We propose that software for shared-nothing cross-site clustering is a more cost-effective way to mitigate site-specific and machine-specific causes of failure than the traditional approach.

The cross-site cluster architecture proposed in this paper provides the following guarantees:

mail, incoming, outgoing, or stored, is never lost unless the majority of the cluster nodes go down at once and never come back up (for example, because they've been destroyed) or a node has a failure that corrupts data without being detected
mail, incoming, outgoing, or stored, is never delayed unless the majority of the cluster nodes are down or disconnected from the Internet or a node has a failure that corrupts data without being detected, even if several nodes go down at once

It has little effect on misdelivery, mail corruption, and duplication, and it affects spurious bounces only by preventing mail delays.

Therefore, it can provide better reliability and availability than any existing mail-server system, and its deployment and management costs are expected to be a fraction of those of existing highly-available mail-server systems. It should make it possible to operate a mail service off of five residential DSL lines (in different residences, using different DSL providers) that is more reliable than the email service for any Fortune 500 company today.

System Design Overview

A geographically-distributed mail server cluster consists of at least three nodes at geographically distinct sites, at least two of which are mail servers. The mail servers in the cluster are equal-precedence highest-precedence MX records for the domains they serve, using round-robin DNS to balance out the load, and they all know about each of the others.

All of the nodes in the network run special cluster software written for this project; the nodes that are mail servers run a special MTA.

For maximal reliability, the mail servers should also be the DNS servers for the domains they serve, and their software (other than the MTA software) should be as dissimilar as possible to minimize common-mode software failures.

A quorum is a simple majority of the nodes in the cluster.

The nodes in the cluster maintain a distributed transactional store, of which each has a full replica, in which they keep their mail queues. The consent of a quorum of the cluster nodes is needed to commit transactions to this store, and the entire transaction is replicated to each node that consents; this keeps the replicas up to date and ensures that, in case of network partition, the store does not fork. The configuration data for the cluster, describing what nodes exist, what domains the cluster handles mail for, and other rules, is also kept in the distributed store; each node may have additional configuration data that is not expected to require similar changes at different nodes. This prevents configuration desynchronization between the nodes.

The nodes talk to each other over ssh connections to maintain this store, because ssh offers encryption, authentication, and remote server starting, and is already installed on all the machines where I want to run the prototype.

Communication need not be direct; there are times on the Internet when node A can talk with node B, and node B with node C, but node A and node C cannot talk directly over IP. The transactional store protocol makes this condition transparent; it appears to node A and node C that they can see one another.

The system has several mail queues:

mail is committed to the incoming queue before its receipt is acknowledged over SMTP.
mail is atomically deleted from the incoming queue and written as zero or more copies into the outgoing queue and the local delivery queue depending on its destination addresses, and possibly also copied as a bounce message into the incoming queue.
mail being locally delivered on a particular machine is deleted from the local delivery queue when it has been handled in a locally-appropriate way, such as being appended to a mailbox. (System-dependent things like mailbox formats are outside the purview of the cluster system and cannot benefit from its reliability unless the mailboxes are actually stored in the transactional store.)
mail in the remote delivery queue is deleted from the remote delivery queue when some machine in the cluster has received notification of its acceptance from a receiver-SMTP elsewhere.

In general, the presence of mail in a queue triggers some processing which, if successful, will eventually remove it from the queue. Some of these triggers can happen on any one of several machines; others can happen only on one particular machine.

In this basic form, the system doesn't provide any reliability benefit over just having some secondary MXes, because mail doesn't get delivered to your personal mailbox until the single machine your personal mailbox lives on is online. The following two sections explain how to actually get better reliability for mail applications on top of this substrate.

Distributed mailboxes

If users' mailboxes are stored in the distributed transactional store along with the mail queues, then users will have access to their email as long as a majority of the cluster is up. This probably requires either a POP or IMAP program with access to the transactional store.

Doing this makes it possible for users to continue to receive mail and to access their mail as long as the majority of the cluster is online and connected.

IMAP and POP programs need to be careful not to delete undelivered mail in case of broken connections.

Mailing list managers

If mailing list subscription lists are stored in the distributed transactional store, then delivery of a mailing-list message can be a transaction within the transactional store: given a message in the local delivery queue and a list of email addresses, put several copies of the message into the incoming queue and delete the message from the local delivery queue. The transaction can be aborted if connectivity to the rest of the cluster is interrupted; it will then be retried on some other cluster node that has the same mailing list software installed.

This also allows subscription and unsubscription to happen within the transactional store.

Distributed store guarantees

Any up-to-date node has a full replica of the distributed store. Even if half of the cluster but one machine goes down at once, and half of the cluster but one machine comes up at the same time, the one machine in common between the two halves will reliably replicate all the information in the distributed store to all the new machines that have come up.

Inactivity and recovery

When a node can no longer contact a quorum of the cluster, it becomes inactive; it ceases to listen for SMTP connections and participate in transactions, and if possible, the DNS server should remove its MX record. (This is an optimization that will prevent incoming mailers from wasting their time trying to talk to inactive nodes; it is not essential to the correct functioning of the system. If it is applied, DNS TTLs should be kept low to make it effective.) The inactive node continues to update its local replica of the data store with transactions from whatever nodes it is able to contact, active or otherwise. It cannot become active unless its replica is up-to-date (has been updated with all the transactions that have committed) and it can again contact a quorum of the cluster.

When a cluster node becomes inactive, it must abort any SMTP conversations in which it is participating on behalf of the cluster.

If it is receiving mail over SMTP and hasn't committed the mail before becoming inactive, it must respond to the CRLF.CRLF terminating the DATA command with a temporary failure error code or by closing the connection, causing the sender-SMTP to retry the mail later.

If it is sending mail over SMTP, it should RSET and QUIT unless it's currently in the middle of sending DATA; if it's in the middle of sending DATA, it should send the rest of the message, but then it should close the connection without sending the terminating CRLF.CRLF. The active nodes in the cluster will retry sending later. (In fact, they may already be retrying.) I believe there may exist misguided MTAs that will deliver mail whose transmission is aborted, and while this measure can't prevent mail duplication when dealing with these MTAs, it can at least reduce the incidence of mail truncation.

Inactive cluster nodes should not listen on port 25 in order to direct incoming mail to active cluster nodes. At least Postfix, and I believe all MTAs, treat TCP connection refusal as a very temporary error condition and immediately try other MX records.

I don't yet know how to tell if a node's replica is up-to-date, and I don't know how the cluster can initially become active or reactivate after a major catastrophe that caused all nodes to become inactive simultaneously.

Nodes rapidly cycling between being active and being inactive do not cause any special problems.

Inactive nodes are included in the cluster size for the purpose of counting quora; this means that if more than half of the nodes in the cluster are ever disconnected, then the whole cluster becomes inactive.

Generally processes on the cluster nodes will be able to submit mail for later delivery even when the node they are running on is inactive, both for the sake of backward compatibility and because refusing the mail probably wouldn't make anything more reliable; they probably don't have any safer place to keep the mail than in the maildrop or any more effective way of getting it sent than putting it in the maildrop.

Adding and removing nodes

Adding a new node requires telling the cluster about the new node and turning it on. It joins as an inactive node. Once it is up-to-date, it can become active.

Removing a node from the cluster is straightforward as long as at least one other node in the cluster is up-to-date. You turn it off and tell the cluster that it is no longer a member.

Commit protocol

The system uses simple two-phase commit to commit transactions to the data store. Two-phase commit works as follows: in the first phase, the transaction manager contacts each of the parties in the transaction to ask whether they are willing to commit the transaction or not. In the second phase, if all of them are willing, it then tells them all to do so; but if one or more of them are not willing, it tells them all not to do so.

Two-phase commit has some problems:

if the transaction manager loses contact with one of the participants during phase 1, it won't know whether that participant was willing or not (or even got the message), leaving the transaction stuck in limbo for everybody.
if the transaction manager loses contact with one of the participants during phase 2, that participant won't know if the transaction committed or aborted, leaving the transaction stuck in limbo for that participant.

Any application of two-phase commit must therefore have a limbo-handling strategy. The limbo-handling strategy the distributed mailserver uses is as follows:

In the first case of limbo, after a short timeout (a couple of minutes), the transaction is aborted and no changes are made to anyone's replica; the lost participant will eventually reconnect and find out that it was aborted.

In the second case of limbo, after the same short timeout, the transaction is partly aborted. Any insertions made during the transaction are committed, but any deletions are aborted. This is informed by the principle that duplicate mail is better than delayed mail.

Security

Since any active node in the cluster can mutate any data in the transactional store, all mail is vulnerable to being maliciously corrupted, deleted, or misdelivered by any of them.

For maximum security, it is recommended that all of the nodes run on software configurations as similar as possible; three instances of the same vulnerability are no worse than one, but three different vulnerabilities are far worse than one.

Of course, this conflicts with the recommendation for maximal resilience against unintentional failure. Some clusters will need to be optimized for maximal security, and others will need to be optimized for resilience against unintentional failure.

In the common event that the nodes in the cluster are under separate administrative control, the administrators of the nodes should be aware they are trusting one another with their email.

Performance

The system hasn't yet been implemented, so its performance is unknown. I expect that it will perform substantially worse than comparable systems that do not provide geographically-distributed clustering; it will, of necessity, use at least twice the Internet-link bandwidth.

Operational issues

A network-distributed mailserver will necessarily be more complex to diagnose problems in and maintain than a single-machine mailserver. In environments where the majority of failures are due to operational problems (not operational failure to prevent hardware problems, but operational problems like setting the mailserver to bounce mail to your domain), using a clustered mailserver will make your email less reliable, not more, particularly if the cluster nodes are in different administrative domains. Storing the cluster's configuration information in the distributed store will keep this from becoming a big problem, but it can still be a problem.

DNS administrators have been familiar with these problems for a long time, but administering DNS is simpler than administering email, so I expect the problems to be worse.

Implementation

The distributed mailserver is not yet implemented. The planned implementation consists of the distributed store, a modification of Postfix to store its queues in the store, and a modification of Mailman that stores its membership data in the store.

Related Work

Stalker's CommuniGate Pro supports clustering for high availability and load-sharing, in two different configurations; "static clusters" divide the accounts in a domain between servers, and "dynamic clusters" use shared storage. A large CommuniGate Pro license costs US$30 000 as of this writing.

iPlanet's Sun Internet Mail Server likewise supports high-availability failover with shared storage. An enterprise SIMS 3.5 license cost US$3500 in 1999, but I can't find more recent pricing information.

A review of Cobalt's StaQWare says it adds system-level hot backup and failover with peer-to-peer disk mirroring over a private LAN for applications on Cobalt RaQ3is, including mail server applications; you can do similar things with Linux's NBD and kernel software RAID-1 support. This is a shared-nothing failover solution. The review also mentions Network Disk Mirror. Linux-HA.org also mentions some other generic hot-backup and failover solutions.

None of these products, as far as I know, support geographically-distributed clusters.

The Transis research group built a weakly-consistent distributed mailserver in 1993 and wrote a paper on it entitled A Highly Available Application in the Transis Environment (alternate link).