livebackup is a partial backup system. I call it `partial' because it addresses only one of the desires backups typically address. I know of three reasons backups are typically kept: - "Oops, I didn't mean to delete that!" - "What was last month's version of this file?" - "Oh no, this disk just died!" livebackup addresses only the last of those, though, with a little external help, it can to some extent address the other two. (If you are already using RAID for disk redundancy, you may have little to no use for livebackup. But you also may find it useful, especially the network-remote aspects.) It works in practice, on at least a geek's-home-network or small-office-LAN scale. I use it for my own machines' backups (currently somewhere between ten and twenty machines, depending on how live a machine needs to be to count). I've had multiple disks die since I've been using it, and, in every case, all the bits have been safe. It works at the disk-partition level. In particular, it backs up the whole partition. This has good aspects and bad aspects, of course. Good aspects: it is filesystem-agnostic; you can use it to back up any filesystem type, or even special-purpose partitions like databases, and it won't know or care. And the amount of space used by the backup is predictable and won't change with writes to the disk - it takes a partition resize to change it. Bad aspects: it is filesystem-agnostic; it backs up even space not currently in use by the filesystem, because it doesn't know the difference, and to get something out of the backup you have to get the backup on the same machine as (or at least accessible from the same machine as) software that understands it. (For example, if you're backing up an FFS partition, you need a machine with FFS support in order to read the backup.) It's essentially a live mirror of the affected disk partition over the network. On most machines, the network is slower than the disk, as in the throughput of writing to disk is substantially faster than the throughput of sending data over the network. This means that we have a classic "fast producer, slow consumer" problem, at least in the short term. (Most machines' writes to disk are very bursty, with very short, very fast bursts mixed with comparatively long periods of inactivity. livebackup is not suitable for a partition that is being run full throttle for writes all, or most, of the time; if the long-term mean write throughput is higher than network throughput, livebackup will work poorly to not at all.) There are basically three ways of handling the overruns that a fast producer and slow consumer combine to produce: you can throttle, exerting back-pressure that forces the producer to wait; you can drop data; or you can figure out some way to shrink each write. Dropping data is obviously undesirable for a backup system. Throttling would work, but I don't like the idea of slowing my disks down to the speed of my network. So I found a way to write less. livebackup handles this with buffering and two levels of behaviour. In normal operation, writes are mirrored live in the order they occur, with buffering at various places, adding up to a few megabytes total. (This means, in particular, that filesystem consistency promises are just as valid for the backup as for the original.) If too much is written too fast, diskwatch falls back to a degraded mode, wherein it just records which blocks are getting written, but not their contents; once the burst ends, it copies those blocks from the live system to the backup and, once that completes, returns to normal operation. For the duration of the catch-up operation - from the time of going to degraded mode to the time the catch-up completes - filesystem consistency promises are weakened, since the ordering of writes to the original is not, in general, preserved when copying to the backup. (In a really extreme burst, when even degraded mode's buffers get overrun, diskwatch backs off to sending nothing to userland and livebackup starts over, doing a full rescan of the partition. The commonest case for this in my experience - and it's not really very common - is a large process dumping core.) Personally, I have never had trouble with this, but I run FFS, which is particularly nice in these respects; even with write ordering lost, FFS will, in general, not corrupt anything not being actively written to. You should consider how your filesystem(s) in question respond(s) to loss of write ordering when deciding whether livebackup is a good fit for your use case. There are three major pieces to livebackup: diskwatch, lb, and lbd. diskwatch is the kernel piece. It consists of a pseudo-device driver (diskwatch proper) and hooks in the various disk drivers to call into diskwatch as needed. diskwatch then handles interfacing to lb. I have put this into the kernels for the versions I run. It probably would be relatively simple for someone familiar with a different kernel to adapt it to that version; most Unix variants with monolithic kernels would probably be relatively amenable to it. It might be adaptable to other paradigms; for example, it might be possible to insert it as a shim layer in an OS with stacking filesystems. I am not aware of any such attempts, though. lb is the client. It runs on the machine being backed up, one instance per affected partition. It talks to diskwatch, in the kernel, to get updates as disk writes happen, and talks over the network, to lbd, to actually write the backup. lb must be run as root, or at least as a user who has access to the raw disk devices and the diskwatch devices. lbd is the server. It runs on the machine where the backups are kept. It talks over the network to lb (potentially, multiple lb instances), managing the backup images. lbd does not depend on any particular kernel support; I have on a few occasions run lbd instances on entirely stock systems. lbd also does not need any particular privilege, at least unless it is configured to listen on a privileged port. All communication between lb and lbd is normally encrypted; the protocol is designed for use over untrusted long-haul networks such as the open Internet. Encryption is based on a shared secret, which is leveraged to both derive nonce encryption keys and verify that each end is talking to whom it thinks it is. (I say it is `normally' encrypted because there is a mode, designed for use over private or otherwise secure networks, that doesn't bother with encryption.) Everything is carried over a single TCP connection, initiated by lb. It is NAT-tolerant; neither end cares whether its idea of the other end's address matches the other end's idea of it. lbd has to be able to receive incoming connections, but they can be NAT-mapped. If lb loses its connection to lbd, it retries every minute or two until it succeeds in re-connecting. Normally, lbd and lb are started at boot time on the relevant machines, but this does not have to be so. lb does need to be told what port it should connect to to reach lbd; how such ports are selected is out of scope. In lb's case, the port is specified on the command line; in lbd's case, in its config file. A typical lb command line would look like lb 1 /dev/rwd0e 10.0.0.17 15440 /etc/backup.key whereas an lbd run would look like lbd /etc/lbd.conf with /etc/lbd.conf containing lines like client file=client1/wd0e key=client1/key listen=10.0.0.17/15440 type=simple (/etc/backup.key on the client machine and client1/key on the server machine need to be set up with identical contents.) See the lb and lbd manpages for more. Personally, for my own backups, I have a setup where most partitions are backed up to a central backup server, which keeps all its backup images in a filesystem occupying approximately all of a 1.8T disk ("2T", two artificially-shrunk disk-maker terabytes). I then have another lb/lbd setup backing that partition up onto a another machine, to a filesystem on a "4T" disk; once a month, I bring that machine down and replace that drive. I keep the first-of-January drives forever (or, at least, I have so far); the rest I rotate through, with each month one drive going to storage and one returning from storage (or, for January, from new stock) to live use. This is why I call livebackup a "partial" backup system: it provides an underlying mechanism, but it needs scripting and procedures wrapped around it to be a full backup system, and, arguably, it isn't even then because it doesn't really address those other two desires except at a fairly coarse temporal granularity.