livebackup is a partial backup system.

I call it `partial' because it addresses only one of the desires
backups typically address.  I know of three reasons backups are
typically kept:

- "Oops, I didn't mean to delete that!"
- "What was last month's version of this file?"
- "Oh no, this disk just died!"

livebackup addresses only the last of those, though, with a little
external help, it can to some extent address the other two.  (If you
are already using RAID for disk redundancy, you may have little to no
use for livebackup.  But you also may find it useful, especially the
network-remote aspects.)

It works in practice, on at least a geek's-home-network or
small-office-LAN scale.  I use it for my own machines' backups
(currently somewhere between ten and twenty machines, depending on how
live a machine needs to be to count).  I've had multiple disks die
since I've been using it, and, in every case, all the bits have been
safe.

It works at the disk-partition level.  In particular, it backs up the
whole partition.  This has good aspects and bad aspects, of course.

Good aspects: it is filesystem-agnostic; you can use it to back up any
filesystem type, or even special-purpose partitions like databases, and
it won't know or care.  And the amount of space used by the backup is
predictable and won't change with writes to the disk - it takes a
partition resize to change it.

Bad aspects: it is filesystem-agnostic; it backs up even space not
currently in use by the filesystem, because it doesn't know the
difference, and to get something out of the backup you have to get the
backup on the same machine as (or at least accessible from the same
machine as) software that understands it.  (For example, if you're
backing up an FFS partition, you need a machine with FFS support in
order to read the backup.)

It's essentially a live mirror of the affected disk partition over the
network.  On most machines, the network is slower than the disk, as in
the throughput of writing to disk is substantially faster than the
throughput of sending data over the network.  This means that we have a
classic "fast producer, slow consumer" problem, at least in the short
term.  (Most machines' writes to disk are very bursty, with very short,
very fast bursts mixed with comparatively long periods of inactivity.
livebackup is not suitable for a partition that is being run full
throttle for writes all, or most, of the time; if the long-term mean
write throughput is higher than network throughput, livebackup will
work poorly to not at all.)

There are basically three ways of handling the overruns that a fast
producer and slow consumer combine to produce: you can throttle,
exerting back-pressure that forces the producer to wait; you can drop
data; or you can figure out some way to shrink each write.  Dropping
data is obviously undesirable for a backup system.  Throttling would
work, but I don't like the idea of slowing my disks down to the speed
of my network.  So I found a way to write less.

livebackup handles this with buffering and two levels of behaviour.  In
normal operation, writes are mirrored live in the order they occur,
with buffering at various places, adding up to a few megabytes total.
(This means, in particular, that filesystem consistency promises are
just as valid for the backup as for the original.)  If too much is
written too fast, diskwatch falls back to a degraded mode, wherein it
just records which blocks are getting written, but not their contents;
once the burst ends, it copies those blocks from the live system to the
backup and, once that completes, returns to normal operation.  For the
duration of the catch-up operation - from the time of going to degraded
mode to the time the catch-up completes - filesystem consistency
promises are weakened, since the ordering of writes to the original is
not, in general, preserved when copying to the backup.  (In a really
extreme burst, when even degraded mode's buffers get overrun, diskwatch
backs off to sending nothing to userland and livebackup starts over,
doing a full rescan of the partition.  The commonest case for this in
my experience - and it's not really very common - is a large process
dumping core.)

Personally, I have never had trouble with this, but I run FFS, which is
particularly nice in these respects; even with write ordering lost, FFS
will, in general, not corrupt anything not being actively written to.
You should consider how your filesystem(s) in question respond(s) to
loss of write ordering when deciding whether livebackup is a good fit
for your use case.

There are three major pieces to livebackup: diskwatch, lb, and lbd.

diskwatch is the kernel piece.  It consists of a pseudo-device driver
(diskwatch proper) and hooks in the various disk drivers to call into
diskwatch as needed.  diskwatch then handles interfacing to lb.  I have
put this into the kernels for the versions I run.  It probably would be
relatively simple for someone familiar with a different kernel to adapt
it to that version; most Unix variants with monolithic kernels would
probably be relatively amenable to it.  It might be adaptable to other
paradigms; for example, it might be possible to insert it as a shim
layer in an OS with stacking filesystems.  I am not aware of any such
attempts, though.

lb is the client.  It runs on the machine being backed up, one instance
per affected partition.  It talks to diskwatch, in the kernel, to get
updates as disk writes happen, and talks over the network, to lbd, to
actually write the backup.  lb must be run as root, or at least as a
user who has access to the raw disk devices and the diskwatch devices.

lbd is the server.  It runs on the machine where the backups are kept.
It talks over the network to lb (potentially, multiple lb instances),
managing the backup images.  lbd does not depend on any particular
kernel support; I have on a few occasions run lbd instances on entirely
stock systems.  lbd also does not need any particular privilege, at
least unless it is configured to listen on a privileged port.

All communication between lb and lbd is normally encrypted; the
protocol is designed for use over untrusted long-haul networks such as
the open Internet.  Encryption is based on a shared secret, which is
leveraged to both derive nonce encryption keys and verify that each end
is talking to whom it thinks it is.  (I say it is `normally' encrypted
because there is a mode, designed for use over private or otherwise
secure networks, that doesn't bother with encryption.)  Everything is
carried over a single TCP connection, initiated by lb.  It is
NAT-tolerant; neither end cares whether its idea of the other end's
address matches the other end's idea of it.  lbd has to be able to
receive incoming connections, but they can be NAT-mapped.  If lb loses
its connection to lbd, it retries every minute or two until it succeeds
in re-connecting.

Normally, lbd and lb are started at boot time on the relevant machines,
but this does not have to be so.  lb does need to be told what port it
should connect to to reach lbd; how such ports are selected is out of
scope.  In lb's case, the port is specified on the command line; in
lbd's case, in its config file.

A typical lb command line would look like

	lb 1 /dev/rwd0e 10.0.0.17 15440 /etc/backup.key

whereas an lbd run would look like

	lbd /etc/lbd.conf

with /etc/lbd.conf containing lines like

	client file=client1/wd0e key=client1/key listen=10.0.0.17/15440 type=simple

(/etc/backup.key on the client machine and client1/key on the server
machine need to be set up with identical contents.)  See the lb and lbd
manpages for more.

Personally, for my own backups, I have a setup where most partitions
are backed up to a central backup server, which keeps all its backup
images in a filesystem occupying approximately all of a 1.8T disk
("2T", two artificially-shrunk disk-maker terabytes).  I then have
another lb/lbd setup backing that partition up onto a another machine,
to a filesystem on a "4T" disk; once a month, I bring that machine down
and replace that drive.  I keep the first-of-January drives forever
(or, at least, I have so far); the rest I rotate through, with each
month one drive going to storage and one returning from storage (or,
for January, from new stock) to live use.

This is why I call livebackup a "partial" backup system: it provides an
underlying mechanism, but it needs scripting and procedures wrapped
around it to be a full backup system, and, arguably, it isn't even then
because it doesn't really address those other two desires except at a
fairly coarse temporal granularity.