GitTogether10/bup
OBSOLETE CONTENT
This wiki has been archived and the content is no longer updated. Please visit git-scm.com/doc for up-to-date documentation.
Bup - Git-based backup system
by Avery
Backup in O(n) where n = # of bytes changed
Backup entire filesystems
Direct backups to remotes with metadata
Rename handling
Deduplication
git gc explodes, git bigfiles was a lame solution and is now a dead fork
zlib's default window size is useless for large files, such as VM images and media files
bupsplit is a rolling checksum similar to the rsync algorithm
There is a 60page PhD thesis de scribing the algorithm
gzip --rsyncable
Reset the gzip buffer at a well-known place
Every time we see a certain bit sequence, restart the window
Avg chunk size ~ 8kb
Idea of nested chunks and super-chunks, which have names which are named after the offset of a parent chuunk
Tree of trees of blobs -> Tree of (Tree of ...) ... of blobs
We get a O(log n) tree of smallish objects
No delta-ification
Doesn't care about file formats
Diffs are O(log n)
Seek is O(log n)
bupfuse is a readonly FUSE module to mount the bup index
When you have lots of files, the .git/index is dog slow, since it is a linear set of files + hashes and has no binary search. Git rewrites the file for every added file.
.bup/bupindex:
- Tree structure with O(log n) seeks
- Can update entries without rewriting
Problem: Millions of Objects
- Only having an 8bit prefix lookup table is not enough -> very slow
.midx - merged idx files
log(2.4gb) = 31 bits
Only has to binary search the last 7 bits, dirties far fewer pages
Bloom filters could be used to optimize this even further. Bloom filters probabilistic but have much better average complexity. They are used by the Plan 9 OS in their FS
Problem: Direct remote backup
Git transfer protocol very inefficient for large files, bup only transmits what is necessary.
tar -cvf - | bup split -n mybackup
bup join backup | tar -xf -
bup index + bup save are roughly similar to git add
bup restore ~ git checkout
Pruning is a future optimization
Bittorrent-like algorithms could greatly benefit from something like bup -> use already downloaded files to seed larger collections that contain those files