GitTogether10/bup

From Git SCM Wiki
Jump to: navigation, search

OBSOLETE CONTENT

This wiki has been archived and the content is no longer updated. Please visit git-scm.com/doc for up-to-date documentation.

Bup - Git-based backup system

by Avery

Backup in O(n) where n = # of bytes changed
Backup entire filesystems
Direct backups to remotes with metadata
Rename handling
Deduplication

git gc explodes, git bigfiles was a lame solution and is now a dead fork
zlib's default window size is useless for large files, such as VM images and media files

bupsplit is a rolling checksum similar to the rsync algorithm
There is a 60page PhD thesis de scribing the algorithm

gzip --rsyncable
Reset the gzip buffer at a well-known place
Every time we see a certain bit sequence, restart the window
Avg chunk size ~ 8kb

Idea of nested chunks and super-chunks, which have names which are named after the offset of a parent chuunk
Tree of trees of blobs -> Tree of (Tree of ...) ... of blobs

We get a O(log n) tree of smallish objects

No delta-ification
Doesn't care about file formats
Diffs are O(log n)
Seek is O(log n)
bupfuse is a readonly FUSE module to mount the bup index

When you have lots of files, the .git/index is dog slow, since it is a linear set of files + hashes and has no binary search. Git rewrites the file for every added file.

.bup/bupindex:

  • Tree structure with O(log n) seeks
  • Can update entries without rewriting

Problem: Millions of Objects

  • Only having an 8bit prefix lookup table is not enough -> very slow

.midx - merged idx files
log(2.4gb) = 31 bits
Only has to binary search the last 7 bits, dirties far fewer pages

Bloom filters could be used to optimize this even further. Bloom filters probabilistic but have much better average complexity. They are used by the Plan 9 OS in their FS

Problem: Direct remote backup
Git transfer protocol very inefficient for large files, bup only transmits what is necessary.

tar -cvf - | bup split -n mybackup
bup join backup | tar -xf -

bup index + bup save are roughly similar to git add

bup restore ~ git checkout

Pruning is a future optimization

Bittorrent-like algorithms could greatly benefit from something like bup -> use already downloaded files to seed larger collections that contain those files

Personal tools