Git

From Git SCM Wiki
Jump to: navigation, search

OBSOLETE CONTENT

This wiki has been archived and the content is no longer updated. Please visit git-scm.com/doc for up-to-date documentation.

Git is a modern distributed version control system focused on speed, efficiency and real-world usability on large projects. More generally, Git can serve as a general tool for directory content tracking.

Git was created by Linus Torvalds with assistance from a loosely-knit team of hackers across the Net. It is currently maintained by Junio C. Hamano. While Git has no official logo the community has adopted a few favorites.


Table of contents:

Contents


Design

Git was designed to be extremely fast (seconds per patch), distributed SCM with good support for nonlinear history (including working with multiple branches).

  • Pure content tracker. The main design idea of Git is "pure content tracking". Regardless of how you get to a particular state, git will consider that state 100% identical. Consequently, git separates the notion of history (which is purely a list of commits and their relationships), and the notion of place and name of file in filesystem (structure), from the notion of "content", which has no history nor structure in itself.
    • content is stored in blob objects.
    • history is stored in commit objects.
    • structure is stored in tree objects.
  • Storing states. Closely tied with "pure content tracking" is that Git versions whole contents of a project, i.e. revision/commit refers to the snapshot of the top directory tree. Git does not explicitly record file revision relationships at any level below the source code tree; content copying and moving between files (so called "rename detection") is done at runtime.
  • Index. In addition to object database (repository) Git has a mutable index that caches information about the working directory and the next revision to be committed, storing the information about "interesting" files. It serves as staging area to prepare commits, helps with merge conflicts resolving, and improves performance (and speed of operations is one of driving ideas of Git design).
  • Explicit packing. Deltafying in Git occurs at low level (plumbing level), and occurs on demand rather than implicitly (although porcelains may choose to do packing implicitly).

Git Features

Advantages

  • Strong support for non-linear development. Git supports rapid and convenient branching and merging, and includes powerful tools for visualizing and navigating a non-linear development history.
  • Distributed development. Like BitKeeper and SVK, Git gives each developer a local copy of the entire development history, and changes are copied from one such repository to another. These changes are imported as additional development branches, and can be merged in the same way as a locally developed branch. Repositories can be easily accessed via the efficient Git protocol (optionally wrapped in ssh) or simply using HTTP - you can publish your repository anywhere without any special webserver configuration required.
  • Efficient handling of large projects. Git is very fast and scales well even when working with large projects and long histories. It is commonly an order of magnitude faster than most other revision control systems, and several orders of magnitude faster on some operations. It also uses an extremely efficient packed format for long-term revision storage that currently tops any other open source version control system.
  • Cryptographic authentication of history. The Git history is stored in such a way that the name of a particular revision (a "commit" in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. Also, tags can be cryptographically signed.
  • Toolkit design. Following the Unix tradition, Git is a collection of many small tools written in C, and a number of scripts that provide convenient wrappers. It is easy to chain the components together to do other clever things.
  • Pluggable merge strategies. As part of its toolkit design, git has a well-defined model of an incomplete merge, and it has multiple algorithms for completing it, culminating in telling the user that it is unable to complete the merge automatically and manual editing is required. It is thus easy to experiment with new merge algorithms.

Besides providing a version control system, the Git project provides a generic low-level toolkit for tree history storage and directory content management. Traditionally, the toolkit is called the plumbing. Several other projects (so-called porcelains) offer compatible version control interfaces -- see the related tools page on Git Homepage and InterfacesFrontendsAndTools page on this wiki.

Design Consequences

Some might say that these are disadvantages

  • Renames are handled implicitly rather than explicitly. A major complaint with CVS is that it uses the name of a file to identify its revision history, so moving or renaming a file is not possible without either interrupting its history, or renaming the history and thereby making the history inaccurate. Most post-CVS revision control systems solve this by giving a file a unique long-lived name (a sort of inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage. Source code files are sometimes split or merged as well as simply renamed, and recording this as a simple rename would freeze an inaccurate description of what happened in the (immutable) history. Git addresses the issue by detecting renames while browsing the history of snapshots rather than recording it when making the snapshot. (Briefly, given a file in revision N, a file of the same name in revision N-1 is its default ancestor. However, when there is no like-named file in revision N-1, Git searches for a file that existed only in revision N-1 and is very similar to the new file.) However, it does require more CPU-intensive work every time history is reviewed, and a number of options to adjust the heuristics.
  • Periodic explicit object packing. Git stores each newly created object as a separate file. Although individually compressed, this takes a great deal of space and is inefficient. This is solved by the use of "packs" that store a large number of objects in a single file (or network byte stream), delta-compressed among themselves. Packs are compressed using the heuristic that files with the same name are probably similar, but do not depend on it for correctness. Newly created objects (newly added history) are still stored singly, and periodic repacking is required to maintain space efficiency.
  • Garbage accumulates unless collected. Aborting operations or backing out changes will leave useless dangling objects in the database. These are generally a small fraction of the continuously growing history of wanted objects, but checking for dangling objects using git-fsck-objects and reclaiming the space using git-prune can be slow. Pruning also breaks immutability of the database. Newer version of git feature a git-gc command which will combine the pruning and packing.

Implementation

Git has two data structures, a mutable index that caches information about the working directory and the next revision to be committed, and an immutable, append-only object database (repository) containing four types of objects:

  • A blob object is the content of a file. Blob objects have no names, timestamps, or other metadata.
  • A tree object is the equivalent of a (sub)directory: it contains a list of filenames, each with some type bits and the name of a blob or tree object that is that file, symbolic link, or directory's contents. This object describes a snapshot of the source tree.
  • A commit object links tree objects together into a history. It contains the name of a tree object (of the top-level source directory), a timestamp, a log message, and the names of zero or more parent commit objects.
  • A tag object is a container that contains reference to another object and can hold additional meta-data related to another object. Most commonly it is used to store a digital signature of a commit object corresponding to a particular release of the data being tracked by Git.

The object database can hold any kind of object. An intermediate layer, the index, serves as connection point between the object database and the working tree.

Each object is identified by a SHA1 hash of its contents. Git computes the hash, and uses this value for the object's name. The object is put into a directory matching the first two characters of its hash. The rest of the hash is used as the file name for that object.

Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression. This can consume a large amount of hard disk space quickly, so objects can be combined into packs, which use delta compression to save space, storing blobs as their changes relative to other blobs.

References


(About)


Personal tools