From Git SCM Wiki
Jump to: navigation, search

Git was originally developed to manage the sources of the Linux kernel, a monolithic code base. It has since developed many features that make it attractive as a general purpose SCM, and is now used by all sorts of projects, some of which are developed in a 'modular' style.

The rule of thumb for determining whether a source tree is monolithic or modular is the correlation of subdirectories to release 'tarballs'. The monolithic Linux kernel is released as a single tarball, whereas the modular Xorg project releases dozens of such packages, with each one potentially having its own release schedule.

Other reasons to use subprojects include: distinct pieces of code shared by and developed by multiple projects, large collection of various projects (Android, Gentoo), one app with a lot of sub-libraries (maybe with customizations) and all libraries coming from different upstream repos with lots of library reuse (Ruby on Rails), projects with huge media files (eg. game dev), projects with so many files that git bogs down. (reasons stolen from git submodule usability notes)

Non-distributed revision control systems such as CVS or SVN do not enforce this distinction; large projects like KDE and GNOME host hundreds of subprojects in a single repository and developers only ever have to check out the module they are working on. Sometimes, release managers, documenters or translators will make single commits that modify files across several modules. Similarly, if a change is made to the public API of a shared library in one module, the developer may update applications in other modules to use the new API as part of the same commit.

Such flexibility is an implicit feature of centralized SCMs, but it is much more difficult to get it right in a distributed system like git.

In git, there are currently three main techniques for providing subproject support: submodules, subtrees, and wrappers.

  • Submodules provide semi-fixed references from the superproject into subprojects and are integrated into git (see the git submodule command and the GitSubmoduleTutorial). It is best used when the subproject:
    • is developed by someone else, is not under the administrative control of the superproject and follows a different release cycle.
    • contains code shared between superprojects (especially when the intention is to propagate bugfixes and new features back to other superprojects).
    • separates huge and/or many files that would hurt performance of everyday git commands.
  • Subtrees causes the subproject repository to be imported into the superproject's repository to be a native part of the repository with full history, typically in a specific subdirectory of the superproject. You are allowed to update as the subproject changes, but exporting changes is more challenging. See git merge subtree for more information.
  • Wrappers, which provide multi-repository management functionality to a superproject with associated subprojects. There are several wrappers which provide this functionality. The best advertised on this page and arguably the simplest is gitslave which provides a command called `gits` which under most circumstances takes the same arguments as `git` and performs the same job, except over all repositories registered with the superproject. Another tool is Google's repo which is used for Android development but does not appear to be well (or at least obviously) documented for non-Android purposes.

The remainder of this page is mostly historical.


Submodules have been a part of core git since version 1.5.3. Current work (as of October, 2010) is focused on improving the usability of commands such as git diff, git checkout, and gitk in projects with submodules.

See also:

Use cases

There are a number of goals which come under the general heading of subprojects, and have been requested by different people at different times. These are listed in rough order of the size of the subproject.

Separable parts of normal projects

Often a project will contain some parts which are either available elsewhere as separate projects or could be used elsewhere. For example, the Linux kernel contains a version of zlib (imported from the external project), and Kbuild (developed in the kernel tree, but adopted by other projects). The git tree itself contains xdiff, an externally-developed comparison engine.

Centralized version control isn't going to handle these at all because they are shared between different development groups with different central servers. So existing users for this case are using informal methods (generally, copying particular revisions by hand at arbitrary times).

Users in this case want to fetch all of the reachable files, since they are integral to the superproject, and may want to fetch all of the subproject history, since the subprojects are generally relatively small.

Weakly dependent portions of a large project

Large projects like KDE or GNOME are often segmented into different modules which share some higher-level structure. Users want to be able to check out some of the modules without checking out all of them. It is often the case that the top level contains some general infrastructure (such as the build system) that any checkout would need.

This case could be supported by path-split partial fetches and checkouts. That is, there could be a single large repository, and the user could ask for only certain paths. When requesting support for this use case, some people have asked for it in this form, but this design is generally seen as having problems.

Users in this case want to fetch and check out the superproject, and some submodules with their entire history. Development mostly happens in the submodules, with every submodule commit appearing in a superproject commit, where the subtree of the superproject history which changes the submodule exactly matches the submodule's history.

Vast superprojects

Some projects (often embedded Linux distributions, but also many *BSDs) consist of a superproject that contains every piece of software in the distribution as a component. This has obvious scalability concerns, since the total reachable content in such a project could be hundreds of times bigger than the largest normal project, since it could include thousands of projects of various sizes.

This case is also not handled by traditional version control, since all of the subprojects are treated by almost all of their developers as completely independent projects. Version control is used to track distribution URLs of the original sources and local patch sets.

Some users in such projects will be doing development of the subprojects, but in the context of the superproject (e.g., people making distro-specific kernel patches). Others will be working at the superproject level, deciding which subprojects to update, and to what versions (without necessarily making any changes to those versions).

Most likely, no users will ever want to check out the entire superproject (since it is likely to include mutually-exclusive packages to provide the same functionality). Even for users who want to check out a particular subproject, many of them will only want to check out versions which are included in superproject commits, not intermediate versions that were never adopted by the superproject. Fetching and storing the complete reachable history of the superproject is impossible for practically all developers.

Unhandled desires

There is an implicit assumption in previous use cases that a subproject is contained in a directory in the superproject and not mixed with other files from the superproject or other subprojects. Furthermore, the superproject subdirectory contains the subproject's root directory. People have discussed allowing more flexibility, but not in particular detail. In general, using symlinks into subprojects takes care of all of the cases that don't also require other features that git doesn't presently have.


Personal tools