SubprojectSupport

From Git SCM Wiki
Revision as of 04:50, 16 October 2010 by Jrn (Talk | contribs)

Jump to: navigation, search

Git was originally developed to manage the sources of the Linux kernel, a monolithic code base. It has since developed many features that make it attractive as a general purpose SCM, and is now used by all sorts of projects, some of which are developed in a 'modular' style.

The rule of thumb for determining whether a source tree is monolithic or modular is the correlation of subdirectories to release 'tarballs'. The monolithic Linux kernel is released as a single tarball, whereas the modular Xorg project releases dozens of such packages, with each one potentially having its own release schedule.

Non-distributed revision control systems such as CVS or SVN do not enforce this distinction; large projects like KDE and GNOME host hundreds of subprojects in a single repository and developers only ever have to check out the module they are working on. Sometimes, release managers, documenters or translators will make single commits that modify files across several modules. Similarly, if a change is made to the public API of a shared library in one module, the developer may update applications in other modules to use the new API as part of the same commit.

Such flexibility is an implicit feature of centralized SCMs, but it is much more difficult to get it right in a distributed system like git.

Status

Submodules have been a part of core git since version 1.5.3. Current work (as of October, 2010) is focused on improving the usability of commands such as git diff, git checkout, and gitk in projects with submodules.

See also:

Use cases

There are a number of goals which come under the general heading of subprojects, and have been requested by different people at different times. These are listed in rough order of the size of the subproject.

Separable parts of normal projects

Often a project will contain some parts which are either available elsewhere as separate projects or could be used elsewhere. For example, the Linux kernel contains a version of zlib (imported from the external project), and Kbuild (developed in the kernel tree, but adopted by other projects). The git tree itself contains xdiff, an externally-developed comparison engine.

Centralized version control isn't going to handle these at all because they are shared between different development groups with different central servers. So existing users for this case are using informal methods (generally, copying particular revisions by hand at arbitrary times).

Users in this case want to fetch all of the reachable files, since they are integral to the superproject, and may want to fetch all of the subproject history, since the subprojects are generally relatively small.

Weakly dependent portions of a large project

Large projects like KDE or GNOME are often segmented into different modules which share some higher-level structure. Users want to be able to check out some of the modules without checking out all of them. It is often the case that the top level contains some general infrastructure (such as the build system) that any checkout would need.

This case could be supported by path-split partial fetches and checkouts. That is, there could be a single large repository, and the user could ask for only certain paths. When requesting support for this use case, some people have asked for it in this form, but this design is generally seen as having problems.

Users in this case want to fetch and check out the superproject, and some submodules with their entire history. Development mostly happens in the submodules, with every submodule commit appearing in a superproject commit, where the subtree of the superproject history which changes the submodule exactly matches the submodule's history.

Vast superprojects

Some projects (often embedded Linux distributions, but also many *BSDs) consist of a superproject that contains every piece of software in the distribution as a component. This has obvious scalability concerns, since the total reachable content in such a project could be hundreds of times bigger than the largest normal project, since it could include thousands of projects of various sizes.

This case is also not handled by traditional version control, since all of the subprojects are treated by almost all of their developers as completely independent projects. Version control is used to track distribution URLs of the original sources and local patch sets.

Some users in such projects will be doing development of the subprojects, but in the context of the superproject (e.g., people making distro-specific kernel patches). Others will be working at the superproject level, deciding which subprojects to update, and to what versions (without necessarily making any changes to those versions).

Most likely, no users will ever want to check out the entire superproject (since it is likely to include mutually-exclusive packages to provide the same functionality). Even for users who want to check out a particular subproject, many of them will only want to check out versions which are included in superproject commits, not intermediate versions that were never adopted by the superproject. Fetching and storing the complete reachable history of the superproject is impossible for practically all developers.

Unhandled desires

There is an implicit assumption in previous use cases that a subproject is contained in a directory in the superproject and not mixed with other files from the superproject or other subprojects. Furthermore, the superproject subdirectory contains the subproject's root directory. People have discussed allowing more flexibility, but not in particular detail. In general, using symlinks into subprojects takes care of all of the cases that don't also require other features that git doesn't presently have.

References


Personal tools