SoC2011Ideas

From Git SCM Wiki
Jump to: navigation, search

OBSOLETE CONTENT

This wiki has been archived and the content is no longer updated. Please visit git-scm.com/doc for up-to-date documentation.

Welcome

Git has been accepted to the Google Summer of Code 2011 program!

This page contains project ideas culled from the Git user and developer community. You can get started by reading some project descriptions, and the mailing list thread(s) that spawned them. If you have another idea, add it to this page and start a discussion on the mailing list.

Preparation: Read advice for GSoC students. Try sending a few simple patches to the mailing list as soon as possible to familiarize yourself with the general patch workflow. Details can be found in Submitting Patches.
Priorities: The top priority for most project is always about getting code merged into Git upstream. However, there are two extreme situations to avoid. The first is being opaque for the risk of not being able to integrate it into git upstream at the end of the summer term. The other is worrying so much about the integration of each little bit that the project keeps getting detracted, and eventually loses focus. To strike a balance, post progress reports to the mailing list (atleast) once a week, and keep a public development branch. Occasionally, it might help to post patches for small components of the project with unittests to get a wider test audience.

If you are interested, you can also find projects which got accepted previously in 2007, 2008, 2009 and 2010.

Contents


Projects touching the core of Git

Git Submodules Enhancements

Git submodule usability has increased in recent releases, but there is still much room for improvement. Tasks carried out by tools like repo (used by Android) could be accommodated by core Git.

Major topics:

  • Place the repository for each populated submodule in the superproject's $GIT_DIR/modules, and use .git as a file (not a directory) to point at it. This resolves issues related to switching between versions of the superproject which contain/do-not-contain the submodule, as the submodule working directory can be safely created/deleted at will and its objects are safe somewhere within $GIT_DIR/modules.
  • It's far too easy to create a commit in a supermodule which contains a commit which does not yet exist in the submodule repo. This should require forcing.
  • Enhancements to tools and CLIs to make working with submodules more easy.
  • Let patches be created and be applied across submodules.
  • Enable git mv to work on submodules (unless this should be known more of an bug fix)

Goal: Get one or more of the above features accepted into git core.
Language: C, Bourne shell
See: Repo sources, git-submod-enhancements wiki.
Possible mentor: Jens Lehmann

Better git log --follow support

When showing the history of a subset of a project (e.g., "git log -- foo.c"), git shows only the history of changes that affects the single pathname. Because of this, changes made to the content currently at "foo.c" that was previously called "bar.c" will not be shown.

We have the "--follow" option that switches the path to follow to "bar.c" by following renames, but it has some deficiencies.

For example, it follows only a single path, the path it follows is global, which means that following more than one lines of development merged together, one with the original pathname and the other with the renamed pathname cannot work. Also it does not interact well with git's usual history simplification (which displays a connected subgraph of the history that pertains to "foo.c").

Major topics:

  • Expand --follow to handle arbitrary pathspecs
  • Design and implement a new architecture for --follow that will allow it to mark uninteresting commits as part of the usual history simplification process. Note that care must be taken not to impact the performance of non-follow codepaths.

Goal: Get the above features accepted into git core.
Language: C
Proposed mentor: Jeff King

Better big-file support

Git generally assumes that the content being stored in it is source code, or some other form of text approximately the same size. While git can handle arbitrary-sized binary content, its base assumptions sometimes mean some operations are slow or unnecessary space is consumed for large binary files (e.g., videos or other media).

Major topics:

  • Examine the behavior of git when using large media files. Identify areas where performance problems lead to a poor user experience. Develop a set of test cases that highlight these problems.
  • Design solutions to mitigate the performance issues. In some cases, this may be as simple as having a "large file" codepath that avoids pulling whole files into memory (e.g., during "git add"). For other cases, it may involve new features (e.g., on-demand fetching of large blob objects from a central repository).
  • Implement these solutions.

Goal: The ultimate goal is getting these implemented features into upstream git. However, because part of the task is identifying issues and solutions, it may turn out that implementing all solutions is too large for a GSoC project. The student will need to work with the mentor to establish the scope of the implementation during the course of the project.
Languages: C, Bourne shell
Proposed mentor: Jeff King

Better big-file support by not storing objects in git

An alternate method for achieving the same goal is to use a git filter that stores a reduced representation (like the sha1) in git, but stores the actual media files elsewhere. Similar systems already exist:

Goal: Identify shortcomings in current systems, both in the systems itself and in the support that git provides for them. Design and implement solutions that address these shortcomings.
Languages: C (for git interaction), other languages for examining existing systems and/or designing your own
Reference: http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165399

Resumable clone

Currently cloning a remote repository has to be done in one session. If the process fails or is aborted for any reason any already downloaded data is lost and one has to start from scratch.

There is also currently a bug where, after successfully loading all data during cloning, an failure in applying the data to the working directory leaves the repository in some unusable state. In this a normal clone behaves differently than a clone --no-checkout followed by checkout. Fixing this bug would also be part of this project.

While not necessarily being part of this project fetch might also benefit from a resume mechanism.

Goal: Allow Git to resume a cloning process that has been aborted for any reason.
Languages: C
See: Cached packs

Port histogram diff from jgit

Git internally uses a stripped down version of the xdiff library to generate patches and perform merges. In addition, we have an implementation of patience diff.

It has been observed that the "histogram diff" algorithm used in jgit performs much faster than xdiff which is an implementation of traditional Myers diff algorithm.

Port the implementation, make it available via a command line option in a way similar to the existing "patience diff", so that people can benchmark the result to see if it performs better.

Major topics:

  • Learn how main part of git interacts with the xdiff library.
  • Learn how "histogram diff" works by studying jgit implementation.
  • Implement "histogram diff" as the third option to the xdiff library.
  • Benchmark the result.

Goal: Get this feature merged to the upstream git as an optional algorithm.
Language: C

Multiple work trees

The git-new-workdir helper (from 2007) for working on multiple branches at once in a single repository is very handy. Three problems (that are well known and are probably the reason git new-workdir is still in contrib) with workdirs are:

  • the HEAD reflogs aren't shared, which means that pruning one work tree may trash accessible stuff from the reflog of another one.
  • if two working trees are on the same branch at the same time, changes in one worktree cause the HEAD ref to move from under the feet of the other.
  • it relies on symbolic links, which are not available on all platforms

Pierre Habouzit outlined an approach to fixing them here: [1]

Major topics:

  • Learn how reflogs, reachability analysis, atomic ref updates, and index locking work by studying git's current implementation.
  • Design and implement a mechanism to inform an object store about ref-like and reflog-like things that live outside of the repository itself (so the objects these refs point to will not be pruned).
    • teach "git clone --shared" to use it so rewinding refs and running gc in the forkee repository won't remove objects still needed by the forker
    • teach "git new-workdir" to use it for HEAD reflogs
  • Design and implement mechanism for a repository to be aware of branches it should not check out, and teach "git new-workdir" to use it
  • Design and implement an analog to the ".git file" mechanism to emulate refs/ and logs/ symlinks on systems without symlink support

Goal: Get git new-workdir in good enough shape that it is ready to move out of contrib/.
Languages: C, Bourne shell

git cherry-pick --continue/--abort/--skip and git sequencer

There was a "git sequencer" GSoC 2008 project. The student was Stephan Beyer and the mentors were Christian Couder and Daniel Barkalow. This project came up with some working code but most of it did not get merged. After the end of the project a few reworked parts of the code were slowly integrated into "git reset" and "git cherry-pick". And some RFC WIP patches were posted last november to implement "git cherry-pick --continue": http://thread.gmane.org/gmane.comp.version-control.git/162183/focus=162197

Now the following steps should be done:

  • Update and rework the patch series to properly implement the "--continue" option in "git cherry-pick", and then the "--abort" and "--skip" options too,
  • Refactor code from the previous step and the old sequencer project into a proper "git sequencer" command,
  • Use the git sequencer command to simplify and streamline the "git rebase", "git am" and perhaps other commands,
  • For bonus points, port "git rebase" and "git am" to C.

Goal: Implement "--continue", "--abort" and "--skip" options in "git cherry-pick" and then a "git sequencer" command.
Language: C

Clean Up and Improve git add -p

git-add--interactive.perl became a bit of a mess. Partly due to {checkout,stash,...} -p it has bolted-on interfaces to other commands. There are some UI issues that simply fall out of its design, e.g., you cannot go back from one file to another, Ctrl-C stops applying to the current file but does not discard earlier files, etc.

While arguably not a very "cool" project, a proper redesign would open up many possibilities that the current infrastructure does not support, such as (in addition to fixing the above problems):

  • Doing something useful about binary file differences (such as: file X differs, should I add it?)
  • Doing something useful during a conflicted merge, such as providing interactive add/review of the --cc hunks
  • On-the-fly reversion of the diff direction (currently the direction is predetermined at launch)

Goal: Clean up and then extend. Get at least the cleanup part merged.
Languages: Perl (and a tiny bit of C if the external interface changes)

Build in more external commands

There are still commands implemented in shell script or perl. The goal is to rewrite major ones in C.
Goal: Write selected commands in C. Not all commands need to be rewritten though. Some commands to consider are:

  • git-add--interactive.perl
  • git-rebase*.sh
  • git-am.sh
  • git-stash.sh
  • git-rebase.sh
  • git-repack.sh
  • git-pull.sh
  • git-send-email.perl

Depending on whether the sequencer is implemented, am and rebase may be left out.
Languages: C, Perl, Shell script

Better option parsing for "git log" and "git diff"

Most git commands use the parse-options API (see Documentation/technical/api-parse-options.txt) to provide some 'enhanced option parsing' facilities, as described in gitcli(7).

Unfortunately, basic commands like "git diff" and "git log" cannot because of two features the current parse-options library does not know how to deal with: (1) Option sets are shared between multiple commands. For example, "git log" and "git rev-list" both support revision listing options and "git log" and "git diff-tree" both support diff-formatting options. (2) In the case of "git log", the fields in struct rev_info for flags like --first-parent are bitfields, and C does not support pointers to bitfields.

Your mission, should you choose to accept it:

  • Convert the revision traversal bitfields to explicit flag words with mask #defines (or find another way to point to them)
  • Design and implement a way for the parse-options facility and its callers to represent option sets shared by several commands
  • Use this facility for diff options and rev-list options, so "git log", "git show", and so on can act more like typical git commands.

This would make the interface more consistent between commands. Longer term, it could pave the way to the bash tab-completion script having reliable information about the usage of each command produced by git automatically. See the thread surrounding [2] for some pointers.

Goal: migrate "git diff" and "git log" to use the parse-options API
Languages: C

Add-ons to git

Word-based Merge Helper

The existing merge algorithms are all tailored to line-based formats such as code. Writing, e.g., LaTeX or even asciidoc requires sticking to a strict word-wrapped format. Worse even, re-wrapping leads to headaches if people work on the same areas a lot, much like the effects of code reindents.

This is similar to last year's idea of a LaTeX merge helper: Merge helper for LaTeX files (2010).

One possible angle of attack given --word-diff=porcelain would be:

  1. Fix --word-diff to properly represent both sides of the diff at least optionally. (It has been observed on the list that it does not even represent either side faithfully.)
  2. Use --word-diff=porcelain as input to some to-be-written merge algorithm.

Goal: Design and implement a merge algorithm that works for formats where word-diff is more suitable than line-diff.
Languages: C, Perl

Remote helper for Subversion

Write a remote helper for Subversion. While a lot of the underlying infrastructure work was completed last year, the remote helper itself is essentially incomplete. Major work includes:

  • Understanding revision mapping and building a revision-commit mapper.
  • Working through transport and fast-import related plumbing, changing whatever is necessary.
  • Getting an Git-to-SVN converter merged.
  • Building the remote helper itself.

Goal: Build a full-featured bi-directional `git-remote-svn` and get it merged into upstream Git.
Languages: C
See: A note on SVN history, svnrdump
Possible mentors:Jonathan Nieder, Sverre Rabbelier, David Barr

libgit2

Complete some libgit2 features

libgit2 is a portable implementation of Git core methods provided as a re-entrant linkable C library. It is already awesome, but some key features need to be completed before it can be used by 3rd party applications.

Major topics:

  • Config file parsing
  • The git network protocol (push, fetch)
  • Diffs
  • Merges

Goal: Implement some missing features.
Language: C
See: http://libgit2.github.com and https://github.com/libgit2
Possible mentor: Vicent Marti

Build a minimal Git client based on libgit2

Write a minimal Git client using libgit2. It has to be something small and 100% self contained in a C executable that runs everywhere with 0 dependencies -- don't aim for full feature completion, just the basic stuff to interoperate with a Git repository. Clone, checkout, branch, commit, push, pull, log. Use the new client to test libgit2 for compatibility with the original Git with the same unit tests that Git uses.

Goal: Implement a minimal CLI Git client.
Language: C
See: http://libgit2.github.com and https://github.com/libgit2
Possible mentor: Jeff King, Vicent Marti

gitweb

Graphical history view in gitweb

Implement graphical log (like in `git log --graph`, gitk, qgit, tig, or git-browser) in gitweb, perhaps also graphical view of forks like in GitHub. There are many possible ways to generate such graph: generate image, generate SVG, ise canvas element if web browser supports it, use images and perhaps transparency, use Unicode a la git-forest, use ASCII-art like `git log --graph`; one can take an inspiration i how other web interfaces do it.

If drawing a graphical history requires extra Perl modules to be installed, it should be possible to run gitweb without them, perhaps without graphical history (there can be ASCII-art fallback).

Goal: Graph of history in gitweb 'log', 'shortlog' and 'history' views, similar to what e.g. gitk offers.
Languages: Perl, HTML/CSS/JavaScript


git-gui and gitk

Embedding graphical diff and merge tool in git-gui

Embed graphical diff tool (and optionaly graphical merge tool) in git-gui, so that one would be able to use `git gui diff <file>` to get graphical diff. One can use e.g. TkDiff as a source of ideas, algorithms and code - it has compatibile license with Git, being GPLv2+ licensed.

"git gui diff" should use git plumbing if possible, rather than reimplemeting algorithm in Tcl. git-gui and/or gitk could have menu entry that runs graphical diff or merge, but it is not strictly necessary.

Goal: git gui diff <file> generates graphical diff.
Languages: Tcl/Tk
Possible mentors: Heiko Voigt


Common library for git-gui and gitk

git-gui consist nowadays of many modules, but gitk is still monolithic script. Much of code is probably duplicated between git-gui and gitk. The goal of this project is to split gitk into smaller modules, and to extract common parts of git-gui and gitk into common Tcl/Tk library (Tcl bindings).

See also this thread.

Goal: split gitk, common library (Tcl/Tk bindings) for gitk and git-gui
Languages: Tcl/Tk
Possible mentors: Heiko Voigt

Platform-specific projects

Port Git to Android

While the Android project uses Git for it's development there is currently no Git client that runs on Android.

Goal: Provide a Git client for Android that is as complete as possible.
Languages: C, Perl, Bourne shell

Other sources of ideas

  • Ideas from previous Google Summer of Code. Note that some of the ideas got implemented since then.
    • SoC2007Ideas (e.g. .gitlink, lazy clone, GitTorrent, blame merge strategy, git-svnserver)
    • SoC2008Ideas (e.g. resumable clone/fetch, pack v4/v5)
    • SoC2009Ideas (e.g. packfile caching for git-daemon, git mirror-sync, libgit2, directory renames)
    • SoC2010Ideas (e.g. mirror support, git sequencer (again), etc)
Personal tools