SoC2009Ideas

From Git SCM Wiki
Jump to: navigation, search

Welcome!

Git has been accepted to the Google Summer of Code 2009 program!

Google will be accepting student applications March 23 ~12 noon PDT/19:00 UTC through April 3 12 noon PDT/19:00 UTC.

This page contains project ideas culled from the Git user and developer community. You can get started by reading some project descriptions, and the mailing list thread(s) that spawned them. Also consider joining the developer mailing list, or finding us on IRC. Details can be found in GitCommunity.

Don't like a project you see here? Toss your own idea out there, it may be just the thing we have been looking for!

If you are interested, you can also find projects which got accepted previously in 2007 and 2008.

Contents


General Requirements

All projects have the following basic requirements:

  • Unless otherwise stated, projects will require programming in C, as nearly all of Git is written in C, for maximum speed and portability.
  • All materials must be released under the GNU General Public License (GPL), version 2.
  • Individual students shall retain copyright on their works.
  • Projects must be tracked and managed in Git, and published on http://repo.or.cz.
  • Weekly project status reports should be sent to the project's mentors. Each status report should outline what was accomplished that week, any issues that prevented that week's goals from being completed, and your goals for the next week. This will help you to break your project down into manageable chunks, and will also help the project's mentors to better support your efforts.

Interested students are encouraged to read the Advice for GSoC Students Page, as it has excellent suggestions that might help you to pick a project and shape your proposal.

If your proposal is accepted by the Git Development Community you will be expected to work on it full time during the summer. Its cool if you want to take a week off for vacation, but remember that Google is hiring you for the summer to help us improve Git. That should be your focus. Don't expect that you will be able to work on your project for just 10 hours a week and then collect at the end.

If your original proposal doesn't pan out or becomes too much of a challenge, you should work with your mentor to help redefine it. We really want to see every project succeed this summer, as there is a great deal of interest in these projects from within the user community.

Students can apply for the program at the Google Summer of Code website. Please consider reviewing our SoC2008Template and answering its questions as part of your application.

New To Git? New To Open Source Development?

A collection of smaller projects. Pick one or two and get introduced to the open source community!

Line-level history browser

Output from "git blame" is often not useful, as you do not see the actual changes. It could have been an indentation change, a typo fix, a code movement, or really the origin of the code that you are looking for.

What is needed, then, is a browser similar to "git log", except that it should be able to follow the history of particular line ranges, not whole files.

Like "git log", it should be possible to see the commits, the diffs (restricted to the interesting area), and code moves. It would be mighty cool if it could see code copies, too.

Some of above functionality can be found in graphical blame browsers such like sample program blameview from `contrib/` (Perl, Gtk2), or "git gui blame" (Tcl/Tk), or git-age 'blame visualizer' (PyGTK).

For an alternate solution, utilizing pickaxe search rather than git-blame, see Giddy, Git Pickaxe Frontend by Petr Baudis (Perl, Gtk2), presented at Git Together '08 as the secret pickaxe project.

Goal: A program that can follow history of line ranges.
Language: C
Mentor: Johannes Schindelin (johannes.schindelin@gmx.de)

Comparative revision history in Gitweb

It is tedious to quickly compare differences between non-immediate pushes in git-web. Implement a selectable, comparative history viewer like on Wikipedia: http://en.wikipedia.org/w/index.php?title=Git_(software)&action=history, http://en.wikipedia.org/w/index.php?title=Git_(software)&diff=279286585&oldid=278280915

Goal: A way to select and compare different versions on gitweb
Language: Perl
See: Gitweb!

Implement git-submodule using .git file

Rewrite git-submodule, placing the repository for each referenced submodules in the superproject's $GIT_DIR/modules, and use .git as a file (not a directory), which is currently in 'pu', to point at it. This resolves issues related to switching between versions of the superproject which contain/do-not-contain the submodule, as the submodule working directory can be safely created/deleted at will and its objects are safe somewhere within $GIT_DIR/modules.

Goal: Rework git-submodule, keeping submodule repositories safe under $GIT_DIR.
Language: C, Bourne shell
See: gitster's recent remarks on #git on gmane.

Medium Sized Projects

Teach git-apply the 3-way merge fallback git-am knows

When git-apply patches file(s) the files may not patch cleanly. In such a case git-apply currently rejects the patch. If it is called from git-am then a 3-way merge is attempted by taking the "index ..." data from the patch and constructing a temporary tree, applying the patch to that, then merging the temporary tree to the current branch.

For various reasons it would be nice to do the entire temporary tree dance directly inside of git-apply, as it can benefit say the git-sequence's "apply" command, speed up git-am processing by forking less, etc.

Goal: Apply a patch that git-apply rejects today, but that "git am -3" can apply cleanly.
Language: C
See: gitster's recent remarks on #git on gmane.

Lazy clone / remote alternates

The idea here is to be able to remotely access objects from a network based object server as needed, rather than having them all local. This would make Git more approachable for very large projects, especially in a more corporate LAN-user setting.

Note: After having implemented the proof-of-concept, I don't think the lazy clone/remote alternates is a good idea; in most of the scenarios it leads to far too many roundtrips which makes it unusable even on LANs. [[[#narrowclone|Narrow and Sparse clone support]]] is the way to go, if you considered subscribing to the lazy clone idea, please choose 'Narrow and Sparse clone' instead :-) -- Jan Holesovsky

Goal: A working lazy clone prototype implementation that could be considered for inclusion, in a nice series of commits (separate branch/fork)
Language: C
Suggested mentor: Jan Holesovsky, who submitted proof-of-concept patch for lazy clone
See also: [[[#narrowclone|Narrow and Sparse clone support]]] below, as it is quite related.

Packfile caching for git-daemon

Even with delta reuse, enumerating objects to be present in packfile generates significant load on server for pack-generating protocols, such as git:// protocol used by git-daemon. Many of requests result in the same packfile to be generated and sent; examples include full clone, or fetch of all branches since last update. It would make sense then to save (cache) packfiles, and if possible avoid regenerating packfiles by sending them from cache. (Possible extension would be to send slightly larger pack than needed if one can reuse cached packfile instead).

The goal is for git-daemon to cache packfiles, use cached packfiles if possible, and to manage packfile cache. Note that one would need in the final version some way to specify upper limit on packfile cache size and some cache entry expire policy.

Caching packfiles is a good solution for infrequently updated repositories (e.g. once per day), so at least full clone requests, and fetch since last update would generate the same packfile. An alternate solution, explored in this post (for JGit), is to cache object list instead, speeding up the most costly part of operation. This might work even for very busy repositories (used as exchange, not only distribution point).

Goal: Support for packfile cache in git-daemon, benchmark server load
Language: C

git mirror-sync

An extension, rebirth and grand simplification of the key features of the (incomplete) GitTorrent projects from previous years.

Includes extensions to the git daemon protocol to handle automatic registration and verification of mirrors, expedient distribution of updates to the mirrors, mirror selection, and for extra points, download spreading (pulling fractions of bundles from a number of nodes) and/or "mob" fork support. Thus yielding the major P2P benefits of GitTorrent in simple, incremental steps. The result would be faster downloads, lower startup overheads, restartable cloning, a more resilient server network, push to any point in the cloud, etc.

For use for sites such as git.kernel.org, repo.or.cz, github.com, etc. Implementations of GitTorrent exist at http://jonas.nitro.dk/gittorrent/libgtp.git/ and http://git.utsl.gen.nz/VCS-Git-Torrent, but the MirrorSync proposal is a simpler way forward, integrates better with git and would be preferred to proposals that would attempt to finish the GitTorrent proofs of concept.

Goal: to implement a useful amount of the MirrorSync proposal (http://code.google.com/p/gittorrent/wiki/MirrorSync, also http://github.com/samv/git/tree/mirror-sync - see Documentation/git-mirror-sync.txt and Documentation/git-mirror-sync-impl.txt in that tree)
Language: Open for proposal.
Suggested mentors:

  • Jonas Fonseca (designed the original protocol)
  • Johannes Schindelin (interested party)
  • Sam Vilain (worked on GitTorrent, designed MirrorSync)


See: 2008 GSoC project

Domain specific merge helpers

Git adds conflict markers to text files when there where conflicts while merging. This is pretty useful if your text file is line based source code. But it can be pretty difficult to use, say, with LaTeX files (where the lines typically reflect whole paragraphs, and a single typo fix results in the same conflict size as if the whole paragraph were rewritten) or XML/SVG/build.xml files (which are not commonly edited with text editors, but other programs that expect valid XML as input -- which might not be true when the conflict markers are added).

To be useful with other file types than plain text or (line based) source code, domain specific merge helpers are needed which present the merge conflict in a reasonable form, helping the user to resolve the conflicts.

Goal: to implement a merge helper for at least one additional file type (depending on the file type, one domain specific merge helper might not fill a whole GSoC project)
Language: Open for proposal.
Suggested mentors:

  • Johannes Schindelin

Apply sparse To Fix Errors

Teach sparse how to fix common errors, and then use sparse to actually fix them. Perhaps more of a sparse project than a Git project. The community just wants to see Git improved, if sparse is improved at the same time, double bonus points!  :-)

Goal: Fix existing errors in Git.
Language: C
Mentor: Johannes Schindelin (johannes.schindelin@gmx.de)
Suggested by: Johannes Schindelin on gmane

Larger Projects

"Smart" HTTP based GIT Transport

Hosting a publicly available Git repository currently requires a server where a TCP port based service may be operated. This not permitted by most of the web site hosting companies except as part of their "dedicated" server offerings. This project is to implement a CGI based "smart" Git transport for Git's fetch and push operations, and the necessary client code. The completed project should also include support both authenticated and anonymous clients.

Most of what is required for the "smart" Git transport for this project is already implemented by the bundle facilities that Git already supports but changes to the existing bundle code may be necessary. The CGI component should be implemented in in a language that a majority of web site hosting companies permit to be used with their product offerings and should rely the Git suite of programs for most of the repository manipulation.

Language: Perl, C, shell, other
Goal: Server (CGI) and client that supports fetch & push operation for both authenticated & anonymous clients.
See: RFC 2/2| Add Git-aware CGI for Git-aware smart HTTP transport thread on git mailing list (GMane), http-protocol.txt RFC

Proper git-svn support on Windows

While we have a git-svn in msysGit (that was backed out for some time due to dependency problems), it is very slow, as it is based on an MSys perl (using a POSIX emulation layer) rather than a MinGW perl (using the Win32 API directly). Also, the interaction between CR/LF handling (core.autocrlf = true) and git-svn seems to have issues.

This project requires knowledge of Perl, and persistence in dealing with the Win32 API.

Language: Perl, C, shell
Goal: Substitute msysGit's Perl with a MinGW based one, and profile git-svn.
Mentor:

  • Johannes Schindelin (johannes.schindelin@gmx.de)

Improve git-cheetah, a TortoiseSVN lookalike

git-cheetah is pretty rudimentary at the moment, and its development speed is glacial, even if it is basically working (but lacks a few functions). However, it is much, much smaller and less complicated than, say, TortoiseSVN, and the goal is to provide most functionality by enhancing and calling git-gui, so it should be pretty easy to get a truly competitive Explorer extension.

Further, git-cheetah was started with X11 and MacOSX support in mind, so at least one of the two should be supported by the end of the project -- again, this should be relatively easy since the GUI part will be offloaded to git-gui.

Language: C, Tcl/Tk
Goal: Make git-cheetah an Explorer plugin everybody wants to have.
Mentor:

  • Johannes Schindelin (johannes.schindelin@gmx.de)

Narrow and Sparse clone support

One of the things that Git is not currently very good at is handling repositories with large media in them. There is no way to checkout a single file or directory, and the shallow clone support does not allow you to make changes and push back.

This project would implement the shallow, narrow and sparse checkout mechanisms to both the client and server side of Git, allowing you to clone:

  • the last revision (shallow)
  • a single directory (narrow)
  • a single file (sparse)

It would accomplish this by downloading all of the commit and tree objects, which are generally small, but only the blobs that you ask for (or just getting the blobs when you need them). It would also allow you to do pushes back.

Goal: Enable narrow and sparse cloning options in Git.
Language: C
Suggested by:

  • Scott Chacon (schacon@gmail.com)


See: "Git and Media repositories" on Nabble

Directory renames

Git deals quite well with renames when merging. One of the corner cases is when one side renamed some directory, and other side created new files in the old-name directory. Git currently creates new files in resurrected old-name directory, while it could create new files under new-name directory instead.

There is a bit of controversy about this feature, as for example in some programming languages (e.g. Java) or in some project build tool info it is not posible to simply move a file (or create new file in different directory) without changing file contents. Some say that is better to fail than to do wrongly clean merge.

Goal: At minimum option enabling wholesame directory rename detection. Preferred to add dealing with directory renames also to merge. At last, one can try to implement "git log --follow" for directories.
Language: C
Suggested mentor: Yann Dirson
See: RFC PATCH v2 0/2| Detection of directory renames thread on git mailing list (via GMane)
See also:

Projects So Large Your Head Will Spin

Implement pack v4/v5 for higher compression

Although Git's packfile format offers incredible disk space savings over other VCS systems, we already know we can still do better. A prototype design dubbed 'packv4' has been floating around the Git world for almost two years, but has not been fully implemented for inclusion in a shipping release.

At this point the implementation for 'packv4' would need to be rewritten from scratch, as much of the affected parts of the core Git code have been modified to support other enhancements. A student with a strong interest in data compression could really get into this project and take it to places we have never thought possible.

Goal: Save at least 10% on current packv2 for benchmark repositories (linux-2.6, git, mozilla, gcc) and boost runtime ~10%.
Language: C
See: on kerneltrap and the entire "understanding version 4 packs" thread on gmane

libgit2

Git's source code was not written to be reentrant, but to execute a program doing a specific task, and then exit. This is not really useful if you want to access a Git repository using a library rather than executing a program, catching and parsing its output. Further, some companies seem to be put off by the GPL that applies to Git's source code, and would prefer a BSD licensed library to base their product on.

Shawn Pearce already started libgit2, but as of now most functions are not implemented. This project's purpose is to implement most of these functions, and maybe provide language bindings for Python, Ruby, etc.

Language: C, doxygen
Goal: Implement enough of libgit2 to access a Git repository from a C program and/or a Python/Ruby/Whatever script.
Mentor:

  • Andreas Ericsson (ae@op5.com)
  • Johannes Schindelin (johannes.schindelin@gmx.de)

git-svnserver

The idea here is implement something in Git that speaks the Subversion protocol on the wire, but uses Git as the backend storage. (This would be like the existing git-cvsserver.)

Goal: To be able to access git repository, at minimum read-only, from a Subversion client, at least svn CLI.
Language: Open for proposal.
Suggested mentors:

Restartable Clone

When cloning a large repository (such as KDE, Open Office, Linux kernel) there is currently no way to restart an interrupted clone. It may take considerable time for a user on the end of a small pipe to download the data, and if the clone is interrupted in the middle the user currently needs to start over from the beginning and try again. For some users this may make it impossible to clone a large repository.

The difficulty of this task is under a dispute: it should be fairly easy to make clone restartable for (deprecated) rsync:// protocol and for "dumb" protocols using commit walker, like http://. It shouldn't be too difficult to design user interface for this feature. Adding resume support to git-clone over "smart" protocols using packfile generation like git:// and ssh:// would require (most probably) good knowledge of pack protocol and packfiles. It would require most probably new extension to git protocol.

Goal: Allow git-clone to automatically resume a previously failed download over the native git:// protocol.
Language: C
Suggested by: Shawn Pearce on gmane

Other Resources

Other links


Personal tools