When filtering the commit history of a Git repository to contain only the history of certain files, and performance is an issue, consider the following suggestions:
- Use BFG Repo-Cleaner where possible. It’s quite fast.
- Otherwise, use the
--subdirectory-filter
option ofgit filter-branch
, where appropriate. - Otherwise, use the
--index-filter
option ofgit filter-branch
and specify the desired files as arguments.
Listing the files as arguments to filter-branch
was not obvious to me, but
makes a huge difference when filtering for a small subset of the commits. As
an example, consider extracting the history for getopt from the FreeBSD src
repo:
git filter-branch --prune-empty \
--index-filter 'git ls-files -s | \
sed -n "s/\tlib\/libc\/stdlib\/getopt/\tgetopt/p" | \
GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info && \
if test -f "$GIT_INDEX_FILE.new" ; then \
mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE" ; \
else \
rm "$GIT_INDEX_FILE" ; \
fi' \
HEAD -- lib/libc/stdlib/getopt*
On my laptop, without the file arguments (the -- lib/libc/stdlib/getopt*
)
after 5 minutes git estimates that the command will take about 4 more hours
(with the estimate continually increasing). With the file arguments, it
completes in about 30 seconds. By passing the file arguments, git only
applies the filter to commits which match those files. Since this can be
determined efficiently it significantly reduces the processing and resulting
run-time.