#2678 Identify binary blobs in pkgs git repos
Closed: Fixed None Opened 13 years ago by kevin.

= phenomenon =

Some maintainers have checked in binary blobs to git.

= background analysis =

This makes cloning those packages hard and more difficult to work with.

= implementation recommendation =

We need some kind of script to identify common binary blobs checked into git and what commit they were added in so we can take further action.

Common blobs would be .tar.gz .tar.bz2 *.tar.xz and the like.


I have whipped up a commit hook that can prevent checkins of large files or those of certain types. It's in the infrastructure git repo under scripts/pkgs-update-hook.

Currently it exempts files ending in .patch and prevents the checkin of anything the file command calls 'compressed data' as well as naked tarballs and zip files. It's kind of trivial to add additional checks or to change the size limit (which is currently absurdly low for testing).

It just occurred to me that it would be wise to exempt spec files as well. (The kernel spec is 104K as it is, and it's not the largest I happen to have checked out.)

Do we need to consider what a binary blob is?

From http://docs.freebsd.org/info/diff/diff.info.Binary.html:
"diff' determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every character in that part of the file is non-null,diff' considers the file to be text; otherwise it considers the file to be binary."

Seems to me that we definitely want to target large binary blobs, and definitely exclude .spec and .patch files.

Anything that looks like text in any encoding should be allowed, IMHO -- git will be able to efficiently pack them if they can be diff-ed.

That leaves small binary blobs (e.g. image files). Any script that tries to detect existing such files should probably report on any binary blob (i.e. non-textual) as well as the filename, extension, size and checksum; so that any clean-up script could make use of this information.

Thoughts on this? I might try whip something up this weekend. Thanks!

Well, someone might have mistakenly commited a binary named 'foo.patch' :)

Yeah, outputting filename, extension, size, etc would be great.

A script to find all blobs in a repo. (this only reports it, the delete function does not yet work in this version)
find-and-report.tar.gz

This updated version will report by default, but can also delete (with --delete or -D), and will rewrite the git history
find-report-delete.tar.gz

Many thanks to Nick (nb) for reminding me of git filter-branch.

Login to comment on this ticket.

Metadata