Scripting fun

My music (mp3, ogg) collection is in desperate need of cleaning up. In the past several years I’ve moved it, copied it, merged it, and the end result is that I have a ton of duplicate files filling up my harddisk and it’s nearly impossible to track them all down manually. Add to that the fact that iTunes has, in the past, been less than stellar about keeping obvious duplicates out, so I have a lot of Some Track Name.mp3, Some Track Name 2.mp3, etc.

To weed out the most obvious, I’ve taken the following approach:

Pass 1:

find iTunes/ -type f -iname '*.mp3' -print0 | xargs -0 -n1 md5sum > audio.iTunes.md5

That calculates an MD5 hash of every single file in the iTunes folder (which is NFS mounted so I could do this on the actual Linux server without slowing down the OS X desktop noticably).

Pass 2:

cut -d' '   -f1 audio.iTunes.md5 | sort | uniq -c | sort -nr > audio.iTunes.md5.dupes

This takes the first element of each line in the output of the previous pass, sorts them so that identical entries can be spotted in a row, then count how many there are of each, and finally sort the result by putting the ones with the most dupes at the top.

That gives you a file that looks like this:

      3 ff836b651908cced484d7b1da5d3257f
      3 f85e0cc4268fd65688f3cbfed683fe2a
      3 f5b0f3b3595b56285e60a4182e6c47c4
      3 f4bcde670ab14abaa2836e4b562f8a0f
...
      1 0023360976f48c6d71fd6e9e988d37a1
      1 00222cdd9cd524655caad5ad99976d0f
      1 001aa7e984e0f75d9fef22fa76ea3008
      1 0004e4b2dd2534a971314e1da21a5bf2

Exciting, huh!

Follows pass 3:

egrep -v '^      1 ' audio.iTunes.md5.dupes  | cut -c9- > audio.iTunes.md5.dupes.2

This first To filter out the ones that start with ” 1″ which aren’t actually duplicates, while at the same time narrowing it back down to just the MD5 checksums.

Now the last pass:

while read md5; do
    echo "$md5:"
    grep $md5 audio.iTunes.md5 | cut -c33-; echo ''
done < audio.iTunes.md5.dupes.2 > audio.iTunes.md5.dupes.list

I actually did this on one line originally but for readability (ahem) I spread it out a bit. It takes the list of MD5 checksums of actual duplicates, then for each one of those finds which files matched.

This is where I stop scripting things because I don’t want to accidentally remove the wrong duplicate, but I could imagine that someone would take the above while-loop and instead of creating an overview file, remove all but the first grep result for each MD5. I’ll leave that as an exercise to the reader. ;-)