Python to help find duplicates in audio and video files

Overview

PyPi module N/A
git repository https://bitbucket.org/arrizza-public/fix-my-media
git command git clone git@bitbucket.org:arrizza-public/fix-my-media.git
Verification Report https://arrizza.com/web-ver/python-fix-my-media-report.html
Version Info
  • Ubuntu 22.04 jammy, Python 3.10

Summary

Going through my audio files I noticed that I had some duplicates. To clean those up was tedious. I had about 3500 or so files and simply going through them by hand took a lot of time. To assist with that, I wrote a simple app in python to get all those files and report any that matched up.

Phase 1

The first pass was to use md5sum on each file. Two identical files will have the exact same md5sum, finding those was easy: use a dict to store the md5sum and a list of file paths that have that md5sum. Then go through that dict and report any that have 2 or more file paths.

One nice thing was that I knew the files were identical even if the file names were different.

Phase 2

After getting rid of those identical duplicates, it was clear that I still had multiple files that seemed to be duplicates but had different contents (i.e. md5sum didn't match).

One possibility is that the files had different audio formats. I didn't want to trust the file extension for some reason. That turned out to be a good thing.

I installed python-magic which gives detailed media content types. On Ubuntu, it requires libmagic1 module, see tools/install/do_install_ubu:

sudo apt install -y libmagic1
# note: automatically installed by "./do_install full"

The duplicates are reported in out/report_md5sum_summary.txt.

---- md5sum duplicates:
  >> ~/Music-dups

         0104e2a90f2457c8d74844c366e936ce ~/Music-dups/19 - Immigrant Song (Album Version) [Album Version].mp3
         0104e2a90f2457c8d74844c366e936ce ~/Music-dups/Led Zeppelin - Immigrant Song (Album Version) [Album Version].mp3

         01c1230af4ea971886b5cc13cc1b00b0 ~/Music-dups/Echo  the Bunnymen - Ticket To Ride.mp3
         01c1230af4ea971886b5cc13cc1b00b0 ~/Music/Echo and the Bunnymen - Unknown Album/06 - Ticket To Ride.mp3
<snip>
     found 168 duplicates

More audio detail is available in out/report_md5sum_duplicates.txt.

Search for WARN for all warnings about any issues or checks that seem to be out of order.

  >> 2dff5cb15b0d1edbd3a0456f2bb80623: found 2 duplicate files
         ~/Music-dups/38 - The Rain Song (Album Version) [Album Version].mp3
         Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
         ---
         ~/Music-dups/Led Zeppelin - The Rain Song (Album Version) [Album Version].mp3
         Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
         ---

And for fun, I report the number of extensions and audio formats found:

---- File extensions:
           11: .flac
          308: .m4a
           10: .m4p
         2840: .mp3
           42: .ogg
           47: .wa
          290: .wma
     #extensions    : 7

---- Audio Types:
           17: ['.mp3']             Audio file with ID3 version 2.3.0
            4: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 112 kbps, 44.1 kHz, JntStereo
         2133: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
          226: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
            2: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 32 kbps, 44.1 kHz, JntStereo
          220: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 320 kbps, 44.1 kHz, JntStereo
          121: ['.mp3']             Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 320 kbps, 44.1 kHz, Stereo
          <snip>
           11: ['.flac']            FLAC audio bitstream data, 16 bit, stereo, 44.1 kHz
          199: ['.m4p', '.m4a']     ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
            1: ['.m4p']             ISO Media, Apple iTunes ALAC/AAC-LC (.M4P) AES Protected Audio
           59: ['.m4a']             ISO Media, MP4 Base Media v1 [ISO 14496-12:2003]
           59: ['.m4a']             ISO Media, MP4 v2 [ISO 14496-14]
            1: ['.mp3']             MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
          285: ['.wma']             Microsoft ASF
            5: ['.wma']             Microsoft ASF ASF_Extended_Content_Description_Object
           42: ['.ogg']             Ogg data, Vorbis audio, stereo, 44100 Hz, ~192000 bps, created by: Xiph.Org libVorbis I (1.3.2)
           47: ['.wa']              RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz
     #audio types   : 38

Oddly enough, using this I found a bunch of ".mp3" and other extensions that were actually JPEGs internally. I double-checked and they were only 20K or less. I got rid of all of these.

I also found out that some audio types are not playable on my ubuntu machine: '.wma', '.m4p'. Eventually I also found that some .m4a files were not playable by Rhythmbox, audio, or play apps. Over time, I moved those out of my ~/Music directory as well.

Phase 3

At this point, I still suspected I had duplicates. They were okay formats, had different content (for some reason) and yet still were the same song. To find these, all I had left was the file name.

To match two file names as "close enough" I used DM25. This is a tool used by web searches to see which pages (via the word content of the page) matches the words used in the search.

In this case I gathered the word corpus for all files and then went through them one by one looking for a high DM25 ranking.

I started with a fairly high ranking. If it was 25.0 or above, then there was a high likelihood of filename matches.

 if score[0] < 25.000:
        continue

See out/report_filenames_similar for these kind of matches:

---- paths with similar filenames:

<snip>
  >> ~/Music-dups/02 - I Still Haven't Found What I'm Looking For.mp3
           0]   40.814 ~/Music-dups/U2 - The Joshua Tree 2/02 - I Still Haven't Found What I'm Looking For.mp3
           1]   40.814 ~/Music/U2 - Joshua Tree/02 - I Still Haven't Found What I'm Looking For.mp3
           2]   39.596 ~/Music-drm/U2 - I Still Haven't Found What I'm Looking For.wma
         ----
<snip>
  >> ~/Music/coolio - gangsta's paradise/17 - get up get down.mp3
           0]   37.052 ~/Music/In Yo' Face - Vol1/06 - Get Up And Get Down.mp3
         ----

     found 484 similar filenames

I found many duplicates using this report. I then lowered the score limit from 25.000 to 15.000 and then down to 8.000. This found more hits but the likelihood it was a duplicate file was less as the ranking was lowered.

Also, I noticed that two files can have similar words in them, and that caused a high ranking. For example see "gangsta" file name "get up get down" and "In Yo' Face" file name "Get Up and Get Down". These are not the same song at all, but do use common words.

As well, Jazz albums tend to have covers of the same song, but played differently so not really the same song.

A lot of Classical musics use song names that are nearly the same except for a number or a key in them, e.g. "B Flat" vs "F Major":

  >> ~/Music/Wendy Carlos - Switched-On Bach/04 - Two-Part Invention In B Flat Major.mp3
           0]   27.038 ~/Music/Wendy Carlos - Switched-On Bach/03 - Two-Part Invention In F Major.mp3
         ----

Phase 4

I knew there was media content that gave album name, artist, track number, and song title inside the file. I figured that could help find any files that were badly named on the outside, but had good info on the inside.

I used python module TinyTag to get the media info.

See out/report_file_media_info.txt for detailed content of each file.

---- file media info:
  >> --- act:                                 Music/War - Unknown Album/04 - Four Cornered Room
  >> ch2 exp:                             Music/War - World Is A Ghetto/04 - Four Cornered Room
         album     : The World Is A Ghetto
         track     : 4
         title     : Four Cornered Room
         artist    : War
<snip>         
  >> --- act:                              Music/War - Unknown Album/07 - Low Rider
  >> ch1 exp:                               Music/War - Unknown Album/07 - LowRider
         album     : Unknown Album
         track     : 7
         title     : LowRider
         artist    : War

<snip>         
>> --- act:                             Music/coolio - gangsta's paradise/17 - get up get down
>> ok  exp:                             Music/coolio - gangsta's paradise/17 - get up get down
     album     : gangsta's paradise
     track     : 17
     title     : get up get down
     artist    : coolio

reported 3548 media info

The tags mean:

  • ">> ok " indicate that the path matches "artist - album>/<track - title".
  • ">> ch1 " indicate the song name does not match "track - title" from the media info.
  • ">> ch2 " indicate the song name matches okay, but the directory does not match "artist - album".

I updated many songs and directories to match the content of the media info. Note, I did find that some media info content was wrong e.g. "title : LowRider" from the audio file itself is missing a space

I have some directories "artist - Unknown Album" that can have one-off songs from many different albums for a given artist. And I have a catch-all directory "Various - Unknown Album" as a catch-all for any remaining on-off songs. That's handy but there can be spurious warnings because of those directories.

After all, I found some media info was repetitive or just wrong or very verbose and therefore useless:

album     : Mambo No. 5 (A Little Bit of .
track     : 1
title     : Mambo No. 5 (A little bit of .
<snip>
title     : Itchycoo Park (Mono Version) (2012 Remaster) 2012 Remaster
<snip>
album     : Sultans Of Swing - The Very Best Of Dire Straits
track     : 9
title     : Money For Nothing (Album Version-Very Best Of...)
artist    : Dire Straits

But even still, this did clean up the song titles and made it possible to search them easier for duplicates.

Phase 5

At this point, I knew I still had duplicates. The only recourse I had left was to search for "matching" filenames. Where "matching" on the filename (no directory) was defined:

  • remove the leading track number "track - "
  • "exact": character by character match of the file name (the track number may be different)
  • "trim": filename with common version removed e.g. "Bonus Track", "Explicit", etc.
  • "case": trimmed filename converted to lower case e.g. "The" vs "the"
  • "spaces": trimmed, lower filename with spaces removed e.g. "sevennationarmy"
  • "punc": trimmed, lower, spaceless filename with common punctuation removed e.g. "Hello It's Me"

See out/report_filename_matches for the report.

---- filename matches:
<snip>
         exact:  A Remark You Made:
                 prev: /home/arrizza/Music/Weather Report - Heavy Weather/02 - A Remark You Made.mp3
                 curr: /home/arrizza/Music/Weather Report - Jaco Years/03 - A Remark You Made.mp3

         trim:   Deacon Blues: "Deacon Blues"
                 prev: /home/arrizza/Music-dups/04 - Deacon Blues Album Version.mp3
                 curr: /home/arrizza/Music-dups/04 - Deacon Blues.mp3

         case :  Sunshine Of Your Love: "sunshine of your love"
                 prev: /home/arrizza/Music/Bobby McFerrin - Simple Pleasures/10 - Sunshine of Your Love.mp3
                 curr: /home/arrizza/Music/Eric Clapton - Cream Of Clapton/02 - Sunshine Of Your Love.mp3

         spaces: Give Peace A Chance: "givepeaceachance"
                 prev: /home/arrizza/Music-dups/GivePeaceAChance.mp3
                 curr: /home/arrizza/Music/John Lennon - Rock 'N' Roll/11 - Give Peace A Chance.mp3

         punc:   Ruby, My Dear: "rubymydear"
                 prev: /home/arrizza/Music/Thelonious Monk - Best Of Thelonious Monk/02 - Ruby My Dear.mp3
                 curr: /home/arrizza/Music/Thelonious Monk - This is Jazz 5/04 - Ruby, My Dear.mp3
<snip>
     reported 826 filename matches

Note that there can be two songs with the exact same name, but they are in fact different. Typically, they are covers of a song by another artist, but they can simply be a different song that both artists chose the same name. In short, you have to confirm they are identical by playing them.

exact:  Norwegian Wood (This Bird Has Flown):
        prev: /home/arrizza/Music/Beatles - Unknown Album/13 - Norwegian Wood (This Bird Has Flown).mp3
        curr: /home/arrizza/Music/Herbie Hancock - New Standard/03 - Norwegian Wood (This Bird Has Flown).mp3

exact:  On The Run:
        prev: /home/arrizza/Music/OMC - How Bizarre/01 - On The Run.mp3
        curr: /home/arrizza/Music/Pink Floyd - Dark Side Of The Moon/02 - On The Run.mp3

Herbie Hancock's version is a Jazz cover of the Beatle's song. And OMC's and Pink Floyd's song are completely different, but they chose the same name.

Finally...

After all of this, I ended up with 2400 or so files. Many of the duplicates were simple, I must have temporarily created a new directory of an album, but I didn't clean it up when I was done. Many were multiple albums with the same songs. Some of these were remastered and sounded better, so I kept those, the rest were duplicates.

Quite a few were the same song, but different media format. I must have bought multiple times without checking the current inventory - oops!

- John Arrizza