Overview
PyPi module | N/A |
git repository | https://bitbucket.org/arrizza-public/fix-my-media |
git command | git clone git@bitbucket.org:arrizza-public/fix-my-media.git |
Verification Report | https://arrizza.com/web-ver/python-fix-my-media-report.html |
Version Info |
|
- installation: see https://arrizza.com/setup-common
Summary
Going through my audio files I noticed that I had some duplicates. To clean those up was tedious. I had about 3500 or so files and simply going through them by hand took a lot of time. To assist with that, I wrote a simple app in python to get all those files and report any that matched up.
Phase 1
The first pass was to use md5sum on each file. Two identical files will have the exact same md5sum, finding those was easy: use a dict to store the md5sum and a list of file paths that have that md5sum. Then go through that dict and report any that have 2 or more file paths.
One nice thing was that I knew the files were identical even if the file names were different.
Phase 2
After getting rid of those identical duplicates, it was clear that I still had multiple files that seemed to be duplicates but had different contents (i.e. md5sum didn't match).
One possibility is that the files had different audio formats. I didn't want to trust the file extension for some reason. That turned out to be a good thing.
I installed python-magic which gives detailed media content types. On Ubuntu, it requires libmagic1 module, see tools/install/do_install_ubu:
sudo apt install -y libmagic1
# note: automatically installed by "./do_install full"
The duplicates are reported in out/report_md5sum_summary.txt
.
---- md5sum duplicates:
>> ~/Music-dups
0104e2a90f2457c8d74844c366e936ce ~/Music-dups/19 - Immigrant Song (Album Version) [Album Version].mp3
0104e2a90f2457c8d74844c366e936ce ~/Music-dups/Led Zeppelin - Immigrant Song (Album Version) [Album Version].mp3
01c1230af4ea971886b5cc13cc1b00b0 ~/Music-dups/Echo the Bunnymen - Ticket To Ride.mp3
01c1230af4ea971886b5cc13cc1b00b0 ~/Music/Echo and the Bunnymen - Unknown Album/06 - Ticket To Ride.mp3
<snip>
found 168 duplicates
More audio detail is available in out/report_md5sum_duplicates.txt
.
Search for WARN
for all warnings about any issues or checks that seem to be out of order.
>> 2dff5cb15b0d1edbd3a0456f2bb80623: found 2 duplicate files
~/Music-dups/38 - The Rain Song (Album Version) [Album Version].mp3
Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
---
~/Music-dups/Led Zeppelin - The Rain Song (Album Version) [Album Version].mp3
Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
---
And for fun, I report the number of extensions and audio formats found:
---- File extensions:
11: .flac
308: .m4a
10: .m4p
2840: .mp3
42: .ogg
47: .wa
290: .wma
#extensions : 7
---- Audio Types:
17: ['.mp3'] Audio file with ID3 version 2.3.0
4: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 112 kbps, 44.1 kHz, JntStereo
2133: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
226: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 256 kbps, 44.1 kHz, Stereo
2: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 32 kbps, 44.1 kHz, JntStereo
220: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 320 kbps, 44.1 kHz, JntStereo
121: ['.mp3'] Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 320 kbps, 44.1 kHz, Stereo
<snip>
11: ['.flac'] FLAC audio bitstream data, 16 bit, stereo, 44.1 kHz
199: ['.m4p', '.m4a'] ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
1: ['.m4p'] ISO Media, Apple iTunes ALAC/AAC-LC (.M4P) AES Protected Audio
59: ['.m4a'] ISO Media, MP4 Base Media v1 [ISO 14496-12:2003]
59: ['.m4a'] ISO Media, MP4 v2 [ISO 14496-14]
1: ['.mp3'] MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
285: ['.wma'] Microsoft ASF
5: ['.wma'] Microsoft ASF ASF_Extended_Content_Description_Object
42: ['.ogg'] Ogg data, Vorbis audio, stereo, 44100 Hz, ~192000 bps, created by: Xiph.Org libVorbis I (1.3.2)
47: ['.wa'] RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz
#audio types : 38
Oddly enough, using this I found a bunch of ".mp3" and other extensions that were actually JPEGs internally. I double-checked and they were only 20K or less. I got rid of all of these.
I also found out that some audio types are not playable on my ubuntu machine: '.wma', '.m4p'. Eventually I also found that some .m4a files were not playable by Rhythmbox, audio, or play apps. Over time, I moved those out of my ~/Music directory as well.
Phase 3
At this point, I still suspected I had duplicates. They were okay formats, had different content (for some reason) and yet still were the same song. To find these, all I had left was the file name.
To match two file names as "close enough" I used DM25. This is a tool used by web searches to see which pages (via the word content of the page) matches the words used in the search.
In this case I gathered the word corpus for all files and then went through them one by one looking for a high DM25 ranking.
I started with a fairly high ranking. If it was 25.0 or above, then there was a high likelihood of filename matches.
if score[0] < 25.000:
continue
See out/report_filenames_similar
for these kind of matches:
---- paths with similar filenames:
<snip>
>> ~/Music-dups/02 - I Still Haven't Found What I'm Looking For.mp3
0] 40.814 ~/Music-dups/U2 - The Joshua Tree 2/02 - I Still Haven't Found What I'm Looking For.mp3
1] 40.814 ~/Music/U2 - Joshua Tree/02 - I Still Haven't Found What I'm Looking For.mp3
2] 39.596 ~/Music-drm/U2 - I Still Haven't Found What I'm Looking For.wma
----
<snip>
>> ~/Music/coolio - gangsta's paradise/17 - get up get down.mp3
0] 37.052 ~/Music/In Yo' Face - Vol1/06 - Get Up And Get Down.mp3
----
found 484 similar filenames
I found many duplicates using this report. I then lowered the score limit from 25.000 to 15.000 and then down to 8.000. This found more hits but the likelihood it was a duplicate file was less as the ranking was lowered.
Also, I noticed that two files can have similar words in them, and that caused a high ranking. For example see "gangsta" file name "get up get down" and "In Yo' Face" file name "Get Up and Get Down". These are not the same song at all, but do use common words.
As well, Jazz albums tend to have covers of the same song, but played differently so not really the same song.
A lot of Classical musics use song names that are nearly the same except for a number or a key in them, e.g. "B Flat" vs "F Major":
>> ~/Music/Wendy Carlos - Switched-On Bach/04 - Two-Part Invention In B Flat Major.mp3
0] 27.038 ~/Music/Wendy Carlos - Switched-On Bach/03 - Two-Part Invention In F Major.mp3
----
Phase 4
I knew there was media content that gave album name, artist, track number, and song title inside the file. I figured that could help find any files that were badly named on the outside, but had good info on the inside.
I used python module TinyTag to get the media info.
See out/report_file_media_info.txt
for detailed content of each file.
---- file media info:
>> --- act: Music/War - Unknown Album/04 - Four Cornered Room
>> ch2 exp: Music/War - World Is A Ghetto/04 - Four Cornered Room
album : The World Is A Ghetto
track : 4
title : Four Cornered Room
artist : War
<snip>
>> --- act: Music/War - Unknown Album/07 - Low Rider
>> ch1 exp: Music/War - Unknown Album/07 - LowRider
album : Unknown Album
track : 7
title : LowRider
artist : War
<snip>
>> --- act: Music/coolio - gangsta's paradise/17 - get up get down
>> ok exp: Music/coolio - gangsta's paradise/17 - get up get down
album : gangsta's paradise
track : 17
title : get up get down
artist : coolio
reported 3548 media info
The tags mean:
- ">> ok " indicate that the path matches "artist - album>/<track - title".
- ">> ch1 " indicate the song name does not match "track - title" from the media info.
- ">> ch2 " indicate the song name matches okay, but the directory does not match "artist - album".
I updated many songs and directories to match the content of the media info. Note, I did find that some media info content was wrong e.g. "title : LowRider" from the audio file itself is missing a space
I have some directories "artist - Unknown Album" that can have one-off songs from many different albums for a given artist. And I have a catch-all directory "Various - Unknown Album" as a catch-all for any remaining on-off songs. That's handy but there can be spurious warnings because of those directories.
After all, I found some media info was repetitive or just wrong or very verbose and therefore useless:
album : Mambo No. 5 (A Little Bit of .
track : 1
title : Mambo No. 5 (A little bit of .
<snip>
title : Itchycoo Park (Mono Version) (2012 Remaster) 2012 Remaster
<snip>
album : Sultans Of Swing - The Very Best Of Dire Straits
track : 9
title : Money For Nothing (Album Version-Very Best Of...)
artist : Dire Straits
But even still, this did clean up the song titles and made it possible to search them easier for duplicates.
Phase 5
At this point, I knew I still had duplicates. The only recourse I had left was to search for "matching" filenames. Where "matching" on the filename (no directory) was defined:
- remove the leading track number "track - "
- "exact": character by character match of the file name (the track number may be different)
- "trim": filename with common version removed e.g. "Bonus Track", "Explicit", etc.
- "case": trimmed filename converted to lower case e.g. "The" vs "the"
- "spaces": trimmed, lower filename with spaces removed e.g. "sevennationarmy"
- "punc": trimmed, lower, spaceless filename with common punctuation removed e.g. "Hello It's Me"
See out/report_filename_matches
for the report.
---- filename matches:
<snip>
exact: A Remark You Made:
prev: /home/arrizza/Music/Weather Report - Heavy Weather/02 - A Remark You Made.mp3
curr: /home/arrizza/Music/Weather Report - Jaco Years/03 - A Remark You Made.mp3
trim: Deacon Blues: "Deacon Blues"
prev: /home/arrizza/Music-dups/04 - Deacon Blues Album Version.mp3
curr: /home/arrizza/Music-dups/04 - Deacon Blues.mp3
case : Sunshine Of Your Love: "sunshine of your love"
prev: /home/arrizza/Music/Bobby McFerrin - Simple Pleasures/10 - Sunshine of Your Love.mp3
curr: /home/arrizza/Music/Eric Clapton - Cream Of Clapton/02 - Sunshine Of Your Love.mp3
spaces: Give Peace A Chance: "givepeaceachance"
prev: /home/arrizza/Music-dups/GivePeaceAChance.mp3
curr: /home/arrizza/Music/John Lennon - Rock 'N' Roll/11 - Give Peace A Chance.mp3
punc: Ruby, My Dear: "rubymydear"
prev: /home/arrizza/Music/Thelonious Monk - Best Of Thelonious Monk/02 - Ruby My Dear.mp3
curr: /home/arrizza/Music/Thelonious Monk - This is Jazz 5/04 - Ruby, My Dear.mp3
<snip>
reported 826 filename matches
Note that there can be two songs with the exact same name, but they are in fact different. Typically, they are covers of a song by another artist, but they can simply be a different song that both artists chose the same name. In short, you have to confirm they are identical by playing them.
exact: Norwegian Wood (This Bird Has Flown):
prev: /home/arrizza/Music/Beatles - Unknown Album/13 - Norwegian Wood (This Bird Has Flown).mp3
curr: /home/arrizza/Music/Herbie Hancock - New Standard/03 - Norwegian Wood (This Bird Has Flown).mp3
exact: On The Run:
prev: /home/arrizza/Music/OMC - How Bizarre/01 - On The Run.mp3
curr: /home/arrizza/Music/Pink Floyd - Dark Side Of The Moon/02 - On The Run.mp3
Herbie Hancock's version is a Jazz cover of the Beatle's song. And OMC's and Pink Floyd's song are completely different, but they chose the same name.
Finally...
After all of this, I ended up with 2400 or so files. Many of the duplicates were simple, I must have temporarily created a new directory of an album, but I didn't clean it up when I was done. Many were multiple albums with the same songs. Some of these were remastered and sounded better, so I kept those, the rest were duplicates.
Quite a few were the same song, but different media format. I must have bought multiple times without checking the current inventory - oops!