File hash database field

DeusExMachina · Post by **DeusExMachina** » Wed Feb 11, 2026 4:32 am

If kJams could calculate the hash of one file from the unzipped Audio+G pair and have that value in the database record, it could be used in place of "Song ID". Or at least pair with it
The advantage is, when the tracks in a user's various playlists are stored, if it used this hash as its ID (or paired with it) instead of the arbitrary (based on import order?) Song ID assigned by kJams, if the host ever has to rebuild/reimport their library, that hash will remain constant. So all user's playlists will remain intact, even if the library has to be reimported due to corruption of the original library, hard drive damage, loss, etc.!!!
Using the hash instead of, say, the file name, also insulates this from filename edits, both inside and outside the .zip file.
Using the hash of the directory contents instead of the directory itself also protects this from file name changes that would change the hash of the zip file.

As a final alternative, a different unique identifier could be calculated on import and stored in the XML file, zipped as a file into the audio+G.zip file, or even just stored as a separate file in the Library, and this value used (or paired) as the new Song ID, and read on reimport if present.
Mixing these methods would also be possible, and would have its own advantages (protection from data loss, for example, by having the value pair database in multiple formats).

Alternately to all that, kJams could remain as is, but an export function created that would generate these unique identifiers prior to any kind of migration, that would then pair with an import function that would be aware of these hashes, and employ them to recreate singers' playlists on reimport, pairing the hash of each imported file with the original song identifier used by kJams in the previous library.

Best of all, a standard MD5 hash function is built in to both Windows and MacOS, generates a small, 128-bit, 32-character value, sufficient for this file identifier, non-security function, and this would add a negligible amount to library size, adding maybe 5MB to a 50,000 track MP3+G library, and is blazingly fast on even marginally modern hardware, essentially adding at most a few seconds to an export of even thousands of tracks.

Post by **dave** » Wed Feb 11, 2026 2:19 pm

what would you use for the hash input?
can't use the path, or the song name/album/artist....
possibly could use bytes in files, but that's not guaranteed to be unique. some songs are zipped, some are not, eg: mp4, and byte size isn't a reliable marker for uniqueness.
don't get me wrong this is a good idea, just that HOW to implement it remains elusive

Post by **dave** » Wed Feb 11, 2026 3:14 pm

oh! i just re-read my reply above.

i said "use bytes in files", and what I meant by that at the time was use the number of Bytes. Like if there are 2,157,329 BYTES in a file, use that number as input for the hash.

But then when I re-read it, I thought "oh that could be ambiguous, it could also mean actually read some of the bytes (content) out of the file and use that as has input". And that actually would be very unique! We could just take the first 1K of the file and hash that! And that would be pretty unique! Oh my God, this is a really great idea.

-dave

DeusExMachina · Post by **DeusExMachina** » Wed Feb 11, 2026 3:23 pm

The entire file. Using the BASH MD5sum command in terminal, it can generate a 128-bit hash from a 4MB MP3 data file in 0.01 seconds (similar speeds in Windows with "Get-FileHash in PowerShell) so it can process an entire full, half a terabyte library in like 15 minutes, plus file zipping and rewriting overhead.
Well worth the value it affords in making a library truly recoverable (from a singers playlist perspective) and reimportable!
Start from scratch and retain all your singers’ data!

(You don't want to use the file name or the zip itself because that will change if you edit metadata. But the hash of the MP3 file itself should remain unchanged. Also, I figure it's more likely someone will use Producer to fix errors in the CDG than will use something like Fission to alter an MP3, but if they do, oh well, that's the price they pay for whatever it is they're doing to that audio!)

Update: apparently it is possible to hash just the audio data portion of the file, ignoring the ID3 tag header/footer to the file. If kJams got even fancier and did it this way, the unique hash identifier would even survive metadata editing that affected the file contents, including things like some kinds of level normalization!!!

Post by **dave** » Wed Feb 11, 2026 4:44 pm

note that MP3 DOES change the file data when you edit meta data, but luckily it does that at the end of the file.
so i do think that reading just the first 1k of a file is sufficient?
note this is using classic code determinism.

a better approach would be something you hinted at: audio fingerprint. like what Shazam does to sample live music and match it to meta data. but that's a bit more intensive.

i think we can get 95% of the way there by hashing the first 1k of the audio / video file, plus the entire CDG if there is one

DeusExMachina · Post by **DeusExMachina** » Wed Feb 11, 2026 5:08 pm

You know better than me. I thought the ID3 portion was often at both the beginning and the end, requiring an offset.
I wrote more about this on the Trello (with code).
The problem I see with using audio fingerprints is if two karaoke versions are very similar. Especially versions with duets or backing vocals that don't kick in until later in the song.
I am doing more research on how to isolate just the audio portion and piping that to the hash function, or, alternately, determining the size of the ID3 tag header and using that as the offset for the hash function .

DeusExMachina · Post by **DeusExMachina** » Wed Feb 11, 2026 5:09 pm

I'm just worried that hashing the CDG will break if someone uses Producer to fix, alter, personalize, etc., the file.

Post by **dave** » Thu Feb 12, 2026 1:08 am

i think if you're using producer you're one of the 5% that gets lost in the cracks?
CDG pretty reliably never changes. UNLESS you're using producer. hmmm. well we can DETECT producer files, because kJams, you know, PRODUCES them, so ... yeah. we could reliably SKIP hashing.

okay so, maybe we take further discussion to email or something?

-dave

DeusExMachina · Post by **DeusExMachina** » Thu Feb 12, 2026 2:43 am

I posted a bunch of stuff, including pseudo code, to the Trello in comments.
Doing the MP3 rather than the CDG is very straightforward since the offset data is stored in the ID3 header and footer.
And this will also have to address separate methods which may be necessary for other file types (WAV, AIFF, FLAC, MP4, ACC, etc.) (assuming kJams even pushes metadata into these file types). If kJams relies on the .xml file alone for metadata, then, of course, the entire file can just be hashed, making them actually easier than MP3s.

Is e-mail better than Trello?

Post by **dave** » Thu Feb 26, 2026 1:57 pm

trello is best.

thanks. noted.

kJams by Slithy Toves Media

File hash database field

File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field

Re: File hash database field