Taking the FLAC

One of the criticisms often made of MAME / MESS’s CHD format is that it doesn’t actually provide very efficient compression, especially when it comes to CD AUDIO data. I’ve had a number of people ask me if I can look into improving this, especially when you consider that in with the current format a complete Saturn set is almost 1TB, with a large portion of that being AUDIO.

The reason it’s inefficient is because it’s using zlib’s inflate algorithm for the blocks, blocks which are rather small to ensure that data is decompressed quickly. While this is fine for DATA (it’s the same thing that ZIP files use) it’s absolutely hopeless for AUDIO.

There are dedicated audio lossless compressors out there, FLAC is a popular one.

I’ve spent the last 4-5 days solid integrating support for this into the MAME / MESS tree, and extending the CHD format to not only support it’s native blocks (hunks), but also reference to embedded streams via ‘virtual hunks’ which point at a stream, and allow the actual FLAC codec to do the seeking and decoding work for this.

By doing this I can achieve a good level of compression with FLAC, far better than trying to split it into CHD hunks due to the lower overhead, and improved ability of the compression algorithms to predict how data best compresses. I also still get good speed decoding, as the FLAC format is designed to be quick to seek, and has built in seektable support of it’s own which I’m levering.

I have to say FLAC is an absolute joy to work with, the API does everything you can expect, the documentation is great, and it’s very good at letting you know if something is wrong. (the only issue I had with the documentation / API was with the seektables, whereby calling things in the wrong order / wrong time during encoding could cause data to be overwritten without throwing an error)

I’ve also added support to the MAME SAMPLE interface to playback files from FLAC sources, this should allow the recently dumped tape loops to be compressed much better than they are now (they’re uncompressed PCM .wav files)

The other possibilities for this are endless, -wavwrite could also output FLAC data if support was added, MESS could potentially load cassette based software from FLAC images. It’s an incredibly useful codec to have around.

I’ve uploaded my first pass of this code Here (link offline for the time being, there is definitely still an error). This should be considered ALPHA SOFTWARE and I won’t be held responsible if you end up destroying your CHDs with it. I’m currently in the process of batch converting many images and haven’t found a broken case yet, but still, it’s in testing. While I’m happy with the current format extensions and CHD format created it could change in a final version, you have been warned.

This code has been submitted to R.Belmont, who is currently making some portability fixes. FLAC is designed to be portable, so this shouldn’t be too much of a problem, so fingers crossed it can be sorted out soon.

Usage is simple, I’ve added an additional -createcdflac commandline option which will use the FLAC routines when compressing AUDIO. If you already know how to use CHDMAN then it’s simple enough.

Have fun :-)

 

70 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. Haze says:

    and if new CHDs decompressing with older builds is such a big problem then just call the compression format with the half block markers sometihng else, so you get an ‘unsupported’ warning if you try and use such a CHD on an old version.

    I’d say a small workaround, for a problem which I didn’t even create, if far less of an issue than changing the checksums for everything. It’s storing the exact same ‘real’ data, the checksum should not change.

  2. Haze says:

    or, if the FLAC mode isn’t used for a CHD, just switch the hunk size back down, that would require a variable rather than a define on the hunk size, but it would give you full compatibility still with older versions as long as FLAC was turned off.

    there are multiple ‘good’ solutions to this.

  3. Haze says:

    (and that’s the other thing I can’t understand, if the half hunk padding checksum thing is what Aaron is taking offense to, why is it the only thing he’s left enabled, ie, by not reverting the hunk size)

    The newly created (non-flac) files work fine with older verisons anyway, they just won’t verify as a result of the bigger hunk size, but again, that’s not my flaw in the first place and is a far lesser evil than breaking compatibility completely IMHO.

  4. ben says:

    Can’t you talk to him to find out?

    FLAC is a drop-in alternative to zlib that happens to work well on different types of data. Any argument for increasing the block size to improve compression applies just as well to zlib as to FLAC. Combining these two unrelated things in one patch — adding FLAC and bumping the block size — was probably a mistake. Especially since the benefit from FLAC is huge, up to 40-50%, while the benefit from the larger block size is a few measly percentage points at most.

  5. Haze says:

    The existing block size was too small for FLAC to be effective (it actually failed to effectively compress most blocks) so had to be boosted. (Optimal is actually double what I’m currently using)

    There were actually good benefits to boosting it with zlib as well tho, yes.

    The problem was, in the infinite wisdom of whoever wrote the original CD code the CD data/audio tracks were padded to the hunk size used (with code to ignore the padding in the CD code) In the even further infinite wisdom, the padding data was checksummed meaning if you change the hunk size, you change the data checksum which is a terrible thing, because you’re actually still representing the same data.

    Now, in an ideal world, that would have never happened, but, it did, and all the existing lists have been built around the CRCs you get with padded data for the old hunk size. Therefore, I had to deal with it, _without_ invalidating all those lists. There were a number of possible approaches, I simple chose to mark the last block so that it knew not to checksum the excess padding and the CRC remained equal. The alternative would be to only compress partial data (padded to the old hunk size), and check if the data being decompressed decompressed to less than the expected hunk size. Note, this applies to both the zlib compressed hunks and the FLAC ones, which is why Aarons simple disabling of the FLAC bit makes no sense, it was done to support the larger hunks, not FLAC.

    I consider the problem to be a bug in the original CHDMAN implementation, but short of breaking compatibility with the existing lists (forcing all the CRCs to change, and old CHDs failing to identify because their CRCs are no longer listed) I had to pick a suitable workaround.

    I know the checksums will change eventually, once more ‘raw’ scrambled CD dumps are used, but that will happen slowly, one game at a time, if ever, but the checksums will be changing because we’re actually representing different data, not due to a silly padding issue. Hopefully when support is added for multisession discs we don’t find other similar issues :/

    Adding the FLAC support was meant to be a nice user friendly option, which people could make use of if they desired. Not something forced upon people.

    If the change becomes too aggressive, not just ‘convert/use if you want’ I think the public reception will be more negative.

    It’s out of my hands now anyway, but I fear for the worst given MAME’s general ‘F**k You, Deal with it’ attitude towards both code and users. Coders might tolerate it, forced changes to all the code standards, interfaces, names, the code changing all the time, but at least it’s usually clear why. Users won’t tolerate being messed about like that. Hopefully a good balance can be found, and as much as I hate to say it, maybe listing both the legacy and ‘new’ CRCs if they do change (for -ident purposes) would be the most user friendly way to go (and allow getting rid of the initial padding problem for good while retaining the ability to properly identify things) but IMHO is uglier than the solution I came up with.

    As I’ve said tho, I can but watch now, I’ve given my input on the issues, given my code, the changes from this point forward are down to the rest of Mamedev, but if they’re expecting all the existing softlists and CRCs to be updated they’re going to have to do that themselves too, I don’t even have the material anymore to help there. I’m done with that side of things now, and I’m going to look at improving the sample inferface instead so that it supports more than single channel wavs (the tape loops are stereo..) and also so that it doesn’t expect the file extensions hardcoded in the drivers.

    I guess I’m also not appreciating some of the aggression being thrown my way, as if it was my fault. I’ve said for years I’m not sure the CHDCD standard is currently good enough, but been told basically I was rude and disrespectful for saying that. Now I’ve made a few (large) software lists using CHDs I’m being told I’m stupid and wrong for making those, because the CHDCD isn’t currently good enough. Welcome to trying to work with Mamedev…..

  6. Haze says:

    Right.. I’m told now that the CRCs / SHA1s will not be changing, that’s good news at least.

    The block size will also be bumped back down for the time being, which makes more sense until the FLAC stuff is turned on for real.

  7. ben says:

    FLAC failed to effectively compress blocks of 2352*4 bytes? That makes no sense, given how FLAC works, and it doesn’t happen in my tests.

    Q&D compression ratio test of Ana Ng by TMBG: 2352-byte blocks: 63.8%, 4704-byte blocks: 62.8%,
    9408-byte blocks: 62.5%, 18816-byte blocks: 62.6%,
    37632-byte blocks: 63.1%, flac.exe -8 -P0: 62.6%. Four-sector blocks are the best for this song, better even than the single-FLAC-stream approach you originally tried.

    Looking at your code I do see one bug: you’re passing the block size in bytes to FLAC__stream_encoder_set_blocksize, instead of the size in samples. When I tried that it worsened the compression by about 3%, so it should be fixed, but it’s not enough to explain your problem. The code is so messy (sorry, but it’s true) that there could be other bugs in there. Maybe you accidentally compressed every sample twice when testing the smaller block size, or something like that.

  8. Haze says:

    the old hunk size 4 * 2352 bytes, and at least on the track I was testing with at the time that didn’t seem to produce good results at all… even just using 8 * 2352 instead of 16*2352 is costing a good 4gb across the PCE set. Maybe it just wasn’t a good test case I was using, or I had another error at the time. Either way, the smaller hunk sizes don’t give as good results, and are less desirable.

    I can’t say I know why, the initial tests done there were just part of my feasibility study, to see if the idea was going to work at all, they weren’t even based in the MAME code, but a hacked up version of the standalone encoder / decoder, which I could easily have misunderstood / made a mistake somewhere with.

    And yeah, you’re right, block size should be in samples, I misread that as bytes somewhere along the line. That’s an easy enough fix, surprised it has an impact tho, but I guess it must set up some default assumptions based on it. Could even explain the 4gb loss I’m seeing with the lower size anyway.

    I wouldn’t really call the code that messy, I’ve had to work with far worse from other projects, including the stuff in the actual FLAC library which is horrendous in places (the stream_decoder stuff is just plain spaghetti code in places, and even RB who fixed it up to compile / link on Linux found some of it questionable). Maybe it’s a bit overly verbose (or at least was) but the majority is just based on the ‘this is how you use FLAC’ examples anyway. I know some of the memory copying could be stripped out, but again I was getting the functionality there before optimizing things and risking breaking it. Personal opinion I guess, I’m used to working out how things work, and making them work, and ensuring the code that makes them work is as straightforward as possible for if somebody else wants to pick it up and clean it up, IMHO it fills that criteria.

    Personally I’d rather stick to emulation tasks than core work, but getting things done these days seems to require a hands on approach, even outside of the drivers. Given the constant code churn and change in MAME as a whole, I’ve no idea what the expectations are, people seem to have made a living out of rewriting what’s already there, so they’re welcome to update the code if they don’t like it from a presentation point of view, heck I even did the initial .mak file to be in it’s own sub-folder because I thought that was the new standard, turns out it wasn’t….. It’s not necessarily an arrangement I dislike, I enjoy figuring things out, putting them in code, and having them there for future generations, giving the functionality.. if others want to alter the form that complements what I’m doing nicely. I think one of the big problems MAME has right now tho is that there are too many major players *only* caring about the form, and doing absolutely nothing in terms of the functionality, hence why I’m having to work in areas of the core in the first place.

    Ultimately MAME will be judged by users on what it does. If we can get the likes of Raiden 2 or Space Lords working it will mean a lot more to people, and add a lot more value to the project (how they work is then written in stone) than changes which might allow running 2 copies of Pacman side-by-side, because even if that would be a great achievement it adds little value unless you want to open an arcade with a single MAME box driving 16 Pacman cabinets or something ;-)

  9. etabeta says:

    Haze, no offense meant, but the “F**k You, Deal with it attitude” has never had anything to do with the FLAC code submission and inclusion
    I had already written that Aaron was trying not to change CRC to make life easier to the users even before you started complaining, so it’s not that he changed his plans because of your comments

    my original point was: if there is any specific detail of your implementation that you fear it gets lost, it would be more effective if you drop a line about it to Arbee or Aaron, instead of commenting about it here.

    no more. no less.

    p.s. and updates to xml lists are better sent by mail to me, than linked in the shoutbox, where usually they get scrolled away pretty fast ;)

  10. Haze says:

    No, I’m saying the “F**k You, Deal with it attitude” is more one of MAME in general*, and if it ended up being applied here it would piss a lot of people off. Again tho, I’m getting mixed messages from different people, on one hand I’m being told the CRCs will *definitely* change, on the other I’m told they won’t. It’s incredibly frustrating.

    It was bad enough last time all the CHDs changed checksums, but then it was needed, because the old design was inherently insecure (no metadata checksumming, so you could produce broken images which said they were the expected one)

    As for email, I didn’t have your address handy, I’ve closed the email account I usually use because of repeated hack attempts / spam, and your PM box was full.

    * and it’s not ALWAYS a bad thing, but when you’re talking about something people have reservations over anyway, such as the CHDs, it’s best to tread carefully.

  11. dave says:

    I was able to get the createflaccd working again (it was just commented out), and ran a little test on my chd collection. After making copies of all chd’s that had an audio track, I extracted and flac created them. For 33 items I went from 9.4GB down to 9.1GB. Not huge but still some savings. Given I don’t have a full set, and I don’t have any of the “dance” ones (I believe those are cds), which I believe would compress quite well, it shows the potential of the flac stuff. I’m pretty sure I was using the larger chunk size (though not positive on how to check that), and I didn’t encounter any errors. The hardest bit was creating a script to locate/duplicate/process the chds and it was pretty good return for the effort.

  12. Haze says:

    The non-digital dance ones would probably show a good saving, yes, the digital ones are already MPEG on the CD so wouldn’t.

    The PCE set, as I’ve said, is where you see the biggest benefit because most of the games are tiny data tracks + audio.

    A few things need fixing, as pointed out (mainly a /4 on the blocksize passed to the encoder, because it’s meant to be in samples, not bytes) These will further improve compression, and if Aaron hasn’t modified those once he’s done with it then I’ll send a quick patch to do so.

  13. Haze says:

    fwiw with the encoder blocksize fixed (samples, not bytes) the test case I’m using ended up 2meg smaller with the smaller hunk size, rather than significantly bigger.. so yeah, that was a pretty nasty bug :-) It effectively renders the small hunk size currently used to be better, at least on audio tracks…

  14. me says:

    So now there is more than one compression type does it dynamically try more than one or is it the same across the whole archive?

  15. dave says:

    I just ran the updated chdman over my PCE CD stuff (not a compelte set) and I saw a huge savings, in the 6 GB range. I reran my chdman flac convert script with the updated chdman over my mame chds and in total they came out 100MB larger then last time, which is kind of strange.

  16. Haze says:

    which version? Maybe the larger block size on zip is actually worse for stuff in the MAME set, in which case sticking with the current size makes more sense.

    the *latest* code should be the best, old block size, with bug fix described in previous posts.

    should never come out larger with the same block size tho, it attempts both types of compression and picks the best one, so if the old zip blocks were smaller, those get used.

    (or are you talking about old flac code with larger blocks vs. new flac code with smaller?)

  17. dave says:

    So I did a git clone today of mame to make sure I had the most up-to-date code (proper blocksize and smaller hunk size) and none of my changes were causing an issue. I turned the createcdflac on, and ran my conversion script over the mame chds I have that also have audio tracks. These I will call “new FLACs” (hunk size of 9792). I also have my FLACs from back on 1/30, which have a larger hunk size (19584) but also with the improper blocksize (i.e. using bytes instead of samples), and those will be “old FLACs”.

    The old FLACs as a whole were smaller then the new FLACs (in total they were 100MB smaller). The new FLACs were the same size as the original CHDs in most cases (some of the new FLACs were 4 to 12 kilobytes smaller then the original CHDs, and I am lumping those in as being “the same size”). Of 33 CHDs only a few (3 or 4) had significant size savings with either the new or old FLACs.

    So it seems that the larger hunk size, even with improper block size, had the better compression. I guess my next step is to see if I can figure out where to change the hunk size in the code up to the larger value, and then re-run the conversion script. In that way the blocksize will be correct, and the hunk size will be “better” (at least from my testing so far).

    Anyway, interesting stuff (I’m finding it fascinating at least), and keep up the good work.

  18. Haze says:

    Hmm.. I guess I should look at the MAME CHDs.

    It’s possible they were ripped with audio sub-data (which isn’t a bad thing) but given the majority of images used in MESS weren’t I fall back to zip mode for hunks with sub-data. If you’re getting sizes close to the original CHDs that seems the most likely explanation. If that’s the case I can shuffle the sub-data to the end when encoding instead of not bothering with tracks containing it (encoding the blank data inline throws off the algorithm too much in normal no-subdata cases) I could also encode the sub-data part with zlib as a sort of hybrid hunk.

    I won’t lie, it was developed more with MESS in mind than MAME, because MESS is where you really have huge numbers of CD based games. I don’t actually keep a complete set of the CHDs needed for MAME around, but I’ll take a look at some point.

    The larger block size does work better on (some) zip data at least, which is probably where your 100meg is coming from.

    It’s in Aaron’s hands now tho… so without permission to make further tweaks to my code there isn’t actually much I can do until it gets enabled officially, at which point it might be too late to make further improvements.

  19. dave says:

    And that would be the answer. I just checked (with chdman -info) the three MAME CHDs that had a file size savings with “new FLAC” (as referenced in my earlier post), and all of them had SUBTYPE:NONE. I did a random check of three MAME CHDs that did not have a file savings with “new FLAC” and all had SUBTYPE:RW_RAW for their tracks.

    None-the-less the savings dealing with PCE CDs has been significant, and greatly appreciated, and I assume the same can be said with numerous other MESS CD based systems.