Spectral flatness calculation not working at higher sample rates?

As before, posting this here as I don’t know if this is a meatspace error or an algorithm error, but when working on an update to SP-Tools that will play nice with arbitrary sample rates, I’ve run into a weird issue where some descriptors report radically different values at different sample rates.

In order to test things I’ve resampled jonly.aif to 176.4k such that I should be able to run the respective analysis with 4x the @fftsize and @window / @hop size and expect (largely) the same results.

This is more-or-less the case for most of the descriptors, but spectral flatness in particular, is massively off.

For jongly.aif @ 44.1k / @fftsettings 1024 256 1024 I get the following as centroid, derivative of centroid, flatness, and derivative of flatness:
67.937912 0.143962 -41.245346 0.159988

For jongly.aif @ 176.4k / @fftsettings 4096 1024 4096 I get:
67.919655 0.142653 -114.411133 0.009569

The centroid isn’t exactly the same, but I can chock that up to less than optimal upsampling (I used Audacity in a pinch), but the flatness is ballparks off, as its derivative.

Here’s an example patch: (along with the 4x jongly)


----------begin_max5_patcher----------
1696.3oc2a0sahqCD9Z5SgEWcNRcirG+ShOW0KOuCGspxELsoaHIJITZ2Uae
1O1NAJsM.NKNzcKR7SLlLd9luY7LSB+3hISuo3Qc8Tz+f9OzjI+3hISbCYGX
R2wSltT83rLUsaZSWpqqU2pmdY620nerwM9C5p5zh7MimN2MZwM2+EAayf4q
VllmoabmI3kAKV0rYTR2nkplY2kle60U5YMsquDFKBeIhJ412Xw1WAHBi9Z2
uo8zz7Tot8GLcJ5q1u4mWbg8kK8TCmUrboNu4cZXcoYoToxPKxTM4FX.8Wp7
4nzlZzbcU5CplzGz+MRUoQqUtGOgJVrXy4IKMWOqXUdytJ+tvDzKLQ5Elv6G
lHfCeDDo8Mhf5.MpEmBHXPhP2nxuE0Tflcmd12P0pkkYZTkpwvm7UkwGTkWT
j2Tm9cm0DvQ3ChD1ImqV5l7z+Um8ftIclZ5AvIRrCZ3wNfBvfiVI1gPYOmKT
ybmSR.AO3EvSkqxd565dba3IenfivhFRGlPRHNDJjPyMqZZ5MZAW7xxpxrja
zUWqyU2jo2Ug7vE4PQRLpUL.RB.bBkKgDip5Bt.aec+QUrltesHK6K1oHNRR
jBNGgiHLPvonuPHrHFgPnTyfXrjKj8gVz.EasGDgm3Bj.bdjbmGIIhwItatd
sQkdG3beQZNh0ity38p6rgp6h3dTdlgOHkXZh.hAljCLpAQvQ7yptuJe+ZuX
.aWvGp1ar4.OVHoBNiBIRJX2Qwicbu7MOCIZrHaU57naVsvtAbiN+49fk3A.
Kv.gEpYGCPBTbBiDKkBJwD2.6IpLJHAjkV2zGLjDl3B8yNH7HNvwXBlJHBP.
FbHFNNN3VriBNT2nZpeFcUsNyr1QK0pbzUFszkXVMhzGDIG.DMXlB0LHWJHl
rKjIfMtoAbRhiRRRjxXKCxffV9yGEyYS1r02oJ0FjawhlZinM5VMhgkBDACr
1Ock4Da+ZS1FcCTVrVWgHnqVkm1XeuTMet4mhfslfYlDepJRmuMa4dSJj1yV
Z3QzCFHjHBjXPeJDSDFiPrMwl1T+jePVhYpLqEvvgqLZgg4hcbW6GqsfaZdo
Y0Tr1jT.F2c3co2dmcdyxRKKsHO8fIcWVoqMFDSIJE4Werzx4jvXAjLbjwEv
.2bJkaAcqEHl2iEf6R2jx+Xr.YEqlaonO+ZbecZ97h06P6Mmz2M1cEktibtK
W0TsRWpUeyZZ15S3MyGFSbuWleKtOVL+8kuacwppYZTlp5V8ktRgl9qvSA4H
tIGg6pNjfSFq9LbDzodoJK60nyt5d73k3+VUWdtUcagPzXyhwUHDUJ.zWXjH
vPZYB6Xba8G8AGvHVGDmbVqCZs5A8hhpk8kdKoey9l76s4Fqq1VouyAaZuXf
3P09SYssBwQ8IRWm1LllCnolPnpFafhC7g1TUGETQLDTw4XMXTQ11Qjeu.k8
4GQhELLNpOnJPNJaHIhXdarB5XEqXucJhPG+NEsQM4ck4MpsGZOoojlun3Yz
q7k2EEXCHoAoGFTJ1aC56Xw9v329A6yz7vVPXqqdGZgtuH+1rmtl8XjI5QeX
GOLIbskkvZCNv4C.6Be6DYLxdb9wgw4uKV3Gnuex3652oje7d9uZ+pcu5Igw
weiw7SgeuCq576iToK5qk4ApHqN5QmKODeVb4GZOxojvzi7j+DaQNjDlVjm7
opC4DVfZuSxezcH2Td43Ux3miNjKYgoC4Ie5ZPtqYe.uqI4m29iCCJr1.MK6
s+3uqyGel6V9Nncbf.aJ+SSuv2x42auvsdFmVqvYiIp2emvgHBwvX.gjvYwT
Ne3bb2bc51atk5bKM63u1vz1o2tS2laUNzKhbtt1Duw0L7clj8BWh12cykuR
xdAhOtjDAPRtMZOpnfj.HJtvGIIe0jJpL6.5pZ7LH539EM4jDM.9XKkr..vu
QA1iV9l0SfTS60E973a7FFR+R5MKmPYK8RzvXvfs2NUGWzVeYxo5lB9HIRHh
x4iW4FE+zkDwGIAgPRfORhDBIQ8QRzSkQP7029j0IebvXwgfQ3SrRVx4JpLa
ThY3gfiSBRRD9vP1lpwIIKIyKQEh3Swduy8ox6oDu14N.Rh3E7EDIQ8xPE.6
jOaiXu3JmrJ4CeHDIk6RQ6Xg1cbFZHjD3ijHgPRDejDDBIg8QR3dpOTUVt4O
KkULVgXJQ+9BW73jKcGll2dnqqzSqzOjtY9thkmppLk51XpycUU6M0vic+Kq
ltrvDYOeUZWtnF0yHRW4+1a.h5xt+yFttDbwOu3+AR8+pjC
-----------end_max5_patcher-----------

jongly_4x.wav.zip (874.7 KB)

I’m on version 1.0.5+sha.9db1bdd.core.sha.001df55a on Max 8.5.3.

So is there something weird in the math/code, or am I overlooking something here?

p.s. I only had a quick look at the other spectral moments and they looked alright, but it’s possible others are off too. I just have spectral flatness as one of my “main” descriptor types, so I see it at the bottom of my patches.

Actually this makes sense I think. @a.harker or @weefuzzy will confirm but your spectrum is no more flat at all…

Think of a 44k white noise as full spectrum block

****

if you upsample, you are not filling the spectrum you are just smoothing. you don’t add. so if you go 4x, you block of samples will now all be in the bottom quarter. so you went from

****

to

****____________

Do you see how unflat that is?

1 Like

so to test if it is a real bug and if I’m right, try flatness with a 192k white (not cut at 22k but at its 96k nyquist)

that should give you what you actually ask for.

I’d need to do some testing to see whether I fully agree with the logic here, but indeed the point at which the resampling filter kicks in is very much not flat, and that may just be biasing the whole thing, so a more correct approach is to start with audio inn the higher SR and downsample - if values calculated in each case then radical change there might be an issue.

In general I would not see the reason for the difference at different SRs as being the SRC also - to my mind it is more realistic to see such descriptors as estimates - numerical error and consequences of the additional bands existing at the higher SR are more likely explanations for the difference than the SRC also used.

2 Likes

BTW - the other obvious test to do is to limit the freq bounds in both cases (within the original representable freq range) in the same way. In this case I’d expect (given the simple 4x multiple on the SR) for the results to be very close to identical.

2 Likes

@tremblap 's explanation makes sense to me. Here’s what librosa’s spectral flatness does under these circumstances:

image

Code:

import librosa
import numpy as np
import matplotlib.pyplot as plt

# make noise data
x44 = np.random.normal(0, 1, size=44100)  # 1" @ 44k
# upsample to 176k, this uses a bandlimited sinc interpolator
x44_176 = librosa.resample(
    x44, orig_sr=44100, target_sr=176000
)  
x176 = np.random.normal(0, 1, size=176000)  # 1" @ 176k

# do STFTs and get flatness
fft_size = 1024
hop_size = 512

specs = [
    librosa.stft(x, n_fft=fft_size, hop_length=hop_size) for x in [x44, x44_176, x176]
]
flatnesses = [librosa.feature.spectral_flatness(S=np.abs(X)) for X in specs]

# plot stuff
f, ax = plt.subplots(3, 2)
for i, (a, X, flat, sr, tag) in enumerate(
    zip(ax, specs, flatnesses, [44100, 176000, 176000], ["44k", "upsampled", "176k"])
):
    S_db = librosa.amplitude_to_db(np.abs(X), ref=np.max)
    librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="linear", ax=a[0])
    librosa.display.waveshow(
        20 * np.log10(librosa.feature.spectral_flatness(S=np.abs(X))),
        sr=sr // hop_size,
        axis="time",
        ax=a[1],
    )
    a[0].set_title(f"spectrogram for {tag}")
    a[1].set_title(f"flatness for {tag}")
    a[1].set_ylim(None, 0)
    a[1].set_ylabel("dB")

f.tight_layout()
2 Likes

indeed if maxFreq is set to 20k it should be the same… @rodrigo.constanzo does the explanations help, and can you confirm that if you set the maxFreq to non-bat-freqs it works?

Ah - yes - I was thinking about this in slightly the wrong way but @weefuzzy has nailed it - basically the empty (larger) part of the spectrum is almost totally flat at 176kHz.

Ok, that all makes sense.

I won’t be doing any funny upsample/downsampling as part of the code, it was more to test to see if my plumbing was correct when switching between modes.

The use case here is for offline corpus analysis, so when it’s matched against realtime analysis, that the periods of time are being compared correctly. So if someone analyzes a file a 192k (or whatever), a 256 sample analysis window doesn’t make sense any more. Or more importantly, won’t correspond with a realtime analysis that may be running at a separate sample rate altogether. So I’m trying to get parity of ~5.6ms = 256samps and upscale the fftsettings/numframes to stay close to the time windows.

I did some quick recording at 176.4k, and then downsampled that to 44.1k and it’s still not the same, but I guess that also makes since given the relative frequency spectrum.

176.4k:

44.1k:

In terms of trying to create a way to test whether my code/plumbing is correct (so not seeing capping attributes), is there a way to apply a filter to one (or both) files such that I can expect the same results at multiple sample rates? Or rather, should I just test with white/pink noise or something like that?

Basically there’s a lot of nested abstractions and I’d like to be sure the coding is correct as I go.

It’m probably sensible to just bound the analysis as there’s unlikely to be that much HF content anyway.

I wouldn’t want to preclude people loading 192k recordings of tinfoil and bats through.

recording at 176k just makes less steep antialiasing filters, so you don’t have a flat top… just less steep filters above 20k -ish

try the range thing - that would make your object care about human range, which is what machine listening is about in your case, right?

I guess if my other spectral moments (centroid/deriv) and pitch/loudness are the same, then I likely coded it all correct. It’s just that flatness, specifically, would make my testing/edge cases broken.

None of my samples go that high, mainly trying to make the library sample rate agnostic. So don’t want to force artificial caps where not necessary.

then you’ll have to educate your users that upsampled signal will have a different flatness if one doesn’t match heard range / recorded range… and that upsampling is not a spectral filling device either. Pick your challenge :smiley:

There’s no upsampling anywhere in the library. It’s only if people have some 192k sample libraries already or whatever. When they analyze those, it should still corresponding to ~5.6ms (or whatever it is per process).

In the end, I’ve added @maxfreq 20000 to the spectral moments analysis. I didn’t want to impact other desriptors, but centroid (the only other one I’m using at the moment) is unbothered by this either way, and it gives me consistent results when doing this kind of shitt upsampling, which will make troubleshooting easier too.

1 Like

Ok, to make sure I’m not going crazy here, when playing back a 44.1k audio file in Max running at a higher sample rate, Max (dynamically?) plays that back at a higher sample rate too yes?

I’m now doing the real-time equivalent of what I was doing earlier in this thread where if you change your sample rate, the JIT buffer stuff scales up to a bigger buffer, and all the JIT “realtime” analysis processes work on larger fftsettings / numframes (e.g. @fftsettings 256 64 512 @numframes 256 becomes @fftsettings 512 128 1024 @numframes 512 when moving from 44.1k to 88.2k).

I noticed I was getting different matching/behavior, and when narrowing things down and testing with the same exact hit at different sample rates, I notice that centroid/flatness and pitch seem to drop as the sample rate goes up. Surprisingly, the derivatives stay largely the same.

This is (the core) analysis chunk upstream:

You can see with the attruis that things are scaled up a bit here. I believe this was running at 176.4k in this screenshot.

Here are the resulting descriptors at each sample rate.

44.1k:
Screenshot 2023-03-19 at 10.58.15 AM


48k:
Screenshot 2023-03-19 at 10.58.28 AM


88.2k:
Screenshot 2023-03-19 at 10.58.48 AM


96k:
Screenshot 2023-03-19 at 10.59.03 AM


176.4k:
Screenshot 2023-03-19 at 10.59.19 AM


192k:
Screenshot 2023-03-19 at 10.59.33 AM

/////////////////////////////////////////////

I’ve set the @maxfreq of fluid.bufspectralshape~ to 20000, so perhaps that is too modest. Or perhaps this is a function of the fact that I’m testing with a 44.1k audio file, and if instead I had a 192k file that it would then be the same each time.

Actually, a mirror of the first question. If I have a 192k audio file and play it in Max running at 44.1k sample rate, does it dynamically downsample? Or rather, can/should I repeat these tests with a file initially recorded at 192k to test with?

Either way, is there some technical reason why this would be happening (and in order words “working as intended”), or have I messed stuff up elsewhere in the chain?

These numbers are very near each other, considering what you are doing to the samples under the hood. Oversampling is far from artefact-less process… you could try to hear and see the difference in various ways, for instance trying to null-sum between an original 192k and various down/up sampling of it…

have you made sure the resampled buffers have the same length?

They aren’t too far off, but it results in fairly different matching, as something else in the corpus is now nearer. It’s super noticeable in the mosaicking, but that happens too quickly to be able to compare stuff.

The playback going into this upstream is from a playlist~ object, so there isn’t an actual buffer loaded for them (above the hood at least). The JIT buffers are indeed resizing (I’ve checked that). I had to get quite surgical around that as it would, on occasion, try to send a presently zero-length buffer down the analysis stream, so when changing SRs that whole part of the patch freezes up until it’s all done.

The use case for all this is if someone is running their entire system at a higher sample rate, rather than up/downsampling files, so outside of my typical use cases (for now at least). The offline analysis stuff I think should be fine in that it’s analyzing (roughly) the same period of time and storing the numbers, with no context as to whatever the original SR was (though I do save that too).

I just want to make it behave as similar as possible across SRs, where possible.

If recording some 192k audio will help test this, I’ll give that a spin. I just don’t want to add further variables and complications if I can’t make sense of what’s already happening.

Ok tested with a new file recorded at 192k and the results are the same (perhaps more pronounced). The spectral stuff drops significantly (or goes up significantly in this case as it was recorded at a higher SR) when moving from 44.1k → 192k.

Screenshot 2023-03-19 at 3.27.49 PM

What’s especially odd is that I don’t remember centroid getting messed around with in the offline stuff. Since the average energy would still be within the audible range, I’d think that that would stay the same (assuming the same fftsize to numframes to ms ratio). But here centroid is changing radically too.