Null values from fluid.pca~

In trying to figure out why I can’t get the same results as @tedmoore in this thread I may have found a bug, or at least very weird behavior.

Here’s a dead simple patch:


----------begin_max5_patcher----------
3094.3oc0as0aaajE9Y6eEDB6SKj0N2uzmZ51cSA1j1fMYQvhhEFzRicXLEo
.Ikuzhle66bgjRThRjTZjrqQfyXJRNmy27ctNi98KuXzMoOoxGE7cA+ZvEW7
6WdwE1KYtvEk+8EilG9zz3vb6sMZZ574pjhQiceVg5oB60kSBRVN+FUVdPwW
BKBtSUDjoBiCxmGFGWc6wQIpooKSrOCp7hKxT45WYXQTZx0MtCR4cbaZRQR3
bkcl9IU7CphnogUuzEgES+RTxcWmolV3zEJFLgNNf.ESPDFhyA1efR33.HgL
An+LzDPv+asI31vo1I.VdMs5DkDqJx23hoKKptJn7pQyrRV5Me8JHczZuy7n
ey8NES.lq9GWdo4WiORvVLIX1x4KBhJFcnHDRPLHDFnACADyfbDkKfLNabfD
a.HD4z.Pjy..wmDbaTQvCgIQwwgAoIpQcR1vGJTBERCThXHCrURuNUnG5Lfd
rIAEg2qBBChSSueRvcYo44Zy6337.0Cprme7KpL0AZSyNTXFiX61lVJOk1zf
y.nimXQ67fz6OXiZFjYfAZoMMBAfTFlvkZDBfNgzRwY.fP5HLoY5nI5W+Zd9
NAFzHtvZP6rmQxSHvw8HvMWkmGdmZKfCLQJELLK.LQauPgb2.R0.nPTN.RcC
.hxaFfK+eX4+aGr9+plMmZV77BkCCGMpFsZCgoVhJWX8UBATmOy0MeafmnVw
S313Iazg.bKiKhxiilox1s5LdcUJWKAEOG2f.ra0T6rx5cB3zVrXWpY0qplH
PZUuQaq2nUz4LMUuPkcsJI7FmDBVI05ISq5NoCXkFir3QxlIr6UWcUQVXRdj
1tbRsM6gwTzbCK0v4YChodhnPki7qVa03a0JaC02qXQoeIhNN3lt2QDtu.Fl
OAlEYQqbkOLskJP6JXFF4KkEeHJah5Q8CuMIHdYzrIyBKB0FZeanq72Dlbmw
IyswogEM81ravQ1VjdYGTg1CNg2FcHLuiN03v21IPDGkWzefnzlnMfnrLliG
GvBOhC+ltj3ky6YblVTXBP5LKv5v5HJk3RCFKPZOC3Cwt.0VFu.OpxOndZQV
ve4VXve076iJkAJ.rc5+B1grR2h+.o7jn1+M8uQAee9zv3vr4oyTAGIHHZAD
zUv6IuhBeZ2OKZZwjkIKBmdutd73kp7u6nzcBiLgrIwGBgdhAvkdV22oxZ9P
cMKgYOas4G2Ca+JH.0RIvTPUMKCY0mzBBT2Aq7vGTytVeMs.bcnNCmnaVV3Z
W3E0.wEiTyuQMas7L2aRnq+gyCWrn9isepE18UAiziuKYks14EnKY3yPE0DW
WxNltHRbs9xkcNCdJQD3Y.QfSLMsdVPTRPwWTA2E8fJIXxWySSNXHBCp7LYv
oSYuDjmgdIbD4yyoUoy6wD3I9rXEyR+goatE4RcC6Ic6LUahmpHwAARWKNHH
Ok5MziXPldpT1PBGd12kQfwtLNYROkuM0q738qh6saKtJHZMdGA5qdzAFcBa
3zw0lI2d3vk9xDV5ca3ESC+Vv2qm4YQ5PX45LHyCndpX5RT.S.szVANvWkS2
n7B6KytIRarOzV4yb8lHUd5xroUpWosy3fUxzLUdQThc6.V6lLI0FrqNc12Y
RWWT2yj4df02SZlooxU42dZmYTi6odlgG0La5DP2SsnTqOpoB2iIB5gkQrQX
QcLQUrpialj8PkvLOLSjyD1Q6yhDwGZDsOTdrvGyzYaUhANWyjrO5zl11dxY
Uul6M7n4IuU79L0Ux2wMSz9D0Y6YxEtKbwhGTY4k2scRzo.70TKDHFa+ynD2
eZiqpqM4gnp6GZK9aTXlN9cgN38xLW4nOwbmslQlVLlkrLpDP05mdNsaX4Fw
Xqqj8MKmEk9whvhk4W+dUxRWBEZ051vkwEMwhat61n33oowN4a8VAUk4wH2m
V0xm560roiHIABEZbYBFh4Hlcjd.ktVfyxmAV8PDJQBPla0z2KB0NRfvTrYD
XiGCsZt.PoaF.RAf3FouDrbmOqdLcoNt9Pgro4T0pprzEoYUKp5GUVe+KKRu
KKblMiysZe03RpTl9Sq2m3QkXpc2lqdl0od0KFuaYUGE10BPC.UZ1JeoQ0DR
JgysiHHI2hRTVMIMb5Ts.03YwHMvZQEBUiP1Qb20fazjhJwSys+kEpjfOpS3
N3ip4Q2jFOaUZkpX07MmEtfwbRivbZCbizSHB0bZLo5ZevqiRL1Gp50QrDAs
qinZh.hKH3lqiMIZDNEhsyERh0Pi6wg.1FJWe3y0q06fR2SZVSRsnlJK3.pP
3FskVsIolUsRwL+vcibuoSBmt9zCDqrcrt4RKRimVoQWkHxYMJ..HowQK3HM
HdeXQw9LHVuGZ+5ZYeudivLU23MA5Cuo3Z6iAFjcpiQz1n04isXl1uk7MMSe
SVTX7ntst.iW6W6zbRRMdtctTHbfktaX63MLmlFqzQlV8jP2qVp8LQc5flsy
w+41Fr2OlmMB2fQi7Fi9ye5eN.pLjg0b.qUOmvjxxQ.YmLYJfyrqM.FECcuB
HBh1e.me54YYo2oR9jkKruPMDFDBboGHQLhkPPkP8ZSyInNOuljHSJIVmXZY
hxbQFMqu1HNx1MLflzXbwZMVC1akWkPyqANd+Bx+hm6T6Tbg2n31FrEMc.zb
AVxcl1XYouOykn3No4jRGisMZO772pRTODNpCV5N8Wejd36qPefE..GFrdv7
wMWBKybb6QGDKdejUou4p+vxhhp8drcF61AZE03DnbDz2h0OZxq30lP81vnj
u0SoRuPV6pd6Q9WzxeLRaHuOgaKeHvRS0U+5jHSnWUB0+Z8SgcubQuSOaslb
POcv0gFu2Bh5bN7Id8N0r8gUsgArdDJ.1Br5SwVWOWVzSSKxFnWjSm.YNORe
a.ztNcebLrPepY+b5LU9P4HGfcRuizdpCY9yc6BoCGHdUZremX2mvzlWjsCZ
dBDsODMsnar5Em.+Ay2uqhWcYa7wooKTCwiwAYSspY2Rr4mSEW3Sg2LTRPeM
2e4XNeRWMzMcl.cKEuA5aHxSXlPeJ8NS8HCbQAyYDNzELhfg1ZjHM1gjWvrf
+Oy6XikNiv6mCePY9xZ4wBFN3rBASndR4lElc+UIQ28khqrcNXf7991g216J
Ao1M.udKZp2rFg80rciJFX+TOhdnA6vFw6cdv68Oymaw3WsqgpL.n7XU6iWE
pq8nY88l3lXk1yraWD+gzz66poWq1BjsFr68dXu89D3wVZ0uo40MwZG4m6Km
SecQTx8Co4q8zkPaaWlFL.h5n3XFP+bcWXVOdrcwMG2326L+mdLCa4ZrGOy1
z9U6Ig81wqOxCsxs+x1Yz66YY+t2mF6Eyj6c0j8YyQ64kNYoXSpDcmB8l8FA
HIkm1BoDK3ki3BYOLAqyXfV69rhx3O8Zd3SLyQvtu5zF.rMgsc68QzrnHdIq
8.2nFuov1SpcuS2kfvTjrYbLi+XuuNz3b82tfM7s0xCoIJpOoEaOZOdq3Byo
RabqiNigzOlyoVSqi9sJ5Nw+Cqoaqcl97dK.beEJ9g3kpqfCrJHLjWd1FV4L
R5cWPkBXV5iICVBOONIcR3e+4vgKf.jj5N5EbNFib1NHnNoD+KfuMSoN.Izh
d08TCxLYa5eg68ZOaIEgCV7DD.i451GCnENKTxjPI0+x3+VMavxGuN2MXc8N
fxHF9V99up33zGGNDhntiWDW.E1DNATMqjtuMivbdpnR.PbJT.zeFU.61W71
z3CfhT2BIpNSHp6jdsdHMuHfKVlsX+sIcqjSI.WdMHJfIsbXjfxPhdlgFu9r
aRfTsaCquMi2CuZZlolphdX+6bWabl5JkWIbf0Onn9Q3bknXnEC.46jPzZAy
DsePYY7Cd4YaWaDH8ZjjbUxr7WaQPVAx921q0y2KfCAb2dMQX.Y4Wi.Nzqbm
hvECnKguKrHM3clpPG4s4OuPc6x33hN2fkMOyqBJE5ZqMCywtuMBLs0091vK
y20Et8VY0o3a5fP8wQ0qJ0f0IjfKb4YfqYxXjDKEdnT8WP7Pk+556Kvilt47
7.H9+XTVwyA+i6R6ps3PoD3xOTyDIka5MC.vvlmX4cFOa.e0BVAMXVm63Evr
PaK+EWt9p8KUtASG3Js66NmcG7LPU9hxkP6Wm9K+iK++.klL8J.
-----------end_max5_patcher-----------

Loads a dataset, normalizes it, then PCAs it. Except if I do that I get a PCA filled with so many nulls, even though the numbers look ok (as far as I can tell) in the dataset.

Attached are two datasets, a 14d made up of loudness with all stats and one deriv, and the other is 98d of spectralshape with all stats and one deriv, both of the same sample set.

datasets.zip (1.5 MB)

Should fluid.pca~ be returning nulls for this kind of dataset?

Both datasets contain columns that have a range of 0 (and therefore a variance of 0). Unfortunately this will result in NaNs cropping up in a great many cases. In the 14D case, it’s column 5. In the bigger one, it’s column 48.

In the spirit of working out which features to throw away, checking for 0 range is a good first check, because it’s never going to contribute anything useful to the algorithm and may actively screw things up (as with PCA, which is fundamentally about maximizing the variance across dimensions). I checked lazily using fluid.standardize, but you could equally check the diff between min and max with normalize

Having mess in the JSON isn’t ideal though, and we have an issue logged already reminding us to think through what to do about NaNs in each case.

1 Like

Hmm, yeah that’s definitely unpleasant downstream. I initially thought I found a different bug as Max was crashing when I dump-ed, but I guess it was nulls wreaking havoc in jitter/js-land.

By “range of 0” do you mean once it’s been PCA’d, or in the dataset itself?

From the looks of the 14D one, column 13 has a load of zeros in it (but many non-zeros):

"1": [
      -79.96369171142578,
      35.59560012817383,
      -1.3295470476150513,
      3.5680458545684814,
      -157.22659301757813,
      -69.97750091552734,
      -40.8841552734375,
      -0.013661191798746586,
      1.1971098184585571,
      -2.230168104171753,
      1272.0557861328125,
      -42.5906982421875,
      0.0,
      45.94984436035156
    ],

With that column corresponding to the mid of the first derivative (of loudness), which I guess seems like a sensible value to return there.

The 98D one is too long too eyeball my way through, but definitely see a bunch of zeros in there, and if it’s the 48th column that would be the mid of (spectral) kurtosis. (It wasn’t until I was putting together a coll with all the column names that I realized that you can have kurtosis of (spectral) kurtosis. Fun stuff!)

With both of these potentially being mid values, and from my (quite likely incomplete understanding) potentially having a small variance, I guess it’s possible to have this cropping up in what would otherwise appear to be “normal” data.

No clue what makes sense to do in each of these cases since I’m sure there are pretty hearty implications, but I wouldn’t have realized these datasets were “busted” unless I ran into this problem doing something else.

You’re just showing a single sample there (with fourteen features in it). What’s at issue here is the range of each feature across all the points, so if column N is the same value in all samples, then you’ll have a range of 0 for that feature…

1 Like

Aaaaah.

Ok, the errand issue here is digital silence, as the consistent low for loudness, which I presume is getting clamped to 157.22659301757813.

So in that case I can easily see how there would be a range of zero for potentially a bunch of “extreme” parameters in this way (min/max of certain values in particular, but I guess all those nyquist/2 vibe-ing as well).