Pca bug or misunderstanding?

tutschku · October 24, 2024, 11:32pm

I have a dataset with 160 dimensions and want to reduce it with pca while keeping the
fraction of total variance to 0.95.

But I’m running into a problem when the number of rows is smaller than the number of dimensions. In that case, fittransform fills the new dataset with zeros.
I’m not sure if this is a bug or me not getting the concept. It is the first time that I’m working with such a large amount of dimensions.

The original dataset with 99 rows and 160 cols

fluid.dataset~: DataSet pca.std:
rows: 99 cols: 160
000001 0.29054 -0.33205 -0.14171 … 0.58806-0.0076716 0.14521
000002 0.35523 -0.55267 0.10501 … -1.0461 0.51816 0.44641
000003 2.7544 -2.5785 -2.6737 … -0.1726 1.1397 0.2607
…
000097 -0.18875 -0.64774 -0.75295 … -0.4076 3.6126 0.34052
000098 -0.62454 1.2527 1.3371 … 0.013334 -0.90388 -1.3311
000099 -0.37633 0.94823 0.46386 … 0.69903 -0.44015 0.80934

If I limit the number of dimensions to 99, the reduced table contains values:

PCA.dimensions.for.95%: 76
fluid.dataset~: DataSet pca.reduced:
rows: 99 cols: 99
000001 4.5537 -1.2961 3.6897 … -0.07921 0.0310326.2655e-16
000002 7.9774 -1.2558 -1.6491 … 0.046491 -0.22899-3.8173e-15
000003 3.9425 -14.258 -3.059 …0.00081779 0.031055-5.6719e-15
…
000097 -7.3595 -13.407 -4.8174 … 0.049414 0.00347621.9137e-15
000098 -0.28963 4.7732 -3.8851 … -0.062684 -0.0117-9.7668e-16
000099 -3.3875 5.8917 1.93 … -0.18516 -0.030852.4009e-15
PCA.dimensions.for.95%: 76

BUT IF I’M ASKING FOR MORE DIMENSIONS THAN THE NUMBER OF ROWS
the resulting reduced dataset has just zeros

PCA.dimensions.for.95%: 76
fluid.dataset~: DataSet pca.reduced:
rows: 99 cols: 100
000001 0 0 0 … 0 0 0
000002 0 0 0 … 0 0 0
000003 0 0 0 … 0 0 0
…
000097 0 0 0 … 0 0 0
000098 0 0 0 … 0 0 0
000099 0 0 0 … 0 0 0

Anybody with a brilliant explanation/solution?

Thanks, Hans

weefuzzy · October 25, 2024, 2:34pm

Bug or misunderstanding?

A bit of each. PCA, like many of these things doesn’t work when the number of data points is less than the number of dimensions: I don’t think you could get more than N-samples non-zero principal components.

That said, getting just 0s is a bummer. We should either document, throw an error or ‘help’ by clipping the number of components to min(num samples, num dimensions).

tutschku · October 25, 2024, 3:19pm

That’s good to know. Yes, some form of feedback might be helpful.

tremblap · October 30, 2024, 3:14pm

this is strange, I thought that we had a warning for that. It used to crash (bad!) but just 0 is not useful. @weefuzzy I’m surprised that line 34.5 of PCA.hpp we don’t do any checks… should we bail there, or line 67.5 of PCAClient.hpp ?

weefuzzy · November 1, 2024, 9:36pm

If it crashed then that was us rather than the underlying Eigen algo. You’ll never get more components out of that than you had data points when you called fit, so the default behaviour would be to surprise people by returning fewer components than they asked for (and the results themselves might not be all that good).

If we were to put an error or warning in, then it could either be in transform (if an error) and optionally a warning on fit than n_samples < dimensions .

tremblap · November 3, 2024, 9:27am

I’m happy to implement if you think that it makes sense. If it helps PCA users to understand the intrinsic limits of the algo, then I’m all for it.