Identical slices remove redundancy

Hello,

in some corpora that got huge I would like to remove slices that are too similar , so I reduce redundancy. For example I have 2 clusters well identified, one is 200 points of the same-ish sound the other is 20 points , I’d like to find an efficient way to make them both 20 points, by removing 180 points that are too close to each other.

(I think I read about a clever way of doing it in the forum before, but I could not find the posts again…)

Would it be this thread? Are there ways to speed up bufnmfcross' performance? - #31 by tremblap

2 Likes

Yes! Cheers!

2 Likes

This may also be of particular relevance (by @jamesbradbury )

I have not explored (much) “Novelty” algorithms yet, but I knew this video.
For some reason the “grid decimation” method described in the link in the post above is easier to understand for me at the moment, it gives me hints at how to ignore/delete material from a corpus

2 Likes