@jamesbradbury work with FTIS may be of interest here:
https://phd.jamesbradbury.net/tech/ftis/
This as well:
https://discourse.flucoma.org/t/segmentation-by-clustering
Ultimately you’ll have to decide what criteria (and thresholds) you use for similarity, and I would imagine morphology/duration/time series being important factors especially if you want to remove sounds that sound identical.
When you say the variations in duration, do you mean that there may be a short file that sounds identical to a long file or is that just the lay-of-the-land in the corpus and you will be comparing similar duration files regardless?