This is essentially a question about what data is represented and how it is weighted.
With zero-padding the edge values are biased by the zero-padding in a manner that is dependent on the descriptor.
- For spectral shape, as long as you have enough data in the window the biasing will come from the sharp onset that has been induced in the sample.
- For pitch it can really mess up the measurement and it’s hard to predict how that will happen
- For loudness etc. it may be less problematic, but you are biasing towards zero
To counter this - if you have a segment of audio that is very loud in the first few samples, then windowing that may significantly affect the result, so if you only start your window at the edge of the audio you are discarding useful data. These situations might be considered as different for streaming situations (analysing in chunks) rather than a file where the start is meaningful. You have to be careful that you are not throwing data away, or biasing it to induce error.
Mirroring or folding (I’m drawing a distinction between repeating the edge sample or not) or wrapping (reading modulo) can reduce some of these issues in that all frames are now full, and full of the original dat - however, you’ve now repeated samples to do so and therefore weighted those samples more highly in your estimates.