Indeed, it’s quite pickle-y of a situation.
Based on my (fairly superficial) understanding, I like the idea of zero-padding for loudness, even though it creates “fake” data by isolating a window, but for my purposes that’s probably the better choice. The spectral ones are where I was hoping to improve things, and it looks like for my example I mirrored (duplicating the start/end frame (or is that folding?)).
As you point out, however, that overly represents those in the analysis, but perhaps that’s less problematic with loudness-weighted descriptors.
I hadn’t really considered what it would mean for pitch though, as that seems like a weirder one in terms of behavior.