It actually starts before sample 0. I made this diagram for a different thread, but the analysis starts such that your first actual bit of audio is covered by each bit of the analysis frame/hop:
I think you’re using a default hop of fft/2, so you won’t have as much overlap as what I have here.
The stuff in light green is also zero-padding generally speaking. I made the diagram there as I was trying to do something else that would incorporate audio from before the analysis window, but that’s not what generally happens.
There’s no data in between the things. You’ll get a single value (per band) depending on your fft settings. So the more overlap you have, the more temporal resolution you’ll have (at other expenses obviously). I tend to use an overlap of 4, but 2 is “normal”.