This is odd; I have a few relatively small intermediate buffers that I used to create on the device only.
I will now be needing to process these buffers on the host, so I tried creating them with pinned host memory,
in preparation to map them.
For some strange reason, performance has improved significantly.
I'm not one to look a gift horse in the mouth, but can anyone think of an explanation for this phenomenon ?