Hi ,
CodeXL seems to be reporting correctly because when the cache hit drops it is accompanied by an increase in fetch size and mem unit busy which can be only explained by increased cache misses. I also tested the kernel 10 times inside a loop on the the host side and performance counters were almost identical for each kernel call. This seems to be a compiler problem to me.
Regards,
Sayantan