Thanks.
With 3/4 my mistake in message. In real kernel loop does 4 steps.
Tried setting global size to multiple of 1024, made no effect. Originally it was multiple of work group size(256).
P.S
In the end managed to decrease VGPRs to 103 and gain speed by 20% (but this mostly to LDS usage i think).
Also found these actions to lower VGPR usage
1) pack data to preffered size vectors (not always)
2) using scalar operations even on vectors
a.x +b.x ... a.z + b.z instead of a+b
3) not using manually unlooped code
even if in loop have to write something like
loop (i = 0..3) {
do something with D
A = (i==0) ? D : A;
B = (i==1) ? D : B;
C = (i==2) ? D : C
}
it works a little faster (and uses less vgprs)
than
do something with A
do something with B
do something with C
do something with D