Does writing in local memory in OpenCL slow down the 400% kernel?

I have a pretty complicated kernel that I've been optimizing. Without going through all the code, there is one kernel that writes some values to global memory.

Then a second kernel fires and does billion of computations on that data, all in local memory. I've optimized the code over and over, getting the kernel run time down to about 275ms.

The final part of the kernel loops over an array of data processed in local memory and searches for a matching string. Obviously, if it finds a match it needs to let the host program know this. I accomplished this by changing global_array[0].x to 999 and global_array[0].y to equal the found result.

After the kernel finishes, it does a read of the first element of global_array, checks if .x == 999 and if so we know we found our target.

In the process of doing more optimizing, I found that if I commented out the global_array[0] = lines, the kernel ran 4x as fast, at about 62ms. Knowing global memory is slow, I started testing various things. I thought, hey, maybe if I change the LOCAL array, then at the very end did a work_group_copy back to global I'd get a bit of a speed increase.

But no... I dont. And it's confusing as heck. If at the end of the kernel, I write anything to seemingly any position in global or local memory, my kernel runs at 270ms. If I write the same data to a private variable, or just do other unrelated code, it's 62ms.

I need to return a result from the kernel somehow - but for some reason, writing to a local variable, something the kernel does 50x before it reaches the end without slowdown, seems to slow it down like crazy when the write is at the end.

Can anyone explain why this would happen? I'm stumped.

When you don't write out to global memory, the JIT compiler is most likely detecting most of your code as dead code, and eliminating it.