Why is Arrayfun much faster than a loop for using the GPU?

Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)

Arrayfun:

``````x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(@(x) x^2, x);
```
```

And equivalent For loop:

``````for i=1:size(x,1)*size(x,2)*size(x,3)
z(i)=x(i).^2;
end
```
```

Is it probably because a For loop is not multithreaded on GPU? Thanks.

I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.

Regardless, I think the explanation you're looking for is as follows:

When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given `i`, the value of `[cell_i]^2` doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.

Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.