OpenCL: Get device INVALID_COMMAND_QUEUE for host memory transfer after the kernel has been looped multiple times


I'm working on an OpenCL program that calls the same kernel several times in a loop. When I use clEnqueueReadBuffer to transfer device memory back to the host it reports that the command queue is invalid.

Below is a function that is called to initiate a bitonic sort, it's shortened to make it more readable. The device list, context, command queue and kernel are created outside and passed to this function. list contains the list to be sorted, size is the number of elements in list.

cl_int OpenCLBitonicSort(cl_device_id device, cl_context context,
    cl_command_queue commandQueue, cl_kernel bitonicSortKernel,
    unsigned int * list, unsigned int size){

    //create OpenCL specific variables
    cl_int error = CL_SUCCESS;
    size_t maximum_local_ws;
    size_t local_ws;
    size_t global_ws;

    //create variables that keep track of bitonic sorting progress
    unsigned int stage = 0;
    unsigned int subStage;
    unsigned int numberOfStages = 0;

    //get maximum work group size
    clGetKernelWorkGroupInfo(bitonicSortKernel, device,
        CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws),
         &maximum_local_ws, NULL);

    //make local_ws the largest perfect square allowed by OpenCL
    for(i = 1; i <= maximum_local_ws; i *= 2){
        local_ws = (size_t) i;
    //total number of comparators will be half the items in the list
    global_ws = (size_t) size/2;

    //transfer list to the device
    cl_mem list_d = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR,
        size * sizeof(unsigned int), list, &error);

    //find the number of stages needed (numberOfStages = ln(size))
    for(numberOfStages = 0; (1 << numberOfStages ^ size); numberOfStages++){

    //loop through all stages
    for(stage = 0; stage < numberOfStages; stage++){
        //loop through all substages in each stage
        for(subStage = stage, i = 0; i <= stage; subStage--, i++){
            //add kernel parameters
            error = clSetKernelArg(bitonicSortKernel, 0,
                sizeof(cl_mem), &list_d);
            error = clSetKernelArg(bitonicSortKernel, 1,
                sizeof(unsigned int), &size);
            error = clSetKernelArg(bitonicSortKernel, 2,
                sizeof(unsigned int), &stage);
            error = clSetKernelArg(bitonicSortKernel, 3,
                sizeof(unsigned int), &subStage);

            //call the kernel
            error = clEnqueueNDRangeKernel(commandQueue, bitonicSortKernel, 1,
                NULL, &global_ws, &local_ws, 0, NULL, NULL);

            //wait for the kernel to stop executing
            error = clEnqueueBarrier(commandQueue);

    //read the result back to the host
    error = clEnqueueReadBuffer(commandQueue, list_d, CL_TRUE, 0,
        size * sizeof(unsigned int), list, 0, NULL, NULL);

    //free the list on the device

    return error;

In this code: clEnqueueReadBuffer says that the commandQueue is invalid. However it was valid when I called clEnqueueNDRangeKernel and clEnqueueBarrier.

When I set numberOfStages to just be 1 and stage to just be 0, so that clEnqueueNDRangeKernel is only called once, the code worked without returning errors (although the result was not correct). There is a problem with calling clEnqueueNDRangeKernel more than once (which I really need to do).

I'm on Mac OS 10.6 Snow Leopard and I'm using Apple's OpenCL 1.0 platform with an NVidia GeForce 9600m. Is running kernels within loops possible in OpenCL on other platforms? Has anyone had problems like this with OpenCL on OS X? What could be causing the command queue to be going invalid?

Could have multiple causes, like (global or local) memory overrun in kernels.

Also global_ws should be multiple of local_ws.

Have a look at the SortingNetworks sample