i have opencl program copies bunch of values input buffer, processes these values, , copies results back. // map input data buffer, has cl_mem_alloc_host_ptr cl_float* data = clenqueuemapbuffer(queue, data_buffer, cl_true, cl_map_write, 0, data_size, 0, null, null, null); // set input values for(size_t = 0; < n; ++i) data[i] = values[i]; // unmap input buffer clenqueueunmapmemobject(queue, data_buffer, data, 0, null, null); // run kernels ... // map results buffer, has cl_mem_alloc_host_ptr cl_float* results = clenqueuemapbuffer(queue, results_buffer, cl_true, cl_map_read, 0, results_size, 0, null, null, null); // processing ... // unmap results buffer clenqueueunmapmemobject(queue, results_buffer, results, 0, null, null); (in real code, check errors etc.) this works great on amd , intel architectures (both cpu , gpu). on nvidia gpus, code incredibly slow. program takes takes 10 seconds run (5 seconds host, 5 seconds device) run more 2 , half minutes on nvidia ca...