gpu - Nvidia OpenCL hangs on blocking buffer access -


i have opencl program copies bunch of values input buffer, processes these values, , copies results back.

// map input data buffer, has cl_mem_alloc_host_ptr cl_float* data = clenqueuemapbuffer(queue, data_buffer, cl_true, cl_map_write, 0, data_size, 0, null, null, null);  // set input values for(size_t = 0; < n; ++i)     data[i] = values[i];  // unmap input buffer clenqueueunmapmemobject(queue, data_buffer, data, 0, null, null);  // run kernels ...  // map results buffer, has cl_mem_alloc_host_ptr cl_float* results = clenqueuemapbuffer(queue, results_buffer, cl_true, cl_map_read, 0, results_size, 0, null, null, null);  // processing ...  // unmap results buffer clenqueueunmapmemobject(queue, results_buffer, results, 0, null, null); 

(in real code, check errors etc.)

this works great on amd , intel architectures (both cpu , gpu). on nvidia gpus, code incredibly slow. program takes takes 10 seconds run (5 seconds host, 5 seconds device) run more 2 , half minutes on nvidia cards.

however, have found not straightforward optimisation problem or zero-copy speed difference. using profiler, see host time of program 5 seconds, in normal case. , using opencl profiling events, see device time 5 seconds, in normal case!

so used poor mans' profiler trick figure out program spends time on nvidia gpus. , shows program waits idly on both of clenqueuemapbuffer calls. find incomprehensible on first instance, queue empty @ point.

i repeat, have profiled every map/unmap , kernel invocation, , time not show there, it's not spent on device, , neither on host. can see stack profile waiting on semaphore instead. knows what's causing hang?


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -