We’re implementing the OpenACC Profiling Interface as defined by the OpenACC 2.6 specification. We’re clarifying some aspects here as implementation-defined behavior, while they’re still under discussion within the OpenACC Technical Committee.
This implementation is tuned to keep the performance impact as low as possible for the (very common) case that the Profiling Interface is not enabled. This is relevant, as the Profiling Interface affects all the hot code paths (in the target code, not in the offloaded code). Users of the OpenACC Profiling Interface can be expected to understand that performance will be impacted to some degree once the Profiling Interface has gotten enabled: for example, because of the runtime (libgomp) calling into a third-party library for every event that has been registered.
We’re not yet accounting for the fact that OpenACC events may
occur during event processing.
We just handle one case specially, as required by CUDA 9.0
nvprof, that acc_get_device_type
(acc_get_device_type – Get type of device accelerator to be used.)) may be called from
acc_ev_device_init_start, acc_ev_device_init_end
callbacks.
We’re not yet implementing initialization via a
acc_register_library function that is either statically linked
in, or dynamically via LD_PRELOAD.
Initialization via acc_register_library functions dynamically
loaded via the ACC_PROFLIB environment variable does work, as
does directly calling acc_prof_register,
acc_prof_unregister, acc_prof_lookup.
As currently there are no inquiry functions defined, calls to
acc_prof_lookup will always return NULL.
There aren’t separate start, stop events defined for the
event types acc_ev_create, acc_ev_delete,
acc_ev_alloc, acc_ev_free. It’s not clear if these
should be triggered before or after the actual device-specific call is
made. We trigger them after.
Remarks about data provided to callbacks:
acc_prof_info.event_typeIt’s not clear if for nested event callbacks (for example,
acc_ev_enqueue_launch_start as part of a parent compute
construct), this should be set for the nested event
(acc_ev_enqueue_launch_start), or if the value of the parent
construct should remain (acc_ev_compute_construct_start). In
this implementation, the value will generally correspond to the
innermost nested event type.
acc_prof_info.device_typeacc_ev_compute_construct_start, and in presence of an
if clause with false argument, this will still refer to
the offloading device type.
It’s not clear if that’s the expected behavior.
acc_ev_compute_construct_end, this is set to
acc_device_host in presence of an if clause with
false argument.
It’s not clear if that’s the expected behavior.
acc_prof_info.thread_idAlways -1; not yet implemented.
acc_prof_info.asyncacc_ev_compute_construct_start.
acc_device_host it will always be
acc_async_sync.
It’s not clear if that’s the expected behavior.
acc_ev_device_init_start and acc_ev_device_init_end,
it will always be acc_async_sync.
It’s not clear if that’s the expected behavior.
acc_prof_info.async_queueThere is no limited number of asynchronous queues in libgomp.
This will always have the same value as acc_prof_info.async.
acc_prof_info.src_fileAlways NULL; not yet implemented.
acc_prof_info.func_nameAlways NULL; not yet implemented.
acc_prof_info.line_noAlways -1; not yet implemented.
acc_prof_info.end_line_noAlways -1; not yet implemented.
acc_prof_info.func_line_noAlways -1; not yet implemented.
acc_prof_info.func_end_line_noAlways -1; not yet implemented.
acc_event_info.event_type, acc_event_info.*.event_typeRelating to acc_prof_info.event_type discussed above, in this
implementation, this will always be the same value as
acc_prof_info.event_type.
acc_event_info.*.parent_constructacc_construct_parallel for all OpenACC compute
constructs as well as many OpenACC Runtime API calls; should be the
one matching the actual construct, or
acc_construct_runtime_api, respectively.
acc_construct_enter_data or
acc_construct_exit_data when processing variable mappings
specified in OpenACC declare directives; should be
acc_construct_declare.
acc_ev_device_init_start,
acc_ev_device_init_end, and explicit as well as implicit
acc_ev_alloc, acc_ev_free,
acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end,
acc_ev_enqueue_download_start, and
acc_ev_enqueue_download_end, will be
acc_construct_parallel; should reflect the real parent
construct.
acc_event_info.*.implicitFor acc_ev_alloc, acc_ev_free,
acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end,
acc_ev_enqueue_download_start, and
acc_ev_enqueue_download_end, this currently will be 1
also for explicit usage.
acc_event_info.data_event.var_nameAlways NULL; not yet implemented.
acc_event_info.data_event.host_ptrFor acc_ev_alloc, and acc_ev_free, this is always
NULL.
typedef union acc_api_info… as printed in 5.2.3. Third Argument: API-Specific
Information. This should obviously be typedef struct
acc_api_info.
acc_api_info.device_apiPossibly not yet implemented correctly for
acc_ev_compute_construct_start,
acc_ev_device_init_start, acc_ev_device_init_end:
will always be acc_device_api_none for these event types.
For acc_ev_enter_data_start, it will be
acc_device_api_none in some cases.
acc_api_info.device_typeAlways the same as acc_prof_info.device_type.
acc_api_info.vendorAlways -1; not yet implemented.
acc_api_info.device_handleAlways NULL; not yet implemented.
acc_api_info.context_handleAlways NULL; not yet implemented.
acc_api_info.async_handleAlways NULL; not yet implemented.
Remarks about certain event types:
acc_ev_device_init_start, acc_ev_device_init_endacc_ev_device_init_start and acc_ev_device_init_end
events, they currently aren’t nested within the corresponding
acc_ev_compute_construct_start and
acc_ev_compute_construct_end, but they’re currently observed
before acc_ev_compute_construct_start.
It’s not clear what to do: the standard asks us provide a lot of
details to the acc_ev_compute_construct_start callback, without
(implicitly) initializing a device before?
acc_set_device_type and acc_set_device_num functions.
It’s not clear if they should be.
acc_ev_enter_data_start, acc_ev_enter_data_end, acc_ev_exit_data_start, acc_ev_exit_data_endCallbacks for the following event types will be invoked, but dispatch and information provided therein has not yet been thoroughly reviewed:
acc_ev_alloc
acc_ev_free
acc_ev_update_start, acc_ev_update_end
acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end
acc_ev_enqueue_download_start, acc_ev_enqueue_download_end
During device initialization, and finalization, respectively, callbacks for the following event types will not yet be invoked:
acc_ev_alloc
acc_ev_free
Callbacks for the following event types have not yet been implemented, so currently won’t be invoked:
acc_ev_device_shutdown_start, acc_ev_device_shutdown_end
acc_ev_runtime_shutdown
acc_ev_create, acc_ev_delete
acc_ev_wait_start, acc_ev_wait_end
For the following runtime library functions, not all expected callbacks will be invoked (mostly concerning implicit device initialization):
acc_get_num_devices
acc_set_device_type
acc_get_device_type
acc_set_device_num
acc_get_device_num
acc_init
acc_shutdown
Aside from implicit device initialization, for the following runtime library functions, no callbacks will be invoked for shared-memory offloading devices (it’s not clear if they should be):
acc_malloc
acc_free
acc_copyin, acc_present_or_copyin, acc_copyin_async
acc_create, acc_present_or_create, acc_create_async
acc_copyout, acc_copyout_async, acc_copyout_finalize, acc_copyout_finalize_async
acc_delete, acc_delete_async, acc_delete_finalize, acc_delete_finalize_async
acc_update_device, acc_update_device_async
acc_update_self, acc_update_self_async
acc_map_data, acc_unmap_data
acc_memcpy_to_device, acc_memcpy_to_device_async
acc_memcpy_from_device, acc_memcpy_from_device_async