Scope
This note summarizes is about debugging on LUMI with AMD ROCm, especially when using Kokkos rather than writing HIP directly.
Typical failure
A representative runtime error was:
Memory access fault by GPU node-4 (Agent handle: 0x514a400) on address 0x14f8105fa000. Reason: Unknown.
srun: error: nid005001: task 0: Aborted (core dumped)
srun: Terminating StepId=17146122.0
This indicates a GPU side illegal memory access or memory violation. The host process then aborts through the HSA runtime. In many cases the CPU backtrace is not the interesting part.
Common root causes include:
-
out of bounds access
-
invalid device pointer
-
use after free
-
stale pointer after reallocation
-
missing synchronization
-
race or lifetime issue
-
host pointer accidentally used on device
Core files: core and gpucore.<pid>
ROCm produces a GPU core file such as:
gpucore.<pid>
Linux may also produce a regular host core such as:
core
The GPU core is not meant to be opened by itself like a normal CPU core. ROCgdb expects a merged heterogeneous core.
Correct workflow
Enable core dumps and run outside ROCgdb:
ulimit -c unlimited
export HSA_ENABLE_DEBUG=1
srun -n 1 ./your_exe ...
If you get both a host core and a GPU core, merge them:
roccoremerge combined core gpucore.<pid>
rocgdb ./your_exe -core combined
Notes:
-
The host core may simply be called
corewithout a pid suffix. -
ROCgdb can load the merged core even if it prints warnings about
/dev/kfd,/dev/dri/…, or similar mappings. -
A host backtrace that stops in
libhsa-runtime64.so.1is normal after a GPU queue error.
Why we need merged cores
When opening the merged core, it is common to see only the host side abort path, for example:
-
abort() -
libhsa-runtime64.so.1 -
__pthread_kill_implementation
That usually means you are looking at the CPU thread that noticed the GPU error, not the GPU wavefront thread that caused it.
Important consequence
where on the current thread is often not enough.
Instead, in ROCgdb inspect all threads:
info threads
thread apply all bt
AMD GPU wavefronts are represented as threads. The useful information is often in one of those GPU threads, not in thread 1.
Then select the interesting GPU thread and inspect it:
thread <n>
bt
frame 0
info registers
info registers system
info registers scalar
info registers vector
Interactive ROCgdb versus post mortem cores
For memory violations, interactive ROCgdb is often more useful than post mortem inspection of a merged core.
Start ROCgdb directly:
rocgdb ./your_exe
Inside ROCgdb:
set amdgpu precise-memory on
run ...
This enables more precise reporting for GPU memory violations.
Important caveat
If ROCgdb is attached during execution, ROCm may not generate the usual AMD GPU core dump in the same way. So:
-
interactive debugging and
-
post mortem GPU core generation
are not always compatible workflows.
ROCr Debug Agent
A very useful alternative is AMD’s ROCr Debug Agent, which can report faulting wavefronts and kernel names.
Use:
export HSA_ENABLE_DEBUG=1
export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2
export ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects"
srun -n 1 ./your_exe ...
This can help identify:
-
the faulting kernel
-
wavefront state
-
code objects for later inspection
Important observation from the discussion
With these environment variables enabled, the crash disappeared.
This should not be treated as a fix. It strongly suggests a debug sensitive bug or Heisenbug, for example:
-
race condition
-
lifetime issue
-
missing fence
-
uninitialized data
-
asynchronous ordering bug
-
out of bounds access that depends on timing
The debug agent and HSA debug mode change runtime behavior enough that some bugs disappear.
Kokkos specific translation
Build with Kokkos debug options
Recommended options for debugging:
-
CMAKE_BUILD_TYPE=Debug -
Kokkos_ENABLE_DEBUG=ON -
Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON -
Kokkos_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK=ON
If full Debug is too intrusive, RelWithDebInfo is a reasonable compromise.
Translating backtrace addresses to lines
Given a file containing lines such as:
[0xc846e8]
[0xc63efd]
...
you can resolve them with:
grep -o '\[0x[0-9a-fA-F]\+\]' your_file | tr -d '[]' | xargs llvm-addr2line -e /path/to/your/exe -f -C
Or with GNU addr2line:
grep -o '\[0x[0-9a-fA-F]\+\]' your_file | tr -d '[]' | xargs addr2line -e /path/to/your/exe -f -C
This helps map Kokkos bounds check backtraces to source locations.
Practical debugging checklist
ROCm and LUMI side
-
Enable core dumps:
bash ulimit -c unlimited -
Run once outside ROCgdb:
bash export HSA_ENABLE_DEBUG=1 srun -n 1 ./your_exe … -
If both files exist, merge:
bash roccoremerge combined core gpucore.<pid> rocgdb ./your_exe -core combined -
In ROCgdb, inspect GPU threads:
gdb info threads thread apply all bt -
For better memory fault locations, use interactive ROCgdb:
gdb set amdgpu precise-memory on run … -
If needed, run with the ROCr Debug Agent:
bash export HSA_ENABLE_DEBUG=1 export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 export ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects" srun -n 1 ./your_exe … -
Narrow the problem to a single kernel with fences and labels.
-
Treat “bug disappears under debug agent” as evidence of a timing sensitive bug, not a fix.