AMD LUMI Debug Notes

Scope

This note summarizes is about debugging on LUMI with AMD ROCm, especially when using Kokkos rather than writing HIP directly.

Typical failure

A representative runtime error was:

Memory access fault by GPU node-4 (Agent handle: 0x514a400) on address 0x14f8105fa000. Reason: Unknown.
srun: error: nid005001: task 0: Aborted (core dumped)
srun: Terminating StepId=17146122.0

This indicates a GPU side illegal memory access or memory violation. The host process then aborts through the HSA runtime. In many cases the CPU backtrace is not the interesting part.

Common root causes include:

out of bounds access
invalid device pointer
use after free
stale pointer after reallocation
missing synchronization
race or lifetime issue
host pointer accidentally used on device

Core files: `core` and `gpucore.<pid>`

ROCm produces a GPU core file such as:

gpucore.<pid>

Linux may also produce a regular host core such as:

core

The GPU core is not meant to be opened by itself like a normal CPU core. ROCgdb expects a merged heterogeneous core.

Correct workflow

Enable core dumps and run outside ROCgdb:

ulimit -c unlimited
export HSA_ENABLE_DEBUG=1
srun -n 1 ./your_exe ...

If you get both a host core and a GPU core, merge them:

roccoremerge combined core gpucore.<pid>
rocgdb ./your_exe -core combined

Notes:

The host core may simply be called core without a pid suffix.
ROCgdb can load the merged core even if it prints warnings about /dev/kfd, /dev/dri/…, or similar mappings.
A host backtrace that stops in libhsa-runtime64.so.1 is normal after a GPU queue error.

Why we need merged cores

When opening the merged core, it is common to see only the host side abort path, for example:

abort()
libhsa-runtime64.so.1
__pthread_kill_implementation

That usually means you are looking at the CPU thread that noticed the GPU error, not the GPU wavefront thread that caused it.

Important consequence

where on the current thread is often not enough.

Instead, in ROCgdb inspect all threads:

info threads
thread apply all bt

AMD GPU wavefronts are represented as threads. The useful information is often in one of those GPU threads, not in thread 1.

Then select the interesting GPU thread and inspect it:

thread <n>
bt
frame 0
info registers
info registers system
info registers scalar
info registers vector

Interactive ROCgdb versus post mortem cores

For memory violations, interactive ROCgdb is often more useful than post mortem inspection of a merged core.

Start ROCgdb directly:

rocgdb ./your_exe

Inside ROCgdb:

set amdgpu precise-memory on
run ...

This enables more precise reporting for GPU memory violations.

Important caveat

If ROCgdb is attached during execution, ROCm may not generate the usual AMD GPU core dump in the same way. So:

interactive debugging and
post mortem GPU core generation

are not always compatible workflows.

ROCr Debug Agent

A very useful alternative is AMD’s ROCr Debug Agent, which can report faulting wavefronts and kernel names.

Use:

export HSA_ENABLE_DEBUG=1
export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2
export ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects"
srun -n 1 ./your_exe ...

This can help identify:

the faulting kernel
wavefront state
code objects for later inspection

Important observation from the discussion

With these environment variables enabled, the crash disappeared.

This should not be treated as a fix. It strongly suggests a debug sensitive bug or Heisenbug, for example:

race condition
lifetime issue
missing fence
uninitialized data
asynchronous ordering bug
out of bounds access that depends on timing

The debug agent and HSA debug mode change runtime behavior enough that some bugs disappear.

Kokkos specific translation

Build with Kokkos debug options

Recommended options for debugging:

CMAKE_BUILD_TYPE=Debug
Kokkos_ENABLE_DEBUG=ON
Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON
Kokkos_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK=ON

If full Debug is too intrusive, RelWithDebInfo is a reasonable compromise.

Translating backtrace addresses to lines

Given a file containing lines such as:

[0xc846e8]
[0xc63efd]
...

you can resolve them with:

grep -o '\[0x[0-9a-fA-F]\+\]' your_file | tr -d '[]' | xargs llvm-addr2line -e /path/to/your/exe -f -C

Or with GNU addr2line:

grep -o '\[0x[0-9a-fA-F]\+\]' your_file | tr -d '[]' | xargs addr2line -e /path/to/your/exe -f -C

This helps map Kokkos bounds check backtraces to source locations.

Practical debugging checklist

ROCm and LUMI side

Enable core dumps: bash ulimit -c unlimited
Run once outside ROCgdb: bash export HSA_ENABLE_DEBUG=1 srun -n 1 ./your_exe …
If both files exist, merge: bash roccoremerge combined core gpucore.<pid> rocgdb ./your_exe -core combined
In ROCgdb, inspect GPU threads: gdb info threads thread apply all bt
For better memory fault locations, use interactive ROCgdb: gdb set amdgpu precise-memory on run …
If needed, run with the ROCr Debug Agent: bash export HSA_ENABLE_DEBUG=1 export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 export ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects" srun -n 1 ./your_exe …
Narrow the problem to a single kernel with fences and labels.
Treat “bug disappears under debug agent” as evidence of a timing sensitive bug, not a fix.