Table of Contents
GPUDirect RDMA (Remote Direct Memory Access) is a technology that enables a direct path for data exchange between the GPU and a third-party peer device using standard features of PCI Express.
The NVIDIA GPU driver package provides a kernel module, nvidia-peermem, which provides Mellanox InfiniBand based HCAs (Host Channel Adapters) direct peer-to-peer read and write access to the NVIDIA GPU's video memory. It allows GPUDirect RDMA-based applications to use GPU computing power with the RDMA interconnect without needing to copy data to host memory.
This capability is supported with Mellanox ConnectX-3 VPI or newer adapters. It works with both InfiniBand and RoCE (RDMA over Converged Ethernet) technologies.
Mellanox OFED (Open Fabrics Enterprise Distribution) or MOFED, introduces an API between the InfiniBand Core and peer memory clients such as NVIDIA GPUs. The nvidia-peermem module registers the NVIDIA GPU with the InfiniBand subsystem by using peer-to-peer APIs provided by the NVIDIA GPU driver.
This module, originally maintained by Mellanox on GitHub, is now included with the NVIDIA Linux GPU driver. The original GitHub project at https://github.com/Mellanox/nv_peer_memory should be considered deprecated and only critical bugs will be addressed for existing installations.
The kernel must have the required support for RDMA peer memory either through additional patches to the kernel or via Mellanox OFED package (MOFED) as a prerequisite for loading and using nvidia-peermem.
It is possible that the nv_peer_mem module from the GitHub project may be installed and loaded on the system. Installation of nvidia-peermem will not affect the functionality of the existing nv_peer_mem module. But, to load and use nvidia-peermem, users must disable the nv_peer_mem service. Additionally, it is encouraged to uninstall the nv_peer_mem package to avoid any conflict with nvidia-peermem since only one module can be loaded at any time.
Stop the nv_peer_mem service:
# service nv_peer_mem stop
Check if nv_peer_mem.ko is still loaded after stopping the service:
# lsmod | grep nv_peer_mem
If nv_peer_mem.ko is still loaded, unload it with:
# rmmod nv_peer_mem
Uninstall nv_peer_mem package:
For DEB based OS:
# dpkg -P nvidia-peer-memory
# dpkg -P nvidia-peer-memory-dkms
For RPM based OS:
# rpm -e nvidia_peer_memory
After ensuring kernel support and installing the GPU driver, nvidia-peermem can be loaded with the following command with root privileges in a terminal window:
# modprobe nvidia-peermem
Note: If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs that are provided by MOFED.