Day 46: Peer-to-Peer Communication (Multi-GPU)
Objective:
Learn how to enable Peer-to-Peer (P2P) access between multiple GPUs on the same system, allowing one GPU to read/write data located in another GPU’s memory without routing through the host. This direct data transfer can significantly reduce overhead if the GPUs are P2P-compatible and share a high-speed interconnect (like NVLink or PCIe with direct peer support). However, GPU compatibility issues can block P2P if the GPUs do not support such direct peer access or are located on different PCIe segments.
Key Reference:
Table of Contents
- Overview
- What Is Peer-to-Peer (P2P) Communication?
- P2P Requirements & Checking Compatibility
- Enabling P2P in Practice
- Conceptual Diagrams
- Common Pitfalls & Best Practices
- References & Further Reading
- Conclusion
- Next Steps
1. Overview
On multi-GPU systems, data often must be moved between GPUs. Without P2P, the data typically travels through host memory. With peer-to-peer enabled, GPUs can directly access each other’s memory if supported, drastically reducing latency and overhead.
However:
- Not all GPU pairs are P2P-compatible. On some systems or older GPUs, direct peer access isn’t possible.
- Even if supported, large data transfers should be tested to confirm actual performance benefits.
- We typically call
cudaDeviceEnablePeerAccess()
from the host.
2. What Is Peer-to-Peer (P2P) Communication?
- Definition: P2P memory access means GPU A can read/write GPU B’s global memory directly over the GPU interconnect (NVLink or PCIe peer pathway).
- Host Bypass: No need to copy from GPU A -> host -> GPU B. Instead, GPU A kernel can run with a pointer referencing GPU B’s memory or use
cudaMemcpyPeerAsync()
between device IDs. - Conditions:
- GPUs must reside in the same “peer group” or have a direct interconnect.
- The system must have no OS or driver restrictions that disable P2P.
3. P2P Requirements & Checking Compatibility
- Hardware: GPU architecture that supports UVA (Unified Virtual Addressing) and P2P bridging.
- System: Must have a single root complex or multi-root if NVLink bridging is present.
- API:
cudaDeviceCanAccessPeer(&canAccess, devA, devB)
.- If
canAccess
, callcudaSetDevice(devA)
thencudaDeviceEnablePeerAccess(devB, 0)
. - Possibly do the reverse (
cudaSetDevice(devB)
=>cudaDeviceEnablePeerAccess(devA, 0)
) for bidirectional.
If either step fails, that means the pair is not P2P-compatible.
4. Enabling P2P in Practice
a) Example Code Snippet
/**** day46_P2PExample.cu ****/
#include <cuda_runtime.h>
#include <stdio.h>
int main(){
int devCount;
cudaGetDeviceCount(&devCount);
if(devCount<2){
printf("Need at least 2 GPUs for P2P example!\n");
return 0;
}
// Check if dev0 can access dev1
int canAccess01=0, canAccess10=0;
cudaDeviceCanAccessPeer(&canAccess01, 0, 1);
cudaDeviceCanAccessPeer(&canAccess10, 1, 0);
printf("canAccess01=%d, canAccess10=%d\n", canAccess01, canAccess10);
if(canAccess01 && canAccess10){
// Enable peer access
cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1, 0);
cudaSetDevice(1);
cudaDeviceEnablePeerAccess(0, 0);
// Now dev0 & dev1 can do direct reads/writes or use cudaMemcpyPeer
// Example: allocate on dev0, copy to dev1
cudaSetDevice(0);
float *d_data0;
cudaMalloc(&d_data0, 100*sizeof(float));
// fill data on dev0
// do direct copy from dev0 memory to dev1 memory
cudaSetDevice(1);
float *d_data1;
cudaMalloc(&d_data1, 100*sizeof(float));
// Peer copy
cudaMemcpyPeer(d_data1, 1, d_data0, 0, 100*sizeof(float));
// Alternatively, kernel on dev1 can read d_data0 pointers if we pass them carefully
printf("P2P enabled between dev0 & dev1. \n");
} else {
printf("Peer access not available between dev0 & dev1.\n");
}
return 0;
}
Explanation:
- We assume at least 2 GPUs.
- We check if device 0 can access device 1, and vice versa. If yes, we enable peer access from each side.
- Then we do a direct
cudaMemcpyPeer(...)
from dev0 memory to dev1 memory, bypassing the host.
b) Observing Direct GPU-GPU Copies
- If your GPUs are P2P-compatible,
cudaMemcpyPeer()
or a kernel on dev1 referencing dev0’s pointer remains on the GPU bus, never hitting host memory. - Tools like Nsight Systems can confirm if transfers are GPU↔GPU vs. GPU↔Host↔GPU.
5. Conceptual Diagrams
Diagram 1: Peer-to-Peer Setup Flow
flowchart TD
A[cudaDeviceCanAccessPeer(&canAccess, devA, devB)] --> B[Enable Peer Access if canAccess=1 => cudaDeviceEnablePeerAccess(devB,0)]
B --> C[Now devA can directly read devB memory or call cudaMemcpyPeer]
Diagram 2: GPU↔GPU Communication Paths
flowchart LR
subgraph GPU0
D0[d_data0]
end
subgraph GPU1
D1[d_data1]
end
A[P2P path or NVLink/PCIe peer] <--> B
GPU0 --> A
GPU1 --> B
style A fill:#cfc,stroke:#363
style B fill:#cfc,stroke:#363
Explanation:
- The “peer path” can be direct if supported by hardware. Otherwise, fallback is GPU0→Host→GPU1.
6. Common Pitfalls & Best Practices
- Incompatible GPUs
- Even if you have multiple GPUs, they might not be in the same root complex or support direct P2P.
cudaDeviceCanAccessPeer
can fail.
- Even if you have multiple GPUs, they might not be in the same root complex or support direct P2P.
- Must Enable Peer Access
- By default, even if
canAccessPeer=1
, you still must callcudaDeviceEnablePeerAccess(...)
from each device that needs direct access.
- By default, even if
- 2-Way or 1-Way
- If you need both directions, do both:
cudaSetDevice(A), cudaDeviceEnablePeerAccess(B)
, and vice versa.
- If you need both directions, do both:
- Performance Gains
- P2P might drastically reduce overhead for frequent GPU↔GPU transfers, but large data sets must be tested; if bus is saturated, might not see big improvement.
- Memory Allocation
- If dev1 is to read dev0’s pointer directly, dev0’s pointer must be allocated in standard device memory. Some advanced pointer passing also requires UVA (Unified Virtual Addressing) to be enabled by default on modern systems.
7. References & Further Reading
- CUDA C Programming Guide – Peer-to-Peer Memory Access
Documentation Link - NVIDIA Developer Blog – Articles about multi-GPU communication patterns, NVLink usage, etc.
- “Programming Massively Parallel Processors” by Kirk & Hwu – Multi-GPU and advanced HPC sections.
8. Conclusion
Day 46 covered Peer-to-Peer Communication (Multi-GPU):
- What: Direct GPU↔GPU data movement without host involvement, if hardware supports it.
- API:
cudaDeviceCanAccessPeer
+cudaDeviceEnablePeerAccess
+cudaMemcpyPeer
or direct pointer usage. - Benefits: Potentially lower latency, higher throughput, especially if GPUs are connected by NVLink.
- Caveats: Not all GPU combos are P2P-compatible; enabling peer access incorrectly or for large data sets might saturate the interconnect. Must measure real performance gains.
Key Takeaway:
Multi-GPU HPC code can exploit P2P for direct data exchange, but ensure your GPUs support it, and watch for PCIe or NVLink constraints.
9. Next Steps
- Check your system’s GPU arrangement (
nvidia-smi topo -m
) to see if GPUs are connected by NVLink or same PCIe root complex. - Benchmark: Compare P2P transfer times vs. host-staged transfers for various data sizes.
- Integrate: If doing multi-GPU algorithms (like multi-GPU domain decomposition), pass pointers or do
cudaMemcpyPeerAsync()
for direct GPU↔GPU transfers. - Profile concurrency if both GPUs do computations concurrently and share data via P2P.
## 贡献者
<NolebaseGitContributors />
## 文件历史
<NolebaseGitChangelog />