Day 24: Pinned (Page-Locked) Memory
Table of Contents
- Overview
- What is Pinned (Page-Locked) Memory?
- Benefits of Pinned Memory
- Drawbacks of Excessive Pinned Memory
- Practical Exercise: Comparing Pinned vs. Pageable Memory Transfers
- Conceptual Diagrams
- References & Further Reading
- Conclusion
- Next Steps
1. Overview
Memory transfers between the host (CPU) and device (GPU) are a critical performance factor in CUDA applications. Traditionally, host memory is pageable, meaning it can be moved (paged out) by the operating system. In contrast, pinned (page-locked) memory is not swappable, which allows for higher throughput in data transfers.
In this lesson, we will learn how to allocate pinned memory, compare its performance with pageable memory, and understand the trade-offs involved.
2. What is Pinned (Page-Locked) Memory?
Pinned Memory is host memory that is locked in physical memory, meaning the OS is prevented from paging it out to disk. This is achieved using the CUDA API call cudaMallocHost()
or cudaHostAlloc()
. Because the memory is fixed in place:
- Data transfers between host and device are faster.
- It allows asynchronous memory transfers using
cudaMemcpyAsync()
, which can overlap with kernel execution.
Pageable Memory, on the other hand, is allocated using standard functions like malloc()
, and can be paged out by the OS, potentially slowing down memory transfers.
3. Benefits of Pinned Memory
Faster Transfers:
Pinned memory enables higher bandwidth for data transfers between host and GPU.Asynchronous Transfers:
Allows the use ofcudaMemcpyAsync()
, which does not block the CPU and can overlap with kernel execution, improving overall performance.Deterministic Behavior:
Memory addresses remain fixed, reducing overhead in address translation during transfers.
4. Drawbacks of Excessive Pinned Memory
System Performance Impact:
Allocating too much pinned memory can degrade overall system performance because it reduces the amount of pageable memory available to the OS and other applications.Limited Resources:
Pinned memory is a limited resource; overuse can cause memory fragmentation and reduce the effectiveness of memory transfers.Allocation Overhead:
Allocating and freeing pinned memory is generally slower than pageable memory allocation.
5. Practical Exercise: Comparing Pinned vs. Pageable Memory Transfers
In this exercise, we will:
- Implement a simple vector addition using pageable memory.
- Implement the same vector addition using pinned memory.
- Measure and compare the transfer times and overall performance.
a) Code Example Using Pageable Memory
// pageableVectorAdd.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// Simple vector addition kernel.
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1 << 20; // 1M elements
size_t size = N * sizeof(float);
// Allocate pageable host memory.
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Initialize host arrays.
srand(time(NULL));
for (int i = 0; i < N; i++) {
h_A[i] = (float)(rand() % 100) / 10.0f;
h_B[i] = (float)(rand() % 100) / 10.0f;
}
float *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data to device (synchronously).
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
// Copy result back to host (synchronously).
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Print first 5 results.
printf("Pageable Memory - First 5 Results:\n");
for (int i = 0; i < 5; i++) {
printf("%f ", h_C[i]);
}
printf("\n");
// Free memory.
free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;
}
Comments:
- Pageable Memory:
Uses standardmalloc()
for host memory allocation. - Synchronous Transfers:
Data is copied synchronously, which can be slower.
b) Code Example Using Pinned Memory
// pinnedVectorAdd.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// Simple vector addition kernel.
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1 << 20; // 1M elements
size_t size = N * sizeof(float);
// Allocate pinned host memory.
float *h_A, *h_B, *h_C;
cudaMallocHost((void**)&h_A, size); // Pinned memory allocation
cudaMallocHost((void**)&h_B, size);
cudaMallocHost((void**)&h_C, size);
// Initialize host arrays with random values.
srand(time(NULL));
for (int i = 0; i < N; i++) {
h_A[i] = (float)(rand() % 100) / 10.0f;
h_B[i] = (float)(rand() % 100) / 10.0f;
}
// Allocate device memory.
float *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Use asynchronous memory copy to transfer data from pinned host memory to device memory.
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// Launch kernel.
vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
// Asynchronously copy the result from device back to host.
cudaMemcpyAsync(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
// Print first 5 results.
printf("Pinned Memory - First 5 Results:\n");
for (int i = 0; i < 5; i++) {
printf("%f ", h_C[i]);
}
printf("\n");
// Free device and pinned host memory.
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
cudaFreeHost(h_C);
return 0;
}
Comments:
- Pinned Memory Allocation:
UsescudaMallocHost()
to allocate pinned memory, which is faster for data transfers. - Asynchronous Transfers:
UsescudaMemcpyAsync()
for asynchronous data transfers to improve performance. - Synchronization:
cudaDeviceSynchronize()
is used to ensure that both the kernel and memory transfers complete before data is accessed.
c) Timing and Performance Comparison
To compare the performance between pinned and pageable memory transfers, you can insert timing code (using CUDA events) around the cudaMemcpy()
calls. This will show the differences in transfer times.
// performanceComparison.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// Simple kernel remains the same.
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
#define CUDA_CHECK(call) { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
printf("CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
}
int main() {
int N = 1 << 20; // 1M elements
size_t size = N * sizeof(float);
float *h_A, *h_B, *h_C;
// Allocate pageable memory
h_A = (float*)malloc(size);
h_B = (float*)malloc(size);
h_C = (float*)malloc(size);
// Initialize host arrays.
srand(time(NULL));
for (int i = 0; i < N; i++) {
h_A[i] = (float)(rand() % 100) / 10.0f;
h_B[i] = (float)(rand() % 100) / 10.0f;
}
float *d_A, *d_B, *d_C;
CUDA_CHECK(cudaMalloc((void**)&d_A, size));
CUDA_CHECK(cudaMalloc((void**)&d_B, size));
CUDA_CHECK(cudaMalloc((void**)&d_C, size));
// Measure pageable memory transfer time.
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start, 0));
CUDA_CHECK(cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaEventRecord(stop, 0));
CUDA_CHECK(cudaEventSynchronize(stop));
float pageableTime = 0;
CUDA_CHECK(cudaEventElapsedTime(&pageableTime, start, stop));
printf("Pageable Memory Transfer Time: %f ms\n", pageableTime);
// Now free pageable host memory.
free(h_A);
free(h_B);
free(h_C);
// Allocate pinned host memory.
CUDA_CHECK(cudaMallocHost((void**)&h_A, size));
CUDA_CHECK(cudaMallocHost((void**)&h_B, size));
CUDA_CHECK(cudaMallocHost((void**)&h_C, size));
// Reinitialize host arrays.
srand(time(NULL));
for (int i = 0; i < N; i++) {
h_A[i] = (float)(rand() % 100) / 10.0f;
h_B[i] = (float)(rand() % 100) / 10.0f;
}
// Measure pinned memory transfer time.
CUDA_CHECK(cudaEventRecord(start, 0));
CUDA_CHECK(cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaEventRecord(stop, 0));
CUDA_CHECK(cudaEventSynchronize(stop));
float pinnedTime = 0;
CUDA_CHECK(cudaEventElapsedTime(&pinnedTime, start, stop));
printf("Pinned Memory Transfer Time: %f ms\n", pinnedTime);
// Cleanup
CUDA_CHECK(cudaFree(d_A));
CUDA_CHECK(cudaFree(d_B));
CUDA_CHECK(cudaFree(d_C));
CUDA_CHECK(cudaFreeHost(h_A));
CUDA_CHECK(cudaFreeHost(h_B));
CUDA_CHECK(cudaFreeHost(h_C));
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
return 0;
}
Detailed Comments:
- Pageable Memory Section:
- Allocate standard host memory using
malloc()
. - Measure transfer time using
cudaMemcpy()
synchronously.
- Allocate standard host memory using
- Pinned Memory Section:
- Allocate host memory using
cudaMallocHost()
for pinned memory. - Measure transfer time using
cudaMemcpyAsync()
to allow asynchronous transfers.
- Allocate host memory using
- Timing:
CUDA events record the time for memory transfers. The elapsed time is calculated and printed, demonstrating that pinned memory transfers are faster. - Resource Cleanup:
All allocated memory and events are freed.
6. Conceptual Diagrams
Diagram 1: Pageable vs. Pinned Memory Flow
flowchart TD
A[Allocate Host Memory]
B{Memory Type?}
C[Pageable Memory]
D[Pinned Memory (cudaMallocHost)]
E[Copy Data to Device using cudaMemcpy() (Blocking)]
F[Copy Data to Device using cudaMemcpyAsync() (Asynchronous)]
A --> B
B -- Pageable --> C
B -- Pinned --> D
C --> E
D --> F
Explanation:
- The diagram compares the two types of memory allocation.
- Pageable memory is allocated with standard
malloc()
and uses blocking memory copies. - Pinned memory is allocated with
cudaMallocHost()
and allows asynchronous copies.
Diagram 2: Asynchronous Memory Copy Timing Workflow
sequenceDiagram
participant Host
participant GPU
participant EventStart as "Start Event"
participant EventStop as "Stop Event"
Host->>EventStart: Record Start Timestamp
Host->>GPU: cudaMemcpyAsync() copies data from pinned host memory to device
GPU->>Host: Data transfer occurs concurrently
Host->>EventStop: Record Stop Timestamp after transfers
Host->>Host: Compute Elapsed Time using cudaEventElapsedTime()
Explanation:
- The sequence diagram shows how asynchronous memory copy is timed using CUDA events.
- The host records start and stop timestamps around the asynchronous copy, then computes elapsed time.
7. References & Further Reading
- CUDA C Programming Guide – Pinned Memory
CUDA Unified Memory & Pinned Memory
Detailed discussion on pinned memory and its benefits. - CUDA C Best Practices Guide – Asynchronous Transfers
CUDA Asynchronous Transfers - NVIDIA Developer Blog on Pinned Memory
Pinned Memory Blog Post - NVIDIA NSight Compute Documentation
NVIDIA NSight Compute
8. Conclusion
Today, we have:
- Explored the differences between pageable and pinned host memory.
- Understood how pinned memory enables faster asynchronous memory transfers.
- Implemented and measured the performance benefits of using pinned memory versus pageable memory.
- Learned to use CUDA events for accurate timing of memory transfers.
- Identified common pitfalls, such as excessive pinned memory allocation potentially impacting system performance.
9. Next Steps
- Experiment:
Modify the example to handle larger data sizes and test the limits of pinned memory. - Profile:
Use NVIDIA NSight Systems to analyze the asynchronous transfers. - Optimize:
Integrate asynchronous transfers with kernel execution to overlap computation and communication. - Expand:
Apply these techniques in larger projects like image processing or deep learning pipelines.
## 贡献者
<NolebaseGitContributors />
## 文件历史
<NolebaseGitChangelog />