Day 17: Host-Device Synchronization Patterns
CUDA kernels are launched asynchronously by default, meaning that the host (CPU) does not wait for the device (GPU) to complete execution before moving to the next instruction. While this is great for overlapping computations, it can lead to incorrect results if memory is accessed before the kernel finishes.
Today's lesson is a deep dive into Host-Device Synchronization Patterns, focusing on cudaDeviceSynchronize()
and other synchronization strategies to:
- Ensure correctness of kernel results.
- Measure kernel execution time using CUDA events.
- Identify and fix common synchronization pitfalls.
Table of Contents
- Overview
- Understanding Host-Device Synchronization
- Synchronization Methods
- Practical Exercise: Measuring Kernel Duration
- Conceptual Diagrams
- References & Further Reading
- Conclusion
- Next Steps
1. Overview
CUDA kernels are executed asynchronously, meaning:
- The CPU does not wait for the GPU to complete a kernel launch.
- Memory transfers may start before the kernel finishes, leading to partial or incorrect data.
- Explicit synchronization is required to ensure that the CPU only accesses results when computations are finished.
Example: Reading Results Too Soon (Incorrect Code)
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Incorrect: GPU may not be done
Corrected with cudaDeviceSynchronize()
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize(); // Ensures GPU computations are done
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
2. Understanding Host-Device Synchronization
Key Problems with Asynchronous Execution
Problem | Impact |
---|---|
CPU continues execution without waiting for GPU | Results may be read before computations complete. |
Multiple kernels launch asynchronously | Can lead to overlapping execution where order of execution is uncertain. |
Memory transfers occur while the GPU is still computing | Results in incomplete or corrupt data. |
Solution: Explicit Synchronization
CUDA provides synchronization methods to control execution order and ensure correctness.
3. Synchronization Methods
Method | Description | Use Case |
---|---|---|
cudaDeviceSynchronize() | Blocks CPU execution until all previously launched CUDA work completes. | Ensuring kernel completion before host execution continues. |
cudaMemcpy() | A blocking memory copy function that implicitly synchronizes. | Ensures data transfers complete before execution continues. |
cudaEventSynchronize(event) | Waits for a specific event to complete. | Used when multiple operations are happening asynchronously. |
Note: Excessive use of
cudaDeviceSynchronize()
may hurt performance by introducing unnecessary CPU-GPU blocking.
4. Practical Exercise: Measuring Kernel Duration
In this exercise, we explore three different approaches:
- No Synchronization – Incorrect and unsafe.
- Using
cudaDeviceSynchronize()
– Ensures correctness. - Using CUDA Events – Measures execution time accurately.
a) Incorrect Execution Without Synchronization
Here, we launch a kernel and immediately try to read back the result without waiting for the GPU to finish.
// incorrectSync.cu
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1 << 20; // 1M elements
size_t size = N * sizeof(float);
float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
// Allocate host memory
h_A = (float*)malloc(size);
h_B = (float*)malloc(size);
h_C = (float*)malloc(size);
// Allocate device memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Copy data to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch kernel without synchronization
vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);
// Immediate memory copy back to host (incorrect)
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("Result: %f\n", h_C[0]); // May print incorrect data
// Cleanup
free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
Expected Outcome:
Results may be incorrect because the memory is copied back before kernel execution completes.
b) Correct Execution Using cudaDeviceSynchronize()
Here, we fix the issue by adding cudaDeviceSynchronize()
.
// correctSync.cu
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
int main() {
int N = 1 << 20;
size_t size = N * sizeof(float);
float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
// Allocate host memory
h_A = (float*)malloc(size);
h_B = (float*)malloc(size);
h_C = (float*)malloc(size);
// Allocate device memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
// Copy data to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch kernel
vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);
// Synchronize to ensure kernel execution is complete
cudaDeviceSynchronize();
// Now safely copy data back to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("Result: %f\n", h_C[0]); // Correct result
// Cleanup
free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
Expected Outcome:
Results will be correct because the host waits for the kernel to complete before copying data.
c) Measuring Kernel Execution Time Using CUDA Events
We now use CUDA events for precise kernel timing.
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel Execution Time: %f ms\n", milliseconds);
5. Conceptual Diagrams
flowchart TD
A[Launch Kernel] --> B[Compute Execution]
B -->|No Synchronization| C[Host Reads Incomplete Data]
B -->|With cudaDeviceSynchronize()| D[Kernel Completes Execution]
D --> E[Host Reads Correct Data]
6. References & Further Reading
7. Conclusion
Today, we explored Host-Device Synchronization Patterns, ensuring:
- Correct results with
cudaDeviceSynchronize()
. - Accurate execution timing using CUDA events.
- Debugging best practices for asynchronous execution.
8. Next Steps
- Optimize event-based synchronization.
- Profile kernel execution using Nsight Compute.
## 贡献者
<NolebaseGitContributors />
## 文件历史
<NolebaseGitChangelog />