Skip to content

Day 17: Host-Device Synchronization Patterns

字数
953 字
阅读时间
6 分钟

CUDA kernels are launched asynchronously by default, meaning that the host (CPU) does not wait for the device (GPU) to complete execution before moving to the next instruction. While this is great for overlapping computations, it can lead to incorrect results if memory is accessed before the kernel finishes.

Today's lesson is a deep dive into Host-Device Synchronization Patterns, focusing on cudaDeviceSynchronize() and other synchronization strategies to:

  • Ensure correctness of kernel results.
  • Measure kernel execution time using CUDA events.
  • Identify and fix common synchronization pitfalls.

Table of Contents

  1. Overview
  2. Understanding Host-Device Synchronization
  3. Synchronization Methods
  4. Practical Exercise: Measuring Kernel Duration
  5. Conceptual Diagrams
  6. References & Further Reading
  7. Conclusion
  8. Next Steps

1. Overview

CUDA kernels are executed asynchronously, meaning:

  • The CPU does not wait for the GPU to complete a kernel launch.
  • Memory transfers may start before the kernel finishes, leading to partial or incorrect data.
  • Explicit synchronization is required to ensure that the CPU only accesses results when computations are finished.

Example: Reading Results Too Soon (Incorrect Code)

cpp
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);  // Incorrect: GPU may not be done

Corrected with cudaDeviceSynchronize()

cpp
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();  // Ensures GPU computations are done
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

2. Understanding Host-Device Synchronization

Key Problems with Asynchronous Execution

ProblemImpact
CPU continues execution without waiting for GPUResults may be read before computations complete.
Multiple kernels launch asynchronouslyCan lead to overlapping execution where order of execution is uncertain.
Memory transfers occur while the GPU is still computingResults in incomplete or corrupt data.

Solution: Explicit Synchronization

CUDA provides synchronization methods to control execution order and ensure correctness.


3. Synchronization Methods

MethodDescriptionUse Case
cudaDeviceSynchronize()Blocks CPU execution until all previously launched CUDA work completes.Ensuring kernel completion before host execution continues.
cudaMemcpy()A blocking memory copy function that implicitly synchronizes.Ensures data transfers complete before execution continues.
cudaEventSynchronize(event)Waits for a specific event to complete.Used when multiple operations are happening asynchronously.

Note: Excessive use of cudaDeviceSynchronize() may hurt performance by introducing unnecessary CPU-GPU blocking.


4. Practical Exercise: Measuring Kernel Duration

In this exercise, we explore three different approaches:

  1. No Synchronization – Incorrect and unsafe.
  2. Using cudaDeviceSynchronize() – Ensures correctness.
  3. Using CUDA Events – Measures execution time accurately.

a) Incorrect Execution Without Synchronization

Here, we launch a kernel and immediately try to read back the result without waiting for the GPU to finish.

cpp
// incorrectSync.cu
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20; // 1M elements
    size_t size = N * sizeof(float);
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // Allocate host memory
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Allocate device memory
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel without synchronization
    vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);

    // Immediate memory copy back to host (incorrect)
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("Result: %f\n", h_C[0]); // May print incorrect data

    // Cleanup
    free(h_A);
    free(h_B);
    free(h_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

Expected Outcome:
Results may be incorrect because the memory is copied back before kernel execution completes.


b) Correct Execution Using cudaDeviceSynchronize()

Here, we fix the issue by adding cudaDeviceSynchronize().

cpp
// correctSync.cu
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20;
    size_t size = N * sizeof(float);
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // Allocate host memory
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Allocate device memory
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);

    // Synchronize to ensure kernel execution is complete
    cudaDeviceSynchronize();

    // Now safely copy data back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("Result: %f\n", h_C[0]); // Correct result

    // Cleanup
    free(h_A);
    free(h_B);
    free(h_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

Expected Outcome:
Results will be correct because the host waits for the kernel to complete before copying data.


c) Measuring Kernel Execution Time Using CUDA Events

We now use CUDA events for precise kernel timing.

cpp
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

printf("Kernel Execution Time: %f ms\n", milliseconds);

5. Conceptual Diagrams

mermaid
flowchart TD
    A[Launch Kernel] --> B[Compute Execution]
    B -->|No Synchronization| C[Host Reads Incomplete Data]
    B -->|With cudaDeviceSynchronize()| D[Kernel Completes Execution]
    D --> E[Host Reads Correct Data]

6. References & Further Reading

  1. CUDA C Programming Guide – Device Synchronization
  2. CUDA Best Practices Guide

7. Conclusion

Today, we explored Host-Device Synchronization Patterns, ensuring:

  • Correct results with cudaDeviceSynchronize().
  • Accurate execution timing using CUDA events.
  • Debugging best practices for asynchronous execution.

8. Next Steps

  • Optimize event-based synchronization.
  • Profile kernel execution using Nsight Compute.


## 贡献者

<NolebaseGitContributors />


## 文件历史

<NolebaseGitChangelog />

撰写