Day 17: Host-Device Synchronization Patterns

字数

953 字

阅读时间

6 分钟

CUDA kernels are launched asynchronously by default, meaning that the host (CPU) does not wait for the device (GPU) to complete execution before moving to the next instruction. While this is great for overlapping computations, it can lead to incorrect results if memory is accessed before the kernel finishes.

Today's lesson is a deep dive into Host-Device Synchronization Patterns, focusing on cudaDeviceSynchronize() and other synchronization strategies to:

Ensure correctness of kernel results.
Measure kernel execution time using CUDA events.
Identify and fix common synchronization pitfalls.

Overview
Understanding Host-Device Synchronization
Synchronization Methods
Practical Exercise: Measuring Kernel Duration
Conceptual Diagrams
References & Further Reading
Conclusion
Next Steps

1. Overview

CUDA kernels are executed asynchronously, meaning:

The CPU does not wait for the GPU to complete a kernel launch.
Memory transfers may start before the kernel finishes, leading to partial or incorrect data.
Explicit synchronization is required to ensure that the CPU only accesses results when computations are finished.

Example: Reading Results Too Soon (Incorrect Code)

cpp

vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);  // Incorrect: GPU may not be done

Corrected with cudaDeviceSynchronize()

cpp

vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();  // Ensures GPU computations are done
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

2. Understanding Host-Device Synchronization

Key Problems with Asynchronous Execution

Problem	Impact
CPU continues execution without waiting for GPU	Results may be read before computations complete.
Multiple kernels launch asynchronously	Can lead to overlapping execution where order of execution is uncertain.
Memory transfers occur while the GPU is still computing	Results in incomplete or corrupt data.

Solution: Explicit Synchronization

CUDA provides synchronization methods to control execution order and ensure correctness.

3. Synchronization Methods

Method	Description	Use Case
`cudaDeviceSynchronize()`	Blocks CPU execution until all previously launched CUDA work completes.	Ensuring kernel completion before host execution continues.
`cudaMemcpy()`	A blocking memory copy function that implicitly synchronizes.	Ensures data transfers complete before execution continues.
`cudaEventSynchronize(event)`	Waits for a specific event to complete.	Used when multiple operations are happening asynchronously.

Note: Excessive use of cudaDeviceSynchronize() may hurt performance by introducing unnecessary CPU-GPU blocking.

4. Practical Exercise: Measuring Kernel Duration

In this exercise, we explore three different approaches:

No Synchronization – Incorrect and unsafe.
Using cudaDeviceSynchronize() – Ensures correctness.
Using CUDA Events – Measures execution time accurately.

a) Incorrect Execution Without Synchronization

Here, we launch a kernel and immediately try to read back the result without waiting for the GPU to finish.

cpp

// incorrectSync.cu
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20; // 1M elements
    size_t size = N * sizeof(float);
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // Allocate host memory
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Allocate device memory
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel without synchronization
    vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);

    // Immediate memory copy back to host (incorrect)
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("Result: %f\n", h_C[0]); // May print incorrect data

    // Cleanup
    free(h_A);
    free(h_B);
    free(h_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

Expected Outcome:
Results may be incorrect because the memory is copied back before kernel execution completes.

b) Correct Execution Using `cudaDeviceSynchronize()`

Here, we fix the issue by adding cudaDeviceSynchronize().

cpp

// correctSync.cu
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    int N = 1 << 20;
    size_t size = N * sizeof(float);
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // Allocate host memory
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Allocate device memory
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy data to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);

    // Synchronize to ensure kernel execution is complete
    cudaDeviceSynchronize();

    // Now safely copy data back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    printf("Result: %f\n", h_C[0]); // Correct result

    // Cleanup
    free(h_A);
    free(h_B);
    free(h_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

Expected Outcome:
Results will be correct because the host waits for the kernel to complete before copying data.

c) Measuring Kernel Execution Time Using CUDA Events

We now use CUDA events for precise kernel timing.

cpp

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
vectorAddKernel<<<(N + 255) / 256, 256>>>(d_A, d_B, d_C, N);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

printf("Kernel Execution Time: %f ms\n", milliseconds);

5. Conceptual Diagrams

mermaid

flowchart TD
    A[Launch Kernel] --> B[Compute Execution]
    B -->|No Synchronization| C[Host Reads Incomplete Data]
    B -->|With cudaDeviceSynchronize()| D[Kernel Completes Execution]
    D --> E[Host Reads Correct Data]

6. References & Further Reading

7. Conclusion

Today, we explored Host-Device Synchronization Patterns, ensuring:

Correct results with cudaDeviceSynchronize().
Accurate execution timing using CUDA events.
Debugging best practices for asynchronous execution.

8. Next Steps

Optimize event-based synchronization.
Profile kernel execution using Nsight Compute.



## 贡献者

<NolebaseGitContributors />


## 文件历史

<NolebaseGitChangelog />

Day 17: Host-Device Synchronization Patterns ​

Table of Contents ​

1. Overview ​

2. Understanding Host-Device Synchronization ​

Key Problems with Asynchronous Execution ​

Solution: Explicit Synchronization ​

3. Synchronization Methods ​

4. Practical Exercise: Measuring Kernel Duration ​

a) Incorrect Execution Without Synchronization ​

b) Correct Execution Using cudaDeviceSynchronize() ​

c) Measuring Kernel Execution Time Using CUDA Events ​

5. Conceptual Diagrams ​

6. References & Further Reading ​

7. Conclusion ​

8. Next Steps ​