Day 71: Advanced Streams & Overlapping
When working on complex HPC pipelines or real-time data feeds, leveraging multiple streams can enable the GPU to overlap various tasks (kernels, data transfers, even CPU tasks) to maximize concurrency. Day 71 focuses on advanced usage of CUDA streams and how to effectively chain multiple operations in a pipeline, carefully avoiding race conditions by synchronizing at critical points. We also explore how missing or incorrect synchronization can lead to silent data corruption or partial results, emphasizing the need for diligent concurrency management.
Table of Contents
- Overview
- Why Overlapping Matters
- Advanced Stream Concepts
- Implementation Approach
- Code Example: Multi-Stream Overlap
- Common Pitfalls & Synchronization Issues
- Conceptual Diagram
- References & Further Reading
- Conclusion
- Next Steps
1. Overview
By default, CUDA operations in a single stream are serialized. However, by creating multiple streams, you can overlap:
- Kernel launches so that multiple kernels run concurrently if the GPU’s hardware resources allow.
- Host↔Device transfers with kernel execution, hiding data transfer time behind GPU compute.
- CPU tasks in parallel with GPU tasks, if the CPU code doesn’t block or require the same data that’s in use by the GPU.
Yet concurrency is only beneficial if carefully planned. Missing sync calls, incorrect usage of events, or oversubscription can degrade performance or produce data races. This lesson digs deeper into advanced overlap scenarios that push GPU concurrency to its limits.
2. Why Overlapping Matters
- Latency Hiding: If a kernel in one stream is waiting on memory, the scheduler can run a kernel from another stream.
- Efficiency: In real-time or batch pipelines, streaming input data while the GPU processes the previous batch can significantly boost throughput.
- CPU-GPU Parallelism: The CPU can post-process the results from the last iteration while the GPU tackles the next iteration.
3. Advanced Stream Concepts
a) Concurrent Kernel Execution
- Modern GPUs can schedule multiple kernels in parallel if resources remain available. Streams with independent data let the device run them simultaneously.
b) Overlapping Data Transfers and Kernels
- By calling
cudaMemcpyAsync(..., streamX)
, you can copy data in parallel with kernel execution in other streams. - Use pinned host memory to maximize transfer bandwidth and maintain concurrency.
c) Integrating CPU-Side Tasks
- The CPU can run computations or manage I/O while the GPU processes data. This synergy is especially valuable if your workflow is partially CPU-bound or requires specialized CPU tasks like compression or network I/O.
4. Implementation Approach
a) Stream Creation and Event Usage
- Multiple Streams: Create separate streams for data transfers, kernel A, kernel B, etc.
- Events: Use
cudaEventRecord()
to mark a stream’s completion point. Another stream or CPU can wait on that event viacudaStreamWaitEvent()
orcudaEventSynchronize()
, ensuring correct ordering.
b) Chaining Operations
- Producer Stream: Copies new data to the device and launches a production kernel.
- Consumer Stream: Waits on an event signifying the production kernel is done, then processes that data or merges partial results.
- CPU Tasks: Potentially run concurrently if your code does not require immediate results from the GPU or pinned memory.
5. Code Example: Multi-Stream Overlap
Below is a multi-stream snippet demonstrating how to overlap kernel execution, data transfers, and CPU tasks. We have:
- Stream for kernelA.
- Stream for data copy.
- A CPU function that runs concurrently.
- Another stream for kernelB which depends on the result from kernelA.
// File: multi_stream_overlap.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <thread>
#include <chrono>
// Simple kernel that increments each element
__global__ void kernelA(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
data[idx] += 1.0f;
}
}
// Another kernel that doubles each element
__global__ void kernelB(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
data[idx] *= 2.0f;
}
}
void cpuSideTask() {
// Simulate a CPU task that runs concurrently with GPU
printf("CPU task started...\n");
std::this_thread::sleep_for(std::chrono::milliseconds(100));
printf("CPU task completed.\n");
}
int main() {
int N = 1 << 20;
size_t size = N * sizeof(float);
// Allocate pinned host buffer
float* h_data;
cudaMallocHost((void**)&h_data, size);
for (int i = 0; i < N; i++) {
h_data[i] = (float)i;
}
// Allocate device buffer
float* d_data;
cudaMalloc(&d_data, size);
// Create streams
cudaStream_t streamA, streamCopy, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamCopy);
cudaStreamCreate(&streamB);
// 1) Asynchronous copy host -> device in streamCopy
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, streamCopy);
// Launch a thread for CPU side task concurrently
std::thread cpuThread(cpuSideTask);
// Wait for copy in streamCopy to complete, then launch kernelA in streamA
// Use an event to chain these streams
cudaEvent_t copyDone;
cudaEventCreate(©Done);
cudaEventRecord(copyDone, streamCopy);
cudaStreamWaitEvent(streamA, copyDone, 0);
kernelA<<<(N+255)/256, 256, 0, streamA>>>(d_data, N);
// Another event after kernelA completes
cudaEvent_t kernelA_done;
cudaEventCreate(&kernelA_done);
cudaEventRecord(kernelA_done, streamA);
// kernelB in streamB waits for kernelA_done
cudaStreamWaitEvent(streamB, kernelA_done, 0);
kernelB<<<(N+255)/256, 256, 0, streamB>>>(d_data, N);
// Wait for all GPU ops
cudaStreamSynchronize(streamCopy);
cudaStreamSynchronize(streamA);
cudaStreamSynchronize(streamB);
// Join CPU thread
cpuThread.join();
// Copy result back in same or new stream (for brevity, do synchronous copy)
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
printf("Sample result: h_data[0] = %f\n", h_data[0]);
// Cleanup
cudaFree(d_data);
cudaFreeHost(h_data);
cudaStreamDestroy(streamA);
cudaStreamDestroy(streamCopy);
cudaStreamDestroy(streamB);
cudaEventDestroy(copyDone);
cudaEventDestroy(kernelA_done);
return 0;
}
Explanation & Comments
- Asynchronous Copy:
cudaMemcpyAsync
instreamCopy
loads data to the GPU while the CPU does a separate task. - CPU Thread: A simple CPU function runs concurrently with GPU actions.
- Event Chaining:
copyDone
ensures kernelA waits for the data.kernelA_done
ensures kernelB only starts after kernelA.
- Streams:
streamA
for kernelA,streamB
for kernelB,streamCopy
for data transfers.
6. Common Pitfalls & Synchronization Issues
- Overlapping CPU Code: The CPU might attempt to read pinned memory before the GPU finishes writing. Use events and synchronization if that data is needed by the CPU.
- Misused Streams: Launching kernels that rely on the same data in separate streams without an event-based sync can cause data hazards.
- Oversubscription: Creating too many concurrent streams can degrade performance if the GPU is saturated.
7. Conceptual Diagram
flowchart TD
subgraph Host
H1[Start CPU task in separate thread]
end
subgraph StreamCopy
C1[cudaMemcpyAsync Host->Device]
C2[Event copyDone]
end
subgraph StreamA
A1[Wait event copyDone]
A2[kernelA(d_data)]
A3[Event kernelA_done]
end
subgraph StreamB
B1[Wait event kernelA_done]
B2[kernelB(d_data)]
end
H1 --- C1
C1 --> C2 --> A1 --> A2 --> A3 --> B1 --> B2
Explanation:
- StreamCopy transfers data in parallel with a CPU task.
- kernelA in StreamA waits for
copyDone
event. - kernelB in StreamB waits for
kernelA_done
event, forming a pipeline.
8. References & Further Reading
- Nsight Systems Documentation – Profiling concurrency timelines.
- CUDA C Programming Guide – Streams & Concurrency
- NVIDIA Developer Blog – Advanced Multi-Stream Techniques
9. Conclusion
Day 71 spotlights advanced stream usage where multiple kernels, data transfers, and even CPU tasks overlap for maximal concurrency. Proper event chaining is essential to avoid data hazards. By dedicating separate streams to distinct operations (data copy, kernel stages), and optionally letting the CPU do concurrent tasks, HPC pipelines can hide latencies and boost throughput. Yet, the synergy among streams only succeeds if events are used to tie their data dependencies together.
10. Next Steps
- Profile Overlap: Use Nsight Systems to confirm actual concurrency and identify potential serialization points.
- Experiment: Try rearranging streams for data staging, kernel chaining, or CPU tasks, measuring real speedups.
- Refine: If concurrency is not giving expected gains, check if the GPU or memory bus is saturated or if sync calls are too frequent.
- Apply: Extend multi-stream concurrency to multi-GPU or MPS scenarios for more advanced HPC pipelines.
## 贡献者
<NolebaseGitContributors />
## 文件历史
<NolebaseGitChangelog />