Check for the GPU device by running nvidia-smi.
@zizkaea:~$ nvidia-smi Wed Aug 28 19:35:09 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Off | 00000000:01:00.0 Off | N/A | | 30% 28C P8 16W / 220W | 15MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
The hello.cu program of project 1 demonstrated how a kernel is written and launched. The vector-mult.cu program demonstrates how data is transfered between the host to the device (that is, between the CPU to the GPU). The CUDA library as a cudaMalloc, cudaFree, and cudaMemcpy functions.
Consider this section of the code, run on the host:
cudaMemcpy(d_c, h_c, n_bytes, cudaMemcpyHostToDevice) ; prod_array <<<1,n>>> ( d_a, d_b, d_c ) ; cudaMemcpy(h_a, d_a, n_bytes, cudaMemcpyDeviceToHost) ;
Reference: Streams and Concurrency, Steve Rennich at NVIDIA.
initial array: +---+---+---+---+---+---+---+---+ | 0 1 2 3 4 5 6 7 | +---+---+---+---+---+---+---+---+ fold in half and add +---+---+---+---+ | 0 1 2 3 | +---+---+---+---+ +---+---+---+---+ + | 4 5 6 7 | +---+---+---+---+ ==================== +---+---+---+---+ | 4 6 8 10 | +---+---+---+---+ fold in half and add +---+---+ | 4 6 | +---+---+ +---+---+ + | 8 10 | +---+---+ ==================== +---+---+ |12 16 | +---+---+ one last time +---+ |12 | +---+ +---+ + |16 | +---+ ==================== +---+ |28 | +---+
The task is given an array a[n] to calculate A = ∑ia[i].
Consider if the array size n is a power of two. Then the array can be "folded" by the addition,
for i in range(n/2): a[i] += a[i+n/2]so that the sum A is now equal to the sum just over the first n/2 elements, the remaining elements having been added into those.
The folding continues in the next step, where n is replaced by n/2; after which the sum over the first n/4 elements equals A, the sum of all the elements originally.
This sketch assumes n is over the form, 2k. Two methods to handle general values of n are given in total-sums.ipynb.
This algorithm uses n/2 threads on the first iteration, n/4 on the second, and so forth. There are log(n) iterations. And the memory access pattern are completely independent reads and writes. By and large, the original values of the array are overwritten, but the algorithm can be run backwards, to recover the original values.
The kernel will be called for each round of the folding. There will be a line by the CPU for each round with d increasing by one each round. The kernel will be called with that d and its unique per thread tread ID to complete the inputs needed for each thread to act is it should in that round.
The kernel launches are asynchronous from the point of view of the CPU. The call returns immediately. But the calls are queued up on the GPU and will run start to finish one by one in the order queued. There will not be a situation where two threads are running at the same time with different values of d.
Note that, each round launches a different number of threads. the number decreasing to one half on each round. Each thread runs exactly the code, hence all threads in a round is started together will end together.
prepare to run the kernel // the initial assert assert loop invariant while (condition): update parameters run the kernel // check that the loop invariant has been reestablished assert loop invariant // on exit (1) the L.I. is true and (2) the while loop has exited // these two things should imply the goal was reached.
Initially it asserts the obvious: each array location holds its original value.
On the next step it asserts that locations in the first half of the array hold the sum of the original value of that location plus that of some one location in the second half of the array.
And so forth on each subsequent round until the loop ends. The final assertion should be exactly the statement that what we wished to have computed, we did compute.
LOOP INVARIANT FOR THE FOLDING ALGORITHM Let a[0], a[1], ... , a[n-1] be the n values initially in the array A[]. Let d be the phase of the algorithm, beginning with phase 0, and D=2**d. Loop Invariant: After phase d, for i=0,...,n/D, A[i] = sumj a[j] for all j=(i mod n/D) Initialization: with d:=0 the invariants states that A[i]=a[i] for i=0,..,n, which is the starting condition. Update: the update equation is A[i]:=A[i]+A[i+n/D'] for D':=2*D before the kernel run we had values in array locations [0,n/D), the step breaks this into [0,n/D'), [n/D',n/D), and the second interval is folded onto the first. Final: A[0] = sum a[j] because D=n, so the set to sum over is all j such j==0 mod 1, which is all j in [0,n).
d=0 (initial): A[0] = sum[0,1) a[i] A[1] = sum[1,2) a[i] ... A[n-1] = sum[n-1,n) a[i] d=1: A[0] = sum[0,2) a[i] A[2] = sum[2,4) a[i] A[4] = sum[4,6) a[i] ... A[n-2] = sum[n-2,n) a[i] ... d=log2n: A[0] = sum[0,n) a[i]
You should also implement another method to do the summing, that resembles and algorithm tree. This method works in blocks that begin with n blocks of size one, representing the initial condition that A[i] contains the initial value the array, a[i], and ends with one block of size n, representing that A[0] contains the sum of all a[i].
First the sum of neighbors is taken a placed in the leftmost cell. That is, A[0]:=A[0]+A[1], and so on.
The block size is double and neighbors of the previous "winners" of the tournament are added. In the second round: A[0]:=A[0]+A[2], A[4]:=A[4]+A[6], and so on.
LOOP INVARIANT FOR THE TOURNAMENT ALGORITHM Loop invariant: with D = 2**d, for all i a multiple of D, A[i] == sum[i,i+D) a[i] Initialization: d=0, D=1, for all i A[i]==a[i] Update: A[i]:=a[i,D)+a[i+D,i+2*D) = a[i,2*D) Final: A[0] = sum[0,n) a[j]
author: burton rosenberg
created: 30 aug 2022
update: 28 nov 2024