Mr. Peabody and Sherman

Project Dot Product

by: burt rosenberg
at: university of miami
date: 22 aug 2024

Goals

Getting Started

Check for the GPU device by running nvidia-smi.

@zizkaea:~$ nvidia-smi
Wed Aug 28 19:35:09 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:01:00.0 Off |                  N/A |
| 30%   28C    P8             16W /  220W |      15MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
Clone the class repo and make in the proj1 and proj2 directories should give you a start on this project.

The sample program: vector-mult.cu

The hello.cu program of project 1 demonstrated how a kernel is written and launched. The vector-mult.cu program demonstrates how data is transfered between the host to the device (that is, between the CPU to the GPU). The CUDA library as a cudaMalloc, cudaFree, and cudaMemcpy functions.

Consider this section of the code, run on the host:

    cudaMemcpy(d_c, h_c, n_bytes, cudaMemcpyHostToDevice) ;
    prod_array <<<1,n>>> ( d_a, d_b, d_c ) ;
    cudaMemcpy(h_a, d_a, n_bytes, cudaMemcpyDeviceToHost) ;
In general, CUDA makes asynchronous calls to the GPU. The call initiates and action and returns immediately without waiting for completion. However in this case, the activities asked are placed on the GPU's default stream, which implies certain default waits: Advanced GPU programming will defined non-NULL streams and launch both execution and data transfer operations into specific streams, which will have programmer defined blocking behavior.

Reference: Streams and Concurrency, Steve Rennich at NVIDIA.


initial array:

   +---+---+---+---+---+---+---+---+
   | 0   1   2   3   4   5   6   7 |
   +---+---+---+---+---+---+---+---+

fold in half and add

   +---+---+---+---+
   | 0   1   2   3 |
   +---+---+---+---+
   +---+---+---+---+
+  | 4   5   6   7 |
   +---+---+---+---+
====================
   +---+---+---+---+
   | 4   6   8  10 |
   +---+---+---+---+

fold in half and add

   +---+---+
   | 4   6 |
   +---+---+
   +---+---+
+  | 8  10 |
   +---+---+
====================
   +---+---+
   |12  16 |
   +---+---+
   
one last time

   +---+
   |12 |
   +---+
   +---+
+  |16 |
   +---+
====================
   +---+
   |28 |
   +---+

Algorithm: ArrayFolding

The task is given an array a[n] to calculate A = ∑ia[i].

Consider if the array size n is a power of two. Then the array can be "folded" by the addition,

   for i in range(n/2): a[i] += a[i+n/2]
so that the sum A is now equal to the sum just over the first n/2 elements, the remaining elements having been added into those.

The folding continues in the next step, where n is replaced by n/2; after which the sum over the first n/4 elements equals A, the sum of all the elements originally.

This sketch assumes n is over the form, 2k. Two methods to handle general values of n are given in total-sums.ipynb.

This algorithm uses n/2 threads on the first iteration, n/4 on the second, and so forth. There are log(n) iterations. And the memory access pattern are completely independent reads and writes. By and large, the original values of the array are overwritten, but the algorithm can be run backwards, to recover the original values.

Thread Launch and Synchronization

The kernel will be called for each round of the folding. There will be a line by the CPU for each round with d increasing by one each round. The kernel will be called with that d and its unique per thread tread ID to complete the inputs needed for each thread to act is it should in that round.

The kernel launches are asynchronous from the point of view of the CPU. The call returns immediately. But the calls are queued up on the GPU and will run start to finish one by one in the order queued. There will not be a situation where two threads are running at the same time with different values of d.

Note that, each round launches a different number of threads. the number decreasing to one half on each round. Each thread runs exactly the code, hence all threads in a round is started together will end together.

Loop Invariants: How to code correctly

    prepare to run the kernel
    // the initial assert
    assert loop invariant
    
    while (condition):
        update parameters
        run the kernel
        
        // check that the loop invariant has been reestablished
        assert loop invariant
    
    // on exit (1) the L.I. is true and (2) the while loop has exited
    // these two things should imply the goal was reached.
A loop invariant is a logical statement that yields true before and after the running of the kernel. In the case of folding, the invariant will assert that certain array locations hold certain sums.

Initially it asserts the obvious: each array location holds its original value.

On the next step it asserts that locations in the first half of the array hold the sum of the original value of that location plus that of some one location in the second half of the array.

And so forth on each subsequent round until the loop ends. The final assertion should be exactly the statement that what we wished to have computed, we did compute.

LOOP INVARIANT FOR THE FOLDING ALGORITHM

Let a[0], a[1], ... , a[n-1] be the n values initially in the array A[].
Let d be the phase of the algorithm, beginning with phase 0, and D=2**d.

Loop Invariant:
    After phase d, for i=0,...,n/D, A[i] = sumj a[j] for all j=(i mod n/D)

Initialization:
    with d:=0 the invariants states that A[i]=a[i] for i=0,..,n,
    which is the starting condition.
Update:
    the update equation is A[i]:=A[i]+A[i+n/D'] for D':=2*D
    before the kernel run we had values in array locations [0,n/D),
    the step breaks this into [0,n/D'), [n/D',n/D), and the second
    interval is folded onto the first.
Final:
    A[0] = sum a[j] because D=n, so the set to sum over is 
    all j such j==0 mod 1, which is all j in [0,n).
d=0 (initial): 
    A[0] = sum[0,1) a[i]
    A[1] = sum[1,2) a[i]
    ...
    A[n-1] = sum[n-1,n) a[i]

d=1: 
    A[0] = sum[0,2) a[i]
    A[2] = sum[2,4) a[i]
    A[4] = sum[4,6) a[i]
    ...
    A[n-2] = sum[n-2,n) a[i]
...

d=log2n: 
    A[0] = sum[0,n) a[i]

Algorithm: Tournament Tree

You should also implement another method to do the summing, that resembles and algorithm tree. This method works in blocks that begin with n blocks of size one, representing the initial condition that A[i] contains the initial value the array, a[i], and ends with one block of size n, representing that A[0] contains the sum of all a[i].

First the sum of neighbors is taken a placed in the leftmost cell. That is, A[0]:=A[0]+A[1], and so on.

The block size is double and neighbors of the previous "winners" of the tournament are added. In the second round: A[0]:=A[0]+A[2], A[4]:=A[4]+A[6], and so on.

LOOP INVARIANT FOR THE TOURNAMENT ALGORITHM

Loop invariant:
    with D = 2**d, for all i a multiple of D, A[i] == sum[i,i+D) a[i]

Initialization:
    d=0, D=1, for all i A[i]==a[i] 
Update:
    A[i]:=a[i,D)+a[i+D,i+2*D) = a[i,2*D)
Final:
    A[0] = sum[0,n) a[j]
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

author: burton rosenberg
created: 30 aug 2022
update: 28 nov 2024