Description
Parallel Conway’s Game of Life Using 1-D Arrays in CUDA
Christopher D. Carothers
Department of Computer Science Rensselaer Polytechnic Institute
110 8th Street
Troy, New York U.S.A. 12180-3590
1 Overview
As promised in Assignment 1, you are to construct a parallel C program that specifically implements Conway’s Game of Life but with the twist of using ***only*** 1-D arrays in CUDA. Additionally, you will run this C-program on the AiMOS supercomputer at the CCI in parallel using a single CUDA enabled GPU.
1.1 For Review – Game of Life Specification
The Game of Life is an example of a Cellular Automata where universe is a two-dimensional orthogonal grid of square cells (with WRAP AROUND FOR THIS ASSIGNMENT), each of which is in one of two possible states, ALIVE or DEAD. Every cell interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur at each and every cell:
• Any live cell with fewer than two live neighbors dies, as if caused by under-population.
• Any live cell with two or three live neighbors lives on to the next generation.
• Any live cell with more than three live neighbors dies, as if by over-population.
• Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.
Re-using the C-template from Assignment 1: Recall from Assignment 1, the world size and initial pattern are determined by an arguments to your program. A template for your program will be provide and more details are below. The first generation is created by applying the above rules to every cell in the seedbirths and deaths occur simultaneously, and the discrete moment at which this happens is sometimes called a “tick” or iteration. The rules continue to be applied repeatedly to create further generations. The number of generations will also be an argument to your program.
1.2 Template and Implementation Details
Re-use the provided template in Assignment 1, gol.c, or your own solution to Assignment 1. Here, you must re-write/extend those implementations in gol.c for gol.cu and support the parallel CUDA processing of the world gid. Here is what you should look out for:
• new headers:
#include <cuda.h>
#include <cuda_runtime.h>
• init routines: Replace all the calloc calls with cudaMallocManaged. For example:
cudaMallocManaged( &g data, (g dataLength * sizeof(unsigned char)));
will allocate the right amount of space for the g data pointer in the template. Also, make sure you correctly ZERO out all the data elements as cudaMallocManaged will not does this for you like calloc does.
• gol kernelLaunch: this function is called from main and looks like:
bool gol_kernelLaunch(unsigned char** d_data, unsigned char** d_resultData, size_t worldWidth, size_t worldHeight, size_t iterationsCount, ushort threadsCount)
This function computes the world via a CUDA kernel (described below) and swaps the new world with the previous world to be ready for next iteration. It should invoke the gol kernel CUDA kernel function which will advance the whole world one step or iteration. Last, it will need to invoke cudaDeviceSynchronize() prior to returning to the main routine.
• gol kernel: is the main CUDA kernel function. This function will look like:
__global__ void gol_kernel(const unsigned char* d_data, unsigned int worldWidth, unsigned int worldHeight, unsigned char* d_resultData)
This function will iterate starting with: index = blockIdx.x *blockDim.x + threadIdx.x
and increment by
index + = blockDim.x * gridDim.x
provided index < worldWidth*worldHeight
.
Using current thread index you’ll need to compute the x0, x1, x2, y0, y1 and y2 offsets as you did in Assignment 1 using the worldWidth and worldSize variables. Note, you don’t need to use any shared memory. Once you have those offset values, count the number of alive cells among the 8 neighbors in the worlds.
To compile and execute the GOL program, do:
• modules: module load gcc/7.4.0/1 spectrum-mpi cuda
• compile: nvcc -g -G -gencode arch=compute 70,code=sm 70 gol.cu -o gol – this will invoke the GCC compiler with debug and all warnings turned on.
• run: ./gol 4 64 2 – this will execute the GOL program using pattern 4 for a world size of 64×64 and perform two iterations. Pattern 4 is a “spinner” pattern that is at the corners of the grid.
For this assignment, there are the following 5 patterns:
• Pattern 0: World is ALL zeros.
• Pattern 1: World is ALL ones.
• Pattern 2: Streak of 10 ones in about the middle of the World.
• Pattern 3: Ones at the corners of the World.
• Pattern 4: “Spinner” pattern at corners of the World.
2 Running on AiMOS
Please follow the steps below:
1. Login to CCI landing pad (lp01.ccni.rpi.edu) using SSH and your CCI account and password information. For example, ssh SPNRcaro@lp03.ccni.rpi.edu and at prompt type in password.
2. Login to AiMOS front end by doing ssh PCP9yourlogin@dcsfen01.
3. (Do one time only if you did not do for Assignment 1). Setup ssh-keys for passwordless login between compute nodes via ssh-keygen -t rsa and then cp ~/.ssh/id rsa.pub ~/.ssh/authorized keys.
4. Load modules: run the module load gcc/7.4.0/1 spectrum-mpi cuda command. This puts the correct GNU C compiler along with MPI (not used) and CUDA in your path correctly as well as all needed libraries, etc.
5. Compile code on front end by issuing the nvcc -g -gencode arch=compute 70,code=sm 70 gol.cu -o gol for debug or nvcc -O3 -gencode arch=compute 70,code=sm 70 gol.cu -o gol for optimized code/performance.
7. Use the squeue to find the dcsXYZ node name (e.g., dcs24).
8. SSH to that compute node, for example, ssh dcs24. You should be at a Linux prompt on that compute node.
9. Issue run command for GOL. For example, ./gol 4 64 2 which will run GOL using pattern 4 with a world size of 64×64 for 2 iterations.
10. If you are done with a node early, please exit the node and cancel the job with scancel JOBID where the JOBID can be found using the squeue command.
3 Parallel Performance Analysis and Report
First, make sure you disable any “world” output from your program. The worlds produced in these performance test will be too large for effective human analysis without additional post-processing of the data. Using the time command, execute your program across the following configurations and collect the total execution time for each run.
• Blocksize is 64 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
• Blocksize is 128 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
• Blocksize is 256 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
• Blocksize is 512 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
• Blocksize is 1024 threads. Execute over world height/width of 1024, 2048, 4096, 8192, 16384, 32786 and 65536 for 1024 iterations each.
For each world height/width, plot the execution time in seconds (y-axis) of each different thread blocksize (x-axis). Note you will have 7 plots. Determine for each world size, what thread configuration yields the fastest execution time. Why do you think that is the case? Use the nvprof command to support your reasoning. Include the output from an nvprof run in your report. How much of your program’s total runtime is really spent in the CUDA gol kernel function according to nvprof ?
Last, determine which thread and worldsize configuration yields the fastest “cells updates per second” rate. For example, a world of 10242 that runs for 1024 iterations will have 10243 cell updates. So the “cell updates per second” is 10243 divided by the total execution time in seconds for that configuration. For this, please create a table of your data – there will be 35 entries – one for each thread blocksize / worldsize configuration with it’s corresponding “cell updates per second” rate.
Last, please provide instructions on how to run your gol program if it differs from using 3 arguments. For example, you might find it easier to run all your jobs if you provide the thread blocksize as a 4th argument and modify the argc, argv processing code in main accordingly.
You can use any graphing software so long as your report is in a PDF file. Please make sure to label each figure and put a proper caption that describes what the figure/plot is.
4 HAND-IN and GRADING INSTRUCTIONS
Please submit your C-code and PDF report with performance plots to the submitty.cs.rpi.edu grading system. All grading will be done manually because Submitty currently does not support GPU programs. A rubric will be posted which describes the grading elements of the both the program and report in Submitty. Also, please make sure you document the code you write for this assignment. That is, say what you are doing and why.
Reviews
There are no reviews yet.