When using these examples to create your own job scripts to run on Derecho, remember to substitute your own job name and project code, and customize the other directives and commands as necessary. That includes the commands shown for setting your TMPDIR environment variable as recommended here: Storing temporary files with TMPDIR.
Running a hybrid CPU program with MPI and OpenMPIn this example, we run a hybrid application that uses both MPI tasks and OpenMP threads. The executable was compiled using default modules (Intel compilers and MPI). We use a 2 nodes with 32 MPI ranks on each node and 4 OpenMP threads per MPI rank. Whenever you run a program that compiled with OpenMP support, it is important to provide a value for ompthreads in the select statement; PBS will use that value to define the OMP_NUM_THREADS environment variable.
Running an MPI-enabled GPU applicationIn this example, we run an MPI CUDA program. The application was compiled using the NVIDIA HPC SDK compilers, the CUDA toolkit, and cray-mpich MPI. We request all four GPUs on each of two nodes. Please ensure that you have the cuda module loaded as shown below when attempting to run GPU applications or nodes may lock up and become unresponsive.
Binding MPI ranks to CPU cores and GPU devicesFor some GPU applications, you may need to explicitly control the mapping between MPI ranks and GPU devices (see man mpi). One approach is to manually control the CUDA_VISIBLE_DEVICES environment variable so a given MPI rank only “sees” a subset of the GPU devices on a node. Consider the following shell script:
The command above will launch a total of 8 MPI ranks across 2 nodes, using 4 MPI ranks per node, and each rank will have dedicated access to one of the 4 GPUs on the node. Again, see man mpi for other examples and scenarios. Binding MPI ranks to CPU cores can also be an important performance consideration for GPU-enabled codes, and can be done with the --cpu-bind option to mpiexec. For the above example using 2 nodes, 4 MPI ranks per node, and 1 GPU per MPI rank, binding each of the MPI ranks to one of the four separate NUMA domains within a node is likely to be optimal for performance. This could be done as follows:
|