# **CUDA** Optimizations

WS 2014-15 Intelligent Robotics Seminar

## Table of content



## Table of content





• Why GPUs?

- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?



- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?

### It's all about real time

- Motion compliance < 1 ms
- Vision (30fps) < 33 ms
- Vision (60fps) < 16 ms

#### **Neural Networks**

- Neuron within a neural network computes its own activation based on local information
- Learning algorithms continuously adapt the strength of connections between neurons

#### pre-processing

 accelerates some of the pre-processing required (e.g. vision processing)



- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?



REPUBLIC OF GAMERS







- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?





- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?

## Global memory (off chip DDR5 RAM)

| PCI Express 3.0 Host Interface |                         |  |                   |  |  |  |  |  |  |
|--------------------------------|-------------------------|--|-------------------|--|--|--|--|--|--|
|                                | GigaThread Engine       |  |                   |  |  |  |  |  |  |
| Memory Controller              | SXX SXX SXX SXX SXX SXX |  |                   |  |  |  |  |  |  |
| Memory Controller              |                         |  |                   |  |  |  |  |  |  |
| Memory Controller              |                         |  | Memory Controller |  |  |  |  |  |  |

### Global memory (off chip DDR5 RAM)

Off chip memory

٠

.

- Constant and texture memory also allocated here
- SM (streamed multiprocessor)
- Blocks of threads are scheduled on SM (e.g. group of 512 threads)
- Shared memory which can be shared between threads in block



Why GPUs?

• Your PC with GPU

- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?





- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?

## Fastest memory in rough order





| Memory          | Location on/off<br>chip                         | Cached | Access | Scope                | Lifetime        |  |  |
|-----------------|-------------------------------------------------|--------|--------|----------------------|-----------------|--|--|
| Register        | On                                              | n/a    | R/W    | 1 thread             | Thread          |  |  |
| Local           | Off                                             | +      | R/W    | 1 thread             | Thread          |  |  |
| Shared          | On                                              | n/a    | R/W    | All threads in block | Block           |  |  |
| Global          | Off                                             | +      | R/W    | All threads + host   | Host allocation |  |  |
| Constant        | Off                                             | Yes    | R      | All threads + host   | Host allocation |  |  |
| Texture         | Off                                             | Yes    | R      | All threads + host   | Host allocation |  |  |
| + Cached only c | ached only on devices of compute capability 2.x |        |        |                      |                 |  |  |



- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?





- Why GPUs?
- Your PC with GPU
- Understanding SM and memory hierarchies
- Understanding CUDA kernel launch
- Questions?

| $\bigcap$             | Which of these is on-chip memory for GPU?                                                                                                                |  |
|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| ✓<br>✓                | <ul> <li>Host memory (RAM)</li> <li>Registers</li> <li>Shared memory</li> <li>Global memory</li> </ul>                                                   |  |
|                       | Can threads in different blocks access same shared memory?                                                                                               |  |
| ✓                     | •Yes<br>•No                                                                                                                                              |  |
|                       | Order memories based on speed                                                                                                                            |  |
| 5<br>1<br>4<br>3<br>2 | <ul> <li>Host memory (RAM)</li> <li>Registers</li> <li>Global memory (GPU memory)</li> <li>Constant and texture memory</li> <li>Shared memory</li> </ul> |  |
|                       | Which of these memories are persistent?                                                                                                                  |  |
| √<br>√                | <ul> <li>Registers</li> <li>Global memory (GPU memory)</li> <li>Constant and texture memory</li> <li>Shared memory</li> </ul>                            |  |
| $\bigcap$             | Except for constant and texture memory, all other memories are R/W                                                                                       |  |
| √                     | •Yes<br>•No                                                                                                                                              |  |

## Table of content



### Categorized optimization strategies



### Categorized optimization strategies



- Data Transfer Between Host and Device
- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
  - Coalesced Access to Global Memory
  - Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## Goal for Memory optimizations

• maximize the use of the hardware by maximizing bandwidth

maximizing bandwidth

• using as much fast memory and as little slow-access memory as possible

What follows next

 discuss the various kinds of memory on the host and device and how best to set up data items to use the memory effectively



- Pinned Memory
   Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers



- 177.6 GB/s > 8 GB/s
- Its fine even if we run kernels on the GPU that do not demonstrate any speedup

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers





- Page-locked or pinned memory transfers attain the highest bandwidth between the host and the device
- can reduce overall system performance (since it is scarce resource)
- Pinning memory is heavy weight operation

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

Lets assume that you are doing some processing on an image.

In which scenarios will you use pinned memory?

- •A] You have very limited host memory (RAM)
- •B] The image processing algorithm running on GPU has many steps to be performed on image
- •C] Your application demands to have processed image always available with host CPU

- Pinned Memory
   Asynchronous and Overlapping Transfers with
- Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers



- Pinned Memory
   Asynchronous and Overlapping Transfers with
- Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers



- Pinned Memory
   Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers









- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
  - Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## Unified Virtual Addressing

- •Internally manages the address spaces and do necessary memory transfers
- •Coding simplicity and rapid prototyping
- •Future compatibility

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers







- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## **Coalesced Access to Global Memory**

- •Memory loads and store by threads in warps are coalesced
- •RAM are designed for batch access and we can take advantage of that in programming
- •We will see what happens with coalesced access to global memory when
- 1] we change offset
- 2] we change stride



- Pinned Memory
  - Asynchronous and Overlapping Transfers with Computation
  - Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## 1] With different offset



## Coalesced access - all threads access one cache line

addresses from a warp

0 32 64 96 128 160 192 224 256 288 320 352 384

## Unaligned sequential addresses that fit into two 128-byte L1cache lines



- Pinned Memory
  - Asynchronous and Overlapping Transfers with Computation
  - Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## 1] With different offset



- Data Transfer Between Host and Device
- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

Assume we are working on an float image of size 500 X 500.

Will it be a problem? If yes what is the solution?



- Data Transfer Between Host and Device
- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## 2] With different stride



Adjacent threads accessing memory with a stride of 2



- Pinned Memory
  - Asynchronous and Overlapping Transfers with Computation
  - Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## 2] With different stride



- Data Transfer Between Host and Device
- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

Assume we are working on an float image of size 512 X 512. Will the below access pattern pose problem? What is the stride number in this case?



- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
  - Coalesced Access to Global Memory
  - Shared Memory
  - Local Memory
  - Texture Memory
  - Constant Memory
  - Registers



How many memory global transactions will be needed? Total pixels to be calculated = 9 Transactions per pixel = 9 Hence total transactions = 9\*9 = 81

> How many global transactions with shared memory?

> > 25

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
   Unified Virtual
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers



## Local memory

- •SM have limited register space
- •Automatic variables are allocated on registers (e.g. local variables)
- If registers memory is not enough then the local memory is used. This is called **register spilling**.
- •Local memory resides on global memory and hence is slow
- •After compilation nvcc compiler can report local memory usage. You must try to avoid it if possible.

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## **Texture memory**

- read-only texture memory space is cached
- •texture cache is optimized for 2D spatial locality
- •In some cases advantageous alternative to reading device memory from global or constant memory
- •Hardware provides other capabilities when textures are fetched using tex1D(), tex2D(), or tex3D() rather than tex1Dfetch()
- Filtering
- Normalized texture coordinates
- Automatic handling of boundary cases

- Pinned Memory
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

## **Constant memory**

- •There is a total of 64 KB constant memory on a device
- •constant memory space is cached
- •In some cases advantageous alternative to reading device memory from global or constant memory
- •the constant cache is best when threads in the same warp accesses only a few distinct locations
- •If all threads of a warp access the same location, then constant memory can be as fast as a register access

- Pinned Memory
  - Asynchronous and Overlapping Transfers with Computation
  - Unified Virtual Addressing
- Device Memory Spaces
  - Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

# Registers

- •Registers is the fastest memory space
- •CUDA provides capability to uses small constant arrays
- •hardware instruction support for sharing registers between threads in warp

- Pinned Memory
   Asynchronous an
- Asynchronous and Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

#### \_\_shfl(var,1)



Shuffle instruction with constant srcLane broadcasts the value in a register from one thread to all threads in a warp

- Data Transfer Between Host and Device
- Pinned MemoryAsynchronous and
  - Overlapping Transfers with Computation
- Unified Virtual Addressing
- Device Memory Spaces
- Coalesced Access to Global Memory
- Shared Memory
- Local Memory
- Texture Memory
- Constant Memory
- Registers

Shuffle up and down instructions illustrated





advantage of the shuffle instruction for the moving average filter algorithm

$$v[i] = \frac{\sum_{j=i-2}^{j=i+2} x[j]}{5}$$

Universität Hamburg WS 2014-15 Intelligent Robotics Seminar Praveen Kulkarni

### Categorized optimization strategies





