A Cyberinfrastructure platform to meet the needs of data intensive radio astronomy on route to the SKA

ADASS 2011 - Day 0

A few notes from the GPU tutorial held on the Sunday before the start of the conference.

GPUs in astronomy

- better FLOP/$
- better FLOP/Watt

Motivation comes from Moore's Law - i.e. get my transistors - but around 2000 single cores were not able to get faster so we now have multi cores. Evolved from computer games industry: RGB +alpha. Expectation of 10x-100x speed-up compared to (single core) CPU/

Best results for: massive parallelism (needs a very big problem for peak), Data parallelism (same computation, may values), High arithmetic intensity (many fp calcs per memory access - otherwise this is slow).

Flynn's Taxonomy - discuss what the instruction are. Single data single instruction for a single core. Singe instruction, multiple data for GPU, multiple instruction multiple data for a distributed cluster. e.g.:


Amdahl's Law: Expected speed-up of parallelized implementation relative to serial algorithm. Assumes problem size remains the same. max speed-up = 1 / (1-P) + P/N N number or processors. P is parallel portion, serial portion is (1-P).

GPU Jargon: steaming multi-processors, SMPs. inside of these steam processors ("Cores"). Quite often we are talking single procession speed up - now this can mean double precision too.

Have to think in terms of treads. Multi: CPU O(8) threads - for 8 cores. GPUs work best with O(10^{4}) threads. Hardware is optimized for generating.  Can think of combination of threads. Grid - set of threads created during a kernel innovation - 3D array of blokcs. A block is a 3D array of threads. Warp - a sub-set of threads with contiguous thread indicies - assigned to multi-processor for execution.

Need to code to run on the CPU - host code. Then device code - the kernel or kernel function that runs on the gpu. Before it is executed memory has to be allocated on both the CPU and GPU. Transfer bandwidth, latency, memory space are fundamental limitations to GPU speed-up.

Lots of different types of memory - the big issues with optimistic the code. Global, constant, shared and registered memory space. Use tools like CUDA occupancy calculator.

Parallel programming is not as simple as Serial programming.  Not all computational problems are suited to GPUs.

Lots of if, then elses - its a disaster. Designed for throughput, not decision-making.  see: Barsdell et al. 2010.

Can GPUs, help me?

No point diving in too quickly. Look at existing libraries before writing code, CPU speedup might be possible.

How to analyze a problem:
outline each step in the problem, identify well-known algorithms, analyze unique algorithms separately, identify bottlenecks, consider CPU-GPU memory transfers, decide on an implementation approach.

Proven GPU algorithms:


Transform, reduce - get a speed up of ~15x (these are say summing an image array or multiplying), the arithmetic intensity compute speed versus PCIexpress (higher =better) ~ 1/15x

Interact - n-body, grav microlensing ~50x speedup and huge arithmetic intensity

FFT, ~8x speed but ~1x artimetic intnesity

RNGs, Monte-Carlo sims? not sure

Sorting, very high speed ups.

Note: you could of course transfer data at the same time of running...can transfer to and from whilst computing.

example: fitting galaxies.

Neighbouring threads needs to access neighbouring memory - way to overcome, is to pre-sort data.

List of CUDA libraries: (unless said they are free).

cuBLAS - dense linear algbra
Thrust (open source)
Magma - LAPACK (open source)
CULA - LAPACK, expensive
libJackets ~all of above, expensive (though academic licence ~$350)

80+ code examples in the SDK.

OpenCL or CUDA?

More libraries in CUDA. Kernels support full C++ - C99 only for opencl. OpenCL can be run on a broad range of hardware. CUDA must be installed manually - OpenCl comes built in on OSX - all rest installed. CUDA is properties. C++ constructs do not help with some GPU processing.

CUDA is not significantly faster - but can only compare on NVIDIA hardware. OpenCL is portable to any hardware - this generally means a re-tune but possibly a re-write.

CUDA better for developers - OpenCl is better for users.

Languages options:

apart from c or C++ there are bindings/libraries/tools now appearing in other languages:

python, pycuda, puopencl, copperhead
matlab: jacket, parallel computing toolbox
java: jcude, jocl
IDL gpulib
perl CUDA::Minimal