CYBERSKA
A Cyberinfrastructure platform to meet the needs of data intensive radio astronomy on route to the SKA |
Accelerating the rate of astronomical discovery GPU-enabled clusters
Incentive to keep playing computer games for astronomy.
Programmable pipeline - shared OpenGL - application programming interfaces: cuda, opencl
First uses in astronomy: B-body forces, high arithmetic intensity - Nyland, Harris, Prins (2005) - using Cg/OpenGL on a NVIDIA GPU
Adaptive optics wave front reconstructions - Rosa et al. 2004
Common-Off-the-Shelf (COTS) Correlator, Schaaf & Overeem - ~5x better performance for 16x bigger problem.
94 abstracts only found on an abstract online search with GPU in abstract - many more than this. 3 classes of papers, methods, science result, philosophy - most are methods. All say good things - is this a bias self selecting process?
Big journals are New Astronomy and MNRAS - encouragement to publish these.
Big uses of NVIDIA cards in the above mentioned papers.
Caution: Why spend time optimizing CPU to do a performance test? i.e. is your CPU code optimized. Single precision vs double precision speed-up? Did you use OpenMP on multicore?
Green 500 - top 10 has 4 GPU clusters in.
at Swinburne, gSTAR - 130 T/flops - 123xGPUs
Solve a bigger problem size in the same wall time as smaller problem in CPU - Terascale (petascale?) image processing/analysis - however, do we have enough memory? Typically only about 6BG of memory - is the bottleneck the data transfer?
A GPU-based survey for millisecond radio transients (see: OERC)
Fermi cards, ~7 GLFOPS/Watt
Aim to process 3.2 Gb/s radio telescope data stream on a single GPU, de-dispersing and detecting transient events.
Use a piece of software called MDSM as a wrapper to out GPU kernel...
Have to do a brute force approach and calculate every dispersion measure. This results in many data points in the (f,t) domain being used lots. This is good for a GPU.
GPU code solution allows fro the dispersion measures to be calculated in real time. If you want a CPU code need a 8x Xeon E7-8860 SSE (80 cores) - need to write hand written assembly code as the compiler does the vectorising so that it runs 80x slower. cost GPU ~£2000, CPU~£55,000
GPUs wins hand-down at the moment, AVX puts up a good time, watch out for the Intel MIC chip.
There GPU code is running at about 40% of efficiency
UDP stream to handle the data going in.
Spotting radio transeint with the help of GPUs (Ben Barsdell)
High time resolution universe (HTRU) survey - uses the Parkes 64m radio telescope. Goal is two discover new pulsars and radio transients, 400 MHz BW, 1024 freq. channels, 64micros time resolution. Parkes has 13 independent beams.
Data goes to Swinburne, do RFI removal, to incoherent de dispersion, do a signal search and produce a list of candidates, and send this back to the telescope for follow-up observations. Currently it has > 30 mins epr 10 min observation. We would like to speed things up to real-time. Instant feedback and instant follow-up observations, can trigger baseband data dumps too to get better incite. The plan is to remove the step of transfer, so its all done at the telescope.
No easy solution to the RFI mitigation. Some things can be done, sigma clipping, spectral kurtosis, coincidence rejection.
Coincidence rejection - use multi-beam receiver as reference antennas - assume rfi is not localized - expect to see it in all beams. Simple way is e.g. 3sigma in 4+beams = RFI. Or could do full Eigen-decomposition approach between all means. It is a embarrassingly parallel , its just a transform.
De-dispersion works well on a GPU - high arithmetic intensity, good memory access patterns and no branching.
Baseline remove - can be port to GPU using parallel prefix sum - there is a library.
Match filtering: can also us parallel prefix sum
Sigma cut + peak find, can use a segmented reduction algorithm, again a library.
De-dispersion was speed up from 20 mins -> 2.5 mins, suing direct method on 1 Tesla C2050 GPU. Goal of 10 mins well within reach.
ESA's cloudscape (William O'Mullane)
US government - "cloud first" big push.
Why cloud, commercial world... amazon (AWS). Neflix have moved to amazon.
Not much different to a grid... but:
- no gridware, I can just have the machine, hence no messing with security in my application. I can have any machine (within reason) i.e. linux, windows, others machine. I pay per hour (cents per machine).
Come broadly in 3 forms:
platform as a service (google)
infrastructure as a serive (amazon)
software as a service (like microsoft offering office)
Take a look at: eucalyptus
How long does it take you to procure a machine? probably 6 months. Can have it in 1 minute if you use amazon. Also get root access!
JackOfAllClouds.com:
Edge caching... Akamai, Highwinds - for getting lots of data around the world.
G-Pod - grid processing on demand - can we put this on amazon cloud instead, they did and were very successful. Didn't put all the data on amazon as this costs a lot of money.
Corporate ITs collaboration Tools - "the yearly cost of WebEx is offset if 500 staff use WebEx instead of traveling once a year".
High-Performance computer infrastructure
How do we survive the data tsunami?
Delivering Montage - portable by design: ANSI-compliant C - image mosaic engine work-flow.
Pegaus work-flow management system,
Uses DAGMan, a workflow engine and talks to condor schedd - which is the task manager.
Comparing Grids with Clouds.
Mass storage is very expensive on Amazon EC2.
Digging out Exoplanets on Academic clouds - using Kepler data and futuregrid.org
Remote Visualization of Large Multidimensional Radioastronomy Data Sets
cyberska project, to develop an infrastructure for SKA> Collaboration portal....
Existing solution, transfer the file. Not easy for very large files, use remote X11 and VNC - this comes with disadvantages - permissions and security, resource allocations, integration with web and interactivity (latency).
So have a portal that has a distributed file system and the clients connect to the portal.
last year at ADASS showed a client-side visualization tool
movie showing all the features and fitting
Questions: making slices or region files?
updating FITS files?
my thoughts: What about CASA integration? part of pipeline (this was raised by others too)
ESO Data flow
Followed an object modeling technique (OMT) - developed by Rumbaugh around 1991.
Spitzer
exoplanets
The project followed a community observing programme.
Recently all reviews over the phone to save cash.
Still get approximately 200 proposals a year.
Have dropped from 200 people to 55.
Used to be $5,000 per hour now, $3,000 per hour of observing.
Continue running till 2012 - applied for longer.
LOFAR scheduling, Alwin de Jong
41 sensor stations. 2 PB temporary (~2 weeks) storage.
Need to do storage coomputing production adn planning ~ 1 week in advance
support parrallel observations
capability to reserver system resources
does constraint check + conflict resolvbing.
early feed back on wrong specifications (i.e. system that aren't working)
does hardward failure monitoring and adapation
publish scehdule on the web.
links to posgtress and mysql
QT4 for GUI
and object orientat C++
uses eclipse
GPU computing applications (Dr Gernot Zielger, NVIDIA UK)
3 types:
Visualization - Quadro
Parallel computing - Tesla
Personal computing, GeForce, TEGRA
2006: first GPU with built-in compute features, 128 cores
2007: Tesla 8 series, 128 cores
2008 Tesla 10 series, 240 cores
2009 Tesla 20-series, "Fermi" - up to 512 cores.
Either get the Tesla M-series as integrated CPU-GPU servers and blades (most interesting is the M2090 - 512 cores, 6 GB, 177.6 GB/s)
or Workstations 2 to 4 Tesla GPUs
up to 30K threads!
higher throughput = less power.
Tesla enables medical scans to be performed faster - and can reduced the CT radiation by up to 70 times, i.e. can alternate the use.
See CUDAZone.
Cross-Correlation on GPUS - complexity is O(N**2) and memory traffic is O(N). achieve peak performance on Fermi of 79%. 1 Tflops on M2090
see github.com/mikeaclark/xGPU
GPU Tree-code - galaxy merger simulations,
GPU as signal process, massive memory bandwidth, 150 GB/s
PCIexpress can be bottleneck, but can overlap computation and transfers
CUDA c: program (Kernel) for single data element, parallelization "in background" (e.g. branching divergences handled by hardware)
Scales with future GPUs (design intention: 2 x of existing code per architecture generation)
TEXTURES: GPU-specific technology that accelerate image and volume processing.
CUDATA Toolkit : cuBLAS, cuFFT.
many webinars.
Quasi-real-time adaptive optics
How adaptive optics work - used in real time to deal with turbulence in real-time.
Use a language called Yorick, bit like matlab or IDL but open sources.
Developed YOGA - works with dynamic linking of CUDA-C libraries.