Poster: A Distributed Memory GPU Implementation of the Boris Particle Pusher Algorithm
Paulo Tribolet Abreu (GoLP/Centro de Física dos Plasmas, Instituto Superior Tecnico, Lisbon, Portugal)
The Boris pusher is a numerical algorithm to advance charged particles in an electromagnetic field. It is widely used in numerical simulations in Plasma Physics. This poster illustrates the implementation of the Boris pusher algorithm on a modern Graphics Processor Unit (GPU) with programmable shading capabilities, and explores the parallelization of the code on several GPUs.
Poster: SAR Signal Processing Using Graphic Processors
Jesus Alonso-Segoviano (Technical University of Madrid)
To propose a SAR data resolution model through the use of graphics processors, able to improve the conventional multiprocessor platforms runtime in, at least, an order of magnitude.
Poster: The Influence of Binary Stars on Post-Collapse Evolution
Rosemary Apple (School of Mathematics, University of Edinburgh)
The results in the N-body simulations in Giersz and Heggie (1996) show that although the masses segregate as expected during core collapse, after core collapse there is self-similar evolution with very little further evidence of mass segregation even though the system has not reached equipartition. Binary stars halt core collapse. It is possible that binary stars could also be the cause of the self-similar post-collapse evolution. To investigate this problem, we construct two models. One model contains mass segregation, but does not have a realistic approximation for the binary heating term. The other model has a more realistic binary heating term, but does not include mass segregation. We have examined these models in two cases; the case where there are assumed to be no binary stars and the case that includes a binary heating term. In both models, when binary stars are included we find the post-collapse evolution to be self-similar. The aim of our work is to combine these two models to form a new model which has both mass segregation and a realistic binary heating term.
Poster: Merging Neutron Stars and Black Holes
Richard Archibald (School of Mathematics, University of Edinburgh)
We present the results of fully three-dimensional, post-Newtonian simulations of the dynamical evolution of mergers between both neutron star binary systems and systems with a black hole and a neutron star. The hydrodynamical equations are integrated using the piecewise parabolic method (PPM) (Colella & Woodward, 1984) and the neutron star matter is described by the equation of state (EoS) of Shen et al. (1998a,b). We compare the results for various physical quantities using this EoS with previous simulations using the EoS of Lattimer & Swesty (1991) and investigate the implications for short period gamma-ray burst models.
Poster: When is Single Precision Good Enough?
Peter Behroozi (Stanford)
Current GPUs bring a whole new level of performance to astrophysical applications, but with the limit of single-precision (FP32) operations. Future GPUs will have double-precision (FP64) support, but will likely offer double performance on FP32 operations. We examine a range of astrophysical simulations (single-galaxy hydrodynamics, galaxy mergers, dark matter n-body simulations) and derive estimates for how long, how precise, and how stable FP32 simulations can be. For verification, we compare our theoretical estimates with a GPU port of a section of the ART/Hydro code.
Poster: Special, hardware accelerated, parallel SPH code for galaxy evolution.
Peter Berczik (Astronomisches Rechen-Institut, Univ. Heidelberg)
We present our first results from the recently developed parallel 3D SPH dynamical code for galaxy evolution. It follows the evolution of all basic components of a galaxy such as dark matter, stars, diffuse interstellar matter (ISM). Dark matter and stars are treated as collision less N-body systems. The ISM is numerically described by a smoothed particle hydrodynamics (SPH) approach.
We perform our simulations on the recently built 32 node GRACE cluster at the Astronomisches Rechen-Institut (ZAH). This system is a new type of supercomputer based on a standard PC's with GRAPE and a new kind of programmable special hardware (FPGA) cards calls MPRACE [http://www.ari.uni-heidelberg.de/grace/].
The gravitational forces calculated using the combined parallel TREE-GRAPE algorithms which give us the expected speed ~15 Gflop/s per node. Pipelines and pipeline tools for MPRACE have been developed for SPH forces and as planned are performing with the expected speed of ~4 Gflop/s per board.
Poster: Simulating MHD Turbulence on GPUs
Chi-kwan Chan (Harvard)
Shearing box simulations have been used extensively in the pass two decades to study turbulent flows in astrophysics. These studies not only prove the magnetorotational instability able to drive turbulence in accretion disks, but also demonstrate the possible break down of Boussinesq assumption in MRI-driven turbulence. In order to study the detailed properties of MHD turbulence, a large amount of numerical experiments needed to be carried out. These three-dimensional simulations are still computationally very expensive nowadays. GPUs evolve much more rapidly compare to CPUs in the pass few years and deriver unbelievable amount of computing power. The development of CUDA makes programming GPUs more straightforward. The computational intensive nature of hydrodynamics and MHD simulations makes them ideal testbed for high performance computing on GPUs. I will present a newly develop pseudo-spectral algorithm for incompressible MHD and its implementation on GPUs. Primary results from the algorithm are also presented.
Talk: A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy
Kevin Dale (Harvard University)
(with Daniel Mitchell, Randall Wayth, Lincoln Greenhill, David Luebke, and Hanspeter Pfister)
This work explores the suitability of graphics processing units (GPUs) for real-time data processing for the Murchison Widefield Array (MWA) radio telescopes. In this talk, I'll provide an overview of our GPU-based implementation of the major stages of array calibration and image formation for the MWA Real-Time System (RTS). Across the various stages, our single-GPU RTS implementation provides an average speedup over a single CPU of about 10x, with more than a 60x speedup for the most improved stage. In addition to performance, I will also ldiscuss trade-offs between CPU- and GPU-based solutions for the MWA RTS in terms of hardware cost, scalability, and power consumption.
Talk: Real-time Digital Signal Processing for Radio Astronomy using GPUs
Paul Demorest (National Radio Astronomy Observatory)
I will describe results from a recent investigation into the use of GPUs (and CUDA) for real-time signal processing. This work focuses on coherent dedispersion for radio pulsar observations, but the results and benchmarks presented should be relevant to other radio/DSP applications as well. The GPU performance and development experience will also be compared with competing solutions such as fast multi-core CPUs and custom FPGA-based hardware.
Talk: GPU Applications at the University of Maryland
William Dorland (University of Maryland)
For a few years now, there has been an active research group in computer science at the University of Maryland focused on ideas for Tabletop TeraFlop computing with GPU's. We have an NSF-funded Beowulf cluster that has programmable GPU's on every node, available either for driving a wall of monitors (for high-end visualization ideas) or for computing. Over the last year, and largely due to successes with NVidia's CUDA library, this project has expanded to include scientists from non-computer science application areas, including scientists working in astrophysics. There are currently six faculty members + postdocs + graduate students meeting regularly to talk about GPU programming, and a novel GPU productivity assessment beginning (with DARPA funding), which involves 3 additional faculty + graduate students. In this talk, I will report on successful GPU implementations of algorithms in three areas of interest for the meeting: nonlinear pseudo-spectral PDE solvers commonly used for MHD/plasma turbulence; particle-in-cell simulations of kinetic turbulence; and an N-body solver based on the Fast Multipole Method, specially adapted for the NVidia GPU. In each case, correct solutions are produced with significant speedups compared to Intel Xeon processors. A working turbulence code runs at 25x speedup on real problems. The PIC code, which is our latest effort, is a more modest 7x speedup. The FMM N-body solver accurately calculates the forces among 1M masses in 1 second on the NVidia 8800 GTX.
Poster: Middleware for Fortran 9X Programming on NVIDIA CUDA
Ramani Duraiswami (University of Maryland, College Park & Fantalgo, LLC)
The NVCC compiler provided with CUDA does compile host C code, it is mostly focused at producing software that runs on the GPU. While this is useful to develop small programs to run on the GPU, when GPUs will be used for high performance computing they should be more properly viewed as compute coprocessors, to which data from a large program running on the CPU host/cluster is farmed out. Of course, since host-GPU communication is relatively slow, back and forth data exchange should be avoided. Instead, our viewpoint of GPU programming is to provide a high level language such as Fortran 9X with a set of functions that give it the ability to manipulate data on the GPU via a middleware library, and augment the middleware functions with a small number of problem specific functions written in CU. We provide a Fortran module that allows manipulation of device variables, which have sizes and allocations on the GPU to provide high performance operations. We implement device variables as structures, which encapsulate information about the pointer, size, and other parameters, e.g., the type, dimension, leading dims, allocation status, etc. The module also allows wrapping of function calls. Overloaded functions suitable for the use with different types, shapes, and optional parameters are developed. Several device functions, callable via wrappers are also provided. These are for initializing variables, copying them, and performing other operations. The NVIDIA provided CUBLAS/CUFFT functions are also encapsulated in a convenient overloaded syntax, which avoids bugs due to calling errors.
Two sample sci. comp. applications were accelerated. In each case, we had original Fortran 90 code available, and we translated this original code to run on the GPU. The first application was from plasma turbulence using a simplified but relevant 2D pseudospectral code that makes use of the wrapped CUFFT library. Computationally the most important part of this code is that evaluating nonlinear evolution terms in a time stepping loop. A speedup of about 25 is achieved vis-à-vis the serial CPU code, executed on an Intel QX6700 CPU processor. The second application is from the fitting of radial basis functions to scattered data, using an iterative algorithm. This is representative of many applications in iterative methods that should see significant speedups. Here an incredible speedup of 662 times over a serial CPU code is seen.
(joint work with Nail Gumerov & Bill Dorland)
Poster: Memory Layout in GPU Computing
Richard Edgar (University of Rochester)
(with A.C. Quillen & A. Moore)
We describe our early experiments in GPU computing. Starting with a very simple NBODY code, we made some simple modifications, and quickly obtained a 20x speed up for the acceleration evaluation. We then made some more changes, to optimise the usage of the GPU’s memory. This sped the code up by another factor of twenty, for an overall speed up of almost 400x the CPU.
Poster: GPU-based Integration of Planetary Systems
Eric B. Ford (University of Florida)
-
Talk: Fast Multipole Methods on Graphics Processors
Nail A. Gumerov (Fantalgo LLC, and University of Maryland, College Park)
Graphics Processors (GPUs) provide access to significant computational processing resources. They contain a large number of processing units with access to local and shared memory, and achieve significant speedups vis-à-vis CPUs on problems that can be mapped to their SPMD architecture. Many applications in molecular dynamics, astrophysics and other areas require the O(N^2) computation of mutual Coulombic potentials and forces among N particles. The FMM provides a hierarchical approximate algorithm, to compute these quantities to a specified error ε at O(NlogN) cost and memory. More generally FMM like algorithms are used to accelerate matrix vector products in applications such as the solution of integral equations, radial basis function interpolation/evaluation and machine learning. On an NVIDIA 8800 GTX installed on a PC, our FMM algorithm achieves timings that if computed using an O(N^2) algorithm correspond to speeds of 25-45 Tflops (for achieved L2 errors of ~ 10-6 - 2×10-4).
(joint work with Ramani Duraiswami)
Talk: Internals of the CUNBODY-1 library: particle/force decomposition and reduction
Tsuyoshi Hamada (RIKEN)
The CUNBODY-1 (CUda N-BODY version 1) library is an implementation of C/C++/Fortran library to accelerate N-body interactions using NVIDIA's GPUs(GeForce8800 etc). Our library is implemented using an optimized algorithm which we call the Chamomile Scheme. In this speak, I present the internals of our library about how we implement particle decomposition, force decomposition and reduction, which are details but important for practical AstroGPU applications.
Poster: GPU FX Spectrometer using CUDA
Chris Harris (The University of Western Australia)
The next generation of radio telescopes, such as Square Kilometer Array and the associated "Pathfinder" arrays, require vast amounts of computation due to the extremely large number of interferometers and the imaging requirements. The hardware for this computation is becoming a significant consideration in array design, both in terms of initial cost and power consumption. Graphics Processing Units provide power efficiency and affordability as well as the flexibility of general purpose hardware. This work implements a GPU-based FX spectrometer, which processes four streams of 8-bit interferometer data, for a variable number of frequency channels. This approach scales well with frequency channels, with a computation speed from 2 to 18 times faster than a CPU implementation. Further work is in progress to scale the algorithm with the number of interferometer streams, and to investigate optimisation of the GPU algorithm.
Talk: GPUs in astronomical image processing
Robert Lupton (Princeton University)
-
Talk: GRAPE-DR
Junichiro Makino (National Astronomical Observatory of Japan)
GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) is a single-chip SIMD massively parallel processor. It has evolved from GRAPE, special-purpose computer for N-body simulation, in the way very similar to that followed by GPUs. Hardwired pipelines are replaced by simple but programmable units. In this talk, I'll give the overview of the GRAPE-DR processor, its development status and programming environment. I also briefly comment on the comparison between GRAPE-DR and GPGPU.
Poster: MHD and Gravity Algorithms
Jason Maron (American Museum of Natural History)
We present an assortment of GPU extensions for MHD and gravity algorithms.
Poster: GPU computing in IDL
Peter Messmer (Tech-X Corporation)
The Interactive Data Language (IDL) is a standard tool used by many researchers in astronomy and other observational fields. Present day solar missions like RHESSI or SOHO, or future missions, including the Solar Dynamics Observatory (SDO), almost exclusively analyze their data in IDL. However the increasing complexity of image processing algorithms and the increasing size of datasets require higher computing power than offered by single processor workstations. The data-parallel nature of IDL programs is well suited for GPU computing. In this proster, we present a library to facilitate harvesting the computing power of GPUs from within IDL. In addition to IDL intrinsics on GPUs, this library offers routines for data transfer to and from IDL. This enables IDL users to quickly write and experiment with algorithms running on GPUs and integrating them into existing IDL code. We will present the API, as well as benchmarks for solar data analysis codes.
Work supported by NASA SBIR Phase II Grant #NNG06CA13C.
Poster: High-Accuracy N-body Simulations on GPU
Keigo Nitadori (University of Tokyo)
Some CUDA implementations of the direct N-body calculation have demonstrated that a GPU can perform nearly its peak. However, the accuracy of these calculations has been limited to that of the single-precision floating-point operation. We have developed a new implementation which assigns two single-precision words for the expression of coordinates and the accumulators for acceleration and potential. All other operations are performed in single-precision. In this way, we can achieve same level accuracy as GRAPE-4 or -6.
The new implementation supports both the standard 4th-order Hermite integrator and recently developed 6th-order Hermite integrator. The latter allows more than double the step-size of the former with a small (~60%) extra calculation cost.
Combined with the block timestep algorithm, the 6th-order implementation marked 120 Gflops on a GeForce 8800 GTX card. A variant implementation using SSE and OpenMP marked 35 Gflops on a Core2 Quad Q6600 (2.4 GHz) processor. Here, we count 97 floating-point operations for one pairwise calculation of acceleration and its 1st and 2nd derivatives, which is not exactly equal to the real operation count.
Poster: Direct N-body Computations on a GPU
Lars Nyland (NVIDIA Ltd.)
In this poster, we will present the performance results of implementing a direct n-body algorithm on an NVIDIA G80 GPU. We have achieved 240 GFLOPS of performance, calculating 12 billion pairwise interactions per second. We will show how we map the problem to the GPU, and discuss several optimizations that yielded a 50% improvement over our initial implementation.
Poster: GraCCA -- a Graphic Card Cluster for Astrophysics
Hsi-Yu Schive (National Taiwan University)
We have built a GPU cluster named GraCCA, which consists of 18 nodes with each of them equipped with two modern graphic cards, the NVIDIA GeForce 8800 GTX. The data transfer between different GPUs is accomplished by using the MPI library through a gigabit Ethernet switch. To demonstrate its performance in astrophysics computation, we have implemented a parallel N-body simulation with both shared and individual time-step algorithms in this system. The maximum measured performance is 7.1 TFLOPS for simulating a globular cluster of 1 million particles by using 32 GPUs.
Poster: High-Performance Computation and Visualization of Plasma Turbulence on the GPU
George Stantchev (University of Maryland)
Direct numerical simulation (DNS) of turbulence is computationally very intensive and typically relies on some form of parallel processing. Spectral kernels used for spatial discretization are a common computational bottleneck on distributed memory architectures. One way to increase the efficiency of DNS algorithms is to parallelize spectral kernels using tightly-coupled SPMD multiprocessor units with minimal inter-processor communication latency. We present techniques to map moderately sized DNS calculations to modern Graphics Processing Units (GPUs), which are characterized by a very high memory bandwidth and hundreds of SPMD processors. We use the Hasegawa-Mima model to contrast and compare our GPU vs the associated CPU implementation of a basic plasma turbulence solver. We also demonstrate a prototype of a scalable computational steering framework based on turbulence simulation and visualization coupling on the GPU.
Talk: GPU Acceleration of Scientific Applications Using CUDA
John E. Stone (University of Illinois at Urbana-Champaign)
For many years graphics processing units (GPUs) have been an untapped computational resource for scientific computations due to limitations in the hardware and programming interfaces they provided. State-of-the-art GPUs and software development tools have begun to address these problems, expanding their applicability to scientific computation and easing integration with existing applications. We present an overview of several GPU-accelerated applications based on CUDA, performance results, and the key performance optimization techniques used in each case.
Poster: Using CUDA for Monte-Carlo simulations - Two Examples
Stefan Umbreit (Northwestern University)
The Monte-Carlo method for star cluster simulations has been established as a robust and fast alternative to direct N-body methods. It is especially suitable to study the evolution of stellar systems with a large number of stars, such as globular clusters. This efficiency is due to the fact that the position of a star is not followed explicitly, but, instead, is randomly sampled according to the stars' current orbital elements. This sampling procedure involves bracketing the root of a certain function on a grid and this is the computationally most expensive step. Fortunately, this step can be parallelized making it suitable for calculation on massively parallel hardware, such as the newest nVidia GPU's. Here we present an example of that root-bracketing-algorithm implemented in CUDA. A second example shows how to generate 2D-projections of 3D data produced with our Monte-Carlo simulations for comparison with observations. This procedure utilizes the CUDA BLAS library for the generation of 3D-positions and velocities of the stars from their Monte-Carlo variables.
Poster: Next-Generation Radio Astronomy for Macintosh Workstations
Boyd WATERS (National Radio Astronomy Observatory)
The Common Astronomy Software Applications (CASA) are a suite of tools for the reduction and analysis of radio-astronomy data via a Python (IPython) interface. World-class radio-telescopes currently under construction will challenge the data-processing capabilities of high-end workstations. Although we are exploring high-performance computing cluster, our users tell us they want to process data on personal workstations. Multiple-core optimization will be required, and utilization of the graphics processors is of significant interest.
Poster: Real-time calibration and imaging for the MWA
Randall Wayth (Harvard-Smithsonian Center for Astrophysics)
(with K. Dale, L. Greenhill, D. Mitchell, S. Ord and H. Pfister)
The Murchison Widefield Array (MWA) is a new 80-300 MHz synthesis radio telescope under construction in Western Australia. The telescope will generate 16GB/s of raw visibilities, which must be processed in real-time including calibration of the instrument and ionosphere.
We describe the instrument, science goals and the main data processing steps of the real-time system. Much of the processing is particularly well suited to GPUs. We estimate that a GPU-based real-time supercomputer will be able to process the data with an order of magnitudes fewer nodes than a traditional cluster.
Talk: Transforming Scientific Codes to Execute Efficiently on the IBM Cell Processor
Paul Woodward (LCSE, University of Minnesota)
(with Jagan Jayaraj, Pei-Hung Lin, and Pen-Chung Yew)
The Cell processor represents an extreme example of a multicore CPU and a very simple example of a GPU. The main features of interest for porting scientific codes to such devices are the number of independent cores on the chip, 8 for Cell, and the size of the private memory for each core, 256 KB for Cell. The ease and speed with which the work of these cores may be coordinated is an issue that we will ignore in this discussion. We choose to implement our multifluid PPM gas dynamics code so that it not only executes entirely on the Cell processor core, but the code itself resides permanently in the core’s local store on chip. This is accomplished by streaming data into and out of the local store asynchronously, using tiny data records corresponding to chunks of 8 (multifluid) or 64 (single fluid) grid cells. The computation of the entire code has been pipelined. Once the pipe is full, each new grid plane of 4, 8, 12, or 16 grid cells, depending upon the code version, is unpacked from its data record or records and completely processed. The result is a new such grid plane offset by 9 (single fluid) or 11 (multifluid) planes from the one just ingested. On the relatively rare occasions that archivable or visible output is requested, this too is generated within this pipeline while all the necessary data is on chip. For the single fluid PPM code, 62 flops are performed for each 32-bit word that is read into or written out of the local store, and each of the 8 Cell processor SPU cores delivers 5.23 Gflop/s at 3.2 GHz (83.7 Gflop/s aggregate performance for a dual-Cell blade). For two-fluid PPM, 39 flops are performed per main memory word accessed, and 3.43 Gflop/s is delivered by the SPU core. For the Intel Clovertown CPU core running at 3.0 GHz, 6.29 Gflop/s and 4.84 Gflop/s are obtained with these same two codes expressed in a highly transformed Fortran. The fully pipelined code expression is suitable for implementation on a series of cores, each performing only part of the code’s calculation, but this would require 10 to 20 times the data bandwidth between neighboring cores in the sequence as is needed from the main memory to the first one or from the last one back to the main memory. The code transformations into the fast Fortran expression and from there into Cell-specific C are being automated via simplified, domain-specific code translator utilities.
Talk: High Performance Direct Gravitational N-body Simulations on Graphics Processing Units
Simon Portegies Zwart (University of Amsterdam)
We present the results of gravitational direct $N$-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the $N$-body problem is implemented in `Compute Unified Device Architecture (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different $N$-body codes: two direct $N$-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for $N > 512$ particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the $N$-body system was conserved better than to one in $106$ on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For $N \apgt 105$ the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.