Challenging applications
Video stablilization, compression, decompression and processing; target tracking; image enhancement; sensor fusion; radar and lidar; encryption, decryption and cryptanalysis; software defined radio; and simulation. All of these are sophisticated military/aerospace embedded computing applications that challenge developers, not least because they require significant amounts of computing horsepower (Figure 1). More than that: increasingly, the requirement is to deploy these applications in confined spaces that must have minimal weight, consume little power and that are subject to harsh conditions – such as unmanned aerial vehicles.
Figure 1: Video tracking is a typical military/aerospace application where GPGPU technology can deliver significant benefit.
Multicore/many core processors
General purpose processors have typically not had the computational power to be able to address this type of application. In fact, an interesting phenomenon has started to emerge over the past few years – a phenomenon that has seen mainstream processor manufacturers implicitly acknowledge that they are beginning to reach the limits of clock speed. This has seen them alter their architectural course – pursuing multicore/many core architectures – as, initially, with Intel’s Core2 Duo – as a way of increasing performance within the constraints of clock speeds. Intel dual core processors have been with us since 2006, and quad core processors are becoming mainstream. Eight core is just around the corner. Using multiple cores is rapidly becoming the de facto standard for achieving optimum performance.
Historically, given the limitations of general purpose processors, applications such as those described previously have been addressed by two alternative processor technologies: FPGA and DSP. A third technology, however, is now becoming available that has three important advantages. GPGPU – general purpose computing on a graphics processing unit – provides the kind of massively parallel computing capability that challenging applications demand. It can deliver this leading edge computing performance in small size, low weight, low power consumption environments. And it does not suffer many of the drawbacks historically associated with FPGA and DSP technology.
The acknowledged leader in GPGPU is NVIDIA with its CUDA technology, and GE Intelligent Platforms is working with NVIDIA to bring CUDA capability to the military/aerospace market with a range of 6U and 3U VPX and OpenVPX products.
Reducing size, weight and power (SwaP)
A typical target application for these products can be found in a major US military program platform which comprises 72 conventional processors on 18 6U boards which occupies four cubic feet, weighs over 100lbs, consumes 2,000 watts of power and delivers 576 GFLOPS of computational performance. Such a system could reasonably be deployed in a warship or a large airframe – but not in a small, autonomous, long range unmanned vehicle. A CUDA-based system, on a like-for-like computational performance basis, would realistically occupy 20% of the space, carry 10% of the weight and consume 10% of the power of that 18-board solution (Figure 2). Similarly, GE has demonstrated how four CUDA-enabled 3U VPX boards with a floating point performance of 766 GFLOPS can be deployed in less than 0.4 cubic feet.
Figure 1: a CUDA-based system Like the GE MAGIC-1 (right) can occupy a fraction of the space of a traditional system and consume significantly less power, yet deliver comparable computation performance.
As an example of raw computing capability, a quad core Intel i7 processor would typically be benchmarked at around 50 GFLOPS. The NVIDIA GTX 480 – widely acknowledged as the highest performance GPU available today – benchmarks at around 1,300 GFLOPS. From that simple comparison, it is easy to see how one major military prime contractor managed to achieve a 15x increase in throughput by using CUDA technology to implement a radar application.
In fact, that 15x increase is modest in comparison to performance improvements that have been demonstrated elsewhere. Improvements of 140x have been obtained in a signal processing application by High Performance Computing of Sweden; of 300x in the signal processing element of a synthetic aperture radar application at Brigham Young University; and of over 250x in an image processing application by the Rochester Institute of Technology. CUDA is widely used in a broad range of commercial applications, including medical imaging, finance, mapping and vehicle telemetry.
Typical GPGPU applications
GPU technology is designed for applications that lend themselves to parallel processing, applications in which data can realistically be processed in parallel – as opposed to the sequential processing for which general purpose processors are designed. GPGPUs, because of their origins as devices designed to render images, are particularly well equipped to process single precision (SP) floating point data. The throughput of single-precision floating-point add, multiply, and multiply-add operations is typically eight operations per clock cycle. Some integer operations can occur at the same rate, but others, such as multiplies, occur at two operations per cycle. GPGPU applications generally fall into a category known as Single Instruction Multiple Threads (SIMT) whereby an algorithm must be decomposed into an implementation that can be run on hundreds to thousands of concurrent threads, each executed one instruction at a time in lockstep across dozens to hundreds of processor cores, each operating on a different piece of the data.
As such, GPU technology is not a replacement for today’s mainstream processors – rather, it is a complement to it. Typically – on a platform such as GE’s IPN250 (Figure 3), for example – the Intel host processor would offload parallel data processing to an onboard NVIDIA GPU. In fact, GE’s product line is designed to allow offload onto multiple GPUs, all operating in parallel (on a number of NPN240 boards, for example, each featuring two CUDA-enabled GPUs) – giving scope for hugely powerful solutions to be designed and implemented.
Figure 3: The GE IPN250 features an Intel host processor and an NVIDIA CUDA co-processor
GPGPUs excel with computationally-intensive applications
The reason that GPUs can offer so many (16 to hundreds) processing cores is that they trade functionality for quantity – in other words, die real estate that would be given over to more complex functionalities is used to implement many more cores that are capable of only relatively simple operations. By definition, general purpose processors have an architecture that is designed to help solve as many different types of problem as possible – whereas GPUs are designed to solve, in effect, only one type of problem. The GPU core is simplified, and does not include the attributes normally associated with general purpose processors such as large and multi-level caches, out of order execution, branch prediction and so on. It can be said that, if an application requires such functionality, it is ill-suited for implementation on a GPU.
Identifying appropriate applications is key to maximizing the benefit to be gained from GPGPU technology. For example, GPGPUs are very efficient at floating point calculation, provided that the problem being solved has certain characteristics. Broadly speaking, these are that the data sets are reasonably large, and that the computational intensity (i.e. the number of operations applied to each data point) is high. The pulse compression and Doppler processing typical of radar applications usually rely on Fast Fourier Transforms (FFTs), which meet the criteria of high computational intensity and thus make GPGPU an ideal technology solution (Figure 4).
Figure 4: Because radar relies heavily on FFTs, it is the kind of computationally intensive application for which GPGPU technology is ideally suited.
The CUDA development environment
The CUDA GPU architecture supports a particularly simple programming model. It allows the developer to write one source file, with code for the host processor and the GPU included in-line. C for CUDA extends C by allowing the programmer to define C
functions, called kernels, that, when called, are executed n times in parallel by n different CUDA threads, as opposed to only once like regular C functions. The toolchain splits the host and GPGPU codes, compiles and links them separately, and then can combine the two into a single executable file. The result is code that is more easily constructed and analyzed, minimizing programming and maintenance cost. The tool chains for these many-core devices are now mature, having been in use by large communities of developers for several years. It is the relative ease and low cost of code development that is one of the key advantages that CUDA technology has over FPGA technology. One example of this is that code can be developed on an inexpensive CUDA-enabled PC before being deployed on a rugged military platform. Over 100 million CUDA-enabled GPUs have been shipped to date, and the NVIDIA CUDA programming
environment is now being taught in over 120 universities throughout the world.
Other developers are using OpenCL. OpenCL is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. It provides a uniform programming environment for software developers to write efficient, portable code for high performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, cell-type architectures and other
parallel processors such as DSPs.
The case for GPGPU technology
The case for GPGPU technology in demanding, sophisticated mil/aero applications is a persuasive one, and one which many defense organizations are actively evaluating.
What has been missing until now is the capability to transfer this powerful technology from commercial grade components in the lab to rugged, extended temperature range, conduction cooled units that meet the demands of deployed mil/aero applications – but, with the availability of a broad range of rugged CUDA-enabled platforms from GE, that vital capability is now in place.
Using many-core processors can allow system designers to fit an unprecedented amount of processing power into a very compact package. The processing power, system size and power consumption advantages of GPGPU technology are compelling enough, but when combined with the ease of programming in the CUDA environment, particularly when compared with FPGAs, and the significantly lower costs that are involved, the benefits become overwhelming.
www.edco.co.il