One teraflop in either hand

This is crazy! When I went off to Sandia Labs with a fresh new PhD in my pocket, my main experience with scientific computing was the Commodore Pet I had interfaced to my thesis project. I then had the opportunity to use Sandia’s four-processor Cray-XMP, probably the world’s fastest computer at the time, to carry out numerical simulations for my projects. The peak performance was (wait for it…) 800 megaflops! Of course, that level of performance was restricted to properly programmed problems which fit comfortably into a vectorized environment, but still…800 megaflops!

Well, time passed and things changed. The petaflop barrier was crossed three years ago, and the present supercomputer record is held by Fujitsu’s K at just over 10 petaflops, requiring nearly 90 thousand parallel-linked 8-core CPUs. That’s seven orders of magnitude increase in computational power, and I’m not even 60 yet! However, despite the incursion of multi-core configurations, the rate of increase of the computational power of single chips has not quite kept up.

This situation has been considerably alleviated by Intel’s recent announcement of its Knights Corner coprocessor, a spin-off of Intel’s doomed Larrabee GPGPU project. This 50-core (native 80 cores) chip is fabricated using Intel’s 3d tri-gate 22nm process, and is capable of sustained double-precision performance past the 1 teraflop level. In comparison, the Intel six-core I7 processor can hit about 160 gigaflops. ASCI-RED also provided a teraflop 15 years ago, but was the size of a house, and required about a megawatt for routine operations. Knights Corner as packaged is about four cm x six cm in size, and reading between the lines of several presentations, the clock rate appears to be a bit over one GHz. Its power requirement has not yet been revealed, but should be some tens of watts owing to the 22nm fabrication process.

Each of the 50 cores in Knights Corner is based on an early Pentium design with a 512 bit vector processor added to provide massive computational speed. There is no special purpose graphic processing capability. The cores run an extended x86 instruction set, with the extensions primarily intended to operate the vector processor. The result should be straightforward programming of HPC applications as well as the ability to port PC Windows and Linux programs with a minimum of difficulty.

Not intended as a standalone chip, Knights Corner is positioned to function as a coprocessor to a conventional x86 processor (probably a Sandy Bridge Xeon E5.) This will allow the Xeon processor and the Knights Corner coprocessor to communicate through PCI-Express 3.0 controllers. Having a fast pipe between processor and coprocessor is the key to high performance. This two-chip combination would have been on the Top500 supercomputer list in 2005. Intel plans to market Knights Corner on a PCI-Express 3.0 card, among other approaches.

What has kept 50-80 core coprocessors off the commercial market to this point is power and heat. The initial goal of Intel’s Larrabee project was a 45nm graphics chip with 16 cores running at 2GHz – and a power requirement of 150W. Given this level of performance, the Knights Corner chip would have required 750 watts of power. Not a pretty situation for designing desktop PCs or servers! Now that the 22nm process is available, the power requirement (at 1GHz) is down to a practical level of less than 100 watts.

It is worth taking a look at how Nvidia-equipped systems and AMD systems may compete with the Xeon-Knights Corner (XKC) combination. First, remember that despite its origins, Knights Corner is not a GPGPU, nor does it have any dedicated graphics hardware. The arena for this competition is in high-performance computing.

The XKC combo outperforms the current generation of NVIDIA systems – the Fermi-based Tesla 2090 has a peak computational rate of 0.65 teraflops, and sustained rates for many problems are considerably smaller. AMD has just introduced a line of 16-core Opteron chips operating at up to 2.6GHz that give a peak computational rate of about 0.17 teraflops, while their FireStream 9370 compute accelerator provides over half a teraflop. Note that this comparison is between systems on the market and a newly announced system which may be marketed in 2012.

In terms of pure processing power, then, while Intel has passed a barrier that still blocks its competitors, it does not have anything like an insurmountable lead. Rather, it appears that Intel has been playing catchup with NVIDIA and AMD. The main advantage offered by the XKC and its descendants is ease of programming. Researchers active in scientific HPC have expressed the notion that it would take one-tenth the time to program their massively parallel computations in an expanded x86 instruction set than it does in CUDA.

In summary, while Intel’s Knight Corner coprocessor chip is a welcome addition to HPC capabilities, it is primarily a reflection of earlier movements by Intel’s competition. For better or worse, the computer hardware manufacturers have decided on a CPU-HPC processor-coprocessor model as the path into the foreseeable future.