Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

We say hello to our dear heterogeneous computing friends! Today we will discuss the recent news from the battlefield, and unfortunately they are not the cheerful news.

For almost two years we have been obtaining all our results with a dozen old trusty AMD Radeons 7970 GHz Ed. New scientific plans leaded us to a choice of a hardware platform for the next 1-1,5 years. In addition, the new HAWAII cards seemed inspiring. Let's run the benchmarks, do some calculations and make the conclusions!

The soldiers of Applied Science are interested in at least two major properties of a device: peak performance in double precision and a global memory bandwidth.

We have tested the following accelerators (+prices in Moscow, written in $$$):

- Radeon 7970 Ghz Ed (1010MHz) (~350$)

- Radeon 7990 Reference (~650$)

- Radeon R9 290X Reference (~700$)

- Geforce TITAN Reference (~1000$)

All AMDs had 12.6 and 14.1b drivers, 331.38 for TITAN.

Here are the theoretical peaks of these devices and the best real performance (we used the synthetic OpenCL kernels from the clpeak project: https://github.com/krrishnarraj/clpeak):

After that, we launched a mini-stress-test: 100 iterations with 10 kernel launches in each iteration. Resulting performance of an iteration is an average of 10 launches. That's how Marsellus Wallace the performance degradation looks like. Isn't it beautiful?

After looking at these diagrams, several questions arise.

1. It is unclear why one half of 7990 has 733 GFlop/s while 7970 has 1007 GFlop/s? As far as we know, these cards are equal, and they have the same frequencies and amount of SPUs.

2. Why does the kernel performance decrease on Tahiti as time goes by? This issue keeps appearing with all drivers after 12.6, and driver developers seem to do nothing about that.

3. Why the HAWAII card is so slow? The fact that it has DP/SP = 1/8 (compare with 1/4 on Tahiti) made us frustrated. In addition, we notice that our non-optimized DGEMM kernels get 650 out 704 GFlop/s, which is 92% - unrealistic number. So we hypothesize that the chip has full 1,4 TFlop/s performance, which is artificially (by software?) limited.

The next chart contains theoretical peaks of global memory bandwidth:

The real bandwidth of all devices was measured with GlobalMemoryBandwidth and MemoryOptimization tests from AMD APP SDK:

Important moments:

1. In some aspects AMD drivers become better and better, and these improvements greatly affect the overall performance.

2. All AMD GPUs demonstrate an exсellent memory subsystem work.

Unfortunately, we have to state that HAWAII is less suitable for scientific computations than Tahiti, and the time of cheap GPGPUs for scientists has ended.

As we think, up to this point AMD hardware was much better than hardware from NVidia. On the other hand, these advantages were neutralized by unstable drivers. At the moment, the AMD drivers became better, but there are some old bugs (like performance degradation) and some new ones, so some codes which worked with 12.6, don't work on the new drivers.

We have also tested the FirePro card (based on Tahiti architecture), and the situation was the same: unstable drivers and a lot of difficulties with multi-GPU systems.

So, the question is: what does AMD plan to do in the computational sector? Previously we had a cheap and fast hardware, which could be used despite the awful drivers. At the moment the drivers are still not fully operational, and soon the FirePro cards will become very expensive (who will buy them?..).

On the other hand, in some performance metrics NVidia devices are not as good as AMD ones, but they have the brilliant software support. We have tested a lot of different NVidia GPUs (GeForce and Teslas) and never encountered serious problems.

Almost all computational problems we investigate are memory-bound, and in this aspect all these GPUs are approximately equal. In the real launches AMD 7970s are usually better than TITANs, so we decided to stay on 7970. But we expect that NVidia will start selling something more fast in a year or so.

There is an opinion (not only ours), that if AMD will not change their slighting attitude to GPGPU sphere, a lot of scientific researches will start choosing NVidia.

From Russia with love,

Pavel,

Anton.

Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112