ADDENDUM:

(Sun May 18 15:50:28 EDT 2003)

Bruce Dawes of Cygnus Software has written a detailed and ultimately more believeable account of the P4's problems with floating point exceptions. His paper is x86 Processors and Infinity. Note that Cygnus Software is not related to the former Cygnus Solutions, now owned by Red Hat, that works on GCC and Cygwin. Cygnus Software makes some apparently cool fractal software.

Doh! ERRATA:

This web page originally claimed that the timings were for a 500x500 random matrix. Someone pointed out that the supplied test code used a 200x200 matrix, so I guessed that I had done the timings on a 200x200 matrix. However, the timings were truly done on a 500x500 matrix. I've updated the test code to do 500x500 matrices, and re-corrected this webpage.

Robust Cholesky Performance on Intel's P4

"Robust Cholesky" is a small modification of a standard Cholesky decomposition. It was created to deal with numerically nasty matrices. In short, it finds a symmetric positive definite submatrix in any symmetric matrix and factors that. If none of this made sense to you, no problem: this page is about the horrendous performance of this algorithm on Intel's P4 cpu. The "bad" part of the algorithm, as far as the P4 is concerned, is the inside of a triply-nested loop that works something like

      for j in [0,i-1]:
      index = ivec_ref()
      val += dyv_ref(a,index) * dyv_ref(b,index)
    

where an ivec is a dynamic (length) integer vector and a dyv is a dynamic double vector.

Here are the timings for Robust Cholesky on a 500x500 randomly generated matrix. Note that the rng is seeded so each test uses the same matrix. I'll add other people's results as the come in, and provide their contact info when allowed. I used the "time" command, while others used other methods (including wall clocks); where you see a time with ".00" after it, it is very likely the decimal places are added for proper justification. After a run is completed, the test code should report "size of newgoodlist is 8".

Please be intelligent about how you interpret these results. In particular, note the compiler flags. Some of these tests use SSE and 3DNow!, both of which employ single-precision arithemtic. SSE2 uses double precision arithmetic, the usual for numerical codes. SSE, 3DNow!, and SSE2 are extensions to the i386 instruction set and are not portable. SS2 is, at present, only available on the P4. AMD states the upcoming Opteron processors will support Intel's SSE2 instructions.

If you're feeling overwhelmed by the table, just focus on the highlighted rows.

Arch Compiler Flags Time (s) Contact
P-II 266 MHz gcc 3.1 -O2 (implied) 710.93 mark7@andrew.cmu.edu
P-II 266 MHz gcc 3.1 -O6 -march=pentium2 -mcpu=pentium2 -ffast-math -fstrict-aliasing -fexpensive-optimizations -pedantic -DNDEBUG -pipe 212.18 mark7@andrew.cmu.edu
P-II 266 MHz gcc 3.1 -O6 -march=pentium2 -mcpu=pentium2 -ffast-math -fstrict-aliasing -fexpensive-optimizations -pedantic -fomit-frame-pointer -DNDEBUG -pipe 209.62 mark7@andrew.cmu.edu
P4 Xeon 1.7 GHz gcc 3.0.1 -O2 82.00 komarek.paul@gmail.com
P4 1.7 GHz MSVC 6 best of "fastest" and "smallest" 82.00 komarek.paul@gmail.com
Pentium MMX 200 MHz gcc 3.2 -O3 -fexpensive-optimizations 62.00 cleberlr@axpfep1.if.usp.br
P4 2.4 GHz gcc 3.2 -O2 54.00 cleberlr@axpfep1.if.usp.br
P4 2.6 GHz gcc 3.2 -O2 47.26 bstoneaz@cox.net
P-III 1.0 GHz MSVC 6 best of "fastest" and "smallest" 18.00 komarek.paul@gmail.com
P-III 1.0 GHz MSVC 6 Best of "fastest" and "smallest". Uses floats and SSE. 2.00 komarek.paul@gmail.com
G3 300MHz (Apple iBook) gcc 2.95.4 -O2 6.23 pgunn@cs.cmu.edu
Athlon C 600 MHz MSVC 6 best of "fastest" and "smallest" 2.00 komarek.paul@gmail.com
Alpha 21264 EV67 667 MHz gcc 2.96 -O2 2.00 komarek.paul@gmail.com
P4 1.7 GHz MSVC 7 (.NET) best of "fastest" and "smallest" with SSE2 < 1.00 komarek.paul@gmail.com
Athlon XP 1900+ gcc 2.95.3 -O2 0.88 komarek.paul@gmail.com
Athlon XP 1900+ gcc 3.2 -m3dnow -march=athlon-xp (uses floats and 3DNow!) 0.830 komarek.paul@gmail.com
Athlon XP at 10x166 MHz gcc 3.2 -s -static -O3 -fomit-frame-pointer -fexpensive-optimizations -Wall -falign -functions=4 -funroll-loops -malign-double -fschedule-insns2 -fprefetch-loop -arrays -march=athlon-xp --msse --mfpmath=sse 0.6 cleberlr@axpfep1.if.usp.br
P4 Xeon 1.7 GHz gcc 3.2 -static -O2 -mfpmath=sse -msse2 0.5 komarek.paul@gmail.com
P4 2.4 GHz gcc 3.2 -s -static -O3 -fomit-frame-pointer -fexpensive-optimizations -Wall -falign -functions=4 -funroll-loops -malign-double -fschedule-insns2 -fprefetch-loop -arrays -march=pentium4 --msse --mfpmath=sse 0.468 cleberlr@axpfep1.if.usp.br
P4 2.6 GHz gcc 3.2 -O2 -march=pentium4 -mfpmath=sse 0.35 bstoneaz@cox.net

At this point, it seems reasonable to draw some conlusions. The P4 fp unit just plain stinks (see next paragraph for clarification), but for simple multiply and add instructions you can work around it by using SSE2 instructions.

Anwar Rohillah and I have indepently tried catching NaNs and infs in the computations. Anwar converted them to zeros, and I treated them as bad rows. Both approaches created very fast code that produced "incorrect" answers. By incorrect I mean that our codes do not produce the same results as all of the other runs posted. Neither of us has checked the correctness of the result in terms of having found Cholesky factors of a postive definite submatrix. These tests suggest that the P4 fpu's Achille's heel is fp exception handling. Some people have suggested that the programmer should bear the burden here, but IEEE 754 and the adequate performance of everything except Intel's recent Pentiums suggests this isn't necessary.

Robin Humble performed an experiment on an unspecified MIPS core running GNU/Linux. His results show an alarming amount of system time consumed by the test code:

    % time rc.test
    size of newgoodlist is 7
    8.810u 41.140s 1:10.77 70.5%    0+0k 0+0io 121pf+0w
  

All other experiments submitted with system and user time, as well as those that I performed, show trivial amounts of system time relative to the user time.

Cleber Rodrigues did some experiments with the Intel C++ compiler 6.0 (demo). He reports that it produce incorrect results, in effect treating the entire random symmetric matrix as if it were positive definite with 500 "good" rows. At one time, you could find Cleber's binaries at http://www.cleberlr.hpg.com.br/p4test-icl.zip. However, that page is 404 now.

My personal conclusion is that the P4 is poorly-behaved technology that requires extra work from the programmer to achieve adequate performance. Higher performance and efficiency are available with less effort and, in some cases, less cost from other architectures.


My test code is rather long for test code, around 300 lines. I haven't had time to shorten it, but would appreciate a smaller example that distilled the miserable P4 performance we see on this test code. The test code is linked below. I'm also providing asm dumps for some platforms. I'll add asm dumps to this list if anyone sends one. Despite appearances to the contrary, the asm I show for the P4 had the same compile line (gcc -O2 -Wall -fverbose-asm -S rc.test.c) as for the Athlon and Alpha, with the only difference that the Alpha also needed -mieee when (eventually) linked. On no platform did we throw arch-specific optimization flags; in general, we try to avoid this sort of thing for (binary) portability reasons. This means that the rc.test.athlon.s and rc.test.p4.s are likely fairly similar, with differences coming from the compiler version.

File Notes Contact
rc.test.c Test code in ANSI C. komarek.paul@gmail.com
rc.test.alpha.s gcc 2.96 assembly dump on Alpha 21264 EV67. komarek.paul@gmail.com
rc.test.athlon.s gcc 2.95.3 assembly dump on Athlon XP. komarek.paul@gmail.com
rc.test.p4.s gcc 3.0.1 assembly dump on P4. komarek.paul@gmail.com

Please let me know if you find any inconsistencies, errors, or lies in this webpage. If you have sent results to me and I have failed to include them, email me again. =-)


Up to Computers and related stuff Home (komarix.org)
Created by Paul Komarek, komarek.paul@gmail.com