ICC and Mandelbrot
2010.01.04 in code
About a year ago, I implemented a smooth-coloring Mandelbrot generator in C on top of GMP (for high precision and speed) and GD (for image output). I spent a lot of time choosing GCC optimization options and hand-optimizing the code (let's play how-few-GMP-instructions-can-we-use), primarily as an exercise. It ended up many times faster than the original, and I eventually ran out of ideas and got bored with the optimization and set it down. It's still the most carefully crafted and most commented piece of code I've ever written, and produces output like this:
I've been hearing a lot about ICC, the Intel C/C++ Compiler, recently — especially how much better it is at optimization of math-heavy routines when compared to GCC. I generally refused to believe people when they claimed 30-50% speedups, if only because it seemed like simple hyperbole.
I downloaded an entire gigabyte worth of compiler (WHAT!) yesterday and set to do some tests using my Mandelbrot generator. I took some time to rebuild GD and GMP with each of my test compilers (ICC 11.1, GCC 4.2, LLVM-GCC 4.2, and GCC 4.5 — unfortunately I couldn't get GMP to build with clang), and then sat around with the manuals correcting optimization and build arguments for each compiler. I then took the average of 10 runs of the program (all runs ended up being ±2% for any given compiler), and came up with the following:
Holy crap... a 2x speedup! Crazy crazy stuff! I realize that this is a completely unreasonable benchmark, performed under less than perfect benchmarking situations (though they were pretty good), and is not representative of "normal" code, and should not be used to pass judgement against any of the involved compilers, but it's relatively representative of the kind of code that I want to run fast. I've had long-running math-heavy programs in the past which I would have been very happy to have been able to speed up anywhere near 2x (in order to meet class project deadlines...)!
I also find it funny that ICC produces both the fastest and smallest code, and that turning on profile-guided-optimization had no perceptible effect (except, obviously, to make the compile time ridiculous). I presume this is not the case in other scenarios.
ICC seems to also have better static analysis (it's pointed out a few legitimate warnings in the code that neither GCC nor LLVM-GCC nor clang picked out). But I guess that's what happens when you've got billions of dollars to throw around...
I should also note that all tests were done on the 2.4 GHz Core 2 Duo T7700 in my 2007-era MacBook Pro. A chip that Intel knows how to optimize for, apparently!
One more for your viewing pleasure (this is a zoom quite deep into the set, Wikipedia copy):
NOTE: Dad suggested that the 2x speedup might be due to autoparallelization or some such. However, luckily for my benchmark, this is not the case, as the program already takes advantage of both cores (spawning 10 threads each generating a tenth of the image), and manages to consume ~197% of the total CPU time (across two cores) over its lifetime.