Ruby faster than Python and Perl! cries the headline. This is based on a benchmark that tests i = i + 1
in a loop, so it’s a particularly useless benchmark, even in a world of benchmarks designed to test unrealistic scenarios that make the benchmark author’s product look good.
But wait! A commenter accuses the poster of cheating! (On a benchmark? No!)
>Ummm…. Why did you test Ruby with less data than you tested Python and Perl? You cheated.
As it turns out, the “microbenchmark” scripts for different languages have differing loop counts, so the total run time is super extra especially meaningless as a way to compare language performance.
When I was in 5th grade (age 9) learning AppleSoft BASIC on the Apple ][+ in math class, we wrote programs that did this:
1 2 3 4 |
10 X = 1 20 PRINT X 30 X = X +1 40 GOTO 20 |
And we would race each other, starting at the same moment and seeing whose column of increasing numbers on the screen scrolled faster. This was of course stupid because all the computers in the lab were chip-for-chip identical to each other, and probably were all made on the same production run on a single day.
We learned that we could cheat by adding a number larger than one in each loop iteration, which was quickly detected and outlawed. Far more cleverly, someone figured out that you could do something like this:
1 |
20 PRINT X: X = X + 1: GOTO 20 |
The same algorithm yielded better performance if it was all written on one line of code. I could be misremembering (it has been 25 years and I don’t have a ][+ handy to verify this on) but it was something like that. Anyway, we learned that the same language runtime on identical hardware using the same algorithm could be made to run faster or slower using simple formatting changes.
So does it make sense to compare different languages this way, which may mean favoring one language’s idiomatic code structure while hitting a weak spot of another? This is a common, and in my opinion valid, critique of apples-vs-oranges benchmarks: how do we know that the performance difference isn’t due to naive coding or configuration on one side and expert tuning on the other side? For that matter, do we know that the benchmark design isn’t selected specifically to highlight exceptionally high performance in one area of a product, to the exclusion of embarassingly slow areas that the benchmark designer would prefer that you not consider?
Thus I claim that this benchmark is approximately as valuable as my 5th grade silly hacks. X=X+1, change the number of iterations to suit your bias, or perhaps just don’t bother making them the same because it’s meaningless anyway. (Z=X*Y and a matrix multiplication are other parts of this benchmark, but they too are so trivial in concept and implementation as to be equally pointless.)
I’m going to guess that the author of the blog post didn’t notice the different in loop iterations, or was looking at the per-second values rather than the total run time. But if we’re looking at average performance over time, then how long does it take for the performance to stabilize? Stabi-whatchamaha? Ask Zed Shaw: look at his list of pet peeves, #3.
Do we know that 0.142 seconds is enough to measure “language performance” (really, it’s the performance of a particular runtime environment being measured) including stuff like garbage collection and JIT compilation overhead? If one language’s runtime waits for N iterations before JIT-compiling the code, whereas another runtime waits for 5N, how many total iterations do you need to minimize the effect of that?
What happens in JRuby, Ruby2C, etc.? The poster says the tests were run on a MacBook Pro – what architecture (PPC vs. Intel) were these language runtimes compiled for, with what compiler, blah blah. GCC versions, optimized for certain CPU models, etc. This stuff can make a big difference in CPU benchmarks, which is why proper benchmarks include things like this in their configuration information.
Or are you measuring small script execution time, and >1s runtimes are meaningless for your needs, in which case Java seems painfully slow and Bash lightning fast?
What the heck is being measured by this “microbenchmark”? Language fanboy gullibility?
For a less awful benchmark, have a look at the Computer Language Benchmarks Game: “What fun! Can you manipulate the multipliers and weights to make your favourite language the best programming language in the Benchmarks Game?” At least they realize how not-terribly-useful synthetic CPU benchmarks of language runtimes are.
Anyway, for most applications, if you’re choosing your language based on runtime performance, you’re choosing very poorly. If you’re choosing your language based on a really awful “microbenchmark” comparable in accuracy to the first toy hack of a room full of 9-year-olds, well…
Bang On!!
It’s mostly the fanboish folks who get down to such idiotic benchmarks.
Even if some lang. were to perform better, its under given conditions.
Thanks for a Very good discussion.!