Advocacy

  Myths
  Press

Dojo (HowTo)

  General
  Hack
  Hardware
  Interface
  Software

Reference

  Standards
  People
  Forensics

Markets

  Web

Museum

  CodeNames
  Easter Eggs
  History
  Innovation
  Sightings

News

  Opinion

Other

  Martial Arts
  ITIL
  Thought


Intel Benchmarks
How do they vary from others? (Cheat)

By:David K. Every
©Copyright 1999


Performance is a very complex issue because it is a composite of many different things. All companies figure out ways to look at those things which make them appear better than the competition -- the trick is to figure out "who is lying" (see marketing) the most.

Intel has a long proud history of being the best at "stretching" the truth -- in a very big way. Now they aren't lying, exactly, but users are highly unlikely to get performance near what they claim in the real world. Others are not saintly, and use some of these techniques -- but they do so to far lesser degrees.

So read on to understand how benchmarks can get so blown out of proportion.

Hot-boxes

The easiest way to exaggerate your performance is to test your machine on a hot box. A hot box is some super-fast, super custom machine, that no one in the real world can afford, but which makes your processor look faster than the competitions.

All companies use hot boxes (high end machines) to benchmark with, there are just differences in degrees. IBM and Motorola usually use high end machines that they sell. Intel usually makes custom motherboards (or even controller chips) that don't ever seem to be sold, or at best are never mainstream. While this does mean that the processor can theoretically perform that fast, it does not mean that yours (of the same speed) will ever come close.

I've seen some Intel benchmarks that use SRAM (high speed expensive cache RAM) as main memory, or they used 4 way interleaved SDRAM (a way to speed up memory access) -- long before you could buy an off the shelf memory controller that supports such thing. The P6 (Pentium Pro) was also benchmarked using a version of the chip that had a larger integrated cache then users could buy. (Almost a custom version of the processor itself). Intel did sell a few of them (just to be in compliance with regulations) but that flavor of P6 (with the 1 Meg integrated cache) was never produced in volume, it cost nearly twice as much as the ones users got, and most of them were delivered late in the products life-cycle (but benchmarked at the beginning). Basically, it was not the processor that 99% of the users bought (they had either 256K or 512K cache) -- and so not the performance that users were going to get. There are lots of little Intel tricks like that.

So most companies may be exaggerating their performance (by 5% or 10% from the norm) by using hot-boxes, or boxes that are high end hardware (but people can actually buy) -- but at least they are fairly close to claims. Intel on the other hand is often off by 10%, 20% or even 30% -- easily. There are rumors of them tuning their processors to run benchmarks faster (which wouldn't be a surprise). Intel definitely does not test with off-the-shelf Dell or Compaq machines (even high end ones) -- those machines would rate far below what Intel claims. So in some ways the users are defrauded into believing that their machines runs far faster than it actually does. But since users are only comparing performance to an older machine, that also had exaggerated claims, they will notice the improvement and never know the absolute truth.

For example:

1995 - *Microprocessor Report Vo.9 No 13, Oct. 2, 1995

Intel claims Pentium performance for a Pentium-120 of 140/105 for Spec92 (Int/Fp) using their XX (Xtended Xpress motherboard) with 1 Meg of L2 cache and EDO memory. Note: this product was never available, nor was anything close.

Best actual performance was 120/90 for Spec92 (Int/Fp), or about 17% slower. This was using the fastest PC in production, a Dell using Intel's Triton Chipset and a 256K cache and EDO memory. This was still an expensive box, and not something that most consumer used.

The average user, at that time, could not afford a cache (at that time) (20-30% slower) or if they were they were caches that were slower than the test machine (10-15%). The average users (even businesses) were also not using EDO and instead using FPM memory (10-15% slower), and most probably ended up using a less impressive I/O Chip set (meaning slower bus speed/memory access). So the real world performance of their Pentium120's would be more like 107/78.

Note: these final numbers are speculation -- I couldn't find Spec92 numbers for this machine, but I can mathematically approximate them, based on other benchmark differences -- and I'm being very generous.

*I used a 1995 machine, because this was one machine that I actually had the differences between an Intel hot-box, and a more commercial hot-box. I hope to collect more figures, and more support -- but normal boxes aren't usually used for benchmarks so it hard to compare.

So I always reduce Intel's Spec numbers by 25-30% just for their hardware exaggerations alone. And this is only one small technique they use to trick people into thinking they are faster than they are.

Most Apple machines they seem to be below the top by closer to 5% -- rather than the 20-40% on the Intel side. This is because Apple produces the whole System, and most users are more into buying quality instead of buying the cheapest. (The exception were some older low-end Performa's which were designed for price more than performance). IBM and Motorola don't want to ruin their reputation with fraudulent claims, and must assume that their customers are smarter than Intels customers or PC magazine reporters (who have a vested interest in perpetuating the fraud).

Hot-Compilers

Intel uses a special compiler called the 'Intel Reference Compiler". This is a compiler that no one really uses for writing commercial software -- mainly because it is "tuned" for Spec's (a standard benchmarks). Now Intel would go ballistic at that wording (it might violate the rules for Spec and imply fraud) -- so lets just say that it is "tuned" for things that are done in the Spec benchmarks, and for little else -- but the point is that this compiler is really only good for making Intel processors look better than they really are. One of the techniques they used was a sophisticated multi-pass dynamic profiling (tuning) technique (2), basically this gave Intel at least a 10-15% boost in performance, that most others people can't afford to, or won't spend the time to do. Partly, they don't need to do it, because the complexities of writing really fast x86 code may have surpassed RISC code when it comes to scheduling complexities or so on -- or in other words, you are more likely to need such things in the Intel world.

(2) Basically, they compile the code once with lots of special stubs -- the compiler runs the code and profiles (analyzes) the the code as it is running to see if it doing things that are used in the Spec suite of benchmarks. When it sees those things, it places a marker, and then the code is recompiled again. On the second pass the compiler replaces your "tagged" code, with highly optimized "hand tuned" assembly code to make it faster (basically, it rewrites your code for you, if you are doing things that are in the Spec suite). This takes a ton of time, and is only useful for a few things (like cheating at benchmarks), but hey, that's Intel.

While it is technically "legal" to have that kind of optimization, it is completely impractical to use it for almost anything other than specialize benchmark tuning -- it takes a lot of work (cost) to create (so no commercial App writers can afford to do it), and it only seems to work well for doing very specific things things (like benchmarks). This is why almost no one uses the Intel Reference Compiler for writing apps -- not to mention the lousy interface, horrid usability, and it is very, very slow to use. Since Application writers don't use it, you will never see that "potential" performance. So while it is true that the processor is theoretically that fast, it is completely false that your applications will ever run that fast. Using traditional compilers which are less optimized in general (like Microsoft's), and given average development budgets (which don't usually include lots of time for specialized "tuning"), I would expect to see 15-20% less performance than Intel claims (at least).

Apple, IBM and Motorola both have high performance compilers as well. But both are far more likely to be used commercially. Motorola's and Apple's compilers are plug-ins for Code Warrior, and they are used in a lot of programs. IBM, and their clients, use their compiler as well (though most Mac users don't use IBM's compiler). Motorola also released a bunch of optimized math libraries, so that ALL Application developers can use those tuned libraries. Those libraries are more generic, and so the speed up more than just Spec. The point being that PowerPC's are far closer to achieving their theoretical maximum than Intel Processors.

Due to the complexities with VLIW (Epic) and Merced, the difference between theoretical performance (that Intel uses) and the actual performance (that Intel users see), is going to get larger -- in fact far larger. The complexities of making a compiler that uses VLIW (EPIC) well, will go up by up to an order of magnitude. But I have no doubt that Intel will tune the first compilers for Spec.

Pentium Optimized Code

Each processor requires the code to be "optimized" (tuned) for that machine. If you do not do this you can get a performance hit (loss) of anywhere between 5-30% (with most being in the 10-15% level).

But here in lies the problem -- you have to tune for each processor, but which processor do you tune for? The problem is made worse because of the size of the PC market, and the number of different processors (that vary greatly in architecture) -- legacy is a millstone that holds you back.

If you have 100 Million users that are using a pre-pentium machine, then when do you come out with Pentium-Optimized code? Do you wait until after ten-thousands Pentiums are sold or after 10 million, or do you wait until they are half of your users (years after introduction)? When the App does change, all of the older 486 users are going to see a performance drop (say a penalty of 5-15%) -- and they will not be happy with your "upgrade" -- and the magazines will show big charts of how much slower you new version of you application is, and so on. So App writers are slow to change optimizations.

The size of the PC marketplace (legacy), means that they move far more slowly (to wait for their user base to catch up to any processor). So there is probably a 2-3 YEAR lag between processor release, and when many Apps are optimized for that processor (at least). But by then Intel has moved on to the next flavor of processor (like the Pentium Pro, Pentium with MMX, or PentiumII). And Intel has to keep moving and changing things or the clone-chip makers will catch them. So almost every program is really optimized for the previous version of processor. Meaning that new processors are really going to be running Applications well below their rated performance. Maybe as the processors are becoming outdated (and replaced) will the new Apps be tuned for them, meaning that things are slower than they could be on the newest processors.

But wait, there's more. Because there are 400 flavors of the x86 processor, it is even harder to pick well. You can't choose the PentiumII, because its scheduling may cause a performance hit for the Cyrix, AMD, other companies, or other Intel processors. And imagine trying to tune for something like MMX, when Intel is going to have multiple flavors of that (already two slightly different flavors, a bigger change is coming soon, and more to come). The amount of choices, works against developers, and achieving theoretical performance. So most don't really tune for any one processor at all -- they go for a lowest common denominator (some compromise that is sorta half tuned for a few different processors).

Remember, this doesn't only apply to Applications, but to Microsoft and the Windows Operating itself. Some of the performance drops can be additive (Apps slower + OS slower + Drivers Slower = computer far slower). So once again, the performance that Intel (or others) claim, is assuming "Optimized" code -- which programmers can't do as well in the PC world.

While this optimization (instruction scheduling/tuning) problems exists in the PowerPC's world as well, it is far less a problem for a variety of reasons:
  1. The whole architecture is newer (cleaner) just a few years ago, and it was designed to be more scalable from the start. So the changes from processor to processor have been less dramatic -- and the performance hits can be less. (x86 design is based on a 30 year old architecture).
  2. There are far fewer choices. Basically right now there are two primary scheduling models (603/G3 or 604) -- with probably more like a 3-5% difference for most cases (though there are a few cases where it can make a large difference). There is also 601 scheduling that was just a little different, but no one uses that (so 601's are probably running about 5% below peak).
  3. Apple's marketplace is smaller, so that fewer people have to move from one product to another before Apple and the Application writers can take advantage of something new.
  4. Apple made a complete jump to PowerPC (from 680x0). This means that all PowerPC native Apps are optimized for RISC. Many (most?) x86 Apps are not tuned for Pentium (let along your model of Pentium).
  5. Because there is more market control by Apple, there is more ability to predict what will be a success or what will happen in the future. So when Apple says that G3 is the future, people can jump sooner, knowing that it will be the future.

Some of this problem of App tuning can be reduced by using Unix, if you are compiling your own Apps for each processor. But then Unix compilers may not have the best optimizers either (which can mean a pretty big performance hit as well), and many people are getting pre-compiled Unix code now days (with compromise optimizations).

Hot-Operating Systems

Have you ever noticed that the benchmarks are not run on Microsoft Windows95 or 98. Most of the time they use a nice clean, and stripped version of Unix, to make things go as fast as possible. This is partly because Windows is a big slow-bloated pig of an Operating System -- but it is also the Operating System that most people are using. So when people see the theoretical performance of "Spec", that only applies to the 2-4% of PC users that are running Unix. Most will probably see anywhere from a 5% - 15% performance drop for using Windows (and it is probably worse with Win95).

PowerPC's also usually use Unix to benchmark with. The difference is that Mac OS X is actually going to be built on Unix. The flavor of Unix that it is being built on, is likely to be faster than the ones that things are benchmarked on right now (since Apple is tuning the scheduling). There also seem to be less of a performance difference between Unix-MacOS than Unix-Windows when running benchmarks (the Mac is a tighter OS). So Apple is not only getting closer to actual performance today, but likely to be right-on or even faster tomorrow.

16-Bit vs. 32-Bit

Remember that legacy nightmare on PC's? It is much worse than you think. That backwards compatibility thing, holds everything back. Most people are using Windows95 (or Windows98). Windows9x still has a whole lot of 16 bit stuff buried in there, and there are 16 bit drivers, and many 16 bit Applications. Those performance ratings are all assuming nice clean 32 bit programs and Operating Systems. So once again, the real world, and the benchmarks are too different things. In fact, so much so, that while the Pentium Pro (P6) was twice as fast as the Pentium, it was actually the same speed (or possibly slower) when running 16 bit Apps. But remember, that ugliness and performance drop can get you when running 32 bit Apps on Win95 (because Win95 itself is not 32 bit clean). So for most people the PentiumPro was actually SLOWER than the regular Pentiums -- despite performance claims that it was up to twice as fast.

So if you are not running WinNT and Win32 Apps only, then you can expect another performance penalty from 5-50% (usually towards the lower end of the scale).

There is no such thing as 16-bit (paged) mode on the PowerPC, or the other quirky modes of the x86. So there is no way for the PowerPC to take that performance hit.

The Mac fixed its much less serious 24 bit addressing limitation about a decade ago, but it never effected performance like the PC's addressing/mode problem does.

There are some "emulated" Apps, which can take an even bigger hit than even 16 bit Apps on PC's do. But the difference was so dramatic that everyone recompiled their Apps for PPC pretty quickly (if they needed performance). Because the Macs market is smaller, the market is faster and more adaptable (it doesn't take as much momentum to effect change). So Mac problems are fixed quickly, and PC's can drag on for decades (PC's started fixing Plug & Play in '92, I believe in another 3-5 years they will have it licked) .

Hot-Applications

Application benchmarks are where you compare one Application on one Platform to the same Application on another. These are the most important comparisons since they are how users work, and what users are trying to do.

Intel got tired of being whipped in these Application benchmarks -- so they paid to hand tune some key "benchmarking" programs and write key parts in Assembly-Language (faster but something most companies can't afford to do any more). So Intel tuned just for MMX, all so that it would look better compared to PowerPC's. They did this with Adobe Photoshop (and I believe one or two other Apps).

This means that Adobe Photoshop is anomolously fast compared to other PC programs because of the tuning. So it is a hot-app for the following reasons:

  1. Apple hasn't paid for (or done for them) custom tuning for Photoshop.
  2. Adobe didn't do this for the Mac.
  3. Most companies are not going to take that huge burden (for Pentium's) on themselves.
  4. Most Applications on the PC are not hand tuned, and do not use MMX.
  5. Companies are more likely to tune on the Mac since it isn't as hard as doing it on the Pentium, and since it more likely to work on more than one model of processor.

So this is just another case of Intel tilting the table in their favor (and slightly defrauding the public). Yet we all know the results. Even with Photoshop being biased in favor of the Pentium, it still runs twice as fast (or more) on the PowerMacs. There are a few small places where that investment paid off and it is as fast, or even a tad faster, but those are rare indeed. In fact, when Intel saw the "Toasted Bunnies" commercial, they got mad about getting crushed in ByteMarks, so Intel tried to use their tuned Photoshop Filters to do a benchmark. Even using specially tuned routines, at special sizes, the Mac was more than twice as fast in composite tests. And if it wasn't for Intel's biasing, the Pentium would be running even slower.

Of course, Photoshop isn't the only Application they've ever done this with -- they used to compare Office on the Mac to Office on the PC, because Office on the Mac actually ran with this big, ugly, fat, Windows emulator underneath it (thanks Microsoft) -- even Office'98 still has some of that bloated psuedo-emulator in there.

MMX

MMX itself is somewhat fraudulent. It does help performance for certain things -- that performance is real. But the way it is presented may be more fraud than truth. Remember, most Apps don't use MMX. So while it may make some Apps seem faster (according to certain benchmarks), it may not apply across the board (only working in certain conditions).

Remember that Photoshop cook-off between the Mac and MMX tuned Pentium code? Not only it the Mac twice as fast, but the PC is only as fast as it pretends to be on a few specific filters, and only if those filters are done with very specific parameters. For example if you do a Gaussian Blur with less than a 4 pixel diameter (fairly common), it runs as fast as demonstrated. But if you do a Gaussian Blur with a diameter of more than 4 pixels the MMX won't be used (it only works with small sizes) -- the the Pentium is less than 1/2 as fast as it was before. The Mac doesn't take that performance hit. It is like this with most of the "filters" and specialty things that Intel demonstrates MMX with -- sure, it works for SOME things. Most users will see little if any difference with MMX, except in very rare cases -- which Intel plays up and tries to sell as "the norm".

ByteMarks vs. SpecMarks

This is a big debate among geeks. Basically, Spec is a far better benchmark -- in theory. They are both just a collection of sample "things" to do -- but Spec does more things. It is also more "controlled" in how tests will be run, and so on. But it may be those "controls" that allow for the impractical results. So in theory Spec should be more reflective of overall performance -- too bad the theory doesn't hold water.

Being a better benchmark in theory or not doesn't seem to matter for all those reasons mentioned throughout this article. Intel taints the benchmarks by only running it in completely impractical conditions, with special compilers, on special hardware, and with OS's that people aren't using, and so on. Intel may even have tuned their processor a little to make it faster especially for Spec. So while Specs shows theoretical processor performance (in a perfect world), that is irrelevant to all users -- what users care about is how a processor performs in their world. Compare the following:

  • You can't get the Spec Suite without paying large sums of money (or pirating it).
  • ByteMarks are free and publicly available, so that you can test with it (and compare the real world).
      
  • Spec Suite is run on proprietary hardware (that you probably can't get).
  • ByteMarks are run on YOUR hardware (or can be) to show your performance relative to the competition.
     
  • Intel has tuned their compiler for Spec's, and you can't get it (without lots of money).
  • ByteMarks are run with the compiler you are most likely to use (and developers are most likely to use), and your Apps are most likely to be compiled with.
     
  • Intel runs Specs on an OS you probably aren't using.
  • ByteMarks are run on the OS you are using.

So even though Spec is theoretically a more full-featured, and more pure, testing suite, it isn't one in real life. The only thing Spec seems to show is relative performance, or how well Intel can taint benchmarks. Best case, Spec is just a theoretical measurement of how fast Intel can make their processor go, if it wasn't in a system that you had to use, and in a purely academic way -- but that is completely unreflective of how fast your machine will run.

ByteMarks, while being a smaller and simpler suite, is a better measurement, probably just because it has not been the focus of Intel's tuning efforts (and lower key)... at least up until recently. I don't doubt that Intel will figure out some ways to start polluting the facts for ByteMarks as well. So when the rubber hits the road (and I am working in the real world), I can compile a test to see how things work in the real world with ByteMarks. I've done a lot of comparative compiles and benchmarks (with our own apps, and many application fragments), and the tests almost always reflect a 2x Mac advantage (for processor intensive stuff). Commercial Apps that I run reflect a 2x Mac advantage (for processor intensive stuff). My Mac feels far faster than my PC (for processor intensive stuff). ByteMarks reflect a 2x Mac advantage (for processor intensive stuff). The only test that does NOT reflect a 2x Mac advantage is one benchmark that I have no reason to run in the real world. So what should we believe? That everything else is wrong, and Spec is the only true test? Or that despite being a well done benchmark, and more thorough, that Spec has been so polluted by Intel's marketing techniques, and that the measurement is so impractically specialized, that ByteMarks are a better general measure of real world performance.


Created: 07/14/98
Updated: 11/09/02


Top of page

Top of Section

Home