Dojo (HowTo)







  Easter Eggs




  Martial Arts

AMD K7 : Athlon
Great chip for an x86

By:David K. Every
©Copyright 1999

AMD has come out with a new processor -- the K7 as it was known, or Athlon is the new marketing tongue twister.

Athlon means contest in Greek (as in Biathlon, triathlon, heptathlon, and so on) -- I assume the reference has something to do with some marketing goons idea of a pithy reference between the K7 and the P6 (Pentium III) core. But if this is a contest, it isn't much of one -- the K7 kicks some P6 ass.

I, and many others, have been warning for some time that Intel is going to be seriously pressured by AMD in articles like "Is Intel Doomed?". AMD has a history of sometimes missing on deliveries -- but it looks like they are hitting this one -- and this one is a killer. AMD also brought this processor in at very aggressive pricing (much lower than I thought) and they are going to push this processor into the mainstream quickly.

Killer Architecture!

In about every way you are to note a processor architecture the K7 just sounds great.

  1. Transistor count helps us understand complexity, and the Athlon is 22 Million Transistors huge. The Pentium is about 9.5 Million. The G3 is 6.5 Million, and the G4 is 10.5 Million! This lets you know right off the bat that Athlon is the big bad boy on the block.
  2. Athlon has 128K of L1 cache -- 4 times the Pentium3 -- and double that of the G3 or G4. And it has onboard L2 cache tags like the PPC.
  3. The real processor work is done in execution units -- and the Athlon again is a monster. Athlon has 3 integer units and 3 floating point units. Compared to the P3's and G3 (PPC 750's) anemic 2 integer and 1 floating point this is impressive indeed. The G4 (PPC 7400) has two AltiVec units as well which should help.
  4. Pipelines allow you to keep break instructions into parts, and work on many parts at the same time. (You can read the article about Pipelines) Well the PowerPCs have a simple pipe -- 4 stages. The Pentium III needs to break things into about 14 stages to keep up with the PPC. Again, the Athlon chose to be the monster with 18 integer and 36 floating point stages.
  5. Reorder buffers allow the computer to execute instructions in an "out of order" way -- so the computer doesn't stall as often. When one instruction can't be executed, the computer just sort of executes all the instructions after in (to get ahead) -- so as soon as the stalled instruction is finished, the computer reorders everything and gets them all completed. The PowerPC (G3) has a 6 entry reorder buffer compared to the Pentium3's 40 entry, and the Athlons 72 entry reorder buffer.

Every way that we talk about processor architecture the Athlon is a monster. It has aggressive branch prediction, an aggressive MP design, an aggressive 200 MHz bus (actually 100 MHz that accesses on the rise and falling clock), and so on. However you measure it, or talk about it, the Athlon is a serious chip contender.

Still an x86!

So the Athlon is a monster processor. It has lots of everything. The chip was designed by many that designed the DEC (now Compaq) ALPHA chip -- and it looks exactly like that. Architecturally it is an Alpha with a x86 front end on in to decode things into these RISC like instructions (and then do all the real work as RISC).

This isn't really RISC as explained in PentiumII is RISC? article -- for the same reasons that you can't just bolt a bigger motor in a pickup and call it a sports car -- but it is close. In the real world, it gives them most of the speed advantages of RISC, but still drags around the size and complexity baggage of CISC.

Of course there are always issues (and tradeoffs). AMD decided to (and had to) use a new type of BUS, I/O chips, and motherboard. It isn't one of the standards, and now will use the ALPHA-BUS, and a PentiumII-slot. It is an interesting approach that gives them a lot of speed potential, but puts them in their own little world. They are hoping that Compaq (Alpha) will follow them, and others will make support chips for them -- but for a while they are going to be nonstandard*.

* This isn't that big a deal, almost all the PC-standards are non-standards (or start that way). They only become defacto-standards after enough companies start using them. So if some of the other x86 processor makers start following them, then they will become the standard. If not, AMD still sells enough processors on their own to make their new wannabe-standard viable.

This monster also eats power like a direct short. 54 watts at top speed. A little easy-bake oven doesn't put out that kind of power. This is at the .25 micron version -- and the .18 should cut that down -- but you aren't going to see any portables in the near future. Size and heat also mean cost (to manufacture) -- both the chip and systems. AMD and PC manufacturers can trim their margins to remain competitive, but that isn't always healthy for a company long term (where do they get their R&D money for the next version?). You also aren't going to see PCs simple and elegant like an iBook or iMac -- an iMac with a 300w power supply is not an iMac, it would be an 50lb iToasterOven. The power requirements also mean that the Althon systems will also be loud (have a big fan). In a desktop workstation this stuff doesn't matter too much (other than higher failure rates and the like), and many people will tradeoff the heat / power for the performance -- but it does go to show more of the tradeoffs and some of the shortcomings of the design.

In performance numbers, there is no doubt that this chip will run. The Athlon is faster than a Pentium III at the same clock rate, and is available at faster clock rates. It is the bad boy of the x86 world and is going to remain that way for quite some time. And in many areas the Athlon is going to beat the PowerPCs.

Of course I'm not completely convinced that the right choices have been made in the Athlon when it comes to good architecture or industry decisions. What I've been trying to explain in articles like Parallelism or others is that there are diminishing returns on how you use gates (transistors). But I'll get into this more later.


Lets review the issues that are "advantages". Athlon is complex (huge) with 22 Million transistors -- that is twice the transistors of a G4 and Pentium3 and over 3 times the transistor count of the G3 -- but what difference in performance does it get?

Performance is complex, it is not any one metric. For now, lets think of it as the sum of Integer, Floating Point and Vector units. Each has their own cases where they are used -- and it gets complex to explain what matters more (and to whom) -- in cars this would be like torque versus horsepower, which is better? Depends what you are doing. I'm not going to pick nits in this and am roughing the numbers -- in some areas there are going to be advantages either way. So I'm being a little loose --partly because I don't have all the specifics, and mostly because unless it is a significant difference (like above 50 - 100%) it just doesn't matter -- a 10%, 20% or even 30% advantage is still just yawn material. With all this in mind, lets look at real world results (approximations as to what you are going to see in application performance).

Scalar just means a quantity -- or in our usage "one" or "single". The vector (AltiVec on the G4, or MMX/SSE on the Athlon/Pentium3) works with more than one integer or floating point element at once. Scalar is the normal way of dealing with these elements (one at a time). Some things are very conducive to be "vectorized" -- some things aren't. If you want to compare the PentiumIII's MMX and SSE to the G4's, then read How does MMX work, What is AltiVec, and AltiVec vs KNI / SSE. For practical purposes, the Athlons MMX + SSE is the same as the PentiumIII's.

In scalar integer the Athlon is probably about 10-20% better than a Pentium3 at only twice the size. The Athlon and the G3 are probably the same performance in integer (within a few percentage at the same clock rate) -- with the Athlon well over 3 times the size. Do you think one Athlon can compete with 3 x G3 (or 2 x G4) in the real world (which would have the same number of transistors)? Part of the reason for this lack of a significant advantage of the Athlon (despite huge investment) is because of the baggage of CISC that the Athlon carries around -- but most of the reason is just that a lot of this stuff done to improve speed doesn't matter much (at least not relative to the costs).

The G4 should help modestly in Integer performance, because you can do things like cache touch ops (warm up the cache ahead of time), a better cache architecture (128 Bit path and larger L2), you can use the vector unit to do a lot of integer streaming (and processing -- even for scalar(1) stuff), and lastly some bus improvements (MaxBus). So I think the G4 will be a tad better than the G3, and probably the Athlon, at most integer stuff (at the same MHz). But not significantly. And I think the Athlon is going to maintain a MHz advantage until at least the G5 processor (at least a year out). So I think the PPC will be battling to maintain integer parity overall (and might fall a little behind). But I still doubt that the Athlon will be able to turn a modest advantage into a significant one (in integer).

In scalar floating point the Athlon is even better. Let's say up to 25- 50% better than a Pentium3 and the PPC (and the same MHz). And remember, it will be able to run at more MHz. This is certainly pushing "significant". Of course it still can't compete with multiple PPCs (assuming we create transistor-count parity) -- but that only matters if we will see MP machines go mainstream (which I suspect and hope for, but one should still be skeptical). The 604e had a good FP unit, and the Power3 and other PowerPC implementation have some killer FP -- but the G3 used a more lightweight FP-ALU that is only competitive with the Pentium (P6) core. The G4 will get better (say a 30-50% improvement over the G3 in Spec) -- but I don't think it will catch up to the Athlon in scalar floating point. Of course, the G4 sort of cheats by having a massive Vector Unit (that can do many floating point ops at once), so this may be moot.

In Vector Operations a single G4 (AltiVec) outclasses the Athlon (by more than the G3 or G4 is outclassed in FP). Basically the AltiVec will allow up to four (32 bit) floating point ops to happen at once. Athlons 3 special purposed Floating Point ALU's (Arithmetic Logical Units) will not be able to keep up with the 4 general purpose units built into the AltiVec. And the G4's standard (64 bit) FP unit is another whole execution unit as well (I believe the Athlon shares it's ALU's with either scalar or vector). So if you can interleave 32 and 64 bit FP, the G4s 5 (or 6) general purpose ALUs will significantly outclass the Athlons 3 special purpose ALU (one ALU can do adds, another ALU can do multiplies, and another stores).

The same concept of superscalar and size can be applied to integer. You can be doing up to 8 x 16 bit, or 16 x 8 bit integer with the G4 (versus less than half that with Athlon or Pentium) -- and on the AltiVec you can still interleave those actions with your normal integer units (again, the Athlon and Pentium shares units and is less superscaler). Again the G4 outclasses Athlon (and Pentium) for this type of stuff. And this is only a single G4 -- what do you think a couple of them will do (again, normalizing for transistor-count)? Many standard integer and floating point operations can be "vectorized" and done in the AltiVec -- and in some limited ways, compilers will do so automatically. And the G4s AltiVec unit itself is superscaler (unlike Athlon or Pentium III) -- so the G4 can be doing Permute operations (rearranging data) at the same time it is doing math operations. This is another advantage for Vectors. These differences are significantly in the PowerPCs favor -- let's assume about 100% better (very conservatively) -- but that advantage must be used to be significant in the real world.

The question then becomes "can Apple and Mac programmers use the Vector unit (AltiVec) enough to make this significant technical advantage a significant user advantage?". I think for some things, like Photoshop filters, QuickTime, Graphics and so on, there will be a big difference IMMEDIATELY and there will be no doubt. 3D and networking, audio (and speech recognition), quite probably. But overall? Over time I believe AltiVec will be exploited to the users advantage -- but it remains to be seen how long this adoption will take.


So the Athlon has some very nice advantages -- but the question is "are they significant?" It isn't that they haven't made a good chip -- they have -- but at what cost? Engineering is always about tradeoffs. I want bigger, better and faster -- and the Athlon is that. But lets break down the costs / tradeoffs.

Athlon has double the L1 cache of the PPC, or 4 times the Pentium3 -- and that gives it something like a 5 or 10% jump in performance overall. Hmmm... lots of power, a small gain for most things. Athlon adds things like huge reorder buffers -- another few percent in performance (but lots more space and complexity). The huge pipelines allow it to run faster (more MHz) which is a significant gain -- but also increases the penalties for stalls and misses -- and in a few cases can decrease performance -- and it costs in design complexity and chip size. The Athlon doubles it's I/O bus speed, but this probably only gives people another 5% increase in real-world performance (overall). Athlon has lots of execution units (much more complexity), but each with their own little pipes and queues, and this makes effective scheduling more complex and offers sporadic returns on performance -- most of the time you can't use all the units and they just spinning idle. All of these small improvements are additive -- but still the processor is only nominally faster overall.

I'd take an Athlon chip over a Pentium3, not only because it is slightly better, but just because of philosophical reasons (holding Intel accountable for the damage they've done to the computer industry). Yet in all practicality Athlon really doesn't matter much over Pentium3, unless you are using the machine full time and doing lots of Floating Point and not using the Pentium3s Vector Unit (SSE) (which can make up for some the deficiencies of the Scalar Floating Point).

Compared to a G3 the Athlon performs quite well -- but the 4.5 Million Transistors used in the G4 for AltiVec may be a far better use of space than the extra 10 or more Million Transistors used to make the Athlon different from its predecessors.

Athlon can get great best-case numbers -- but specs and best case benchmarks aren't the real world. Most real world numbers aren't that big a deal. What is important (to AMD) is that these numbers lure the techno-weenies into their FUD and hype. People are buying into the specs and not what they mean. Remember a computer (or any complex system) is usually a victim of their worst component / bottleneck, and not just the best. PC people and many techno-weenies just don't get that or fail to pay attention to it. You can make something that is not the bottleneck a hundred times faster, and users still won't see a difference. Geeks get so enamored by the individual specs, that they forget that it is the system that matters -- and they get so excited over MHz and Speed that they forget about work and productivity. In some cases of course speed does matter -- like rendering and serving -- but still it has to be a significant speed advantage.

There are going to be few cases where these is some significant advantage that either the G3/G4, Pentium3 or Athlon will have over the other. But more importantly is not performance but productivity. I know that I wouldn't trade a 300 MHz G3 for a 1 GHz Athlon for getting my work done, because a faster PC doesn't necessarily make me more productive. Computer systems have to offer compelling solutions and not just speed. -- they have to use their advantages (speed) and turn them into MY advantages (productivity) for me to care. That is something that many techno-centric people don't pay attention to. Two more frames per second on a game just doesn't matter. A hundred more frames per second on a game doesn't matter if the game is already past a certain threshold (of say about 20 - 24 fps).


The x86 camp with Intel, and now AMD, has turned an inferior architecture, into incredible implementations. The inferiority of the x86 front end, costs them much more to design each generation of chip -- but they just put their heads down and doing it, well, in spite of their handicap. They keep pushing the envelope forward. I can only drool over what IBM or Motorola could have done if they had put the same amount of money and effort into the PPC. I see multicore and deeper piped (higher MHz) G4s (or G5s). Heck just stuffing a lot of 1:1 L2 cache on chip could help (and will happen). But Moto and IBM seem to be striving to basically be just a little better, instead of really trying to make a difference. <Grumble> They are keeping up by putting in only a fraction of the resources and with a much simpler design -- I just wish they would push harder.

IBM and Motorola have been screwing up a good technological lead for quite some time. They haven't been falling behind, just failing to pull ahead. The PPCs will still have many advantages -- especially in areas of simplicity of architecture, size, and cost to design chips. But they have mostly been mostly failing to exploit those advantages into significantly more speed -- though Apple has converted those advantages into better systems (like better portables, more reliable systems, and so on). Speed alone doesn't matter as much as people think -- but it damn well does matter as an enabler (it can empower new capabilities). And it is far harder to sell people on architectural advantages when you don't turn that into implementation advantages. I'm sure that if IBM and Motorola wanted, they could have had multicore chips completed 2 years ago. Some of the delay is probably because Apple doesn't have OS X client in place -- and I'm hoping that changes the whole game next year.

The K7 is a good chip. It will be fast. It will sell well. It takes the concept of throwing everything they can think of into a very complex chip to make things go a little faster. It will probably be outclassed in a year or two by the IA64 (Merced or McKinley) variations of even "bigger" and "more units" kind of design -- but I think IA32 (x86) is around a lot longer than people think so for the most part that IA64 advantage won't matter. Intel should be worried, and will likely see their x86 (IA32) market share erode towards Athlon -- all while Intel will be struggling to erode x86 (IA32) market share away towards Merced (IA64). The next few years will be interesting.

But I still think the Alpha, Merced, and Athlon approach to processing is the wrong way to go for the industry (in the mid term). I'd rather see simpler, cheaper and more elegant designs that do things like put multiple cores on a single chip, or can add in functionality (like AltiVec) that adds value to the machine. Instead of just doubling the size every 18 months to give us a 15% performance increase I want to see multiple cores enable the machine to do more. Speed alone gives me little -- that speed has to be usable for something to have value. Cheap MP (or multi-core chips) will give us all lots of speed to burn -- but more importantly, it should enable new ways of using that speed (in programs) to make us more productive. I like faster processors -- and there is no doubt that the Athlon is a faster processor -- but it is time to go wider (do more and design better) and not just go faster.

Created: 08/11/99
Updated: 11/09/02

Top of page

Top of Section