MacKiDo/Hardware/Parallelism

Advocacy

Myths
Press

Dojo (HowTo)

General
Hack
Hardware
Interface
Software

Reference

Standards
People
Forensics

Markets

Web

Museum

CodeNames
Easter Eggs
History
Innovation
Sightings

News

Opinion

Other

Martial Arts
ITIL
Thought

The Future of Processors
Parallelism!

By:David K. Every
©Copyright 1999

Designing processors sounds easy -- just make it faster and make it do more in each cycle. It isn't easy -- it is a complex balancing act of many things, and the sum is dependant on how well you balance all the options. There are differing views as to where processors are going to go in the future -- but basically there seems to be two camps -- more parallelism in a single core, and adding more cores. I'll explain where I think the industry is going to go, and why.

Speed = faster + wider

At first processor design was just making a processors faster and faster (MHz). The smaller the chip (better production technology) the faster the processor can run. So all that designers had to do is keep making the same chip smaller and smaller (newer and newer fabrication), and things would continue running faster and faster (and use less power, and cost less, etc.). Fortunately we keep improving the photographic (lithographic) techniques we use to make and produce chips -- so things keep getting faster, cheaper and use less power (without too much work on the designers' part).

Unfortunately, MHz and size reduction wasn't keeping up with the demand for more performance. So designers decided to add more "parallelism" -- this allows a processor to do more in each cycle. Normally you could do one instruction per cycle (like a load or store or multiply, and so on). Well, what if you made a processor that could do two things at once (do things in parallel) then it can do a load instruction AND a multiply instruction at the same time -- twice as fast (in theory). It is a little more complex to make compilers that can schedule things correctly (so that it always puts the multiply after the load, to see the potential performance gains), but it can be done. And it doesn't just stop with two instructions. Some machines issue (or retire) 3 or 4 or more instructions at once -- parallelism.

Smaller and smaller

Think of a computer chip as a very small photograph. You put an image on a negative, and then shoot that down (reduce it down) on to a piece of silicon. The exposed (or non-exposed) areas get a layer of metal (or insulation) deposited, and through a series of layers you build up small little electronic switches (called gates / transistor / etc.) -- all of them work together to become your chip (processor).

The more you can reduce your image, the more individual chips you can get on the same wafer of silicon. The more chips you get on the same wafer, then the more your costs go down. Think of wafers as having a fixed cost -- if you can produce 100 chips on a wafer for $X, and in a year you can double that number to 200 chips on the same wafer, then your cost per chip is about 1/2 as much. That's basically how it works.

Since the lithography techniques used to make chips keeps getting better and better, the cost of chips (including processors) keeps getting less and less. Smaller also means more than just cost, it means that things are closer together -- so it takes less time for each switch to talk to the ones next to it (the chip can run faster). Because things are closer and smaller, it can take less power (which means less heat, or faster at the same power). So simple lithography improvements keep remaking the computer chip industry every 18 - 24 months. We still have a long way to go before we max-out. This is the magic formula of computers.

The problem is that chip makers don't just want to make a processor that keeps costing less and less (though that is a nice benefit) -- chip makers want to have many processors at various price points. Lets say one price point is $500. When the process technology doubles they may sell the older version of that chip at $250 -- but they also want a new chip that is twice as complex to fill in the old $500 gap (and that will stay on top of the performance heap). So chip designers keep adding things to chips to fill up the newly created "free" space (or roughly twice as many switches every 2 years). How each company uses that space will result in how fast their chip runs (at a given clock speed) -- and that will help them sell more chips. So it is a complex balancing act.

CISC versus RISC

The history of CISC and RISC is interesting, and relevant. We had these big old chips that had tons of leftover crap (instructions that were seldom used because compilers only used a small fraction of total instructions). All this under-utilized instruction set complexity made it harder to design each version of processor. This meant that the teams required to create each generation were getting larger and larger -- and things were getting more and more complex, which meant more and more potential for errors, and so on. CISC was having problems keeping up with Moore's law.

Some researchers figured out that if only 10% of the instructions get used 90% of the time, then why not throw out the other 90%? That gives you a lot more time to optimize the heck out of the important 10%, and it gives you cleared out space (on chip) to make the processor go fast. Since the processor is much simpler in design, it is easier to make subsequent versions of the chips (with fewer designers) -- and so it is easier to keep up with each generation. The only question was, "what to add in all that free space that would make the processor go faster" (not just cost less)? They decided to add a few things:

since memory isn't able to keep up with processors (and is falling behind more and more with each generation) -- they would use some of the space for cache (high speed on processor memory), so the processor doesn't have to go to slower main-memory as often.

More execution units (more inline or single-stream parallelism). This way the computer can try to do more than one instruction at once (and since it is getting most of those instructions from cache, this doesn't even strain memory too much).

pipelines -- each instruction would be broken into simpler stages. Since each stage is simpler that trying to do a whole instruction at once, you can make the main processor clock go faster (more MHz)! [Read the Pipelines article for more on how they work]

more registers (where the computer remembers programmers variables). The more registers you have, the less often you stomp on registers that alread have something important in them (and have to reload them).

It worked and was a brilliant success. Anything you could do with CISC, you could do with RISC better (simpler design, fewer gates, fewer designers, and so on). It was just a matter of tradeoffs -- and RISC made much better tradeoffs. Sounds great, and clear cut -- it isn't! The real benefits of RISC only worked for a few generations. More than that, with enough wasted money and wasted chip space, you could add all the same things from RISC to CISC.

RISC IS DEAD!

RISC had real advantages in the 80's and 90's -- but that was in the days where processors only had 100,000's of transistors (switches). Now days we've flown past millions, and are heading into hundreds of millions of transistors. Think about what that means to processors and cores.

A "core" is the part of the processor that does the work. Usually the "cache" is not considered part of the core -- but "core" is a loose term that can vary by implementation or company.

Let's assume it takes a million switches (transistors) to make a good RISC core (PPC), and it takes two to three times as many switches to make a CISC core (x86/Pentium).

In the 80's, when chip designers only had a few hundred thousands or a million transistors the RISC was far better -- since the CISC core had to keep making more sacrifices to work at all (like not doing floating point, not having a memory manager, having to use microcode, and so on).

In the early/mid 90's, when a chip designers only had a 3 Million total switches to play with, the PowerPC had a lot more "left over" switches to use for other things (like cache or parallelism, etc.) compared to the CISC chip. The RISC chip could be far better -- and the PowerPC was far superior at the same clock rate or given size. The Pentium (P5 and P55C) had many more compromises and was far inferior to RISC chips of that era (like the 603's and 604's).

But then a few more years went by (mid/late 90's) and chip designers got 6 Million transistors on a single chip (or you could cheat and make a processor that was 3 chips set in a single package -- with a processor bonded to a large pair of external caches -- like the Pentium Pro / P6). With far more cost in design, size and heat the CISC could almost keep up. Sure CISC took twice as many switches, 10 times as many designers, and was far more complex to design (and buggy), it cost more to make, and so on -- but it worked nearly as well. So the advantages of RISC that make sense in a few million transistor processor (or less), suddenly didn't matter as much on a 6 Million transistor processor.

Time keeps marching on! Now we are looking at 10, 20 and 50 Million transistors, and heading into hundreds of millions. Who cares if the core takes 1 or 3 million transistors when your total transistor budget is heading towards 50 Million! RISCs advantages matter less in high end processors -- though it still matters a lot in low-end embedded controllers, portables, and so on. RISCs time was the 80's and early to mid 90's... but it is mattering less and less every day. But don't count RISC out just yet -- I'll explain why later.

What do you do with all those gates?

The problem is that no one can effectively use 50 Million transistors yet. It takes 1 - 4 Million to make a good core. And there are diminishing returns on all the techniques used by RISC to go faster.

Cache has big rewards for the first 16K, then the next 64K probably only doubles the speed again. The next 256K probably doubles performance, and than another 1 or 2 Meg may double again, and so on. It takes far more transistors to make a small difference in speed for each subsequent step. Even staging with onboard L1, L2 and L3 caches keeps getting less and less return for each generation. It keeps getting better by getting larger and more complex, but it keeps making less of a difference than previous revisions.

It is easy to write code that keeps 2 or 3 execution units filled (2 or 3 things running at once) -- it gets much harder to keep 5 or 6... and more than that almost never sees any real world gains (unless you go into vector units like SIMD). Not only that, but you have to add fancy stuff (like out-of-order execution units and reorder buffers, etc.) to allow more units to actually work -- and those add a lot to your size and development time (complexity and potential bugs) all for an ever decreasing amount of performance improvement.

Pipeline stages can only be simplified to a point -- the pipelines themselves get more and more complex (over all) as they get deeper. There is a delay (latency) while the pipe is being filled, and you have to be concerned about stalls -- so you keep having to make things more complex to avoid stalls, and so on. You need to do much more with branches (to avoid stalls) and add things like predication (to avoid branching) and so on. Things get a lot more complex, for not much return in real world performance. A 4 or 5 stage pipe (like many of the PPCs) makes a big difference over none -- but a 10 or 14 stage pipe (like the Pentium III) has almost no real world advantages over the smaller pipe (say a few percent and many times the size and complexity). Beyond here is complexity insanity (and Merced?) with little returns.

How many registers does a program need? After a while it takes a lot of time to do a context switch (when one program gets control, and all the old registers have to be saved). Registers basically behave like memory -- 32 registers (PowerPC) is probably twice as good as about 8 (Pentium). 128 register may be 50% better than 32 registers (despite taking more like 5 - 8 times as much space on chip) and so on. You are gaining a little, and spending a lot in design time, area, transistor count, and so on.

So it is getting very hard to effectively use all the space the designers have, and see any significant real world results. A big, hot, complex, super RISC machine like Alpha or HP-PA really doesn't have a large performance advantage (MHz for MHz) over a smaller simple RISC chip like PPC. They see most of their performance gains by using lower yield (higher MHz) manufacturing techniques (and staying on the bleeding edge, which you can't do in higher volume manufacturing), or by having expensive and fast memory and system busses. Twice the size and complexity (and more like 4 - 10 times the amount of design effort) is resulting in performance increases on the order of 10 - 15% . Frankly, technology and complexity are just maxing us out.

LONG LIVE RISC!

Remember what the goals of RISC were? Simplify, reduce complexity, and then use that space for other things. All the techniques used in the first RISC were to get more parallelism and performance out of a single stream of execution. It was all about how to make one sequence of instructions run through the processor faster -- but the important part was "how to use the space better".

Multi-Threading and Multi-Processing are where programs are broken up into many parts that can all be working at once, and run on different processors at the same time. So a Photoshop filter gets broken into 20 blocks (chunks) and both processors keep grinding on different blocks at the same time until each finishes roughly half the blocks. Two processors are nearly twice as fast as one. In your system, one processor can handle downloading from the Internet (or checking your email) while the other is paying attention to your foreground app. One can be doing speech recognition, while another is playing a video stream from the Internet, and so on. The scope of this type of multi-stream parallelism (loose-grained) is far easier and more understood than the single stream parallelism (fine-grained) -- especially beyond a few execution units. It is also far easier to scale up -- so you can go to 4, 8, 16, or more processors with multiprocessing, and see a pretty linear performance growth -- while going from 4 to 16 execution units is unlikely to show a significant performance increase (say 10% difference instead of 4 times faster like in loose-grained).

So has the answer to the question, "What to do with all those gates?" been answered yet? Multiple cores (sub-processors) on a single processor, and some cache, and you can easily fill 40 Million transistors. Not only that, it is far easier to design this type of chip than it is to add more parallelism to a single stream. For multi-stream parallelism, you just make a core, then mirror it a few times, and then just add some connection logic to allow the different cores to talk and share cache and so on. (Nothing is quite as easy in implementation as the theory says -- but you get the idea). Designers can scale Multi-Core design to dozens of processors before they will max out and start bumping their head (get diminishing returns) -- unlike single-stream parallelism which is already getting diminishing returns. So multi-cored processors are the future!

Of course multi-stream parallelism isn't completely linear growth -- just more linear than adding more execution units to the back end. And you can't go on forever either (you saturate the memory bus, and not every problem breaks down well for multiple threads and processors) -- but it will help us get through the next 4 - 8 generations of processor design -- which is a decade of improvements. Right now, MP starts having issues after about 4 or 8 right processors, but that may improve as we are making subsequent generations . Also we can increase the complexity of the Hardware (have multiple busses) and do some large scale AMP (Assymetric MP) systems or "clusters", and take that number up to many dozens of processor before you start to max out. So we have a while.

Remember how RISC has an advantage in simplicity (for the basics) -- a simple core (like the PowerPC) is about 1/2 - 1/3 the size of a CISC design (like P6 core) at the same performance. The simplicity doesn't matter much right now on a single processor desktop machine -- because power and size for a single core desktop computer just isn't that important. Now make a 4-Core RISC chip and put it on an SMP based OS (like OS X). How will Intel compete? They can't just add more cores to their processor because each core takes so much more space, heat, cost, power, and so on. They have to wait until the process technology advances further -- but by then, the RISC chip just implements two or three times as many processors on their new flavor, and so on. For the next few generations, RISC has the potential to keep outperforming CISC by at least 2 x for any given price-point, complexity, heat, etc. -- and CISC just can't get around the physics of its design flaws!

Single Stream Parallelism

So CISC is about to have its design flaw exploited seriously by nice, simple, multi-core RISC implementations -- like PowerPC or MIPS. The problem is that most other RISCs (other than PowerPC) lost site of the original goal (processor simplicity). They kept making their RISCs more and more complex, to try to increase their single-stream parallelism and use all those "free" transistors (and stay on the cutting edge). Now those other mega-cores are so big (hot, complex, etc.) that they can't take advantage of multi-stream parallelism! Alpha, HP-PA, SPARC and others are big, hot, RISC chips that will require major reworks (simplifications) to shoehorn them down into multiprocessor versions -- and if they do that, then all their code that was designed for lots of execution units (and that complexity) will not run as well in the simplified core. They can wait years for Moore's law (and transistor counts) to go up high enough where it won't matter -- but I'm not sure they can handle a few generations with nothing significant new.

As if that wasn't bad enough, Intel decided to out complexify these complex-RISCs. They are making a new Uber-RISC processor (that uses VLIW and EPIC) called Merced (IA64). This design is where each instruction is even simpler than normal RISC, but is going to have many more execution units, far more registers, and far more complexity of design -- all to try to deal with all that single-stream parallelism. The RISC instructions are grouped into bundles of 3, each instruction is larger to describe the single stream parallelism (causing more load on the memory subsystems), and the compiler has to be far, far, far, far more complex to try to keep things scheduled for all those units (and working near peak), and the compiler is far more dependent on a chips implementation to create code (meaning it is harder to keep older code working at full speed on subsequent versions of the processor). So far, no one has gotten this level of parallelism and complexity to show any significant performance returns in the real world -- but Intel seems to think they can. It is all a nightmare of complexity, for something that will make a single stream run probably 30% faster than current RISC code, at a cost of 20-30% more space in memory and probably 4 times larger chip area (more transistors). It is sort of trying to out-RISC RISC, by making things much more complex and using all the transistors they can.

IA64 / EPIC is not likely going to be a big win in performance -- let's say 20 - 30% in the real world over other RISC ISA's. But Intel would love it if other chip designers tried to follow them. Intel's implementation is so complex that you need to throw hundreds of designers at making an EPIC / Merced core, and it will still take many years (Intel started in like 1994 or 1995 with a goal of 1997 -- it will ship in the end of 2000 and get a revision in 2001 that may deliver on the first one's promises). The complexity is so huge, and the costs so much, that only a company like Intel can afford it -- and Intel can afford to bankrupt anyone that tries to follow in their footsteps. Of course they want to follow the path of maximum complexity -- because anyone that tries has to try to outspend Intel! I wouldn't want to play by those rules. The problem (for Intel) is that you don't have to -- they didn't anticipate the path of minimum complexity.

Of course you can build multi-processor systems without having multiple cores -- but the costs are much, much higher, the performance gains are lower, the system complexity is higher, and you can't bring this performance down into portables and the low-end. So this pushes your system and maintenance costs up, when the rest of the market trends are down (into cheaper).

Conclusion

We are at a major crossroads in processor design -- are we going to go for more and more complex single-stream processors, or go for simpler multi-stream processors.

Intel definitely wants to head towards the path of more complexity, and away from nice clean multi-core chips. If other chip designers (or systems designers) follow Intel down the IA64 (Merced) path, then they are like cattle being led to the slaughterhouse. Intel can't go multi-core in the IA32 (x86 / Pentium) arena for a few more generations (until the process technology gives them enough gates to make it affordable) -- and most Intel chips are sold for Win98, which doesn't support MP at all -- and it is likely 2 to 4 more years until it will. So Intel has far more to worry about MP (for the next 3 or 4 generations) than to gain from trying to use it -- as such I think they will avoid it like the plague.

AMD is headed towards a big mondo Alpha like super-chip with the K7 -- so they too are headed towards very complex single-stream processors. It is possible that some of the simpler x86 clone-chips could do a multi-core chip, and they could compete pretty well in performance with a simple x86 with multi cores against a hairy do-everything x86 chip -- but the PC market is all about following. Historically the PC market will follow Intel, even when there are better choices elsewhere, so I hold out very little hope in change in that market.

The PowerPC and MIPS could easily go for the multi-stream / multi-core versions of their chips NOW! NOW IS THE TIME! They can have a 3 or 4 generation head-start to exploit their advantage, and really compete in the high end. The only question is will the companies have enough of a clue to do it?

Since SGI spun off MIPS it is hard to say where they are going -- they've done a little low-end stuff, and a little high end stuff. I doubt they have the funding to follow through on a vision right now, and it is hard to get investors to finance anything that competes with Microsoft or Intel, despite the huge amount of potential gains it would have. So despite the MIPS being a nice chip, I just feel like the odds are against them.

Motorola management is not known for their vision -- usually relying on Apple or IBM for that. But I can't believe they are so stupid that they won't realize the potential of this path for them -- not only in the mainstream market, but in all their high-end embedded markets as well. IBM is pursuing multi-core version of the PowerPC chip, and have even shown one -- but so far it is more towards their server side of the market and they aren't really "mainstreaming" it yet. Lets hope they go towards the straight PPC side as well. Apple may or may not get it -- many of their engineers already get it -- the question is does this vision go to the top (where all decisions are made)? If IBM or Motorola drop the ball, will Apple try to pick it up and run with it? Apple can produce their own PPC chips -- and could have an MP version of the G4 fabbed or made either by Motorola or IBM if they really wanted -- but if IBM and Motorola won't pick up that expense (like they should), will Apple? Or can Apple exert enough pressure to get IBM and Motorola to do what is in all their best interests? I think they can.

If IBM, Motorola and/or Apple do follow this path and push hard, they stand the potential to make a serious dent in the Wintel mindshare -- marketshare comes with that (just slower). This could be timed perfectly, as Intel is trying to push people to change their code to run on IA64 (to see a 30% performance improvement), Apple/IBM/Motorola could push people to change their code to run on MP-PPC (to see a 200% - 400% improvement). And there is lots of code that already runs on PPC so this could be a very easy task. But we shouldn't underestimate the hegemony that Wintel exerts, and how slow the market is to change (even when it makes a positive difference on their bottom line). But a good Multi-core chip, combined with a good MP OS (like Darwin, Linux, or OS X), could result in serious competitive advantages that even the most conservative group-thinkers could not ignore. That should affect stock prices, marketshare, mindshare, and momentum -- everything important to companies. In fact, these advantages could bring in innovation that others (Intel implementers) could not easily compete with, like higher quality speech recognition (by devoting a core to that task), and so on. These are serious value-adds that this free processing power would enable. AIM (Apple, IBM and Motorola) could deliver on their promise of "twice the performance of anything Intel can field" -- and that mindshare could cause an upward trend in marketshare (in mainstream processors as well as embedded controllers) for the next few years. That could result in really, really big money for Apple, Motorola, IBM, and anyone willing to diverge from blindly following Wintel.

Created: 04/07/99
Updated: 11/09/02

Top of page

Top of Section

Home