People keep discussing what is and what is not RISC without really understanding it. RISC is just a name we've given on a certain design philosophy (that came from a collection of techniques to make processors better. So this will be a crash course in what exactly is RISC.
What is RISC
RISC stands for Reduced Instruction Set Computing. In the 70s (though there are a few concepts that predate this time) it started as a few techniques to make a processor simpler (in design). Why simpler? Well, the processor being "simple" (in size and complexity) results in quite a few things to a processor design:
Some parts of the design (and tradeoffs) feed into other parts -- and making the right tradeoffs enables competitive advantages (over competing processors) that defined the likely success or failure of a processor.
Over time, the individual techniques for making a processor "simpler" or "faster" all have changed (evolved). The first generation RISC processors were substantially different architectures from the second, third generation ones (and so on) -- but the philosophy of why and where tradeoffs were made didn't change much at all. The techniques used to make RISC faster are not exclusive to RISC -- CISC can also use those same techniques to make their designs go faster as well. It is just that to do so with CISC takes more time, size, power and cost to do so. Yet, the design strategy (philosophy) of RISC remained constant -- it was the idea that trading off transistors on the processor (and the design complexity in the instruction set) for other ways to make the processor faster/better. So the techniques aren't RISC -- the philosophy of where to make tradeoffs (and how) were more what defined RISC. So RISC is just a design philosophy.
RISC is a philosophy with the goal to reduce the COMPLEXITY in one area of the processor (the instruction set complexity) -- then use that saved space and design time in other areas of the processor that will matter more.
RISC is not JUST about instruction count
What RISC does NOT stand for is reduced instruction set COUNT. Cutting the number of instructions has almost nothing to do with RISC -- remember, it is about complexity! Reducing the number of instructions is one way to pull some complexity and size out of the instruction decode logic, so eliminating instructions is one technique for reducing the instruction set complexity -- but it is only one technique among many . Why you are pulling instructions out (to reduce complexity) is far more important than what you are doing. Too many don't know this, and think that RISC is just about instruction count.
Before RISC many processors had so many instructions on board that designers couldn't keep up. The designers had given up on adding each instruction into hardware -- and usually just broke down the complex instructions (the assembly or machine code) into a series of simpler instructions called microinstructions or micro-operations. The processor would then run many of these microinstructions (microcode) to complete a single real instruction. It was like having an emulator for your instruction set. It cost processors in performance, but was the only way they could afford to add all the instructions in. It is important to remember that complex instructions decoded to simpler backend instructions is a CISC concept. Simpler instructions that don't need to be decoded is a RISC concept.
RISC designers profiled instruction sets to figure out exactly which instructions (and modes) were actually being used. They learned that compilers (and most programmers) only used a small minority of the instructions -- and the rest were very rarely used. That was a lot of design effort (transistors and complexity) to drag around for not much reward. It made sense to eliminate the rarely-used instructions and spend your design time (and space) making sure the fewer remaining instructions ran in hardware (not microcode) and that they ran really fast. This was primordial-RISC and the only area where they actually focused on reduced the instruction set count -- but it was only the start of the process. The reason they were doing it was the important thing (to reduce complexity and simplify the design).
After trimming the deadwood instructions, RISC designers still wanted to reduce the complexity more (not just the count). They found out that many instructions were very complex to implement. Removing that added complexity was actually more significant to RISC, and defined the philosophy (and techniques) more.
Older CISC processors had instruction that moved data around -- the MOV(e) instruction. It MOV'ed either from registers to memory, from memory to registers, from registers to registers, and from memory to memory. This MOV instruction also had many different addressing modes, and worked on different data sizes, and it was complex (it took many pages just to document all the ways that it could be used). This one instruction alone was a nightmare to implement and the RISC people thought that this one instruction was really trying to do many different things -- so it was not really factored (broken down to its core). This was the first thing they wanted to improve.
Breaking down the MOV instruction became one of the core concepts of RISC. Instead of MOV, RISC used two separate instructions LOAD and STORE. They used one instruction to LOAD a register from memory, and another to STORE a register to memory. All other operations are register to register. No more memory to memory operations. This limited access to memory to two control points -- which made the instruction set (and memory traffic) much easier to optimize. The designers fixed the data size of the load and store to always be the same size (the entire register) -- so they didn't have any more complex sizing issues. There were fewer addressing modes (complex calculations to figure out where to get something), and so on. This one instructional change alone dramatically reduced the complexity of the instruction set, and reflected the philosophy.
RISC machines used to be called Load/Store architectures for this reason!
In this case (and a few others) the RISC architecture actually increased the number of instructions -- but it reduced the complexity for creating that instruction (and decoding it). One MOV instruction became two simpler instructions (Load and Store). And in many other cases one complex CISC instruction (with lots of attributes and "options" became a few simpler RISC instructions.
In CISC, instructions were considered one instruction even when they had many modifiers (like the size of data, the addressing mode, the status register results, and so on) -- while in RISC the same basic instruction working with different data size, addressing modes, or status results were all considered discrete instructions. This dramatically alters the total instruction count for each processor. So people that say that a particular RISC processor has more instructions than a CISC processor aren't usually counting variations in CISC (or normalizing them for RISC).
Again, RISC may have more instructions depending on how you count -- but the important thing, the complexity of the instruction set (and the complexity of each instruction) is what is reduced. This should help people understand that RISC is not about the number of instructions, but instead was about the complexity of each instruction and the overall complexity of the entire instruction set.
Processors are faster than memory -- so memory accesses like MOV, LOAD or STORE can take a lot of time. Prefetching data is a way to "guess" at what the processor is going to need, and loading it ahead of time. This prevents those memory access from slowing the processor down -- and keeps the processor fed (running much faster).
With a MOV instruction, the processor could not easily predict what data it was going to need ahead of time. In order to figure out what data a MOV was going to want, you had to decode the entire instruction first -- meaning its addressing mode, source, destination, etc. -- and you had to have decoded all instructions that came before it. That makes prefetching a very hairy process (hard to do), and it will take a lot of space.
Since most RISC instructions were fixed size (or simpler) it was easier to make prefetching cache logic. With a load/store architecture, a simple part of the processor (prefetch logic) can peek ahead and preload instructions. This little snooper will see a LOAD instruction coming up (with a fixed size and simple address) and just pull that location into a buffer (cache) ahead of time. If something you wanted was already in cache, then you didn't have to waste time going and getting it from memory. Because it is in cache, you aren't going out to memory to get that information as much -- which leaves memory (and the memory bus) more "open" for other parts of the computer to use. So it is not only faster for the processor, but for the entire system.
The key to why RISC chips got prefetching is that it was easier to do because the instructions were simpler. So it was easier to do with RISC, and doing it wouldn't take up as much space (design time) as it would with CISC. This made it a good idea to do with RISC.
Cache didn't just apply to fetching (LOAD) it also applied to writing data (STORES). On a STORE you just store to a buffer/cache, which will get written to memory by another simple part of the processor, whenever there is some free time. So without a cache, everything stalls and waits for the write (STORE) to complete -- but with a cache, you just go on without holding everything else up.
Cache was also a temporary storage place for commonly used items (locations, registers, etc.). If you kept using these pools, then reusing some data was very fast -- and you didn't have to go to slow memory to get them. It turned out that in processors, a small amount of data is used a whole lot -- so cache can really matter a lot -- especially the difference between no cache and a little cache.
Since RISC chips were simpler, they got caches first. Since RISC designs took less space, and that meant more space free on the chip -- so even when CISC machines started getting cache they could have as big of ones. RISC still had more space to devote to cache (since the instruction set was simpler). Since RISC has simpler instructions it is easier to allow caches to do more (smarter logic to help the cache).
So cache could improve speed on loading (prefetch), reloading (cache) and storing (deferred writing). Caches helped RISC more than CISC because there is more time to add them, they are easier to add (since filling them is with fixed size instructions) and because there is room for more cache. RISC almost always comes out ahead since more cache, or a better designed cache, means more performance. RISCs simpler architecture allowed for more space to do other things, and made it easier to do them as well. This too was the philosophy of RISC -- trade off your design time (costs and space) in ways that will give you more bang for the buck (and return the biggest rewards).
Other things RISC
The Load-Store design decision of RISC kept leading to many other decisions in RISC philosophy. Once they were already making the decision to sacrifice area in the instruction set and use that elsewhere for other things, the issue became what would use that area for? What would help make the processor go faster? What were the best returns on investment? Many techniques were created and helped define early RISC -- but it was always about the philosophy of where to use the space better. Since they had more area on the chip, they kept adding things like:
Each of these techniques was just an outgrowth of the philosophy of RISC. Make the instruction set simpler, then use the space saved for better things. Since the instructions were simpler, you could more easily break them down and do things with them that were way too complex to do with CISC. Lets look at some of the techniques and what they do (and why).
One thing to help with performance was to add in more registers. Registers are named/tagged locations for working with variables. Programs are often loading, unloading and overwriting these locations -- sometimes overwriting them too much. Loading and unloading takes time -- so if you have more registers, then you wouldn't have to keep loading and unloading them, which means increased performance. In human terms this is like giving you a bigger desk and office -- with more space, you could do more at once, without having to keep file and re-file things. Many CISC processors have too few registers like the Pentiums (x86's) 8 addressable registers (plus a stack based floating point and SSE registers). RISC added more registers because they had more room for them. Addressable registers were a simple way to make things go faster, and a better way to tradeoff space (than complex but rarely used instructions).
Since RISC has more registers it could do things that CISC couldn't. CISC didn't have enough registers, so the designers were very tight with how they used the registers they had (in creating the instruction set). When programmers (or compilers) would do some operation, you have to put the results somewhere, and most CISC instruction sets just stomped one of their source registers. This would read something like A = A + B -- the results of A+B are stuffed into A. A is replaced with the new value and the orignal A is stomped. If you needed the old A, then you are out of luck -- you either reload it, or should have saved a copy (done a register to register MOV) ahead of time. By doing things this way, it only required 2 registers to do an operation. With only 8 registers it is important to keep down the number of registers in use.
RISC has enough registers that they aren't concerned with that. So they create the superior (and more normal) 3 operand instructions. RISC operations would read more like C = A + B -- the results of A+B are stuffed into C -- so A is not stomped on. This means that later you can reuse A for something else -- which in many algorithms may be important. In many ways this makes things more efficient in RISC than in the older CISC type instruction sets since you aren't always reloading some stomped on value, or having to save that value off ahead of time.
RISC freed up enough spac, that they can occasionally add complexity and more (new or compound) instructions. Some RISC's (like POWER and PowerPC) can actually do a single instruction like D = (A * B) + C, or use one register as a mask or modifier on a value. While these (4 operand) instructions are technically more complex than CISC's simpler two operand ones, they may still be easier (simpler) to implement. If these operations were done often enough (like multiply-adds) this can be a big win.
More Complex or less?
Now another way to make things go faster, was to allow more things to happen at once. Imagine you have to do a big report -- and you have to add all of column A and separately you need to add all of column B together -- then at the end, you may need to add the two totals. You can do it all yourself -- but what if you had another helper? Why should you do all the work while he waits around? You could divide the load amongst two people. You could add column A, while your friend was adding up column B and you could get your results in nearly half the time. In processors that would basically be superscalar -- you have two execution units, that can be working at the same time (on independent things) -- and for many cases it can get nearly twice as much work done. The simpler the operations, often the easier it is to divide the task among many units -- so RISC seems to have an advantage in utilizing superscalar in the design.
Well RISC processors had lots of extra space (because they reduced the complexity of the instructions and had more time and space, etc.) -- so they added extra execution units. In many cases the instructions could be interwoven so that both units could be kept busy at the same time. Doing this puts more load on the compiler, since it had to know enough about the processor to keep both sides busy (be a better project manager). Or the processor itself can occasionally help and divide tasks while it was running (but that takes more space in hardware). Either way, superscalar means that more work can get done at the same time.
Pipeline theory is pretty easy -- but all that it entails can get very complex. The basics of a pipeline is to break some operation into simpler stages, and have individuals each working on a single stage (and passing their results to the next in line). Think of an assembly line. Each person does their part, and then goes on. It allows specialization of skills, localizes logic, and is basically Object-Oriented Design of the hardware world -- while breaking functions into even simpler parts. When you start the assembly line, it takes a little while before everyone gets working, but once they do, they get a whole lot done. Everyone does a simpler task, but that simpler task can be done faster than doing many complex tasks -- and the results are more productivity overall.
I have written far more on how pipelining works, and what it is good for (and not). So if you want to know more, read: Pipelines.
Some CISC instructions (like MOV) really were doing many things. Imagine MOV'ing (swapping) one register for another. What that breaks down into, is that you have to store the first register (in some temporary space), move the second register into the first, then move the stored version of the first into the second. If that sounds like 3 discrete actions, it is -- it also required processor designers to stall the entire processor while it was doing these three things. By making sure that instructions were factored (simplest), it allowed them to run faster -- or most of them require the same amount of work to get done. Then you could more easily break those simple instructions into even simpler stages. This resulted in a processor than could be doing different stages of many things (like an assembly line) at the same time -- and it results in a processor that is faster overall. This is pipelining.
The problem is that there is a complexity and tradeoffs in pipelines. The deeper the pipes, the more MHz can be run -- but also the bigger your penalties are in other ways, and the more size it takes to implement and so on. CISC can do pipelining -- but the pipes have to be much deeper (because different instructions take dramatically different amounts of time) -- this results in far more complexity for the same results. Then if you have deep pipelines, you need to make more complex other things (like Out of order execution units, reorder buffers, complex branch prediction and so on) all to get the same results. This means far more cost, design time and space are required to get the same results.
Another things the simpler instructions of RISC allowed was something called out of order execution. Instructions could stall the pipeline (or execution unit) for a variety of reasons. Sometimes the processor just hit a LOAD (of something that wasn't in the cache) or STORE (and had to flush the cache before it could go on), and sometimes an instruction hit a branch or an instruction that was dependent on another instructions results (that hadn't been completed yet). This would waste time (stall) and wait for things to catch up, until the reason for the stall finally got resolved. Stalls are a waste of precious potential -- time the processor could be using to grind away doing work -- so designers really don't like stalls.
One way to avoid stalls is to make compilers smarter. This allows instructions to be reordered at compile time -- and move loads as high up the chain as possible. The earlier you've done your LOADs (or STOREs) the more likely they are to be complete when you need that data. So if you just reorder the code through the compiler, then you get fewer stalls. In early RISC compilers did most of the reordering (scheduling) and tried to predict every potential stall, and avoid them (by reordering the code). But the compilers could only predict one processor versions behavior (and know how it worked) -- when the hardware changed, the compiled code could not adapt (not without recompiling).
As RISC evolved, some of the reorder logic got moved into the hardware. The other was to also allow out-of-order execution. This means the processor itself, at runtime, would just reorder instructions to avoid stalls. Basically the hardware would run into some instruction that was a stalled, but it could see that the next few instructions were not dependent on that instructions results. If those next few instructions weren't waiting for this ones results, then why stop? The processor would just continue executing the other instructions and only really stall when there was an absolute dependency. Eventually, the offending instruction would get done -- just not in the order the code had specified -- but it had first made sure that this didn't matter (and nothing was dependent on that result). For simple things this worked great -- and reduced the effects of a stall. And this technique allows some adaptation processor to processor (it is designed to match the hardware -- and the software doesn't have to be recompiled).
Both techniques have advantages, and they actually work well together. All things are "hotly" debated. Out-of-Order was considered by a few to be a violation of RISC -- since they felt that this should be done by the compilers only (they didn't want to add more logic to the processor). I'm more in the pragmatic camp, and believe that it is all about tradeoffs -- as we are given more and more space to burn in a processor, then it is fine to use it for things (like reorder buffers) which allow the processor to go faster. Of course there are degrees -- and reorder buffers and complexity of logic for OOO is tied to the depth of the pipelines. And you can definitely go overboard in a design.
Over time, out-of-order execution got a bit more sophisticated (and took more space). It could look ahead further, and keep shifting instructions around (changing order), or feeding instructions to the right units (to only stall one of the units) and so on. It couldn't eliminate all stalls, and the compiler had potential to avoid some (because it knew more about the code at compile time) -- but the processor did know more about its own hardware, and so the hardware out-of-order units could offload the compiler and was better for some things. It was all about balancing the bottlenecks and finding out what was slowing the machine down the most.
Well one of the things that causes the most stalls is branches (conditionals). Basically this is an instruction that says, "if this is true, then do this one thing, otherwise do this other thing", or "if this is true, continue from here, otherwise continue from that place over there". Well, you can't know what is actually going to be true until you get there -- so the pipeline stalls until the processor knows the dependant result. However, what if we could predict which way the branch was going to go ahead of time? Then there wouldn't be a stall.
Branch prediction was a way that the compiler (or hardware) can just skip over the stall. It works with out of order execution and just takes its best guess at which way things will go -- and it keeps executing the instructions (in the pipe). By the time the actual branch gets decided, the processor has "pre-completed" many other instructions. If the processor (branch prediction) guessed correctly, then the instructions get finished (completed) and there was no stall. If it guessed wrong, then there is a stall, it clears its buffer of pre-finished instructions, and is no worse off than if it hadn't of tried to guess in the first place.
At first, hinting (branch prediction) was done by the compiler -- it would look at the code, and figure out what was the most often executed path, and then set a bit in the instruction to tell the processor which way to guess. Later, some designers decided to make the processor smarter about guessing -- it would assume that whichever way a branch was taken before would be the way that it would go the next time (this helped all subsequent passes through a conditional). And finally, the hardware evolved even more -- to where there was just enough space and logic that some processors would just execute the next few instructions on both sides of a branch, and whichever path that became the proper one would be used, and the wrong path would be thrown away.
In some later designs there is something called predication which is sort of a sophisticated form of branch-prediction or branch hinting. Basically it makes it easier to do both sides of a path. I consider that more a post-RISC design, and will get into what it is in an article on post-RISC.
Code Creep / Code-bloat
Most of the balances of RISC were positive. You simplified the instruction set, saved space, then tried to use that space more wisely. It sounds like all wins -- how could you go wrong? Well everything in engineering is about tradeoffs. The one big tradeoff for RISC is code-creep.
CISC instructions are variable in size, and work with different sized data. This makes decode logic and prefetch logic very tricky -- however, instructions don't take any more memory than they absolutely have to. RISC instructions have all instructions the same size -- and everything is aligned -- but not all data sizes and instructions need the same amount of data (or instruction complexity). So in some of the simpler RISC instructions you have to pad (or waste space) -- or you may load more data than you need (load 32 bits, even when you may only be looking at the lower few). So the data and instructions aren't as packed (efficient).
RISC instructions often took more instructions to get the same work done -- they were simpler so you may need more of them. The more complex instructions of CISC could often get things done in fewer instructions. Of course this goes both ways, and RISC often worked on more data at once, didn't destroy registers (and require reloads) and so on -- but it generally comes out in CISC processors favor. This "inefficiency" of not being packed as tightly, and not getting as much done in each instruction (which meant more instructions) caused code creep. Code grew for RISC chips over CISC chips but it varied in degree and implementation -- but 10-20% larger was not uncommon and in some extreme cases it could be more like 30%.
RAM is constantly becoming cheaper over time, so the issue is not the memory costs. What you have to remember is that processors are much faster than memory. So loading memory takes time -- so code boat means that you are getting a performance penalty as well as a size penalty. And remember, you had to complete more instructions to get the same work done. So far, the tradeoffs have worked out in RISCs favor. It turns out that the improvements in cache (and the fact that 98% of the time code is actually running out of cache) more than makes up for the performance loss of code creep -- and the ease of creating superscalar designs (more units) meant that it was easier for a RISC to be doing more at the same time than CISC. So RISC has taken over (for now) -- and almost every new design in the last 15 years has been RSIC -- but there are tradeoffs.
Memory continues to get faster -- but it does so at a slower rate than processors. So the difference in performance (between processor and memory) continues to grow. For now, cache advances, out-of-order execution and other tricks are keeping this problem at bay -- and RISC is still the way to go. But this may not go on forever. Some companies are working on "packed" RISC instructions -- which means instructions and data that is compressed to reduce the memory overhead (and performance bottleneck of pulling it in and out of RAM). This sounds suspiciously like a RISC-CISC hybrid... that doesn't mean CISC will return (most of the design concepts and philosophy of RISC are here to stay) but it may mean a partial return, or a further blurring of the lines.
So RISC is many things. RISC is a design philosophy that started as a few improvements to the ISA (Instruction Set Architecture) -- like factoring the instructions down (like LOAD/STORE instead of MOV), and RISC just exploded from there. There are many techniques for improving processor performance by simplifying the instruction set design, and then adding complexity (and logic) back in other ways to make a processor go faster. RISC was about the change from just adding more instructions, to adding better instructions, making a simpler (architectural) chip, and then using your space more wisely.
The RISC philosophy started very much with the attitude of pushing more and more smarts into the compiler. Let the compiler do it all. But that purist (and most extreme) RISC philosophy didn't last long and hold up in the real world. (Some designs just went too far). Designers pulled back, and started allowing the philosophy to evolve -- and they allowed the hardware to do more. They added reorder logic, branch prediction, and smarter cache snooping. They started allowing compound instructions, as long as they weren't too complex and could easily justify their existence (with frequency of use and significant performance increases). RISC started as all fixed size and perfectly aligned data -- but working with legacy data was near impossible (since it wasn't perfectly aligned) -- so they added in support for misaligned data. Many techniques for adding speed get added in -- and some others dropped or changed. But the one thing that remained constant was the philosophy of trading off gates in the instruction set for gates in other areas to make the processor faster over all, easier to design, and use more "modern" techniques.
The rules of physics have not changed. The way you implement the better chip is that you get the chip out on a more "modern" manufacturing process sooner (which means faster and lower power), and you don't waste space and time on dead (rarely used and complex) instructions. This means the RISC chip will requires less heat, size, and costs less to design and manufacture, and so on. If you reduce the amount of space (and time) it takes to implement the instruction set (not the count) -- then you can spend that saved time, space, money and so on in other areas to make a better design over all. Engineering is about balancing these tradeoffs -- and RISC is about knowing what to leave out.