Both KNI and AltiVec are SIMD (Single Instruction, Multiple Data) implementations, or they are also called (short) Vector Processors. What they do is to allow a single instruction to work with multiple pieces of data at once (instead of one at a time), so they can do 8 things (or sometimes as much as 32 things) at once. Each piece of data, or path through an instruction, is called a vector.
If you still don't understand the basics of SIMD or Vector Processing (what MMX, KNI, and AltiVec are), then read What is AltiVec or the older MMX/VMX (from before the specifics of AltiVec were known). They can give you a pretty good understanding of SIMD and design decisions and tradeoffs.
Why MMX2 (KNI)?
Intel rushed MMX to market with many very poor design tradeoffs (1), and they had some inherent problems with making it work with the x86/Pentium's 20+ year old design. So Intel had to make many compromises because of the legacy of the x86. Rushing MMX to market got more acceptance early on, but it allowed others to edge in and make other implementations in the x86 arena (like the AMD's 3D-Now), and that confuses and fragments the marketplace, and causes things to have to be fixed later. Rushing also meant even more design compromises. KNI (Katmai's New Instructions) also known as MMX2, is just a way for Intel to retroactively fix and improve the bad design of the original MMX. But fitting in with the older MMX (and older x86 architecture) requires even more compromises to be made -- so that MMX2 can't quite live up to its full potential either (or at least not to a cleanly designed RISC based SIMD architecture like AltiVec). To make matters worse, KNI (in its first implementation) isn't clearly superior to 3D-Now -- and 3D-Now will have tens of millions of adopters before KNI ships, making the decision of which to support harder for ISV's.
(1) Remember, design tradeoffs are different than marketing tradeoffs. While design tradeoffs may make for an ugly design, and some problems later (and make programmers lives worse, and performance worse) -- the realities of the marketing tradeoffs may force companies to do bad design anyway. Intel was losing market share to other chipmakers so they had to do something to make their chips "new" and incompatible (on some levels) as a way to differentiate and add value, otherwise they would have continued to bleed market share. So I'm not saying that Intel made the right or wrong decision from a business view -- just from a design view it limited capabilities and was guaranteed to hurt them in the future, as we are seeing today. In engineering it is always, "you can pay me now, or pay me later" -- and it will almost always cost more later. Intel rushed to get something out, and so they pay for that hurried design later on.
So what are these issues that make MMX2 an inferior design to AltiVec for doing serious SIMD and signal processing?
The biggest problem with the x86 design (and MMX and MMX2) is that the x86 only supports 8 registers. So KNI can only support 8 x 128 bit registers (as compared to AltiVec's 32 x 128 bit registers). While 8 registers was fine for a 1970's computer, that is way too few for modern compilers and algorithms which can use more registers to optimize algorithms and generate more speed. Algorithms that require more than 8 registers with the x86 require you to unload and reload registers more, which wastes time -- time that the processor could be using to do other things. Each load/store is even more dangerous (time wise) because of things like cache stalls that can burn dozens of cycles (or more), and basically leave the CPU twiddling its anthrapomorphized thumbs, waiting for memory. Since "load/stores" and being register starved is not something you can just work around, it can become a really big issue -- and there are many cases where algorithms will be seriously penalized by using KNI's fewer registers as compared to AltiVec.
MMX2 instructions (KNI) use the same 2 source registers and 1 destination register format as the x86, but that destination is one of your sources -- this means that your original data is destroyed. Destroyed source-data means that often you have to either reload that data or you have to make a copy of your original before you execute an instruction (both of which waste time). In many cases 3 of your 8 registers are already filled for any one thing that you want to do. AltiVec registers have a 2 source, 1 filter/modifier and 1 destination register format (and the destination is separate from the source). The 3 -> 1 format of AltiVec makes the instruction set more versatile and doesn't cause source-data destruction (and reloads). But Intel can not use the 3 -> 1 format because they didn't have enough registers, and because it is different from the way they encode their other instructions (because of the 25+ year old design).
But it gets worse for KNI still -- many algorithms build on previous registers results (and have multiple stages). Programs care more about how many registers we still have free after doing something (not how many registers they used) -- after many simple one-pass algorithms AltiVec has 28 free registers as compared to only 5 registers left when using KNI. But most algorithms and signal processing aren't just single stage -- 3 or 4 stage algorithms are common (and many want to go much deeper). The second stage may require 2 - 3 more registers, and the 3rd stage requires 2 or 3 more... uh wait, KNI is out of registers, while AltiVec still has more registers available than KNI had available to begin with. This means that you can do far more complex algorithms, far quicker with AltiVec.
For the record, there are some things that can help KNI hide some of its problems (like register renaming) -- but they can't hide the odor, just try to use a little perfume to try to cover the stench. The reality is that KNI will stall more often, can't do as complex of things, can't do as many things at once, and it will be slower.
Efficiency of Instructions
Performance is not only achieved by how many registers you have -- but by how efficient the instruction set is, and how well it is tailored toward your goals. SIMD is being used a lot for Signal Processing. Signal Processing is basically taking a stream of data (also called a signal) as it is coming through, modifying that data (processing it), then outputting it. The faster you can do this, and the more you can do, the better Signal Processor you have. A PowerPC with AltiVec is the best Signal Processor on the planet (according to Motorola). Intel does not make that claim for KNI because there are many little gotcha's that make the MMX2 instructions far inferior to AltiVec. Lets look at a few.
"Multiply and Add" are two instructions that are done very often together in signal processing. All sorts of algorithms use these instructions. In AltiVec "Multiply and Add" is a single instruction -- you can multiply one thing by another, and add the result to another register (or the original) -- one instruction, one cycle, done. KNI doesn't have a multiply and add (partly because it couldn't do the 3 -> 1 instruction encoding), so with KNI you do a multiply then another instruction to do the add (another cycle). This two step process with KNI may requires an extra register (which the processor is already starved for), and it may force another load/store (and remember, those are dangerous and can cause a stall). So in one of the more common things in signal processing, the KNI is already half as fast as AltiVec. But even that isn't the whole story -- Katmai (the first Intel Processor that implements KNI) can't multiply all 4 vectors (an entire register) at once (nor add them) -- it can only do 2 vectors (half the register) at once. Then there is large latency in KNI's ability to do multiplies (5 cycles), and there is the long latency of add as well (3 cycles). So it takes many cycles to do the multiply, more to do the adds, it has to multi-pass both, and it has less registers to unroll loops (and make things more efficient and hide some of the other instruction set deficiencies), compounded by the fact that it can't do complex reordering capabilities and other things that will make KNI far slower than AltiVec in the whole algorithm -- by something on the order of 8 times slower (or more).
"Multiply and Add" is not just some rare case of an AltiVec instruction being better designed, this seems to be the norm. Try to do things like estimations on Log, conditional moves, and other instructions, AltiVec has them but KNI does not. The instructions that they both do have are often implemented better in AltiVec (it is just more versatile). There are about 127 New Instructions for MMX and MMX2 (KNI) combined (some may overlap in functionality) -- while there are 162 New Instructions for AltiVec -- and while more isn't always better, it usually is if you have equally good designs -- and in this case AltiVec design is even better. Many of Katmai's New Instructions may be wasted just to fix other problems (2). Then to compound the AltiVec's advantages even more, the 3 -> 1 instruction format of AltiVec allows many more variants of instructions (and more usefulness) than the 2 <- 1 instruction format of KNI and MMX.
(2) The ways some instructions are sort of "wasted" (or not very important instructions) is that KNI is using some instructions to get data back and forth between the old MMX registers (or x86 registers) and the new KNI registers -- so those instructions aren't for doing useful work on algorithms, just helping you to work around the poorer design. More than that, some KNI's instructions are wasted on just doing regular Floating Point Math on a single word at a time. The 80x87 (math unit) used a brain-dead stack-processing architecture that no programmer on earth actually likes (and can reduce performance in many different ways) -- it is not only awkward but is different than the way the rest of the x86 architecture. To get around this KNI can now deal with scalar floating point registers more like all other intelligently designed processors. So a portion of KNI's new instructions (and die space) are there to just fix old bad instructions.
A very important part of signal processing is being able to reorder (move and copy) data quickly. AltiVec has an incredibly well designed instruction for this called "Permute". It can take any value (of differing data sizes) from 2 source registers, and move and duplicate them to a destination register in any order I want. This is incredibly versatile for many algorithms. Remember, this reordering and filtering is pretty common and important in many types of signal processing and for regular computing as well. This one instruction can be 30 times as efficient as the old PowerPC way of doing things (which in general was more efficient that the x86 way of doing things). Since the control vector for Permute can also be data (a table index), Permute can be a great way to accelerate table lookups as well (doing up to 16 at once). This is another thing that is allowed because of the 3 -> 1 register format of AltiVec. These type of table lookups can be useful for a variety of things including nonlinear DSP algorithms. I could give Motorola/IBM a big sloppy kiss for adding this one instruction alone, as I can think of a dozen ways I can use it. This one instruction is probably the single most significant addition to the PowerPC instruction set since its inception, and a great contribution to Microprocessor design in general.
Intel on the other hand has a few different ways to rearrange data with KNI, but they are nowhere near as versatile as Permute. Many of the ways to "Swizzle" data are fixed in size, ordering, and it too will destroy some of your source data -- which requires an extra instruction if you wanted to actually keep your source register in tact (like you often do). So while KNI may have some data reordering improvements over x86, it doesn't seem to be able to hold a candle to AltiVec's Permute instruction.
It is easy for me to create algorithms and scenarios where what would take 2 cycles with AltiVec would take more like 4 - 16 cycles (or more) with KNI, or might not be doable in KNI at all and you'd have to do them using MMX (which would be slower still since MMX can only work on half the data that KNI can at one time). I expect that almost no instruction set designed after this point in time will be designed without having a Permute Instruction, or something like it (except KNI of course).
Does not play well with others (Ints and Floats)
One of the problems with Intel's design philosophy (slap something in, fix it later, make it compatible with everything that came before) is that it comes back to haunt you. MMX and MMX2 is a perfect example. MMX was made compatible with processors without MMX -- to do this, Intel couldn't add new registers (that would have been useful) -- instead they "shared" registers with the floating unit. This means that every time programmers change between floating point "mode" and MMX "mode" there is a performance "hit" (the computer stalls while saving the states of the mode). Many times programmers want to go back and forth quickly -- but with MMX they are out of luck (later implementations of MMX have reduced this hit). This "mode" thing sort of makes Floating Point and MMX exclusive (they don't share data well). Now that isn't that commonly needed, but is still a little ugly. MMX2 (KNI) makes it worse by bolting on the MMX2 instructions (KNI). These instruction don't replace the older MMX, they have a few New Instructions that replace or improve the older instructions -- but the KNI's 4 x Floats are pretty isolated (and different) than the MMX's integers. KNI has new registers, MMX uses the old registers (64 bits) -- the KNI registers are 128 bits. MMX and MMX2 are almost as incompatible with each other as MMX and Floating Point is. What if you want to mix data sizes or instructions? More work. All this can make some algorithms slower, and uglier. All these problems can be worked around -- but why should you have to? It certainly makes code look uglier, and harder to write or maintain.
AltiVec has a set of registers for both types (Ints and Floats) -- so with AltiVec you can work with Ints and Floats at the same time (on the same data). AltiVec handles more data types, and more data at once. KNI can use 128 bits at a time for Floating Point, or 4 x 32 bit floats (just like AltiVec). But KNI can't work with Int's (only MMX can), and MMX can only deal with 64 bits at a time (8 bytes, 4 words, 2 long words or 1 quadword) -- while AltiVec can work with 128 bits at a time for either floats or ints -- or twice as much as MMX. Not only that, AltiVec can handle some special types like a 1/5/5/5 x 4 data type, that is designed for dealing with 16 bit graphics (thousands of colors on a monitor). Also AltiVec handles a Fixed Point math (and conversions between floats and fixed) which can be good for many things like games, graphics, as well as other things.
So Integers are at least twice as fast with AltiVec, there are more data types (meaning AltiVec is faster or can be used for more algorithms), and you can mix and match data types together better.
The AltiVec based PowerPC's can also handle a 2 Meg L2 cache, and interposers (processor + cache cards) are easy to design and make (and sub-manufacture). Katmai will likely only handle 1 Meg L2 cache. For smaller streams this won't matter that much, but for some larger things (like emulation, video processing, image processing) it is more likely to make a difference. Furthermore some AltiVec designs allow for 128 bit access to L2 cache (and main memory), while Katmai is limited to 64 bits -- so accessing L2 cache should be faster on AltiVec (wider is better).
One of the tougher things about SIMD designs (and all future processor designs) is keeping the cache fed with valuable -- since misses cause serious performance hits. Newer instruction sets (and instruction set add-ons) are including "hinting" and tips for preloading the cache before you use it (so you don't stall waiting for it to load). As is normal for the two designs, AltiVec is far more aggressive and clean for how to "hint" the cache. While you can preload one cache line for KNI, with AltiVec you can have 4 independent streams (threads) used for hinting and prefetching. With KNI you have to prefect each line of cache -- but the AltiVec streams allow you to tell the cache where to load (and how much) and then you leave it alone, and don't have to waste more instruction bandwidth on prefetching. Better use of cache means more performance.
But not only is AltiVec better in what it loads, it is better in what it does not load. Sometimes the data you are running is on some data that you will never have to see again. Bringing it into the cache, processing it, and then flushing it immediately, can actually cause a performance hit (due to cache design issues). In AltiVec there is also the ability to load transient (temporary) data in AltiVec (data that will clear itself out once it is used). This transient data goes around the cache (and so does not fill the cache with useless data) -- and keeps you from filling your cache with data that just needs to purged. Again, AltiVec has a better use of cache, which means more performance.
AltiVec (PowerPC's) also have all the cache line information tags on chip -- so they can peek ahead and work a little faster (by not having to go to cache as often , or by knowing when something is going to miss without having to ask the cache first). So not only is there more cache in the AltiVec, there is more hinting, more control, you don't have to ask for everything explicitly, the hinting that will likely make the cache be more available for what you need it for (and you get more hits), and the cache design means that you get smaller miss penalties.
For embedded systems Intel and KNI aren't even in the game. The AltiVec powered G3 (or G4) is likely to be in the 10w range (Max), while that power-sucking heat-generating monster Katmai is more likely to be putting out 30-40w. (Generating easily 3 times as much heat, or more). That drives up costs, can decrease reliability, you have to heat and cool those things, often have to add fans to your designs, and so on. Then there is the space required -- physically a G3 and Cache is 1 1/2" by 2", while there is probably 8 times the volume (or more) required for the Pentiums (with heat sync's) -- more if you add in the larger power supply. Of course KNI should cost more as well since the G4 is a lot less expensive chip to make (smaller = less expensive).
Katmai will likely only handle 1 Meg L2 cache. Intel will likely allow larger L2 caches in subsequent designs (like for Xeon variants of KNI) but those are at least 6 months to a year behind Katmai, and those are very expensive processors (5 to 10 times the cost) that require more expensive support chips, even bigger power supplies and so on. Intel has also been trying, and succeeding, at closing off their system designs to outside designers -- you can buy motherboards or at least processors and chipsets from Intel, but Intel is getting a pretty big lock on support chips for their processors because of their proprietary connector designs (Slot-1, Slit-1, Slot-2, Slot-M). And there is not even one form factor for how to connect the processor to your system -- for now there is Slot-1 (and Slit-1 for Celeron variants), but Intel is talking about Slot-2 and Slot-M, so there is serious concern in form factor and future changes. You can't just design your own interposer with Intel chips (or it is far harder to do) and Intel is likely to require licensing fees (though I have not yet heard of any licenses being granted). In fact, Intel is being legally aggressive at defending their Slot-X designs -- so if Intel doesn't offer a flavor you want, then you can't get it (nor do it yourself).
The AltiVec based PowerPC's can also handle a 2 Meg L2 cache and the interposers (processor + cache cards) are easy to design and make (and sub-manufacture). These are basically open designs, and IBM and Motorola makes their designs available to their customers. Both are encouraging embedded use of their processors. Motorola and IBM are dual sources for Processors and offer different types of processor variants and support chips. It is also easier to deal with either Motorola or IBM to have custom variants of the processors or embedded controllers or support chips made.
For embedded systems it isn't even close. Katmai isn't being used as an embedded controller because it doesn't make a good one. Intel will not be able to bring KNI into the embedded market (and so far they aren't even trying), while Motorola already has many companies lined up and excited about AltiVec as the worlds fastest DSP.
Remember, while all the things that make Katmai and PentiumII's worse for the embedded controller market, also make them the worse choice for low end computers and are going to be barriers to bringing the price down in those areas as well.
In a normal home computer, only a small percentage of things done will be accellerable by SIMD. But those things will be accelerated by so much that it makes a real overall system difference. Imagine QuickTime films that instead of opening 2 at a time and having your machine bogged down, being able to open 8 and still having more of the processor free to do other things. Imagine 3D games getting to be as good without specialized dedicated Video Cards as they used to be with them. Imagine networking requiring far less of the processors attention, emulators running faster, and being able to do more things at once. Speech recognition that can afford to be better. Filters and video processing running much faster. This is going to make a real difference in processor performance even if it is only doing it on a few things at a time.
There are certainly a lot of areas I did not go into. And for many things KNI is better than nothing, and better than MMX, and MMX is better than not having MMX. But for everything you want to do SIMD for, AltiVec is a far faster design and implementation than KNI with MMX. In fact for many things it looks like AltiVec will be 8 times faster, with a few things being about the same speed and a few things going dozens of times faster on AltiVec. Of course AltiVec makes the machine faster still compared to machines that don't have SIMD at all.
Whatever the percentage of things that can be accelerated with KNI, that percentage is larger for AltiVec. Whatever amount of acceleration you can get out of KNI, it looks from the designs that AltiVec will be far faster. If you think AltiVec is going to be fast in a generalized computer, you can't imagine this thing in a nice specialized embedded application. And all this extra computing power available is going to open up whole new applications and uses for computers -- imagine high quality video conferencing, enabled because SIMD allows for better Video compression algorithms and lower overhead processing the networking streams. There are whole new markets to be opened because of AltiVec.
But not only is AltiVec far faster, and applicable to more things, it is far easier to program and develop for. Intel dumped MMX and KNI on the market without any tools and an uglier design, so programmers are forced to hand sludge through assembly -- which is slow and complex. Motorola made a C-like Syntax for programming AltiVec that is far easier to develop with, on a cleaner design to begin with. They created compilers, emulators to test programs with, profiling tools, they are being much more open with the design as well as the tools, Motorola is even giving away great source code libraries (to give developers a head start), AltiVec has a longer lead times to developers (to let them adopt it), and it is more likely to get OS support sooner. The end results are that it costs far less to develop for AltiVec than for KNI, and you are far more likely to see a larger return on your investment. Plus the poor Intel guys are forced to choose between supporting MMX, MMX2 (KNI), 3D-Now, or trying to support all of them (with all of them supporting slightly different features). So for mainstream PC's you have three dissimilar and somewhat incompatible standards, all with less performance return on investment than if you just use AltiVec. All these things mean that there will be more adoption of AltiVec -- which means that not only will AltiVec be faster, but it will be used to speed up more things (making applied performance faster still).
So in summary, for mainstream applications AltiVec should be more prominently used in the OS and Applications, for more things, and offer a better return on development dollars, and have far higher performance. In the embedded world the advantages are even greater. Intel did not have the on-chip real-estate they needed to do a good SIMD design, while the cleaner RISC architecture of the PowerPC gave Motorola and IBM the space they needed to implement SIMD right. While KNI is a nice hack on top of a 20 year old instruction set, it is still a register starved implementation that can't hold a candle to a well designed SIMD implementation (like AltiVec) with a clean instruction set in a RISC chip (like the PowerPC).
Special thanks to: Keith Diefendorff, editor in chief of Microprocessor Report,
His article for MPR on KNI, and his answers to my questions, aided me in the research of this article.