MacKiDo/Dojo/MMX

Advocacy

Myths
Press

Dojo (HowTo)

General
Hack
Hardware
Interface
Software

Reference

Standards
People
Forensics

Markets

Web

Museum

CodeNames
Easter Eggs
History
Innovation
Sightings

News

Opinion

Other

Martial Arts
ITIL
Thought

How does MMX/VMX work?
How to inject FUD into your computer...

By:David K. Every
©Copyright 1999

Intel has been creating advertisements with dancing manufacturers in day-glow clean suits -- implying that MMX will give you rhythm, dance moves, and blazing performance. The truth (for Intel) is far less impressive than the hype. MMX did briefly decrease the performance lead that the PowerPC's have on the Pentiums. However, the PowerPC's are now running so much faster (in both MHz and performance per MHz) that the lead is once again stretching out in favor of the PPC's. Whew, I was worried too -- after all, this years Intel Processors had almost caught up with last years PPC's. Now the PowerPC's are not only faster, but they are also about to add their own versions of MMX called VMX that will take them much further ahead in the future.

I know that some people are going to claim that AIM and the PowerPC's are copying Intel. However, the irony is that IBM was one of the original researchers in this field (MMX like functions). In fact, IBM had a VMX enabled version of the POWER Architecture (which is the Chip-Set that was used as a design foundation for the PowerPC) long before Intel started on MMX. It was likely IBM's papers and chips that inspired Intel to do MMX, but I imagine that will be lost in the marketing wars (the truth seems to be the first casualty).

I expect that VMX will far outperform MMX, for various reasons that I will get into later. But first let me explain what MMX really is, and is not. It is not piping 70's DISCO music in the fabrication facilities for Intel's chips -- which is what we've been lead to believe. It is a few instructions, that improve a small set of functions, by a large amount (two to eight TIMES faster). But to understand how these instructions work, and why, I must first review how traditional processors (instructions) work.

Traditional Instructions

A computer executes a sequence of very simple instructions, very quickly. To the computer these instructions are represented as nothing but patterns of "on" and "off" (1's and 0's) - known as binary numbers. These bits are paired up into larger groups to make an instruction (usually groups of 32 bits, on modern processors). These bit patterns are either data, or instructions (or both). (Data is just a value -- number, character, pattern,etc.).

The larger the groupings of bits, then the more unique possibilities the grouping can represent. Groups of 8 bits are, called a Byte, and can have a value from 0-255; while a group of 16 bits, called a word, can hold a value from -32768-32767; 32 bits is called a long word or double (two words); and groups of 64 bits, called a quad (word), represent a real number (not whole) and can be very very large (basically it is large enough and accurate enough to be used by most scientists). Quad words are often "floating point" values, because a few bits are used to represent where in the number the decimal point is -- the decimal point "floats". This allows a much wider range of values (with more accuracy).

Careful with these terms. A "Byte" is always 8 bits, but just to be confusing, a "word" did not ALWAYS mean 16 bits. "Word" is commonly used the way I described (as 16 bits) -- but the term originally had to do with the size of the computer (not an absolute bit count). An 8 bit computer had an 8 bit word, and a 32 bit computer had a 32 bit word. But so many computers were 16 bit, for so long, that almost everyone uses "word" as 16 bits -- even on 8, 32 or 64 bit computers.

Computers hold values they are working with in very-temporary super-fast Memory locations -- called registers. (The PowerPC's also have far more registers than the Pentiums which help with the performance). The computers size is named for the size of its registers. The more data the Computer (processor) can work with at one time, the faster it is. So the modern 32 bit CPU's (Processors) are faster than the older 8 bit ones. However, there are always tradeoffs. For this article I will discuss 32 bit machines. Just for fun, 32 bit computers often work with 64 bits of data for their floating point registers (a separate part of the processor) but are still called 32 bit computers (go figure).

If I am working with data that is the size of the machine, there is no problem -- I just load in all 32 bits of the data. However, if I need to work with data that only needs to be 8 bits wide (on a 32 bit computer) then I (or the computer) have to ignore 24 bits. See the following figure --

Unused Space Used Data

<----- ALL 32 BITS (what the CPU works with) ------>

FIGURE - A

Notice that 75% of the potential of the machine is not being used. The computer loads (accesses) 32 bits at a time (the processors size), but then only uses 8 bits (which is the data's size). So in some cases a 32 bit computer, really just works like an 8 bit computer, if that is the size of the data; or it works as a 16 bit computer, if the size of the data is 16 bits.

Using a lot of 8 bit values in a 32 bit computer (this way) is inefficient and would leave lots of left over space. Memory is just a sequence of all these instructions and data. If you had a stream of 8 bit data, in a 32 bit computer, memory would look like the following figure --

Unused Space
Used Data 1

Unused Space Used Data 2

FIGURE - B

So programmers code around this issue and instead they often pack the data in like this --

Data 1 Data 2 Data 3 Data 4

Data 5 Data 6 Data 6 Data 7

FIGURE - C

Well that's great for space efficiency. Now we have the data as a series of bytes, even in a 32 bit computer. But now we have a new issue (engineering is often about tradeoffs).

When programmers want to access the data (one byte), out of the 32 bit chunk of data ( the long word), we have to make sure to mask off (ignore the other data) when we work with it. In other words, we have to be careful not to harm the unused data that is next to (or around) our data, because that data may only be unused for THIS calculation.

If I want to work with all of those values at once, then there is not much of a problem. Imagine each byte in our long-word represents a character in the word "TEST". If I want to move "TEST" around (as in copying and pasting in a word processing document), its no big deal -- I just grab the whole long-word. So 32 bit processors are usually faster than 8 bit ones -- it is only when you want to work with the letter "S" alone (one part of a packed structure) that we have to worry about this "masking" stuff. Sometimes the Hardware (processor itself) can do the masking, but that only makes the performance hit smaller, and does not eliminate it.

So different instructions have to know where their data is in a long word, and also not to harm the data around it. They mask out the data they need, do what they want with it, and carefully place it back. All this gobbledy gook can be a royal pain in the butt, and the extra steps makes things slower. Often programmers just don't pack the data -- wasting a little space is not as important as speed in some cases. But sometimes spacing the data out is not possible. In life, there are always exceptions. Graphics, video, Sound and Network streams are all continuous packed streams of data. In those cases, not only do we want to work on bytes or words inside of longer words, but we often want to complex math on the streams in real time (or near real-time).

Real-time means as fast as the data is coming in. Usually computers can work with data as fast as they can -- if it takes 2 seconds, then so be it. But if you start falling behind on a stream of data, then you get further and further behind until you run out of memory and have to start forgetting things. Its like the "in-box" at your job. If you can't keep up, then you are screwed.
Processors (or functions) that are designed to work with streams of data (like bytes) and alter them in real time (very quickly) are called either DSP (Digital Signal Processors) or NSP (Native Signal Processors). The PowerPC was far superior to Pentiums at NSP/DSP functions -- this is why the Pentium with MMX just catches up to the PowerPC without it.

MMX

So here we are. Normal instructions do not work very well with small data, inside of larger data,. In fact I simplified the problem a bit -- to actually do something, like add one (+1) to each byte in the long-word is more tricky than you might think.

Adding 1 to the whole long word would only really add 1 to the lowest byte, and all the bytes in the structure are independent. So that doesn't work. Adding a value that has 1's in each byte position won't work either (though it seems like it would) because of overflow -- when one byte overflows, it can "overflow" into the next byte (like your sink onto your floor). More complex math (multiplication/division/etc.) is even worse. So math functions that work with packed data have to be designed special FOR working with packed data. That is all that MMX is -- A few functions that can do math on packed data very quickly. Instead of treating a long word (or quad word) as an individual piece of data, it breaks the instruction down into smaller components, and works the packed data in parallel.

Because the MMX instructions use the floating point registers, it has 64 bits to work with at one time. That means that it can work with 8 bytes at one time, or 4 words -- there are also a very few cases where MMX can work with 2 long words or 16 nibbles (4 bits) at once. So instead of a 32 bit processor behaving like an 8 bit processor (when working with 8 bit data), it can behave as 8 x 8 bit processors at the same time. There are also a few instructions that do specialized math, instead of building those instructions out of many simpler ones (this increases speed further).

Because MMX shares it registers with the FP unit, all the registers have to completely unloaded and reloaded whenever the programmer wants to change from MMX mode to floating point mode (or vise versa). This causes some real performance issues with MMX. If you don't want to mix MMX and FP, then things are fine. In the real world, the programmer is not really in control of this, the OS itself will often switch in and out of MMX and FP mode on its own schedule. Also programmers DO want to mix them in some cases. So in some ways MMX is faster, in others it can slow things down (mode switches). Intel is trying to improve MMX with the PentiumII's and later, because the first versions had this weakness. Even so, they can only do so much with their design without breaking programs already written for MMX.

VMX (AltiVec)

This design is not written in stone, in fact I have yet to find it in ink or pencil. So this is speculation, but educated speculation. We do know that VMX is likely to be similar to MMX in that it will be some specialized parallel instructions, that work on many pieces of data at the same time. However, the PowerPC's have more space on the chip to do their work. So chances are that the PPC's VMX will have big advantages over Intel's MMX.

VMX will not have to share its registers with the FP unit. (This is almost guaranteed). This alone guarantees better performance because there will be no mode switching. It also means that the registers may be larger than 64 bits -- like possibly 128 bits or more. This could increase performance even more.

PowerPC's are more Superscalar than Pentiums; PPC's can complete more instructions at the same time (4 or 6 at a time, compared to the Pentiums 2 or 3). This makes the PPC faster at the same clock rate than the Pentium. The PPC's have more registers (so they don't have to unload and reload them as often), so they are faster there as well. The PPC's also has more special fast memory (Cache) than Pentiums. All of those things will likely apply to VMX over MMX -- with possibly more registers, more superscalar functionality and a larger cache.

MMX had a few basic instructions that were good for some sorts of Multimedia -- but not that many. Since the PowerPC has more free space and power, it may add in more specialized instructions -- which can be a big deal in performance. So there may be more or better instructions for VMX (PPC's).

There is so much more free space in PowerPC's, that in many ways the VMX design may be included as a separate processor inside a processor -- sort of on-board MP (Multi-Processing). This means it might run simultaneously with your regular code, without having to stop regular work to do VMX instructions (which is what happens in the Pentiums).

Also the MacOS and OpenStep is better designed, and more likely than Windows (95 or NT respectively) to take advantage of functions like VMX/MMX -- without requiring Every programmer recompile every program to use it. QuickTime, Sound, Networking, and other libraries, are all more able to take immediate advantage of this added functionality -- and Macs are better at using these libraries and standards.

Lastly AIM (Apple-IBM-Motorola) have not "rushed" to get something out. They have taken their time to do things right. This alone means they are more likely to learn from others mistakes (and avoid them), and create a superior product from the start.

It seems highly likely that VMX will be a far superior design and implementation than MMX is. But Intel and Microsoft have never been known for superior engineering, just superior marketing. Hopefully the industry is waking up to these facts.

Conclusion

I hope you have an understanding of what MMX is, and is not. It can not cure world hunger -- and does not even make that big a difference for most programs. Some programs can take advantage of it, and see dramatic performance differences -- so much so, that for some limited things, the Pentiums almost catch up to todays PowerPC's. Before MMX the PPC's were beating the Pentiums by 2:1 or 3:1.

VMX is likely to be the next generation in this area of functionality. PowerPC's superior RISC design gives it advantages that just can't be marketed around (despite great effort by Intel to do so). The G3 (new PowerPC's) are beating the PentiumII's (with MMX) at the same clock rate (by a large margin) without VMX, the G3's are going to be available at much higher clock rates, and the G3's use only 20% the Power (switches and area) to do so. Since area, power, switches and cost are all related -- the designers of the PowerPC just have far more room to work than Intel. This means that no matter what Intel tries to do, AIM (the PPC's) can respond, and far "out do" them. Of course Intel can't sell that -- so they will resort to more ads with dancing techs in iridescent suits trying to sell you an image or name recognition -- while staying as far away from the facts as they can.

Created: 06/16/97
Updated: 11/09/02

Top of page

Top of Section

Home