Computer: CPU
12864   (+25, -20)  [vote for, against]
A microprocessor description I created in early 1991

The 12864 Microprocessor

This is a daydream...let me call it the 12864....

Let me start by saying I am prejudiced in favor of the 6809 microprocessor, created by Motorola. That it was the best of its day was confirmed when NASA decided to use it in the Space Shuttle's main computers. I personally feel that the 6809 should have had wider acceptance in the personal-computer market, and that Motorola snubbed its potential by introducing the 68000 too quickly. Only recently, with the widespread use of the 32-bit microprocessors, has the 6809 really become outclassed. So it is time to move on, time to create a new best microprocessor of the day. Since this is currently only my own dream, it has been greatly influenced by what I know of the 6809, and also what I have learned about the 68020. I do not mean to ignore any worthy contributions from other microprocessors; that is in fact the main reason for this essay! I am sharing my dream in hopes that it may be catching....

The 12864 is a 128/64-bit microprocessor. It has 64 address lines, and all registers are 64 bits wide. But it also has 128 data lines, and this is why: First, being able to handle this many bits at once means that the 12864 doesn't need a coprocessor; most coprocessors only handle 80 bits or so. Therefore the 12864 also doesn't need a secondary instruction set telling it how to talk to a coprocessor. A second reason for having a 128-bit data path leads to further simplification of the microprocessor: All its instructions have been carefully designed to fit within 128 bits, so that a single memory-access can provide the 12864 with a whole instruction. To make this still more efficient, the computer that incorporates a 12864 will be required to have 128-bit-wide memory, and not the common 8-bit-wide or 9-bit-wide memory of most of today's microcomputers. This means that the 64-bit Program Counter or PC register is always incremented just once for each instruction pulled from the memory. The 12864 is not much of an evolutionary offshoot from previous microprocessors; it's a radical mutation. Only in the efficiency of its instruction set does it relate to the 6809....

With 128-bit memory, design decisions made in the 6809 and 68020 are greatly simplified in the 12864. Example: Because the 6809 fetched instructions only 8 bits at a time, there were two distinct groups of Branch instructions: an 8-bit branch and a 16-bit branch. Machine code that used 8-bit branches as often as possible was both shorter and faster than code that always used 16-bit branches, because only one byte of memory and 1 clock-cycle of time was needed for 8-bit branching-data, while 2 bytes and 2 cycles were needed for 16-bit data. (Not to mention that 8-bit-branch INSTRUCTIONS were themselves only 8 bits, while most 16-bit branches also had 16-bit opcodes.) And in the 68020 processor, although there are 8-bit, 16-bit, and 32-bit branch instructions; the latter, 32-bit type requires an extra fetch of data from the memory. But the 12864 processor needs only one size of branch instruction, because any 64-bit branch-distance will always fit into a one-clock-cycle 128-bit opcode+data fetch.

Likewise, because any 64-bit address in the memory can be part of a 128-bit fetch, there is no longer any need for a special Direct Page or DP register. In the 6809 the DP register offered an 8-bit way to access part of the memory; thus the longer and slower 16-bit way of specifying memory locations did not always have to be used. This is not a problem in the 12864.

Now what about the choice to use 64-bit-addressing? This represents about 18.4 quintillion addresses (18,446,744,073,709,551,616 addresses, to be exact), far beyond any reasonable projection of any computer's memory needs -- including virtual memory! Not to mention that since each address holds 128 bits of data, we are actually talking about 295 quintillion (8-bit) bytes of memory!

Nevertheless, there are some possibly valid reasons for this choice: First, since the design of this processor is not yet completely fixed, and belongs to nobody, it might be that it could tickle the fancy of a number of different chip manufacturers, and lead to a Industry-Wide Standard Design. Naturally, it makes sense for the 12864 assembly language instruction set to become standardized and non-proprietary, also. Therefore a second reason for choosing 64-bit addressing is simply that it would take longer to put this complex chip into production -- and that hopefully gives the software developers plenty of time to convert their existing software to run on this admittedly incompatible processor. Thus, both the new computers and their software could arrive at the same time! Finally, a third reason for jumping straight to 64-bit addressing is that the architecture of the new computers can be designed with that in mind. Simply because 64 bits represents such a tremendous enhancement, making it the immediate goal means it can remain a standard far into the future....

Now let's get into some of the details of the 12864. The total number of registers of all types will be about 45, give or take a few. This number can be decided after the Condition-Code/Status Register has had its bits defined. As stated earlier, every register is 64 bits wide, including CCS. In the CCS register a number of bits are necessary for various processor functions; just how many depends on the total list of functions that will be designed. For the purposes of this essay, let us examine the CCS register of the 68020: It is 16 bits wide, of which 12 are defined and 4 are undefined. If we start with a 64- bit CCS and only use 12 of them for such things as result-of-instruction flags, interrupt masks, etc, then that leaves 52 bits that can be equated to the entire register set of the 12864 microprocessor. However, it is certain that some of those 52 bits will be dedicated to other processor functions (but I don't know other dreamers will add to this), and so the number of registers is yet unknown.

In case you are wondering why match the bits of CCS with the register set, the answer involves the interrupt system. Whenever an interrupt or exception or other special event occurs, the processor can automatically save on a stack all the registers that are specified in the CCS register. The processor saves time because none of those interrupt-type handling routines need include instructions to specifically save and recover the registers they use. In fact, if the 12864 computer system's main power-up/initialization routine includes defining such a list of registers in CCS, then all interrupt-type routines can be written using only those registers. Different boot software, different registers. Note that 2 registers, the Program Counter and CCS, which ALWAYS are saved, do NOT need to be matched to bits in CCS, and so the 12864 can have 2 more registers than the simple count of available bits in CCS implies.

The next thing to discuss is the actual list of registers. A major element in the design of the 12864 is that as far as the programming instruction set is concerned, all registers are treated equal. But as far as the microcode and the hardware is concerned, some are more equal than others.... For the sake of this discussion, let us assume that there are 45 registers, numbered from 0 to 44. Suppose that Register 33 is the Program Counter, while Register 17 is just an ordinary general-purpose register. The hardware will always use 33 as a pointer to the current instruction about to be executed, and the hardware will always adjust 33 to point it at the appropriate next instruction. But the instruction set will not distinguish 33 from 17! A Logical-OR instruction that manipulates a group of bits inside 17 can just as easily manipulate bits inside 33, simply by specifying 33 instead of 17, in the Logical-Or instruction. Just because this is something that might be disastrous to the program is no reason to keep it from being possible! Let the assembly-language programming tool be written so that it catches such dubious instructions, and warns the programmer! The big advantage of this scheme is that it leads to an extremely significant reduction in the total complexity of both the instruction set and the microcode. Examples later on in this essay may make this more clear.

Let us now examine the bit-format of some of the instructions. By far the majority of the instructions will have a single format that offers astounding programming potential (well, what do you expect with 128 bits to play with!).... Actually, most of this instruction-group format fits into 64 bits, numbered 0 to 63, and defined as follows:

Bits 63-58: These 6 bits hold the actual generic instruction. Of course this means that there are only 64 such instructions, but if you have any doubts about this being enough, you don't yet realize how generic they are!

Bits 57-46 are divided into three groups of 4 bits each, hereinafter to be referred to as 'admode fields', short for 'addressing mode'. Since these fields have 4 bits, it follows that there are 16 different addressing modes. They will be explained shortly. The first admode field, bits 57-54, tells the processor where to find the first chunk of data needed for some instruction, say a SUB. The second admode field, bits 53-50, tells the 12864 where to find the second chunk of data; obviously a SUB instruction needs data that can be subtracted. And the third admode field, bits 49-46, tells the processor where to put the result of the SUB. Perhaps you now see that with 16 addressing modes for each admode field, a simple generic SUB instruction can encompass both registers and memory in quite a few different combinations!

For convenience, let us call the admode fields GET1, GET2, and PUT. A list of proposed addressing modes follows, and if it is adopted, there will be a few restrictions on the use of two of them. Should the list be modified during later design stages of the 12864, these restrictions may still apply. The modes subject to restriction are marked with * symbols; the limitations are detailed at the end of the list. The admodes are numbered 0 to 15 in binary.

{POST NOTE: SORRY FOLKS, but the lack of a fixed font here does bad things to the table layouts. If you copy/paste to Notepad, you can remove excess carriage returns, and the tables should look better.}

Direct Modes 0 to 3 Semi-Direct Modes 4 to 7
_0000 Register Data+16\9\10\3bit Offset__0100 Reg Address + 16\9bit Offset
_0001 Register Data+16\9\10\3bit Adjust__0101 Reg Address + 16\9bit Adjust
_0010 GET1=PUTmode, or GET2 or PUT=NONE__0110 Reg Addr+(Reg+10\3bit Adj) Offset
*0011 Immediate 64bit Data______________*0111 Absolute 64bit Address

_____________________________Indirect Modes 8 to 15
_1000 [Reg + 16\9bit Offset],LSig64bits__1100 [Reg + 16\9bit Offset],MSig64bits
_1001 [Reg + 16\9bit Adjust],LSig64bits__1101 [Reg + 16\9bit Adjust],MSig64bits
_1010 [Reg+(Reg+10\3bit Adj)Offst],LS64__1110 [Reg+(Reg+10\3bit Adj)Offst],MS64
_1011 [Reg],LS64+(Reg+10\3bit Adj)Offst--1111 [Reg],MS64+(Reg+10\3bit Adj)Offst

Explanations

Register Data: The value in a register is considered to be data.

Reg Address: The value in a register is considered to be an address.

16\9bit, 10\3bit: 16 or 9 or 10 or 3 bits of twos-complement information, sign-extended to 64 bits (internally) by 12864 processor.

64bit: 64 bits of information fetched with the 64-bit generic instruction.

Adjust: Value in register is modified, using twos-complement information. If info is negative, register adjusted BEFORE instruction executed. If info is positive, register adjusted AFTER instruction executed.

Offset: Similar to Adjust, but register not modified. Computation of the Offset is always performed before instruction is executed.

NONE: No data or address at all.

[]: Value inside brackets is an address. Information at that address is in turn used as an address.

,LSig64bits ,LS64: An address holds 128 bits of data, of which the Least Significant 64 bits are selected for the instruction.

,MSig64bits ,MS64: The Most Significant 64 bits at an address.

(Reg): Distinguishes a second register that this addressing mode uses.

* Recall design decision that limits instructions to 128 bits, including 64 bits of Immediate Data or Absolute Address. It's quite obvious that only one Admode Field can get to use those 64 bits. It also works out that if different admodes exist in all three Admode Fields, then none of the *-marked admodes may be placed in any Admode Field. And in any instruction that uses data acquired through the GET2 field, or in which GET1 is different from PUT...such instructions exclude the Absolute-64bit-Address mode from the PUT field. (Of course, Immediate Data mode is always excluded from the PUT field.) More details of these limits will be provided later; for now it might be noted that the reason that no 64bit-Offset modes exist is to avoid a lot of trouble. It makes the programmer use more registers for indexing, but eliminates much competition between the Admode Fields for the use of the 64 bits that accompany the instruction. Besides, you might be surprised by how well other instructions can replace any 64-bit Offset modes! Anyway the 12864 processor will probably have 30 or more general-purpose registers (registers the hardware doesn't always modify for specific purposes, like CCS or the Stacks---or have their contents used for other purposes, like pointing at cache or program data). It may be easy to find enough available registers for most address-pointing.

Now for some descriptions of the 16 admodes and their consequences:

Direct Modes 0 to 3 all specify data the 12864 processor has on hand, in a register, or just loaded along-with or as-part-of the instruction. Obviously, these modes can be executed more quickly than the Semi-Direct or Indirect Modes.

(0) In this admode the data needed by the current instruction is in one or two of the registers. The 12864 processor has 128 data lines; just because every register is only 64 bits wide is no reason to limit its ability to process 128 bits. ANY TWO registers may be put together, in any order, to make a place that holds 128 bits! (Okay, I exaggerated; the 12864 will have both a 'Boss' mode and a 'Peon' mode. Only the Boss mode can put ANY two registers together; in the Peon mode a lot of combinations will be illegal. And, even in the Boss mode, a lot of combinations will be undesirable, like using a Stack pointer with a Cache-pointing register; the 12864 Assembler would warn the programmer.) Note that admode 0 merely declares that one or two registers will be used; the actual register(s) specified are elsewhere among the many bits of this generic format. After the processor identifies the register(s) holding the data, an offset will be applied to that data. The offset quantity shall be used by the 12864 in its implementation of the instruction; the register(s) holding the data will not be affected by the offset. The maximum size of the offset is affected by how many registers are used, and by the type of generic instruction being performed; the details of this will be provided later. The main purpose of admode 0 is to let us eliminate the LEA (load effective address) instructions from the processor's list of 64 generics--but certainly other uses will be found for it.

(1) This admode is very much like admode 0. The only real difference is that the content of the register(s) IS affected by this admode, which makes the mode useful in counting loops. One thing to keep in mind is that any negative adjustment is performed before the whole instruction is implemented, while any positive adjustment is performed after the overall instruction is implemented. This admode also helps us eliminate LEA instructions (details later).

(2) This is the only admode with a double meaning. If admode 2 is used within the GET1 field, then it means that the first chunk of data, needed by the instruction, is currently in the place specified by the PUT field. Thus data at some location, after manipulation, will return to that location. If we exactly specify the same admode in both the GET1 and PUT fields (instead of using admode 2 in GET1), we end up being unable to use Immediate Data at all--you'll see! If admode 2 is used in the GET2 field, then it means NONE, no data for that part of the instruction. Operations like LSH (logical shift) use admode 2 in GET2; they need only one main data chunk since any other data is part of the instruction's definition. In fact, if any admode besides 2 is in GET2 during a LSH or similar instruction, then the admode should be ignored, or declared illegal. If admode 2 is in GET2 during a SUB or similar instruction, then the net effect of the SUB will be equivalent to a TST instruction. (With lots of TST-equivalents, there need not be a specific TST among the 64 generics. But the 12864 Assembler may include a TST, and translate it into an equivalent.) Finally, admode 2 in the PUT field also means NONE, no address. The computed result of the SUB or other manipulation is not put anywhere, and this is useful, too! The definition of a CMP (compare) is exactly a SUB that doesn't save the result! So the CMP becomes another common instruction that the 12864 processor excludes from its list of 64 generics.... Like TST, the 12864 Assembler can include CMP, and translate it to an equivalent: a destinationless SUB. Similarly, the 6809 BIT operation is an AND instruction with no destination. Designers unite! The 12864 has a full set of destinationless instructions--and no extra complexity! Moving on, suppose admode 2 is in both GET1 and PUT: This is basically a no-operation, NOP. Lots of ways exist to do a NOP; the 12864 Assembler can include NOP, and translate.

(3) This is the Immediate Data admode. Since the instruction is often 64 bits long, while 128 bits are always fetched from memory, admode 3 tells the instruction to use as data the group of 64 bits fetched with the instruction.

Semi-Direct Modes 4 to 7 all specify that the data the processor has on hand are memory-addresses of the data needed by the instruction. Admodes 4 to 7 are slower than the Direct Modes because the 12864 has to go fetch the data from the memory, but this process is still faster than the Indirect Modes.

(4) This admode is like admode 0 in operation. The main difference is that only one register is ever specified, since one register holds 64 bits and the memory addressing range is 64 bits. But the offset is figured the same way as admode 0, and the value in the register is not changed. As mentioned, the result of the offset computation is a memory address; the data at that location is fetched for use by the instruction.

(5) This admode combines features of admode 1 and admode 4. Again only one register is specified as an address-pointer, or index (4). An adjustment of the value in that register will be applied, pre-decrement or post-increment (1). If you review admodes 1 and 4, this one should be pretty obvious.

(6) The basic addressing mode for doing 64-bit (or any size larger than 16-bit) offsets is admode 6. One register is specified as a pointer (index) to the general region of memory; a second register is specified that will hold the offset from the general place to any specific place. Furthermore, this second register can be given a predecrement or postincrement adjustment, which makes it easy to skip through tables of data. Note that although the second register is adjustable, its value is only an offset; the first register remains unchanged.

(7) This admode specifies that the 64 bits fetched along with the 64-bit generic instruction is absolute memory address of data the instruction needs.

Indirect Modes 8 to 11 are quite like Indirect Modes 12 to 15: They are computed the same way, but at some point the data at an address is used as an address. Now since the data is always 128 bits and addresses are only 64 bits, which 64 of the 128 do we use? Thus admodes 8-11 use the Least Significant 64 bits of the 128, while admodes 12-15 use the Most Significant 64 bits.

Note that all the admodes that use registers as indexes let the Program Counter be used as easily as any other register. The 12864 processor needs no special microcode to provide a host of Program-Counter-Relative admodes, due to basic design decision making the instruction set handle all registers equally. The trick to consistency is for the processor to apply any adjustment or offset to chosen index register AFTER incrementing PC past the current operation This in turn works due to design choice to make ALL the instructions fit in 128 bits. Nevertheless, the 12864 Assembler may specifically distinguish Program-Counter- Relative admodes from the other admodes, and translate appropriately. Finally, note that it may be undesirable to use the PC register as a data-pointer in any admode that will adjust the value of the index!

(8) This admode first computes an address in exactly the same way as admode 4. The 12864 processor then fetches the lowest 64 bits from the memory at that address, and uses this information as another address. Instruction will use the data in the memory at the second address.

(9) This admode first computes an address in exactly the same way as admode 5. Then an address is fetched, and then data, as just described.

(10) This admode first computes an address in exactly the same way as admode 6. Then an address is fetched, and then data, as just described.

(11) This addressing mode starts by using the value in a register as an address. The 64 lowest bits in the memory at that address are fetched; they will be used as a second address. However, before they are used, an offset will be applied to that second address. A second register is specified, along with an adjustment. The value in this predecremented/postincremented register is the 64-bit offset that is applied to the second address; the first register's value, and the memory that held the second address, are not changed by this process. After computing the new, offset address, the 12864 processor fetches 128 bits of data from that location in the memory, for the current instruction.

(12) This admode first computes an address in exactly the same way as admode 4. The 12864 processor then fetches the highest 64 bits from the memory at that address, and uses this information as another address. Instruction will use the data in the memory at the second address.

(13) This admode first computes an address in exactly the same way as admode 5. Then an address is fetched, and then data, as just described.

(14) This admode first computes an address in exactly the same way as admode 6. Then an address is fetched, and then data, as just described.

(15) This addressing mode starts by using the value in a register as an address. The 64 highest bits in the memory at that address are fetched; they will be used as a second address. Then everything proceeds just like admode 11.

Now to show how LEA (load effective address) needn't be included among the 64 generics. Consider admode 15: At the end of its computations the 12864 processor has an address which it normally uses, right now, to fetch data, after which the address is not saved. LEA creates that address and saves it for later use and re-use (doesn't use it now). Suppose admode 15 specifies register 10 (f irst), register 7 (second), and an adjustment of -58. The Assembler translates LEA (with syntax specifying admode 15 and the register info) into an ADD: The GET1 field is given admode 4, register 10, and a 0 offset; the processor fetches 128 bits from the address (part of generic instruction we haven't got to lets us select correct 64 bits). GET2 field is given admode 1, register 7, and adjust of -58; the processor modifies the register and gives its content to the ADD. Then the PUT field specifies where to save result. GET2 might have admode 0 and get result without modifying register 7. Any LEA can be translated!

At last we can continue the bit-designations of the generic instruction format. Been about 1000 bytes per bit of explanation, so far...!

Bits 45-39 specify a Bitfield Size for the instruction. These 7 bits can hold any number from 0 to 127, and with 0 being interpreted by the processor as 128, it becomes possible for the instruction to operate on any data size from 1 to 128 bits. Even though the registers of the 12864 microprocessor are only 64 bits wide, its Arithmetic/Logic Unit is 128 bits wide, and is able to handle any data size smoothly. So if the Bitfield Size is 79, then 79 bits will be taken from the place specified via GET1, manipulated (if the instruction requires it) with 79 bits from the place that GET2 indicates, and finally a 79-bit result is sent to the place described by PUT. The 12864 Assembler considers the Bitfield Size to be optional information; if it is not provided by the programmer, a size of 128 bits will be assumed. Some Assembler instructions, like LEA, default to 64 bits due to the nature of the instruction (LEA computes a 64-bit address). MUL will always have two 64-bit inputs and one 128-bit output; DIV will always have a 128-bit dividend, a 64-bit divisor, and a 128-bit quotient. And whenever Immediate Data is specified, then either the whole instruction must be limited to 64 bits, or the processor must allow 64 bits to be used in the manipulation of 128 bits. (Perhaps we can have both: The processor can have the ability to do the latter, while the Assembler lets the programmer decide the former.) One way the programer can set the Assembler's default to 64 bits would be to simply specify only one data-holding register in an instruction's syntax.

Bit 38 of the generic instruction is the Signed Extension Flag. It tells the processor to treat the result of an operation as a twos-complement number, if this bit is set. When the result is PUT into its destination, its negative- ness or positive-ness, as it exists within the Bitfield Size, is extended out to the Bit-127-mark (the Most Significant Bit is numbered 127; the Least is 0). If only one register is specified, then sign-extending the result out to the Bit-63 -mark is the thing to do. If the Signed Extension Flag is not set, the result of the instruction is simply PUT into its destination, and nothing else is done.

Bits 37-34 contain the Do-If condition. Practically the whole instruction set of the 12864 processor is conditional. This lets the programmer avoid a lot of conditional-Branches that only skip past a few instructions. Where formerly some code might have: BCS (branch if carry set) followed by a ROT (rotate) that would be executed if the carry flag was clear, now we can specify Do-the-ROT-If Carry Clear, and delete the Branch entirely. In fact, with these 4 bits we can delete the entire collection of Branch operations from the generic instruction set of the 12864! The Assembler simply translates any Branch to ADD Immediate Data to the Program Counter, and sets the appropriate Do-If condition-bits. Of course, most of the time, most instructions will set the Do-If to ALWAYS. With only 4 bits, only 16 conditions are allowed. This is enough for Motorola's 6809 and 68020; I hope the final design of the 12864 processor won't require more.

Bits 33-29 are the Flag Mask bits, the other side of the coin from the Do- If conditions. If every instruction can be controlled by the flags in the CCS register, it follows that every instruction should be able to specify which CCS flags, if any, will be affected as a result of its implementation. In fact, for the Branch instructions to be properly deleted from the generic instruction set, it is essential that flag-masking be possible. Traditionally, Branch operations never affect any flags; translating them into ADD instructions makes it obvious why we require flag-masking. Now consider again the Do-the-ROT-If Carry Clear that was previously described: What if the instruction after the ROT is also to be executed only if the Carry flag is clear? A ROT normally affects the Carry flag! So we mask the flag; the next instruction can also Do-If Carry Clear. In the 6809 and the 68020 there are only 5 conditions-of-results flags; I hope the final design of the 12864 processor won't require more.

Bits 28-0 (yes, all the rest) are devoted to the details of the PUT field. However, the highest seven of them, Bits 28-22, can have another purpose. There is a group of operations that perform what we might call 'minor manipulations', and which may need some minor data. The generic instructions of this class that I have so far identified are, in alphabetical order: ASL and ASR (arithmetic shift left and right), COPY, INIT (initialize), ISUB (subtract from an initial value), LSR (logical shift right; LSL = ASL), ROL and ROR (rotations), and SWAP. ASL, ASR, LSR, ROL, and ROR need data ranging from 1 to 128; INIT, ISUB, and sometimes COPY, need twos-complement numbers ranging from -64 to +63. The specification of 7 bits was decided by the needs of ASL, ASR, LSL, ROL and ROR; other instructions are merely taking advantage of what is already there. Only SWAP does not need any of those seven data bits. We could have assigned eight bits to ASL, etc.; twos-complement numbers from -128 to +127 (with zero = +128) would let us reduce the list of generic instructions even more. Unfortunately, we are running out of bits! So we can either assign 5 of 64 generic operations to various kinds of bit-shift, and use 7 bits to describe the size of the shift --or we can have 3 generic shift operations and use 8 bits to describe the size of the shift. But ONLY those 3 generic instructions ever really need that 8th bit! It seems more reasonable to use an extra 2 of the 64 generic instructions.

Let's examine some of the capabilities of these 'minor' manipulators:

ASL (and the identical LSL) merely shift bits from Least Significant to Most Significant. The Bitfield Size determines how many bit-positions will be involved in the shift. There is also some Bitfield Start data (which we haven't got to yet, but has to be mentioned NOW) that specifies exactly where among the 128 bits the Bitfield Size is located. The 12864 Assembler needs to scrutinize these things carefully; we can't let Bit 100 be the Start while the Size is 34 bits, nor let the Size be 52 bits while the Shift is 73 bits! One final thing about ASL and LSL: Perhaps they shouldn't be so identical. The 6809 processor defines them so that there's no reasonable difference between an ASL and an LSL. But the 68020 places a new flag in the CCS register, an eXtend flag designed to hold a bit of data specifically for arithmetic operations. The Carry flag holds data for both arithmetic and logical operations. Yet LSL and ASL both affect the X flag! So perhaps a distinction can reasonably be made: Only ASL should affect X. (ASR and LSR also have this small irrationality.)

ASR and LSR are similar to ASL, of course, their main difference being that these instructions shift bits from Most Significant to Least. More details of what they do need not be presented here; they all are common instructions. But we might note that the power of the 12864 processor lets us get data from just about any place in the computer (using GET1), shift or otherwise manipulate any part of that data, and then PUT the result almost anywhere else, all in just one instruction. The mundane turns into the extraordinary.

INIT lets us initialize a register, or registers, or data at any memory location, such that it becomes a 64-bit or a 128-bit expansion of any number in the 7-bit range of -64 to +63. INIT replaces CLR (clear), which initializes a data-storage place to zero only; now we can initialize to 1, or -1, or to any of more than a hundred possibilities. Note that INIT never needs any GET1 info.

ISUB replaces both NEG (negate) and NOT, which respectively subtract a number from 0 or -1. ISUB subtracts numbers from anything in the Initial Value range of -64 to +63. Finding other uses for this operation is not so important as consolidating NEG and NOT into one generic instruction. The Assembler will, of course, retain both NEG and NOT, and translate appropriately.

ROL and ROR are pretty much like the shift instructions. The 68020 has another sort of rotation called ROXL and ROXR, but the 12864 may not need them. First examine the rotation operation of the 6809: The Carry flag is always part of the rotation; a bit coming off one end of a byte is moved to Carry by one ROT and moved out of Carry back into a byte by another ROT. In the 68020 a simple ROT moves a bit from one end of a location directly to the other end; a copy of that bit is placed in the Carry Flag. ROX, on the other hand, uses the eXtend flag the same way the 6809 uses the Carry. In the 12864 processor we can mask flags that an instruction would normally affect. Suppose the 12864 rotation is designed to normally flag both X and Carry: If we mask Carry, no copy is sent there; if we mask X, the bit that normally moves through it simply bypasses it. (Similarly, the 12864 can have one generic ASL/LSL operation, but the Assembler can mask the X flag for LSL--if the notion proposed earlier is adopted.)

The COPY instruction replaces LoaD, STore, TransFeR, INC, DEC, JMP from the 6809, and COPY also replaces MOVE from the 68020. Even some LEA operations can be translated to COPY. The GET1 admode field lets us specify any place in a 12864-based computer from which to fetch data (and any number of bits from 1 to 128); the PUT field lets us specify almost any other place to receive a copy of those bits of data. What could be simpler and more powerful? To replace INC and DEC, GET1 can specify admode 2 -- same as PUT. When GET1 holds 2 while COPY is being processed, the 7-bit Initialize-data will be used to modify the place specified by PUT. Instead of only -1 or +1, the INC/DEC can now range from -64 to +63 -- even to +64 if the value of zero is interpreted thus (it's no good for anything else!). Some LEA instructions that the Assembler translates into COPYs will have admode 0 in GET1, a specified register, and a 16-bit offset ranging from -32768 to +32767. PUT would specify the same admode and register, and an offset of zero. Masking the flags is normal for LEA. Larger offsets can become ADD Immediate Data to a register, with the flags masked. JMP instructions are translated into COPY to the PC register, with masked flags--and remember that any JMP can now be conditional! Load and store and transfer and MOVE operations become COPY memory to register, reg. to mem., reg. to reg., and mem. to mem. Another 68020 instruction, PEA (push effective address), may be unneeded in the 12864. It has the effect of computing an address and saving in a place that is NOT a register, for later use (most likely by the Program Counter, since there isn't a LEA-to-PC instruction in the 68020). In the 12864 processor, we simply specify the Program Counter's register-number in the PUT-field data if we want to LEA-to-PC. Otherwise we can PUT the EA almost anywhere else, for later use.

SWAP is similar to COPY, in that the GET1 data specifies one place while the PUT data specifies another. However, as the names imply, they do different things: The 12864 SWAP replaces both the 68020 SWAP and EXG (exchange); data in the PUT place is sent to the GET1 place, as well as the usual GET1-to-PUT. Two thing to note about SWAP are that register-adjustments of zero, in the specified admodes, will probably be common, and the CCS flags will usually be masked. But consider that if the GET1 admode is 2 (same as PUT), then nothing happens. This may be the ideal thing for the Assembler to translate a NOP into. And if the flags are NOT masked while the GET1 admode is 2, during the generic SWAP, then this may be the ideal thing for the Assembler to translate a TST into. (If the flags aren't masked during a normal SWAP, then they will be affected only by the data going from the GET1 place to the PUT place.)

Now back to Bits 28-0 of the generic instruction; as mentioned, they hold the details of the PUT field data; we shall begin with Bits 0-6. These specify the Bitfield Start for the PUT field, from 0 to 127. After the 12864 processor analyzes the identity of the place where the result of an instruction is to be PUT, the Bitfield Start tells it exactly where in that location the result goes. For most instructions, most of the time, the value here will be Zero.

Bits 7-12 specify the number of the first register needed to identify the place where the result is PUT. In other words, if Register 7 is the destination of the data, then a 7 will be here (admode 1 in the PUT field). To modify flag bits in the CCS register, simply set a Bitfield Size of 5 (for 5 flags), the CCS register's number here, and a Bitfield Start of zero (assuming the designers put the CCS flags in the lowest bit-positions of the register). If a memory address indexed by Register 15 is the data's destination (admode 4 or 5 in PUT), then 15 will be the number placed here. Bits 7-12 can hold any number from 0 to 63, and as mentioned early in this essay, the 12864 will probably only have 45 registers or so, total. Anything more than the highest register number would be illegal, of course, even in the Boss mode! If admode 2 or admode 7 is specified in the PUT field, then the processor would ignore any register-number in these bits. Admode 3 would be another, except it is illegal in the PUT field.

Bits 13-28 specify the offset or adjustment to be applied to the register. indicated in Bits 7-12. At least this could be true for instructions OTHER than ASL, ASR, COPY, etc., because only OTHER instructions never need the 7 bits from 22-28. An index register being used with a ROL instruction can only have a nine bit offset or adjustment applied to it (in Bits 13-21, of course). HOWEVER, it can be worse! Bits 13-18 may specify a second register altogether! For admodes 0 and 1, any Bitfield Size 65 or more, we must specify 2 registers. For admodes 6, 10, 11, 14, and 15, a second register is a normal part of address-indexing. (At least those admodes get a 64-bit offset from the second register, applied to the first register.) After a second register has been specified, only the bits from 19-28, or from 19-21, can be used as an offset or adjustment to the second register (a 10-bit or a 3-bit modification, respectively). Here is a chart:

___________|2___________2|2___1|1_________1|1__________|_____________|
___________|8|_|_|_|_|_|2|1|_|9|8|_|_|_|_|3|2|_|_|_|_|7|6|_|_|_|_|_|0|
___________|__16-bit_offset/adjust_________|_First_____|__Bitfield___|
___________|__applied_to_first_Register____|_Register*_|__Start______|
___________|-------------------------------|___________|__for_PUT____|
___________|_ASL,_ASR,___|_9-bit_off/adj___|___________|__only_______|
___________|_COPY,_INIT,_|_to_1st_Register_|___________|_____________|
___________|_ISUB,_LSL,__|-----------------|___________|_____________|
___________|_LSR,_ROL,___|3-bit|_Second____|___________|_____________|
___________|_ROR__data___|_ad-_|_Register*_|_*will_be__|_____________|
___________|_____________|_just|_admode_0,_|_ignored___|_____________|
___________|_____________|_to__|_1,_6,_10,_|_in_admode_|_____________|
___________|_____________|_2nd_|_11,_14,___|_2,_7______|_____________|
___________|_____________|_Reg.|__and_15___|___________|_____________|
___________|-------------------|___________|___________|_____________|
___________|_10-bit_offset_or__|___________|___________|_____________|
___________|_adjust_to_1st_or__|___________|___________|_____________|
___________|_2nd_reg,_depending|___________|___________|_____________|
___________|_on_the_admode.____|___________|___________|_____________|

______And_just_to_be_complete:
____|6_________5|5_____5|5_____5|4_____4|4___________3|3|3_____3|3_______2|
____|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|_|_|_|9|8|7|_|_|4|3|_|_|_|9|
____|__12864____|_GET1__|_GET2__|__PUT__|__Bitfield___|_|_Do-If_|__CCS____|
____|__Instruc._|_admode|_admode|_admode|__Size,_for__|_|_Con-__|__Flag___|
____|__Code_____|_______|_______|_______|__entire_____|_|_dition|__Masks__|
____|___________|_______|_______|_______|__operation__|_|_______|_________|
_______________________________________________________^
________________________________________________Sign-Extension

Having used up 64 bits of the normal 128-bit fetch by the 12864 processor, it's obvious that to provide details of the admodes specified for GET1 and GET2, we will need to use the other 64 bits. Now it has already been stated that they are supposed to hold Immediate Data or an Absolute Address; the potential for conflict is obvious! This conflict is the main reason admode 2 was created: It makes the GET1 field use the admode in the PUT field, thereby eliminating any need for any specific GET1 information among the second 64 bits of the operation fetch. And if the GET2 field specifies admode 3 (Immediate Data) or 7 (Absolute Address), then THAT is all the GET2 information needed, and the instruction can be properly executed. So the main restrictions of limiting the GET1/GET2/PUT system to a total of 128 bits are these: (1) We can't combine Immediate Data with more Immediate Data; (2) We can't combine Immediate Data with data at an Absolute Address; (3) We can't combine the data at two Absolute Addresses; and (4) We can't use Immediate Data or an Absolute Address in any instruction where the GET1 admode is different from PUT. How much does it matter that we can't do these things? We already can't do them with any current processor, right? What we CAN do is far more important: Not only can we combine Immediate Data or the content at an Absolute Address with the content of any register (normal for any processor), we can also combine our Immediate/Absolute information with the data at any place in the memory that can be index-referenced -- and save it too! The typical 12864 program will probably be position-independent, anyway, and seldom need Absolute Addressing. It likely will start by loading several registers with the addresses of a number of data tables, all relative to the PC register. No Immediate Data there! Then the remaining registers will become variable- holders, and use Immediate Data as needed, just like any other program.

So to be a little more specific about how GET1 and GET2 information is set among the second group of 64 bits, let's first note that it took all of 29 bits for the PUT information. Keeping that the same for GET1 and GET2 means that 58 of the 64 bits get assigned real quick! Suppose we assign the Least Significant 32 bits to the GET1 information, and the Most Sig. 32 to the GET2 information. This leaves 3 bits extra for GET1 and 3 extra bits for GET2. The most obvious thing to do with the extra bits is to expand the offset/adjustment data (from 16 to 19 bits, for example), but perhaps they can be used for something else. Note that the ASL, ROL, etc. data takes space away ONLY from the PUT information. A possible use for one of the extra bits is that of being a flag controlling the Bitfield Start data: If the flag is zero, then the seven bits hold the number of the starting bit; if the flag is one, then six of the seven bits specify a register-number where the information on the starting bit is to be found. It would have been nice to have had enough bits to do this to the PUT field, but it may not be missed too much, since the PUT field's Bitfield Start is likely to be zero most of the time, anyway. So here is one more chart:

__The_GET2_info_duplicates_this_GET1_info,_except_Bit-numbers_range_from_32-63.
_______|3|3_____________________1|1_________1|1__________|_|___________|
_______|1|0|_|_|_|_|_|_|_|_|_|_|9|8|_|_|_|_|3|2|_|_|_|_|7|6|5|_|_|_|_|0|
_______|_|__18-bit_offset/adjust,_applied____|_First_____|_|_Register__|
_______|_|__to_First_Register________________|_Register__|_|_holding___|
_______|_|-----------------------------------|___________|_|-----------|
_______|_|_12-bit_off/adj_to_____|_Second____|___________|_Bitfield____|
_______|_|_1st_or_2nd_Register,__|_Register__|___________|_Start_data__|
_______|_|_depending_on_admode___|___________|___________|_for_GET1____|
________^
______Flag_determining_use_of_register_to_hold_Bitfield_Start_data

Now let's consider a few things about the 12864 Assembler. Obviously it's going to recognize many common assembly instructions, and translate them into the far fewer set of generic instructions recognized by the processor. The set of 12864 instructions may be enlarged, simply to take advantage of the possible list of 64. Ordinary instructions like ADD, SUB, ADDC (add with carry), SUBC, ABCD (add Binary Coded Decimal), SBCD, OR, EOR (exclusive or), and AND may be supplemented with NOR, ENOR, and NAND. I don't propose to offer a complete list here; let the Industry decide all the final details. The main thing that needs some attention right now is the format of the Assembler instructions; each will occupy a fair amount of space! But this is reasonable, considering that a 12864 instruction will usually equal 2, and often 3, regular-processor instructions. All the information from 3 regular lines of Assembly code, plus some new stuff, has to fit on 1 line in this proposed Assembler format:

|Label|Instruc_|Bitfield|ASL,_etc|_GET1_|_GET2_|_GET3_|Do-If|Flags_|Comment
|field|Mnemonic|_Size___|7bt_data|admode|admode|admode|cond.|masked|_field

The Label field gives this place in a program an optional name, so that it referred to from other places in the program, if desired.

The Instruction Mnemonic is, of course, the name of the instruction.

Bitfield Size (BfSz) is simply a number 1-128; if this part of the Assembly format is blank, a 128 Size is assumed -- but Admode data may change it to 64.

7-bit data is required in this area whenever the Mnemonic is ASL, ROL, etc. The nature of this data has already been described. Note exceptions like SWAP and COPY, which the Assembler knows never needs this data. The Assembler offers INC and DEC instructions that will require 7-bit data; the programmer never need see this get translated to COPY. Exceptions are peculiar, aren't they?!

Below are examples of the syntax for the addressing modes:

Admode__Syntax____________________Explanation

0000____16;+20.33______Register 16 has data. A Bitfield Start (BfSt) of 33 is specified, so data extracted from 16 starts at Bit 33 (BfSz specifies how many bits). Extracted data will have 20 added to it (register not affected), before being given to the current instruction. An assumed BfSz of 128 would change to 64 (one register specified) minus 32 (due to the BfSt). Conflicts cause Assembly errors

0000____6 9.18_________Two registers have data: 6 is Most significant; 9 is Least significant. 9 is the First Register in Bits 7-12 of the Specific Information area; recall charts. Data extracted from registers starts at Bit 18.

0000____20:10__________Data in register 20. BfSt in register 10. Note that a period denotes exact BfSt data; a colon means a register has the data. BfSt-in-a-register is illegal in PUT.

0000____10 11__________Two registers have data. BfSt is assumed to be zero. Note spaces denoting registers. Other admodes will use commas, and admode fields must be tabulation-separated.

0000____7 3;-123_______128 bits of data available in registers 7 and 3. BfSz determines how many are extracted. BfSt assumed to be 0. 123 subtracted from it before instruction gets it.

0000____URHERE PC:12___The Assembler will accept either 'PC' or the actual number of the PC register (as yet unknown!). Assembler computes offset between content of PC register (what it will be at end of instruction) and place in memory that is designated by label 'URHERE'. Suppose PC register is

0000____PC;-87:12____\_____34, and the programmer knows that the offset is -87:
0000____URHERE 34:12__>__would be identical to URHERE PC:12. PC is the only
0000____34;-87:12____/_____register to which labels can be referenced, because it is the only register that has it value known at all times by the Assembler (relative to Origin of program). Note the :12 means BfSt is in register 12 (no, I don't know why the programmer wants that in this example!).

0001____20+20.33_______Check first example; note lack of semicolon here. Plus or minus sign mandatory for all offsets and adjustments. With plus adjust, data first goes from register to the instruction. At END of instruction, data in register is adjusted. With minus adjust, register is adjusted before data extracted from it for use by instruction.

0010___________________To specify admode 2, simply leave the field BLANK!

0011____#123456________Immediate data preceded by #. A + or - is optional.

0100____,20;+20.33_____Check first example; note extra comma here. Register 20 now being used as index, with an offset of 20 applied to it. The offset value (index+offset) is the address from which data is removed, starting at Bit 33. The data may extend to Bit 127, depending on the BfSz. Note that initial index is always just ONE register.

0100____,14:2__________Register 14 is index of address holding data; register 2 has data on where is the BfSt. Specifying offsets, adjustments, or bitfield starts is always optional.

0100____URHERE,PC______Note use of comma. Assembler computes offset between PC and URHERE, as before. In previous exmple the ADDRESS of URHERE was the information (ignoring fact that BfSt specified in that example made the address useless!); now the memory content at that address is the data.

0101____,20+20.33______Check similar examples. Here register 20 is an index which is used to fetch data. Afterwards, an adjustment of +20 is applied to the register. If the adjustment is negative, it is applied to the index before the index is used as an address-pointer that tells us where data is.

0110____,10 12-14:5____Register 10 is the index, holding an address. Register 12 has a 64-bit offset to that address. Before offset is applied, register 12 receives adjustment of -14. The address thus found (by applying adjusted offset to register 10) is the address of the data, which will be accessed using the BfSt data in register 5.

0111____>123456________Absolute Address always preceded by > symbol. Out of 18.4 quintillion possibilities, this one is pretty low!

1000____[,20;+20]L.33__See admode 0100; the part inside brackets is figured in exactly the same way, resulting in an address. Exactly 64 bits are always extracted from that address in this admode. They are the Least Significat 64 bits, as the L indicates. The 64 bits are then used as the address of the data, which starts at Bit 33.

1000____[URHERE,PC]L___The lowest 64 bits of the data at address URHERE are extracted and used as an address. The instruction gets its data from the address thus found. The L (or M, in admodes 12-15) is a mandatory part of the syntax. All programmer does is provide correct syntax; the Assembler will deduce from that syntax the admode number, and the specific info, that are built into the instruction.

1001____[,25-2222]L:13_The value in register 25 is adjusted by -2222 (maximum can be -32768 in PUT before an assembly error occurs, or -131072 in GET1 or GET2), and then the adjusted index is used to fetch an address (least significant 64 bits). In turn the fetched address is used to fetch the needed data, using the BfSt in register 13.

1010____[,5 9-873]L____See the example for admode 6 (0110); bracketed syntax is analyzed the same way, this time using register 5 as the basic index, register 9 as holding the 64-bit offset, and -873 as the adjustment applied to the offset, before the offset is applied to the index. The address thusly computed is the place from which the Least 64 bits are taken and, in turn, are used as the address to fetch the data. Note that -873 is too big for PUT information, but would work as GET1 or GET2 information.

1011____[,18]L 6+3_____Value in register 18 is used as an address to fetch an address from the memory. Least significant 64 bits are taken from memory to become an address. Register 6 has a 64-bit offset, which is applied to the extracted address. The thusly-computed new address is the place where data will be found. (I say 'found' or 'fetched', but address is also a possible place to PUT the data.) Afterwards, register 6 is adjusted by +3. This example, if in the PUT admode field, and if the instruction is LSL or one of that group, is using the largest positive allowable adjustment (3 bits, twos-complement). What's the chance of having only 32 generic instructions, so we can move a bit to the PUT information field?

I don't think I need to provide any examples for admodes 12-15; they are identical to admodes 8-11, with the sole exception that the letter Lin the syntax is replaced by M. The Assembler uses L and M to determine the correct admode; the 12864 processor uses the admode to determine that either the Least Significant or the Most Significant 64 bits are to be taken from the memory and used as an address. This process has absolutely nothing to do with Bitfield Sizes and Bitfield Starts.

It should be repeated that these examples are only a proposal; thinking about them is bound to lead to speculation about how easily the programmer can make a mistake by forgetting a comma. A whole different syntax might be created just to reduce the chance of such accidents, perhaps one where mnemonic letters replace the commas, periods, colons, and semi-colongs -- even lower-case letters, to prevent confusion between O/offset and 0/Zero. This syntax simply attempts to make the admode-field information compact.

The next field of the Assembler format, after the PUT admode field, is the Do-If condition. Two letters suffice to abbreviate the possible conditions (at least only 2 letters if Motorola's list is used): HI (higher); LS (lower or same); CC (carry flag clear); CS (carry set); NE (not equal to zero; zero flag clear); EQ (equal; zero flag set); VC (oVerflow flag clear); VS (oVerflow set); PL (plus); MI (minus); GE (greater than or equal to zero); LT (less than zero); GT (greater than zero); and LE (less than or equal to zero). This list totals 14 possibile Do-If conditions; with a maximum of 16 allowed, the last two are usually Do Always and Do Never. For the purpose of the Assembler format, the Do Always condition can be the default if the Do-If field is simply left blank, but it wouldn't hurt to allow a DA abbreviation. A DN abbreviation is logically sensible, but practically almost useless -- a NOP for sure! (If the Assembler converts NOP to SWAP, as proposed, obviously the Do-If would be Never!). Maybe some other Do-If condition can be created, just to use that 16th possibility.

After the Do-If field in the Assembler instruction format is the Flag Mask field. Motorola's flags are abbreviated X, N, Z, V, and C, so simply putting an appropriate letter (or letters) in this field should tell the Assembler that you don't want a particular flag to be affected by the current instruction. Simply entering ZCN without any punctuation should be adequate to specify the Carry, Zero, and Negative-sign flags, for example. Now consider the opposite notion: Some Assembler instructions, like LEA, will be translated into other operations, and the flags will automatically be masked by the Assembler during translation. In the 6809 processor there are two registers Y and U, which are not treated the same by LEA instructions. LEAY will affect the Zero flag, while LEAU will not. The idea is to let register Y be used in counting loops, and it works fine. The 12864 Assembler could allow the same sort of thing: If the programmer specifies the Z flag in the Mask field during an LEA instruction, then the Assembler WON'T mask the flag! More precisely, what is happening is the programmer telling the Assembler to reverse its normal handling of the 12864 flagmask bits. If the Assembler usually doesn't mask a flag, then it will be masked -- and vice-versa.

The last field of the Assembler format is the Comment field, in which the programmer is supposed to explain the purpose of the instruction. This field is completely ignored by the Assembler, of course, during the task of creating the machine code for the 12864 processor from the assembly source listing.

And now my two-cents-worth on the hardware of the 12864 computer; if what I am about to say is really worth as much as two cents, I'll be surprised! The average computer has a System Clock that controls the timing of everything that goes on in the computer. The average microprocessor accesses the memory every (fill in blank) cycles of the System Clock, on the average. The remaining clock cycles are spent by the processor processing the data it has accessed. Some of the newer processors have 'preprocessors' built into them, so they can access the memory significantly more often. The preprocessors begin working on future instructions before the main processor finishes the current instruction; it is known as 'pipelining', I believe. The 12864 will be both similar and different to this scheme. It'll likely have one main processor for the main instruction, and 3 subprocessors to handle the data represented by GET1, GET2, and PUT. It figures that if the average 12864 instruction is as complex as 2 or 3 regular- processor instructions, the 12864 may have to do as many memory-accesses as 2 or 3 'regulars'. Yet by processing GET1, GET2, and PUT simultaneously, the 12864 is essentially doing the work of the 'pipeliners'. Whether or not pipelining of the current sort is actually built into the 12864 remains to be seen. In the meantime, though, the 12864 is still going to spend a number of clock cycles in- between memory-accesses, during which it is processing the accessed data. Since it is fairly obvious that the more often a processor can access the memory, the greater the performance of the computer, the standard trick is to increase the speed of the System Clock, and building both processors and memory chips to keep up. Nevertheless, this does not change the fact that the processor spends many clock-cycles NOT accessing the memory! And I get the impression that the memory chips are not keeping up with the processors, in the speed race. So here is my suggestion: Build the 12864 with a faster clock than the System Clock. It will have to hold its outside lines open for more than one internal clock cycle each time its subprocessors access memory (to stay in sync with the System Clock), but while it is doing that, its main processor can be manipulating previously- accessed data. With proper planning the 12864 should be able to access memory almost every cycle of the System Clock, at the memory's maximum possible speed.

I have been saving the thorniest problem for last (at least I think the end of this essay is approaching!), and it concerns the hardware's management of the data. The first part of the problem is this: While most 12864 instructions are 128 bits long, many will be fully described in only 64 bits. So do we make the processor skip the other 64 bits, and move on to the next memory location, or do we scheme to fit another whole instruction in those 64 bits? My inclination is to ignore the 64 bits, UNLESS it 'just happens' that two adjacent instructions in the assembly source listing can both be reduced to 64 bits. In other words, what the processor would do is load 128 bits, discover that the first 64 of them comprise a complete instruction, execute that instruction, and test the next 64 bits to see if they also comprise a complete instruction. If they don't, they will be ignored, and the processor will load 128 bits from the next address. It would be worth having this scheme just to give the programmers a chance to prove they are clever enough to always make full use of it. Any programmer who NEVER attempts to conserve memory should be fired! (And so what if there are more than 18.4 quintillion memory locations -- waste is waste.)

The other aspect of the memory management problem concerns the Stacks, which are places where random numbers of registers are temporarily stored. If each address a Stack register points at holds 128 bits, and each register being saved is only 64 bits wide, then it seems at first obvious to always put 2 registers at the Stack address. But many times an odd number of registers will be saved; what then? The very simplest answer is to always only store 1 register at each Stack address, and ignore the obvious waste, because this way the processor can never get confused. The next-simplest answer may be to REQUIRE the programmer to always PUSH or PULL an even number of registers when using the stack -- even a JSR (jump to subroutine) instruction would have to save another register with the Program Counter, just to keep the total even. I think I may recommend this particular solution (would you believe I have been worrying about this since the middle of this essay, and just now have come up with the idea?).

The bit-code format of instructions like JSR, BSR, PSH, PUL, and MOVEM can't be the same as the format for most 12864 instructions. The main reason is, as mentioned, that the instruction has to incorporate a list of registers -- but it works out OK, because much of the instruction is predefined. Before we get into any details of that, though, let us examine the Stacking system a little closer. In the 6809 there are two Stack registers, one of which is always used by the hardware to save JSR and interrupt information, and one of which the programmer can use for other things. There are occasions when having two Stacks is really convenient, notably when moving large blocks of data around. In the 68020 there are three Stack registers, one for the Boss mode, one for the Interrupt mode, and one for the Peon. Two bits in the CCS register are devoted to keeping track of which Stack the hardware is using at the moment, so if it had been wanted, a fourth Stack could exist in the 68020. This seems worth putting in the 12864. And another thing: TWO CCS registers! One would be a Boss mode CCS that keeps track of things like the current Stack being used and interrupt-control flags, as well as the list of registers to be saved during an Interrupt, as proposed at the beginning of this essay. The other would have the instruction-result flags in it and some other stuff. MOST of that other stuff is another register list, like that in the Boss CCS. Thus when a GSR instruction is used (generic for JSR and BSR: go to subroutine) a list of registers could specified that would be saved in the Peon CCS. Here is a proposed bit-map for GSR:

_____________|6_________5|5_____5|5_____5|4_____4|4________________|
_____________|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|.....|_|_|0|
_____________|_Code_For__|_GET1__|_GET2__|_Do-If_|_Register_List___|
_____________|___GSR_____|cannot_|_______|_(PUT__|_Note_Peon_CCS___|
_____________|Instruction|__be___|_______|_is_PC_|_and_PC_registers|
_____________|___________|admode_|_______|always)|_not_on_list;____|
_____________|___________|___2___|_______|_______|_always_saved.___|
_____________________________________^
________If GET2 is admode 2 then data specified by GET1 is copied to PC -- equivalent to JSR. If GET2 is any other admode then the data it specifies is added to the data GET1 specifies, and the result is copied to PC. If GET1 specifies PC then we have a BSR equivalent. The CCS instruction-result flags are NEVER affected by this one. Normal limitations: No adding Immediate Data to Absolute Address!

It is worth noting that the Register List, from 0 to 45, is in agreement with the early estimate of approximately 45 registers total for the 12864. If there are any registers that we can be sure NEVER need to be saved during a GSR, even during the Boss mode, then we can have a few more than the 48 implied here. When executing a GSR, the processor would copy the specified register list to the Peon CCS register, save them all on the current Stack, THEN save both the PC and Peon CCS registers. When an Interrupt occurs, the last two registers saved would always be PC and the Boss CCS (although the Peon CCS would be saved just before then). One bit in the same place in the two CCS registers would serve to identify which is which; this bit cannot be allowed to be changed by anything. Then when the generic RTN (return) instruction is executed, 128 bits of PC and CCS data would be taken from the memory; the correct CCS would be identified, and the correct way of returning would follow. One thing to note about RTN from a subroutine: The instruction is almost completely pre-defined. The only odd thing is that the values of the instruction-result flags in CCS BEFORE the RTN occurs have to be preserved while CCS data is being loaded from the Stack during the actual RTN operation. Unless various flag-masks are set by the programmer! The bit-coding of RTN only needs 6 bits for the instruction, 4 bits for Do-If, and 5 bits for flag-masks (flags the programmer does not want preserved during the RTN from a subroutine); the rest of the 64 bits can be ignored. Programmers should be wary of specifying any flag-masks for RTN at the end of an Interrupt handling routine, since here the normal thing for the processor to do is to NOT preserve the flags, as they exist at the end of the Interrupt handler. Masking them would mean transferring Interrupt data to the interrupted program. This would be OK if the interrupted program was specifically waiting for such....

PSH, PUL, and MOVEM-type instructions can all be combined into one generic, I think, that we can call STAK. The bit-coding for it might be like this:

_______|6_________5|5____|5|5_____5|4_____4|4____________________|
_______|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|_|.....|_|_|_|0|
_______|___STAK____|Con-_|_|_Do-If_|__PUT__|__Register_List,_to__|
_______|instruction|trol_|_|_______|_______|____be_stacked_or____|
_______|___code____|Bits_|_|_______|_______|______unstacked______|
_______|___________|_____|_|_______|_______|_____________________|
_______|___________|_____|_|_______|_______|_____________________|
_______________________^__^____________^
_______PUT specifies the address where the stack is to start. If LOCATION OF ADDRESS is in memory elsewhere, one Control Bit denotes L or M for the 128 bits at that location, from which the stack's address will be fetched -- no bitfield specs! After STAK is finished, the PUT place is given a new value, indicating the new start of the stack. (Immediate Data still forbidden in PUT, of course.) One Control Bit specifies top or bottom of stack; another Control bit specifies data being added to or removed from the stack. As always, only an EVEN total number of registers may be specified. Bit 54 means that the Peon CCS register is part of the stack operation. STAK never affects flags, except when loading CCS from this kind of stack. (I forgot to say, details of PUT can be in other 64 bits of the instruction fetch.)

That about wraps it up, I guess. Any inconsistencies you may have noticed are due to the fact that this is only a proposal, and therefore does not need to be perfect. Only if the Industry decides to get together to create a standard microprocessor along these lines would it be necessary to get really finicky on all the details. And what do I want out of this? First of all, I want to beat the NIH Syndrome: 'If it is Not Invented Here, we are not interested!' Except for the fact that computers I own and know well happen to have 6809s in them, I am not associated in any significant way with any company in the entire computer industry. I will claim the credit for dreaming up this thing, just to prevent anyone else from doing so -- and just to prevent any person or any company from claiming ownership of it, I am quite deliberately placing this whole concept in the public domain, as of NOW. Thus the whole industry starts off on an equal basis with respect to the proposed 12864 microprocessor, and there should now be no barrier to creating an industry wide standard. I am knowingly forfieting all legal claim to any compensation for these ideas, just to prove I seriously want the Industry to get its act together. On the other hand, any 'royalties of conscience' that might come my way will be gladly accepted!

Vernon Nemitz

March 17, 1991

NOTE ADDED JULY 11, 2001: It wouldn't be so tough to build one of these processors today. The Industry is as non-uniform as ever. The first 80486 DX2 chips began appearing in late 1991. Very Long Instruction Word processors are also being designed and built, to different specs than those described here. But the Industry is STILL putting only 8 bits of data at one address, even as it ramps up the production of 64-bit processors. What a mismatch! Meanwhile, 64-bit addressing looks to be a stable quantity for 20 or 30 more years. Maybe I'll try to promote a 25664 microprocessor...128 bits wasn't really QUITE enough, for all the variety of instructions that I had in mind, and considering all the new multimedia instructions, well, why not!
-- Vernon, Jul 11 2001

For 6809 lovers... http://www.vavasour.ca/jeff/trs80.html
CoCo and Dragon computers can be emulated on Pentiums [Vernon, Jul 11 2001, last modified Oct 04 2004]

Homebrew FPGA CPU http://www.fpgacpu.org/
Design your own 32-bit (or more) RISC cpu, implement it on a gate-array, and share your results with other hobbyist CPU designers. [wiml, Jul 11 2001, last modified Oct 04 2004]

One bit processor in use http://srtm.die.uni...vers/comunicati.htm
by NASA no less... [phoenix, Jul 11 2001, last modified Oct 04 2004]

(?) Athlon electrical & mechanical data sheet http://www.amd.com/...hdocs/pdf/23792.pdf
What are all those pins for? Find out here. [wiml, Jul 11 2001, last modified Oct 04 2004]

Return to Top http://www.halfbakery.com/idea/12864
[The Military, Jul 11 2001, last modified Oct 04 2004]

(?) The totally, like bitchin' 12864. http://www.80s.com/...tainment/ValleyURL/
Enter the URL of this idea. [angel, Jul 11 2001, last modified Oct 04 2004]

(?) Makes technical jargon even more revolting, er more accessible to word on the street http://www.pornolize.com/
copy/paste URL from here... [thumbwax, Jul 11 2001, last modified Oct 04 2004]

(?) For [POYF], one of 3,510 hits on Yahoo. http://www.11a.nu/lempelziv.htm
I'm surprised you don't know of this. It's used in GIF encoding. [angel, Jul 11 2001, last modified Oct 04 2004]

(?) CodePack Compression for PowerPC http://www.chips.ib...pc/cores/cdpak.html
On-chip compression for PowerPC cores. [JKew, Jul 11 2001, last modified Oct 04 2004]

(?) More ancient technology revisited http://www.shouldex.../6805;mode=moderate
In this case, 6502-inspired. Probably relevant to 6809 issues. [LoriZ, Mar 01 2002, last modified Oct 04 2004]

AMD Hammer http://www6.tomshar...1/020227/index.html
Looks like (roughly) a 31x31 grid of pins... [Vernon, Mar 02 2002, last modified Oct 04 2004]

(?) Transmeta Announcement http://www.pcworld....0,aid,101516,00.asp
Transmeta announces plans for a 256-bit wide Very Long Instruction Word processor. I wonder.... [Vernon, May 31 2002, last modified Oct 04 2004]

12864 (again) http://www.nemitz.net/vernon/12864.htm
In theory, same essay, in unmangled format [Vernon, Oct 04 2004]

Transmeta http://www.transmeta.com/
Looks like the 256-bit processor is well into production [5th Earth, Feb 16 2005]

Intel fully embracing 128-bit data crunching http://arstechnica....paedia/cpu/core.ars
Now all they need is an efficient instruction set, to access multiple instructions per data-fetch. [Vernon, Apr 14 2006]

Hack-a-Day: Minecraft articles http://hackaday.com/?s=minecraft
Mostly building processor simulators in a games engine [Dub, Nov 14 2010]

Um, okay, fishbone for wibni.
-- phoenix, Jul 11 2001, last modified Jul 12 2001


Wow. After scrolling for three or four minutes, I was thinking, "Hey, Vernon really has some competition now!" Then I got to the bottom after another three or four minutes of scrolling (just scrolling, mind you, not reading much) and found out that no, there is no competition, Vernon is still the reigning champ of verbosity.

Time for character limits on the idea textarea, maybe?
-- PotatoStew, Jul 11 2001


I had wondered why we had not heard from Vernon for what seems like months...evidently he's spent the entire time typing.
-- beauxeault, Jul 11 2001


FOR ANYONE starting to read the essay, I recommend going down to the end of it, and clicking on the link "12864 (again)". You will find the essay in its intended format to be MUCH more clear about certain things. Enjoy!

phoenix, what is "wibini"? I don't recall encountering the term before. If it's an acronym, I can assume it is derogatory ('fishbone' clue), but I'm used to that, so let 'er rip!

PotatoStew, as thorough as this microprocessor description tries to be, it is still only half-baked. I only wish I could have posted it on the Internet 10 years ago, when it was written. If actually baked, the idea will require a whole book for all the details.

Rods Tiger, we really do need more than mere 16-bit processors. The maximum ordinary number you can fit into 16 bits is only 65535, and how often do you need to manipulate numbers larger than that? To do so efficiently means using a processor that handles more bits! Since the proposed 12864 handles 128 bits of data at a time, ITS maximum number is 18-quintillion SQUARED. Only mathematicians and physicists regularly need numbers anywhere near that large (or larger). Next, with respect to floating-point calculations, the average 80-bit coprocessor (mentioned in the essay) has a 64-bit mantissa and a 16-bit exponent. This HAS proved to be sufficient for most purposes, most of the time. Floating point calculations on the 12864 could quite naturally still use a 64-bit mantissa (in one register), and a 64-bit exponent (in another register). The accuracy of the calculations would remain the same; only the overall size, of the maximum number that could be manipulated, would increase. Next, I tend to think that if the 6809 had been equipped with 16 data lines, and Motorola had provided a decent large-memory-management chip, then Intel's 8086 wouldn't have stood a chance, because the instructions set really is that efficient and versatile. A 1-Mhz 6809 could stay neck-and-neck with a 4-Mhz 8086. So, if you study the essay, you will find that I kept just about all of that versatility, and made it even more efficient. (For those who don't know, Motorola commissioned a special operating system, "OS9", which is sufficiently UNIX-like to be able to do very good multitasking. So NASA put 6809-based computers in the Space Shuttle.)

I see nothing wrong with your suggestion that a Digital Signal Processor have Direct Memory Access capabilities, similar to hard disk and video controllers. I don't see any reason to lessen the capabilities of the processor, though. One of the elements of many modern games is a large amount of PHYSICS calculations (so that when large objects fly though the air and impact massively, both the flight and the results of the impact look realistic. NO mere 16-bit processor can keep up with all those calculations, pertaining to all those objects moving around on the screen, these days!

beauxeault, I really have lots of things on my agenda. The more I want to sleep, the less I can play on the HalfBakery. And, this idea really was written 10 years ago. There was a recession, and I was experiencing unemployment, and had time on my hands....

waugsqueke, the essay is still there. Feel free to read it as many times as it takes to sink in. Even <I> had a hard time digesting it, when I discovered it on a disk that has been sitting in storage for 10 years, and reread it. I haven't been mentally in the Assembly Language groove in a long time. But after I saw what angel wrote in the "Restore Duelling" area, I had two reasons for posting this ;)

UnaBubba, you have the date wrong: it was 1991 when I wrote what you quoted. Some of the rationale for that statement is explained in my reply to Rods Tiger; the only real mistake is that phrase "any computer". Even back then I knew that NO computer will EVER be powerful enough for mathematicians and physicists! More of the rationale is this: If I recall correctly, in 1991 a decently-sized hard-disk drive had maybe 30 megabytes of storage space. Now it is ten years later, and it takes 1000 times that, 30 gigabytes, to be a decent size.

Please note that the maximum number of distinct memory locations that a 32-bit processor can directly access is 4,294,967,296. Because RAM is more expensive than disk space, a "virtual disk" is a piece of special software that sits between the processor and the disk; when the processor wants memory outside of the RAM, the special software accesses the disk and fetches data and feeds it to the processor, AS IF the RAM had actually been there. Today's hard disk drives can easily provide more simulated RAM than a 32-bit processor can psuedo-directly use, and this fact actually should be interpreted to mean that the processors are behind the technology-development curve!

If the above disk-storage trend continues, then in 10 more years a decent hard drive will store 30 trillion bytes, and 10 years after that, 30 quadrillion bytes -- and 10 years after THAT (30 years from now), 30 quintillion bytes. WOULD THIS HAVE BEEN A REASONABLE PROJECTION, TEN YEARS AGO? If you say NO, then what you quoted is almost perfectly accurate!!! Because, remember, a 12864 processor will expect to find 128 bits of data (not the current standard of merely 8 bits) at every numerical address -- so even a 30-quintillion-byte hard disk drive can only provide about a tenth of the "data space" that this processor could work with!
-- Vernon, Jul 12 2001, last modified Jul 19 2001


[Vernon]: Are you saying it's *my* fault that you posted this?
'wibini' - WIB incredibly NI
I also have a soft spot for the 6809; it was the heart, soul and guts of my first computer, the Dragon 32. Further to what [Vernon] suggests, it could knock the socks off a 4MHz Z-80, not just the 8086. As for the rest, I'll leave it to you guys.
-- angel, Jul 12 2001


UnaBubba, thanks for the acronym. I see that angel provided the variant that includes that middle 'i'.

angel, when I discovered the 10-year-old file that I posted here, at first I wondered if I should, because it was indeed so long. But I had two recent reminders (yours was one) that technological notions are generally the preferred thing to post on the HalfBakery. OK -- take THIS, heh heh. Besides, it IS a half-baked idea, and even after ten years, it still isn't outdated!

Rods Tiger, my first computer was a CoCo, and I also have seen Motorola's reference sheets for the 6809. Did you know that one of the unusual video modes, in all the TRS-80 CoCos, has a hardware bug that is directly derived from those design sheets? Makes you wonder about how many other so-called tech companies (like Tandy) are really just marketing machines....
-- Vernon, Jul 12 2001


Wow, Vernon, are you an interesting person....
-- PotatoPete, Jul 12 2001


[Vernon] Wasn't doggin' you with the wibni, buddy. Wibni's are discouraged @.5B

Anyone ever see a one bit computer? (No, really)
-- phoenix, Jul 12 2001


Whew. Vernon, I think you're halfway to inventing RISC processors. You should take a look at the design of the MIPS architecture and then, perhaps, the SPARC and PowerPC architectures.

One thing you should consider is that one of the bottlenecks on modern processors is the transfer rate between main memory and the CPU. By making all instructions 64/128 bits long, you've greatly reduced the code density (in terms of useful operations per bit). Sure, you can make the data path to main memory wider, but this makes the machine much more expensive (more pins, more circuit traces, more sockets, more fiddly bits of metal to manufacture, test, and have go wrong). On modern machines, the core CPU talks to a cache controller (which is on the same chip), and the cache controller talks to memory via a data path of whatever width provides the best performance/cost, regardless of the width of the instruction set.
-- wiml, Jul 12 2001


wiml, I'm pretty sure that RISC processors were around in 1991, and that I knew about them back then. In one sense, it depends on how you define RISC. The 6809 had its working registers named A, B, D, X, Y, U, and S, and so the assembly-language included instructions (example, to stow contents of a register into the memory) that were literally named STA, STB, STD, STX, STY, STU, and STS. Now, if I decide to merely NAME things differently, so that I have a generic STO instruction, then the Assembler still needs to know which register and what place in the memory. If the original syntax was STA >10000 and the new syntax is STO A,10000 ---well, somebody might SAY they have reduced the instruction set, but in actuality the opcodes to the processor haven't been changed a bit. What I have described in the essay is simply alternate ways to do some of the same long list of instructions as before. What I really was independently inventing was Very Long Word Instructions, because I thought it would be more efficient to fetch the equivalent of three normal instructions (get someting, do somthing to it, put result somewhere) all at once. Have you noticed that recently processors like Athlon and Pentium 4, when the manufacturers, bump up the CPU clock speed from (say) a multiple of 13 to a multiple of 14 times the system bus, the overall performance doesn't usually go up very much? The reason is that most of the instructions that the CPU loads don't take 13 or 14 clock cycles to process! ESPECIALLY with the built-in pipelining and duplicate internal pathways, that let the CPU process MORE than one instruction in a single CPU clock cycle. SO, I SAY, FETCH MORE INSTRUCTIONS AT ONCE, to keep the CPU busier!!! This means we NEED a wide path to the memory. And as for the number of pins, well, the 6809 had 40, of which 8 were data lines and 16 were address lines. The other 16 were things like power, ground, hardware-interrupt signalling, clock signalling, etc. Okay, granting that a powerful processor needs extra power and ground and clock-signal pins, I hear that the Pentium 4 has more than 400 total! Only 32 of them are address lines, and somewhere I gathered the impression that maybe 64 of them were for data, but 300+ pins for all the other overhead seems ridiculous. The proposed 12864 processor would need 192 pins for data and addresses. In 1991 I <WAS> concerned about such a large number, but I am not bothered about it nowadays, that's for sure!

Concerning the PowerPC architecture, in 1991 one of the companies to which I sent a copy of this essay was Motorola. If they borrowed something from the essay, they never told me. Nor did they need to, since I declared it public-domain.

I think my biggest gripe with respect to putting only 8 bits of data at one memory location is this: BOUNDARIES. The memory management circuits will gather 8 bits from each of 4 addresses, in parallel, to supply a single 32-bit block to most of today's processors. In such a computer, every 4 addresses (such as from 000000 to 000003) constitutes a bounded group. If I wanted to save 32 bits starting at Address Number 000001, the hardware won't let me! --because it would cross the boundary between Address 000003 and 000004. THEREFORE, IF ONLY 32-BIT DATA IS ALWAYS USED, the net effect is AS IF the total address space had been reduced to 1/4 of its advertised size, from 32-bit- to 30-bit-addressing, that is. I think this fact reveals a total waste of potential! A 16-bit processor should have AT LEAST 16 bits of data at every address, a 32-bit processor should have AT LEAST 32 bits at every address, and so on!. Only then will ALL the advertised number of addresses be fully useful.

About the bottleneck between the processor and the memory, I knew about that back in 1991, and if you go down to near the end of the essay, you will find a paragraph that begins with "And now my two cents worth about the hardware..." In there I describe my independently-invented notion of clocking the CPU faster than the rest of the system, to compensate. Yet today that bottleneck STILL exists, in spite of large clock-multipliers, RDRAM, DDRSDRAM, and even QDRSDRAM. Sometimes I think they should use SRAM for ALL the memory, and not merely the cache, heh heh. Well, there ARE some new memory technologies in the pipeline, which may offer the performance of SRAM, the cost of DRAM, <and even> the nonvolatility of ROM. I can wait for one of those.
-- Vernon, Jul 13 2001


"What I really was independently inventing was Very Long Word Instructions..."

Understatement of a Lifetime.

"I hear that the Pentium 4 has more than 400 total! Only 32 of them are address lines, and somewhere I gathered the impression that maybe 64 of them were for data, but 300+ pins for all the other overhead seems ridiculous. The proposed 12864 processor would need 192 pins for data and addresses."

Vernon, would all 192 pins be solely dedicated to data and addresses, and/or would they absorb all the other overhead as well? If so, what proportions do you assume would be logical in completed form?

RE: "the thorniest problem"... would stack addresses run parallel - virtually or otherwise - on both 64 bit 'divisions' or is that a nonissue?

Given that no one has taken advantage of the 10 year old public domain offer to the fullest meaning, I would hope in all sincerity you are instead able to see this through to your financial benefit or to live long enough to see it done. Fate has its own method to the madness.
-- thumbwax, Jul 13 2001


I am withholding commitment pending further discussion.
-- The Military, Jul 13 2001


Rods Tiger, you are reiterating stuff that's already in the essay. Of each 128-bit data fetch, some are defined as "bitfield size" and "bitfield start" information. So that any portion of any group of 128 bits can be used for whatever. There are probably fewer technical difficulties with this than one might first imagine. For example, consider a relatively standard "3-state signal", on one of the data lines between the CPU and the memory: One state specifies Read, one specifies Write, and one specifies Do Nothing. So if one wanted to fetch only 53 bits from somewhere among the 128 at a particular memory location, only 53 wires would be set to Read state, and all the others would be set to Do Nothing.

Next, you have also described existing SIMD instructions (Single Instruction Multiple Data), which are used to process multimedia data streams. I didn't happen to think of that idea, but I did mention at the very bottom of the essay, that they should be included.

About distributing the idea, I originally did mail off copies to several places, and then shelved it. Later, when I wanted to start thinking about bit patterns for a 256-bit processor, I couldn't find it! Only a few days ago did it surface by accident, in a very unexpected place. Now it is posted where anybody can find it. Hmmmm. In my various searches of the Web for pages containing whatever-I-might-be-seeking, pages from the HalfBakery never seem to come up. I wonder if Jutta can do something about that...if she wants an even busier place, that is.

thumbwax, NO, I did not mean to imply that this proposed microprocessor would ONLY have 192 pins. I don't claim to know enough about all the overhead-pins to offer a guess as to how many total this processor would be expected to have.

I'm not quite sure what your are asking "would stack addresses run parallel?", when I THOUGHT the essay fairly clearly states that when stacking the data in the 64-bit registers, two registers would be saved at each memory location, an even total number of registers being required for stacking/unstacking. Since the list of stacked registers corresponds to which-bits-are-set in the stacking instruction, one way to think of the stacking process is as examining a stream of bits from first to last when stacking, and from last to first when unstacking. Each set bit means the associated register has its data stacked, in left64/right64, next-address. left64/right64, next-address, etc. fashion (and right64/left64 when unstacking. (Or call them hi64/lo64, if you prefer.)

And thank you! for the kind remarks about potential future compensation.
-- Vernon, Jul 13 2001


There are 11,913 words in this idea; 67,283 characters (including spaces).
Just thought you'd like to know.
-- angel, Jul 13 2001


The most important thing I learnt when doing academic research was: If you've got a complex technical or scientific idea and you can't express it in a single, simple sentence then you haven't thought it through properly and you don't yet know what you're talking about.
So how about it: Can you give us one short sentence which expresses why this is an important idea?
-- hippo, Jul 13 2001


[Rods]: No, do you think I've got nothing better to do than count carriage returns?
-- angel, Jul 13 2001


angel, unfortunately when something is posted on the HalfBakery, it undergoes a text-massaging process that tries to eliminate excess carriage returns and spaces. For purposes strictly pertaining to posting the idea here, I deliberately added quite a few extra carriage returns, to keep certain lines from becoming run-together. It didn't do enough good. Some of those lines had precisely-spaced characters in them, using a fixed font, and they have all been messed up royally. (What I am trying to say, your count of characters is rather different from what the true count was, before mine and the HalfBakery's text-massaging.)

hippo, since this is the place for half-baked ideas, why are short descriptions automatically therefore important? Not to mention that this particular "idea" is actually a whole raft of them. Each had to be explained, and organized in relation to the others. And then there is that phrase "microprocessor description" in the subtitle! Why isn't that good enough for you? Finally, since different people have different standards as to what qualifies as an important idea, all that can be done is let an idea speak for itself. Individuals can only read and decide.
-- Vernon, Jul 13 2001


// I think it would be an achievement to get the annotations to wrap past the end of this particular idea! //

we're about a third of the way there. anyone else have anything interesting to say?
-- mihali, Jul 13 2001


If I'd known my text could me massaged for free I wouldn't have been paying for it this whole time. Sorry.
-- thumbwax, Jul 13 2001


I'm afraid I cant realy contribute to a discusion on processors as I don't realy know anything about them and quickly got lost in all the technical details but I am having a good go at helping to extend the anotations...
-- RobertKidney, Jul 13 2001


half way there...
-- NHstud1216, Jul 15 2001


I give up...
-- RobertKidney, Jul 15 2001


this idea is beyond my attention span
-- krispykremednut, Jul 15 2001


I was going to add an annotation to this, just to contribute in the long, slow crawl towards the bottom of the Vernon's idea. For a second, Rods Tiger, your comment almost made me lose faith. Closing the History window on Internet Explorer, suddenly the browser window widened and the distance to the bottom doubled. But then I thought, if it works one way, it works in reverse. So I just re-opened the History window and widened it to take up half the screen. By the time the annotation column is only about 20 characters wide the notes not only reach the bottom of Vernon's idea, they actually pass it... at which point I realised that I didn't have to add this note. Oh well.
-- Guy Fox, Jul 15 2001


Returning to the beginning of the annotations I'm just wondering how phoenix can WIBNI what is effectively a brief design spec. Bad phoenix, naughty phoenix - you're grounded for a week and no firelighters for a month.

I don't pretend to understand the detail, but what this looks like is a revolution over evolution redesign of the microchip. Unfortunately this will need it's own operating system and the necessary "killer app" to make people buy it. Would this be any use as a games console as these are designed more from the ground up than general purpose computers?
-- st3f, Jul 16 2001


Can someone summarize this article for those of us who, well, you know?
-- egnor, Jul 16 2001


[st3f] "I'm just wondering how phoenix can WIBNI what is effectively a brief design spec"

Brief?

And it's not that this is a bad idea but the idea of enhancing a processor by widening its data buses and/or taking advantage of data caches is baked. I believe you'll find that there is currently a trend in computing towards smaller data buses which are more easily manufactured and can be more efficiently clocked (e.g. Serial ATA, USB, FireWire, etc).

That said, I hope [Vernon] finds a buyer for his CPU, makes millions, buys a tropical island and has us all over for tequila.
-- phoenix, Jul 16 2001


Erm... I'm with egnor, I realy needa sumary. I bet the reason no-one has writen one is that they still have'nt read the whole idea...
-- RobertKidney, Jul 16 2001


Man, I love the 6809. In fact, I'm coding on one right now with my current project. LEAX 1,X FOREVER!
-- koz, Jul 16 2001


st3f, with respect to the notion of needing a "killer app", the essay tries to explain how that is not essential: "...the whole industry starts off on an equal basis with respect to the proposed 12864 microprocessor, and there should now be no barrier to creating an industry wide standard." Remember that this was written in 1991, before anybody could consier making this. However, they COULD have said, "Yes, let's create an industry-wide standard CPU, based on something that nobody has any legal control over." Which is why I declared the essay public domain. Then, by the time anyone could have made superprocessors like this one, all the software developers would have had years to prepare for conversion, no matter how incompatible it might be. The basic idea was to design a processor SO superior that everyone would want it, AND provide the time frame needed for acceptance/preparation. It's kind of late for that now, but, as noted at the very bottom, there's no reason why we couldn't set out to design a 256/64, or maybe even a 1024/128 processor....

egnor, the preceding may perhaps qualify as a summary. All the details merely show that I was willing to give the notion reasonable consistency, and bring it to the actual half-baked point. Even as we write, consider recent Intel and AMD history: A couple years ago they decided that they would make 64-bit processors that would obey certain instructions in specific ways, and THEREBY they created "instruction sets". An Assembler lets a programmer use that instruction set to create programs that get things done; an Emulator is a program for Computer A that can take those program-instructions and run them AS IF Computer A was actually Computer B (the computer for which the instructions are actually intended). So, while programmers had a chance to pretend they were using 64-bit processors (and modify existing software to take advantage of their features), Intel and AMD could develop the actual 64-bit hardware that would supposedly function according to the instruction-set specifications. THAT IS THE KEY: It was by specifying much of the instruction set for the 12864 that I hoped other things could follow.

Rods Tiger, you got that right.

phoenix, the essay is indeed only a brief design spec. I mentioned in a prior annotation that the FULL specifications would require an entire book, and that remains utter truth. Also, you are missing the point about the high-speed narrow-channel serial connections like FireWire: They are used for PERIPHERALS only, and they are used to reduce the cost of connecting those same peripherals. It is simply cheaper to make a cable with 2 wires in it than a cable with 128 wires in it. But, since a cable with 128 wires can carry 64 bits of data simultaneously (each distortion-resistant signal requires a separate pair of wires), the 2-wire cable has to operate at 64 times the speed, to be equal. THIS translates as more-expensive hardware on each end of the cable. So trade-offs are inevitable.... So, for dealing with communication between CPU and main memory, you MAY be able to condense it some (let's say 1/4): 128 bits goes out of the memory to a communications controller at 1Ghz; the controller sends 32 bits at 4Ghz to a second controller at the CPU; and the second controller downshifts it back to 1Ghz and 128 bits for the CPU. Now, HOW MUCH POWER will those communications controllers consume, and is this a worthy trade-off to merely finding space among the layers of the motherboard for 128 data lines, connecting CPU directly to memory? I SHOULD NOTE that the preceding is hypothetical, a more correct description of what is actually done involves ONE intermediary chip between CPU and memory. This chip deals with the fact that the main memory DOESN'T run at the same clock speed as the CPU -- the "bottleneck" mentioned in several places above. Nevertheless, for today's 32-bit systems, the intermediary chip can gather 64 bits at once from the main memory at the max speed that the memory can handle, and feed it to the processor at a faster speed, for the CPU's benefit. [No conflict here: if the memory runs at 200Mhz and the CPU runs at 1.6Ghz (8 times faster), then the intermediary chip has to spend 8 CPU cycles talking to the memory, and THEN spend one CPU cycle talking to the CPU.] Well, if the memory instead communicated 1024 bits at once to the intermediary chip, THAT chip could divide the 1024 bits into eighths, and feed the CPU 128 bits on each one of its cycles! NO SYSTEM TODAY does anything like that, to the best of my knowledge. I didn't even broach the notion in the essay. I DID want the wider data path to try to keep a faster-clocked CPU busier, but ten years after thinking it up, the fact is, as previously stated, the memory still lags behind the processor. Barring the introduction of faster RAMs into the market, THIS is the sort of thing that should be proposed for a far-superior next-generation processor, that everyone would want, and that the Industry might decide to settle on as a standard. Wider data paths really are better!

The tropical island party sounds like a pretty good idea, though, but don't expect tequila. Didn't you know that the leading cause of stupidity is alcohol? I don't want to become more stupid than I already am, so I leave the stuff alone. And won't serve it, either. People who want to poison their brains will have to bring their own.

RobertKidney, perhaps you will find the above notes to st3f and egnor adequate.

koz, if I could have found a decently paying job working with 6809's in 1991, I'd likely be there yet, and perhaps never have gotten around to writing this essay.

waugsqueke, good one!
-- Vernon, Jul 16 2001, last modified Jul 17 2001


Please can I adopt the word 'verbose'?
-- DrBob, Jul 17 2001


Sorry, [Vernon]'s already cornered the market on that one.
-- angel, Jul 17 2001


[DrBob] Be my guest. <change of topic> Are you going away on holiday at all this year?
-- hippo, Jul 17 2001


Rods, Vernon: It's been so long since I used a unix that the idea of cross-compiling didn't even occur. I guess all you need to get this going is a kernel, a simple C compiler and a shell? (along with a motherboard to support your processor).
-- st3f, Jul 17 2001


Quite from POYF's link: "...it is fully compatible with x86 architectural platform...".

So that would be a *different* architecture then.

Normally I'd look deeper in the site and search even a tenuous link for your assertion, but with a name like that you're just asking to be flamed.
-- st3f, Jul 17 2001


I must be missing something. How is a computer with 10 processors (ELBRUS) the same as this proposed single microprocessor?

And while I HAVE had some notions of "shared memory" architecture, they were based on dual-port RAM, which I'm pretty sure did not exist even in Russia in 1970.
-- Vernon, Jul 17 2001


So can anyone tell me where I could go or what I would have to do to gain the appropriate knowledge to be able to begin to understand the specifics of what this idea is saying? Is there a class I can take?
-- PotatoStew, Jul 17 2001


POYF: Fine, but if a massively wide memory bus is hooked to 10 processors TODAY, the memory will STILL be unable to keep up with the demands of the CPUs. I think that keeping one CPU well-fed should be sufficient. Not to mention that the deficiencies of merely 8-bit-wide memory should now be so obvious that who cares if the Russians thought of it first, that long ago? Any patent will have expired!

More about my shared-memory notion using dual-port RAM. Let me say right away that this is NOT about numbers of data lines at each address; it is only about connecting the address lines. Suppose you had 4 processors, each of which had 32-bit address space. Now imagine a memory array arranged AS IF it had 33-bit addressing (8 gigabytes total):

[#4:2gigs:#1][#1:2gigs:#2][#2:2gigs:#3][#3:2gigs:#4]

I have broken up the totality into 4 groups of 2 gigabytes. Each processor has full access to 4 gigabytes of RAM (normal for 32-bit machines), but every block of 2gigs is shared between two of the processors.

Next, if I recall the general specifications of dual-port RAM correctly, each memory cell is designed so that one 'port' can be used either for Read or Write, while the second port is Read-Only. Originally used in some video cards, the onboard graphics processor could manipulate data in the RAM freely, and a separate chip for scanning the RAM could equally freely translate the data into graphics images for a monitor, with no worry that any given memory cell might be in use by the graphics processor.

[#4RO:#1RW][#1RO:#2RW][#2RO:#3RW][#3RO:#4RW]

Assuming that the entire 8 gigabytes described above is dual-port RAM, we arrange things so that each processor can Read or Write to 2 gigs, but can Read Only from 2 other gigs. Each processor can place and modify any data it wants one of the other processors to have access to. That processor can ONLY access it.

Now, the above desribed scheme offers simplicity in understanding the basic idea. But it offers difficulties if Processor #1 needs to give some data to Processor #2. (The data would have to be passed through Processors 4 and 3, in that order, first.) Certainly a more complicated variation of this idea would allow every processor a decent amount of Read/Write memory space, and access to sufficient ReadOnly memory space, of ALL the other processors, for maximal efficiency.
-- Vernon, Jul 17 2001


Vernon: If you (or anyone else) wishes to discuss micros with me, email me at 12864@casperkitty.com
-- supercat, Jul 18 2001


[supercat]: You might want to check the maximum file size for your e-mail account.
[Vernon]: What [waugs] said!
-- angel, Jul 18 2001


hippo: Nope!
-- DrBob, Jul 18 2001


POYF, obviously I disagree, and this is why: First of all, on what grounds do you assume that a wide memory path automatically means more expensive memory? In a prior annotation I describe how 4 ordinary (8-bit) bytes at 4 different addresses in a bounded group are currently wired together to yield 1 block of 32 bits for a processor. I say that to get 128 bits at any one address, it is ONLY a wiring issue, not an issue inherently related to the type of memory. Now I do recognize that larger QUANTITIES of memory will be required, and that translates as more expensive -- but that is normal! -- while the way you phrased it implies that an inherently more expensive type of memory must be used, and that is not true.

Next, if you had actually read the specs for this proposed processor, you would know that while it has a 128-bit data path, it only uses a 64-bit address bus and 64-bit internal registers. Many of the actual instructions are fully described in 64 bits, so when 128 are fetched, the processor gets both instruction and data at the same time. If that data happens to be address information, IT IS SUFFICIENT. Please make sure you actually know what you are dismissing, before you dismiss it.

Next, if 128 data lines effectively connect the CPU to the memory (ignoring intermediary chips for the moment), on what grounds do you assume that the data-transfer process must be slower than when only 32 data lines are connected? I agree that more power will be necessary, simply because 128 circuits and not 32 are being used, but the speed should be exactly the same.

Next, only in one respect will programs written for this processor likely take up more memory space than programs written for an ordinary processor. That respect involves actual data storage: If I want to save an ordinary 8-bit byte somewhere today, it takes one byte, whereas on this computer design at least 8-bytes-worth (64 bits) will be consumed. (Since every memory location would be able to hold all the data of 2 registers, I shall assume that any halfway intelligent programmer, when saving data to memory, will always try to save 2 registers -- holding two data items -- at a time.) Nevertheless, it is hoped that the efficiency of the instruction set will make up somewhat for that wasted space -- there is a place in the essay which indicates that if a DATA-LESS 64-bit instruction is pulled, what should the other 64 bits contain? A second data-less instruction! Read it and see.

And that brings me back to the quantities-of-memory issue: If I have an existing program that covers 1 million addresses (8 bits each), then IF PERFECTLY rewritten for this processor, it would still occupy 8 million bits, but the layout would be 1/16-of-a-million addresses (128 bits each). How much more it would actually take depends on what percentage of the program consists of data that MUST waste space, as described in the previous paragraph. Let me assume that the program ends up using 1/8-million addresses, due to such wastage. If every average program converted like that, and if your current computer has 128 megabytes of RAM, then your new computer, running the converted software and a 12864 processor, would only need 256MB of RAM (rearranged in terms of number of bits per address, of course). That's a price differential, for ordinary SDRAM, of less than $75 these days. What will the price differential be by the time any 12864-based computers actually hit the market, eh?

And as for implying that nobody really needs high-performance computers like this one, there are two things you are ignoring: (1) People have unlimited wants (fundamental basis of capitalism); and (2) Lazy programmers write more and more bloatware. Give 'em enough time, and you wouldn't be able to play Tetris without a supercomputer!
-- Vernon, Jul 18 2001


And the baby spoke his first word: dhuh? and after two weeks: 110011101011100011110101010011101

I think I'm stuck in the first week..
-- BartJan, Jul 18 2001


Comrade Poyf, are you saying this has been baked in the form described within these chambers?
-- thumbwax, Jul 19 2001


POYF, thank you for the more detailed information regarding your position. I did indeed write the essay from a programmer's perspective. After all, I had heard that when Motorola designed the 6809, they first interviewed a bunch of programmers to find out what they wanted most, and then gave it to them (easy handling of position-independent code, for example). Why shouldn't what worked so well be tried again? "Practical engineering," eh? Ah, but the art does improve as time goes by....

I do see what you are getting at about the *native* wide memory -- the memory has to run at the same speed as the CPU, and only expensive SRAM can currently do that. But the "intermediate chip" that I mentioned was supposed to be understood as simply an appropriately updated version of an ordinary "North Bridge" chip. Perhaps I am basically misunderstanding exactly where the timing-synchronization occurs, that lets the CPU run at a multiple of the main memory speed, yet they still communicate successfully? (In already-existing computers, that is.)

Yes, of course you need a very long word to store each very long instruction. However, what does the instruction DO? Fetch or manipulate or store data? The long instructions for the proposed 12864 processor would be capable of specifying all three tasks, every time. Thus all the bytes associated with storing any grouping of three ordinary instructions will be consolidated into just one of these long instructions. I would expect the total number of bits that hold instruction-information to end up being roughly the same, when comparing ordinary processors to this one. The major exceptions are those instructions that do simple things to the contents of data already in a register, such as flip all the bits. The more that some program uses those types of instructions, the more wastage there will be when the program is converted for the proposed 12864. Because, as you said, existing processors do tend to have commonly-used instructions fitting in less space than other instructions. Yet, as previously stated, memory doesn't cost so much these days, that the difference can't be accommodated reasonably easily.

With respect to compression and other intermediate tricks for avoiding a 128-bit data bus, I tend to think that the complexity they introduce isn't worth the trade-off. In the long run the capacitance issue you mention will be solved by designing shorter signal paths (or moving to optical paths), among other things. Not to mention that there is one aspect of that issue about which I need clarification. Consider a reasonably standard memory chip, which is wired internally to yield 4 bits per address. Two such chips provide a full byte when requested, four of them provide a pair of bytes when that is requested, and so on. EACH of those chips is independenly powered! Each is responsible for meeting the capacitance needs of the data lines that it outputs to! So, if I want to fetch 128 bits at one address, there could be 32 of these chips all working together, and each one is still independently powered. WHY, then, is there really the capacitance issue that you described? (Now I do recognize that there is another issue with physically mounting so many chips just to obtain 128 bits per address, but we can save that for later.)

Finally, with respect to wasting processor power: "It is better to have and not need, than to need and not have." Uses for such power will inevitably arise, if for no other reason than because it is there.

P.S. it does take a certain amount of time to craft a worthy reply.
-- Vernon, Jul 19 2001


Hmmm... for some reason, I have a funny feeling we may revisit this idea in the future as evidence of prior art in a patent law suit.
-- quam, Jul 20 2001


POYF, it seems you have raised an interesting engineering challenge. Deserves thought. Probably has a solution. And I, too, need to get some actual work done at my job.

Rods Tiger, you might look up "magnetic bubble memory", because it holds a serialized bitstream.

angel and thumbwax, those two links are hilarious!
-- Vernon, Jul 20 2001


Tip o' the hat to [Piss On Your Fire] for breaking the [Vernon] idea barrier. I can rest easily now that the annotations have passed the idea itself.
-- phoenix, Jul 20 2001


Not on my display, they haven't. I'm only as far as 'I don't propose to offer a complete list here'.
-- angel, Jul 21 2001


We could get to the bottom quickly by saying, '[Vernon], would you please clarify ...' and quoting a whole paragraph. He would then quote our entire annotation before clarifying. Of course, we still wouldn't understand.
-- angel, Jul 23 2001


Hey! No fair! I've read all the way through the words (most of which was well over my head - God forgive you if you're bluffing Vernon, Rods et al) and the bloody firewall denies me my reward by stopping access to Thumb & angel's links. Is there no justice?
-- DrBob, Jul 23 2001


Heh heh. I've got "And just to be complete:

|6 5|5 5|5 5|4 4|4 3|3|3 3|3 2|

|3| | | | |8|7| | |4|3| | |0|9| | |6|5| | | | | |9|8|7| | |4|3| | | |9|"
And the scroll thingy is only just over halfway down the page. I guess I can't complain about being given a big monitor.
-- lewisgirl, Jul 23 2001


DrBob, this essay was no bluff in 1991, and I'm not bluffing now. I do confess to enjoying how well most of the ideas have held up through 10 years of computer progress, but I STILL think that the Industry should design some sort of very powerful processor, that could handle any average person's demands for a whole lotta years, and GO FOR IT. Resolving all those incompatibility issues would be worth it!

Rods Tiger, hopefully you now know that instructions ARE a variety of data, which is why they are loaded on the data bus.

But speaking of extra busses, I have a few more notions to spout....

Suppose we take that dual-port RAM described earlier, and make it the main memory for a single processor. If I recall right, this type of RAM has two address busses and two data busses (so that two different accesses to it can occur at the same time). Therefore the processor should ALSO have two address busses and two data busses.

See, one of the most annoying things a processor has to deal with are conditional branch instructions. IF such-and-such, GOTO some address where appropriate instructions can be found. Otherwise it continues processing the current group of instructions.

Now, instructions and data are loaded 'en masse' into a cache, for fast access. Lets say that some block of information has been loaded from Main Memory Address #1,000,000. If the processor encounters an instruction to start manipulating data at Address #2,000,000, it has to re-fill the cache, which slows the system down. EVERY large-enough jump to some other Main Memory Address will have the effect of slowing the system down.

These days the processors have special hardware to speculate in advance about which branch is likely, so that the cache can be filled in advance of the need for the data. However, while such speculations are right maybe 95% of the time, it seems to me that there is another way that may be better.

For example, what if there were two caches? These days a processor can just about always compute the destination-address of a jump well in advance of knowing whether or not the jump will actually be taken. So, while processing instructions from the current cache, the other cache is being filled with info in case the jump occurs. Then implementing the jump is as easy as switching to the other cache, as the source of instructions and data.

Which is why two address busses and two data busses would be needed by the processor! The Read-Only bus would be used to fill the 'just-in-case' cache, and it could be done at the same time that the main Read/Write bus is being used to stow processed data into the Main Memory.

(The preceding is not especially related to the size of the data or address busses, but obviously, the more signal lines, the more pins are needed on the CPU, and the harder it becomes to lay out the circuit-traces on a motherboard.)

And another alternate idea simply involves ultra-fast Main Memory. SOME day they will probably succeed at making memory fast enough to keep up with the CPU, and cheap enough to make in vast quantity. Then there will be no need for any sort of cache, because whenever the CPU needs to switch to some other address, the data will be there as fast as the switching occurs.
-- Vernon, Jul 25 2001


[RodsTiger] "If, (while you're dreaming up a new cpu anyway) you incorporated a microcoded-in-kernel (or even auxiliary kernel) hardware Gzip routine"

Baked as an option for PowerPC cores (see link) although as far as I can see for code only, not data.
-- JKew, Aug 26 2001


It's disconcerting to open up a HalfBakery idea which is about as long as my dissertation (which I'm meant to be writing RIGHT NOW) is supposed to be.
-- -alx, Aug 26 2001


[Vernon] Two caches? I dunno. Sounds as if my defragger will be working overtime. I'm convinced M$oft has already created a windows version for this 12864 hardware and will soon weight it down to run on my platform. Call it maybe WD^40; assuming it is crashproof?
-- reensure, Aug 26 2001


Right Brain: "Make it stop! Aaaah!"
Left Brain: "Interesting."

3 minutes later
Right Brain: <commits suicide>
Left Brain: "Make it stop! Aaaah!"
-- AfroAssault, Oct 13 2001


Following on from AA's admirable critique:

Right brain: Please tell me why I should want one of these at the start before you list the 914 registers, addressing modes etc.

Left brain: What're the power dissipation figures for this? Could I use it in a laptop? A mobile phone? What's memory running at 2 GHz going to do to that? Isn't there a reason we have 1000 different microprocessors out there? What's wrong with RISC (Small core size, low power requirements, easy to write code for)? Many DSPs already access memory in 48/32 bit not 8 bit chunks - this is very annoying if you want to program them for smaller data sizes. And the ARM7 uses a 32 bit word size but can also use a series of 16 bit instructions for simple operations to save on code size, so that bit's baked. Many processors access the memory almost every cycle due to pipelining; in some processors like the ADSP-2106x you can access memory twice every cycle, because it's divided into 2 separate blocks, and well-pipelined, with delayed branch instructions (branch in 2 instructions' time), so even when you're branching you can execute more instructions while the PC adjusts.

I think most of this is baked, but i don't want to read it all.
-- pottedstu, Oct 13 2001


<Head uncovered> Wow. </>

This is the stuff of which legends are made.
I thought ideas at the halfbakery had to fit in one screen or less. Not any more, though.

I'm curious. Just by how much does the modern P4 or its peers approach this?
-- neelandan, Dec 22 2001


Everyone who understands all this deserves a croissant. I have no idea what it's all about.
-- TeaTotal, Dec 22 2001


Amen TeaTotal!!!
-- NeverDie, Dec 22 2001


[neelandan] Check out some of Vernon's other stuff, some (most?) of it's just as bad.

[TeaTotal] Yes, anyone who bothered to read the whole lot (myself excluded) would certainly be in need of nourishment, whether they understood the idea or not.
-- cp, Dec 22 2001


Text Massage, anyone?
-- thumbwax, Dec 22 2001


I can't even manage to read all the anotaitions.

I really need to change some settings because on my screen the anotations still don't reach the bottom of the idea text... (6 screens to go)... can we have marked–for-reduction or something?

Thats an idea - someone who understands the idea in the first place should put it into word and try out the auto-sumary and see if the amount of sense made changes... I wonder what would happen if we got it down to a page...
-- RobertKidney, Dec 22 2001


Just scrolled down thinking this was very Vernon-esque. 1/2 an hour later I find out it was by Vernon. Some things on 1/2B are just too predictable.
-- CoolerKing, Dec 23 2001


WOOOOOOOOOOOO!!!!! WOOOOOOOOOOOOOO!!!!
-- entremanure, Dec 24 2001


The new processors are, I believe, faster at such things as compressing video (from avi to mpeg) and sound (from wav to mp3). I have not done the comparison, but that is what the literature available has made me believe.

But my question: does modern µP's approach (or, heaven forbid! surpass) Vernon's 12864?
-- neelandan, Dec 24 2001


neelandan, when the essay was wriiten in 1991, about the only 64-bit proccessors around were mainframes. I am not sure if any 128-bit central processing units exist today, but wouldn't be surprised if mainframes and supercomputers have been at that level for years (I do know that graphics processing units are up to 256 bits). Certainly for the desktop computer market, there are not any 128-bit CPUs, and only a few 64-bit CPUs (such as the Alpha). AMD's new Hammer will join the fray later this year, with 64 bits for anyone who wants it on their desktop. Intel's current 64-bit offering is being directed to the server market (and no doubt the workstation market), rather than the ordinary desktop market, but no doubt their tactics will change if AMD has a lot of success.

So processors with 64 address lines will soon be common. Intel's instruction set for its processor is probably comparible to the one proposed here (both are Very Long Instruction Word types).

Since all those 64-bit processors also gobble data in only 64-bit chunks, they would not match the 128-bit specification of my proposal. But I may be mistaken: As mentioned in a prior annotation, while the ordinary Pentium processes data 32 bits at a time, it actually pulls 64 bits at a time from the memory (most of the time for cache-fills, I expect). So perhaps something similar is about to happen for the new AMD and Intel processors.

So while my proposal is not outdated now, it probably will be in another 5 years (and possibly less).
-- Vernon, Feb 27 2002


Vernon: The term "reduced instruction set computer" does not refer to the set of instructions being reduced, but rather to the fact that the computer uses a set of reduced instructions: each instruction does 'one thing', with no side-effects. Things like pre-decrement addressing are great in non-pipelined machines on which instructions are inherently guaranteed to run atomically, but would pose two major problems on RISC machines: (1) they introduce additional addressing and data dependencies which must be tracked by the compiler and/or the hardware; (2) on any processor which allows virtual memory, there ust be a means by which a progrma that triggers a page fault can be resumed. On a RISC machine, there is no such thing as a 'partially executed' instruction; either an instruction has completed or it has not. By contrast, an instruction which uses pre-decrement addressing may be in various levels of completion when a page fault occurs. Recovering from a page fault requires capturing the internal state of the progressor, and thus requires the existence of much special circuitry precisely for that purpose.
-- supercat, Feb 27 2002


supercat, I did not mean to imply anywhere here that the proposed 12864 microprocessor qualifies as a RISC device. Its instructions are complicated because they are versatile; each is a starting point for a lot of programming possibilities. And certainly appropriate recovery circuitry will be required. Its primary simplification and source of efficiency comes from having as few as possible different KINDS of internal registers, by letting almost any register be used as part of any task.
-- Vernon, Feb 28 2002


Folks, AMD's new Hammer processors (64 bits) are now in actual computers being tested. There are currently two variants, and both have ENORMOUS numbers of pins. See the link I added.
-- Vernon, Mar 02 2002


Rods Tiger and quam, see newly added link about Transmeta. Heh heh.
-- Vernon, May 31 2002


OH. MY. GOD. I am humbled. And nowhere near topic-anno parity. So I thought I'd jump in the shallow end of this here ocean & add my little dribble of piss.
-- General Washington, Sep 08 2002


I find it fascinating that Vernon's last name is Nemitz. Almost like the USS Nimitz.
-- RayfordSteele, Sep 09 2002


Yes, very apt. The Nimitz Class are the massivest vessels in the history of the universe.
-- General Washington, Sep 09 2002


Do you have any proposed insturctions as an example to be used with the 128 data fetch and bit set?
-- pashute, Sep 30 2002


Thanks in any case. It got me running on an old idea of mine, which I'll put in right now... (See Plug and Play Distributed Processors)
-- pashute, Sep 30 2002


One day, when my childrens children are half-baking, they may see the anotations for this idea be longer that the idea.
-- dare99, Oct 16 2002


Yes, this idea must be the longest one ever in the history of the HalfBakery!
-- XSarenkaX, Oct 16 2002


Hi folks! It's been a while since I checked this, but I haven't forgot; just been real busy.

pashute, I'm not quite sure what you were asking about. Perhaps you asked it because of the way the main text became jumbled? It needs to be viewed in a fixed font. The link "12864(again)" will take you to another posting of the idea, which, if your browser doesn't show it in fixed font, you can copy/paste it to a text editor like WordPad, and force a fixed font. Many of the 12864 instructions are mentioned in a generic way, and I tried to present enough detail that specific versions of them could be deduced.

Regarding my last name of Nemitz, I'm told that this and other similar versions may have a common root, some centuries back. None are considered "close", nowadays, so I don't think of myself as having any significant geneological.relationship to Chester Nimitz.

Regarding the crawl of annotations, I might mention that my monitor resolution is normally set to 1152x864 pixels, and without modifying any browser settings, I'm now seeing the annotations exceed the text. On this annotation page, though, they have a little ways to go yet :)
-- Vernon, Jan 27 2003


I've just finished reading Vernons anno. Its taken me 1 year 10 months and 3 days. Can anybody explain it to me?
-- sufc, May 13 2003


Bender from the TV cartoon Futurama set in the year 3000 is powered by a 6502. Hence that should do for all our next millenium's needs.
-- FloridaManatee, May 14 2003


on my apple g4 laptop, at 1152 pixels wide, on safari 1.0 beta 2, the anno's reach about 75 percent of the way down the mian idea.

i propose that this idea be required reading. having been a 1/ 2baker since 2000, but having taken a hiatus for most of 2002, it's refreshing to go back and read through ideas like this one, which predated the misconception that the 1/2b is a forum for bad puns and cleverly worded shutdowns.
-- urbanmatador, May 15 2003


For anyone interested, and too lazy to go to the 12864(again) link, I've done some editing, to try to make the tabulated portions of this Idea look a little more like they do when viewed in a fixed font.
-- Vernon, Nov 26 2003


Hey Vernon. Having the document in better format improves things considerably. At a quick glance, though, I don't quite see what's particularly novel about it. Yeah, it uses big instructions, but since larger instructions take longer to fetch than small ones and/or require more cache to store, the benefits of using such large instructions must be substantial to justify them.

BTW, it should be noted that many processors are in fact tending toward VLIW designs when executing code within the internal cache. Basically what happens is that the code stored in RAM is 'uncompressed' when it's read into cache and the uncompressed code is then executed. Allows the advantages of the smaller main memory bandwidth requirement with the advantages of a larger easier-to-decode instruction.

Also BTW, are you familiar with the ARM processor?

Finally, on a parting note, I found it interesting that Intel has finally come around to the idea of a processor that executes two simultaneous virtual machines. I'll admit I'm surprised it's taken this long, since the idea seemed obvious to me back in 1992 or so when I was learning about pipelined processor designs. To be sure, if I'd implemented such a design back in 1992 it would have been much simpler than Intel's. On the other hand, much of the complexity in Intel's processors stems from a need to have them be able to execute a single instruction stream quickly. Executing more instruction streams slowly would simplify things considerably.

The key issues are pipelining and data dependence. With pipelining, a subsystem can accept multiple inputs in less time than it takes to fully process one. As an example, consider a pipelined multipler: it can accept one input on every clock cycle, and will produce a result four cycles after it was given the appropriate inputs. If a "multiply" instruction is executed and the result is not needed until many instructions later, other instructions can execute while the data makes its way through the multiplier. If the result of the multiply is needed before it's done, however, then it's necessary to either have hardware to stall the processor until the necessary result is available, or have lots of hardware to allow the execution of instructions that relied upon the multiply to be delayed while other instructions continue to execute. This sort of thing gets horrendously complicated, but is necessary in order for systems to execute code quickly.

The key to making a multi-VM processor simply would be to arrange all the subsystems so that all data paths had a uniform pipeline delay (add latches as needed to make the delay longer on subsystems whose data path was shorter than the maximum). Rather than using direct feedback to have registers hold their values, replace each register with a series of latches and a 'write mux' [which determines whether the register should accept new data or simply recycle the old]. Under such a system, much of the complexity associated with pipelining and data dependancies would vanish. If all data paths were normalized to six cycles, the processor would execute one instruction per cycle without any data dependency issues other than the external memory bus (since by the time any instruction executed, the previous instruction would have run to completion). The latches would add some overhead to be sure, but it would be minor compared to the complexity required to handle register scheduling etc.
-- supercat, Dec 06 2003


A few comments: HOLY SHIT, that is a fucking long post, and...fishbone. I can't stand to read that much text on a computer screen.
-- Eugene, Dec 06 2003


jutta, I'll see what I can do, but can't make a promise about when I can find time to do it. The changes I did to the tables were to replace spaces with underscores, and to add HTML line-breaks, so the double-carriage-returns that had been previously used to separate the lines could be removed. That was all.

supercat, thanks. While it is true that for ordinary processors it takes several clock cycles to load a VLIW, simply because that Very Long Instruction Word occupies a lot of bytes of memory, in this processor the goal was to fit every instruction into 16 bytes or less (128 bits) AND specify a data path of 128-bits/16-bytes at EACH memory address (not the usual mere 8-bits/1-byte). Thus every instruction would load in one clock cycle.
-- Vernon, Dec 07 2003


by far... the longest, most thorough explanation of an idea I've ever come across... I can't bone or croissant it since I can't understand/finish it, but... wow....
-- sheep, Dec 07 2003


Vernon: On modern PC's, it takes a number of cycles (6 to 12 depending upon CPU and bus speed) to read a single word from main memory. Having small instructions means that several instructions can be executed in the time required to fetch the next word from main memory. To be sure, running code from cache (multiple instructions per cycle) would still be faster than running from main memory (which would still be multiple cycles per instruction even if 4 instructions fit in each code word), but the penalty for running from main memory on a machine with smaller instructions is less than on a machine with large instructions. And that penalty matters bigtime.

Consider two machines: (1) 90% of the instructions average 0.5 cycles each, while the remaining 10% average 2 cycles each. (2) 90% of the instructions average 0.1 cycles each, while the remaining 10% average 6 cycles each. Which machine is faster?

The first machine will take 65 cycles to execute each hundred instructions; the latter will take 69. This despite the fact that the second machine executes 90% of the code five times as fast as the former.
-- supercat, Dec 08 2003


Thanks supercat, but somehow I still think you aren't quite grasping what I've been trying to describe here. Before I get to that, though, I'd like to mention that in the main body of this Idea are only a few references to such things as caching and pipelining. I mentioned them because they seemed worthy, but remember that in 1991 while caching was in wide use, pipelining in ordinary PC processors was just getting started. You should assume that the cache in this processor will be 128 bits wide, at each address in the cache, and that appropriate pipelines will exist.

Regarding the fact that several clock cycles are taken to load a single instruction word in today's processors, sure, I do know that. The major cause is simply that while a data word may be 32 bits wide, processor instructions are frequently longer than 32 bits. I seem to recall than even on the ancient 8/16-bit 6809 processor, five-byte instructions were not uncommon. Since that CPU could only load 8 bits at a time, it took 5 cycles just to load those 5 bytes of those instructions. Today, CPUs have a wider data path, accessing multiple memory addresses simultaneously, but memory is still laid out with only 8 bits of data at each address. (I mentioned in a prior annotation something about the wastefulness of having 4 billion ADDRESSES, but only 1 billion of them could actually be used, since 4-at-a-time always had to be accessed, because the processor wanted 4 BYTES at a time.)

Fancier CPUs come with fancier instructions, and so it STILL takes multiple cycles to load the longer instructions.

In the 12864, I am specifying that each memory address should have 128 bits of data, so that when the processor wants a full "chunk", it can get the whole thing from just ONE memory address. The next memory address will have another full 128-bit "chunk" -- so every one of those 18 quintillion adresses can be efficiently used. I have also attempted to design the instruction set to fit in those 128 bits. I make no bones about the fact that frequently the data that must be loaded, to be processed by an instruction, will require a separate loading cycle. But I have also tried to indicate that many lesser instructions will fit in 64 bits, so that some appropriate 64-bit data can be simultaneously loaded, all in just one cycle. ALSO, however, because I have designed this instruction set to mosly incorporate THREE "old" instructions per 12864 instruction (GET1 data, GET2 data & operate (perhaps ADD), and PUT result), there should be some savings of cycles from that alone. LOADING a full instruction plus data may often take only 2 cycles, simply because the REGISTERS here are only 64 bits wide. That is, the efficient programmer would put GET1 data in half the bits at a particular memory address, and GET2 data at the other half of the same address (if that data wasn't already in a register, left over from some prior instruction). So, one cycle would load the full instruction, and one cycle would load the full data. Processing takes 1-to-many additional cycles, and saving the result, IF it goes to main memory, takes one final cycle.

The intermediate processing is the thing that get most affected by pipelining, and I don't know enough to say how much effect there will be. I'd like to think that all the hardware pipelining efficiencies that have been learned over the past decade can be applied to the 12864, such that that ".1 cycle per instruction" that you mentioned would indeed be the norm, and for much more than for merely 90% of instructions.
-- Vernon, Dec 09 2003


Main memory fetches on modern computers are always done at least 64 bits at a time. Code bandwidth is not as relevant to CPU speed as it used to be, but it's still significant.
-- supercat, Feb 16 2005


what?
-- jaksplat, Feb 16 2005


it makes more sense if you read it again. all of it. out loud. in a pirate voice.
-- benfrost, Feb 16 2005


I knew that page bottom button would come in handy.
-- skinflaps, Feb 16 2005


Hah! Synchronicity - it is precisely 20 years TO THE DAY since I handed in my resignation from my full time job programming a dual-processor 6809 system. My next job involved programming bit-slice machines with > 512 bits per instruction - VLIW is not new!.
<FYI> In 1987, a decent sized disk was in the 100-200 Mbytes range, a bit ahead of your 30 Mbyte in 1991.
-- AbsintheWithoutLeave, Feb 16 2005


[AbsintheWithoutLeave], thanks for the info. Regarding those hard drives, though, are you talking top-of-the-line models obtained by Big Busines, or the mainstream models that the average PC owner might be able to afford? I'm pretty sure my 30MB figure more accurately reflects the latter group. (I distinctly recall, about 1996, an acquaintance getting a 500MB drive and shaking his head saying, "I can't believe how big that is!" (He had replaced a crammed-full 20megger, if I recall right.)
-- Vernon, Feb 23 2005


I will obviously be corrected if I am wrong, but wouldn't the power draw of so many more gates being opened at once instead of in rapid succession as is the norm be a problem?

If the power is a problem, it could quite severely limit the size of the chip.
-- Giblet, Feb 23 2005


Wow. I can't believe I read the anno's - I didn't even try the idea! But I was darn sure going to leave my name on this edifice, after having a cramp in my arm from scrolling to the bottom.

"My daddy read through the anno's of one of Vernon's longest ideas and all I got was this stupid T-shirt"
-- Ichthus, Feb 23 2005


[vernon] The disk in question in late 1987 was a 115Mbyte ESDI (full height 5.25") in a 386-based IBM PS/2 server (forget the model, possibly an 80) with 1Mbyte of RAM, running the Novell admin network for a company of 25-30 people, so yes, not your average home user. 20Mbyte drives were not uncommon in PCs at the time though.
-- AbsintheWithoutLeave, Feb 23 2005


A BIT OFF TOPIC- but quite entertaining:

12864 is the American ZIP code for Sabael, New York

The word Sabael is a Hebrew word, meaning "the Burden", the Latin Onus. The name of the sixth step of the mystic ladder of Kadosh of the Ancient and Accepted Scottish Rite. Sometimes spelled Sabael.

12864 = The Burden

No negative or positive implications implied. (implied implications...hmmmm.)
-- macncheesy, Feb 23 2005


I could see this as a movie.
-- RayfordSteele, Jul 28 2006


Vernon, you have, including annos, written 18,420 words. All I can say is WOW and congratulations and bun.
-- Voice, Sep 29 2008


Oh my... this seriously needs a tl;dr on it. I have a case of don't know what the heck it is!
-- mofosyne, Nov 13 2010


Why not build one in Minecraft?{Linky}... You appear to have the parience to!

tl;dr, BTW - Can I suggest adding Histogram Optimization?
-- Dub, Nov 14 2010


[Dub], I don't know anything about Minecraft. From the quick look I just took, though, it appears that creating a 12864 emulator would be a whole lot easier just using some ordinary computer language like C.
-- Vernon, Nov 14 2010


<rolls eyes>
No dedication, [Vernon]. No dedication...
-- Jinbish, Nov 14 2010


{Starts porting Linux to run on run on said Minecraft-simulated 12864 }
-- Dub, Nov 14 2010


+
-- VJW, Feb 20 2011



random, halfbakery