h a l f b a k e r ySugar and spice and unfettered insensibility.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
12864
A microprocessor description I created in early 1991 | |
The 12864 Microprocessor
This is a daydream...let me call it the 12864....
Let me start by saying I am prejudiced in favor of the 6809 microprocessor,
created by Motorola. That it was the best of its day was confirmed when NASA
decided to use it in the Space Shuttle's main computers.
I personally feel that
the 6809 should have had wider acceptance in the personal-computer market, and
that Motorola snubbed its potential by introducing the 68000 too quickly. Only
recently, with the widespread use of the 32-bit microprocessors, has the 6809
really become outclassed. So it is time to move on, time to create a new best
microprocessor of the day. Since this is currently only my own dream, it has
been greatly influenced by what I know of the 6809, and also what I have learned
about the 68020. I do not mean to ignore any worthy contributions from other
microprocessors; that is in fact the main reason for this essay! I am sharing
my dream in hopes that it may be catching....
The 12864 is a 128/64-bit microprocessor. It has 64 address lines, and all
registers are 64 bits wide. But it also has 128 data lines, and this is why:
First, being able to handle this many bits at once means that the 12864 doesn't
need a coprocessor; most coprocessors only handle 80 bits or so. Therefore the
12864 also doesn't need a secondary instruction set telling it how to talk to a
coprocessor. A second reason for having a 128-bit data path leads to further
simplification of the microprocessor: All its instructions have been carefully
designed to fit within 128 bits, so that a single memory-access can provide the
12864 with a whole instruction. To make this still more efficient, the computer
that incorporates a 12864 will be required to have 128-bit-wide memory, and not
the common 8-bit-wide or 9-bit-wide memory of most of today's microcomputers.
This means that the 64-bit Program Counter or PC register is always incremented
just once for each instruction pulled from the memory. The 12864 is not much of
an evolutionary offshoot from previous microprocessors; it's a radical mutation.
Only in the efficiency of its instruction set does it relate to the 6809....
With 128-bit memory, design decisions made in the 6809 and 68020 are greatly
simplified in the 12864. Example: Because the 6809 fetched instructions only 8
bits at a time, there were two distinct groups of Branch instructions: an 8-bit
branch and a 16-bit branch. Machine code that used 8-bit branches as often as
possible was both shorter and faster than code that always used 16-bit branches,
because only one byte of memory and 1 clock-cycle of time was needed for 8-bit
branching-data, while 2 bytes and 2 cycles were needed for 16-bit data. (Not to
mention that 8-bit-branch INSTRUCTIONS were themselves only 8 bits, while most
16-bit branches also had 16-bit opcodes.) And in the 68020 processor, although
there are 8-bit, 16-bit, and 32-bit branch instructions; the latter, 32-bit type
requires an extra fetch of data from the memory. But the 12864 processor needs
only one size of branch instruction, because any 64-bit branch-distance will
always fit into a one-clock-cycle 128-bit opcode+data fetch.
Likewise, because any 64-bit address in the memory can be part of a 128-bit
fetch, there is no longer any need for a special Direct Page or DP register. In
the 6809 the DP register offered an 8-bit way to access part of the memory; thus
the longer and slower 16-bit way of specifying memory locations did not always
have to be used. This is not a problem in the 12864.
Now what about the choice to use 64-bit-addressing? This represents about
18.4 quintillion addresses (18,446,744,073,709,551,616 addresses, to be exact),
far beyond any reasonable projection of any computer's memory needs -- including
virtual memory! Not to mention that since each address holds 128 bits of data,
we are actually talking about 295 quintillion (8-bit) bytes of memory!
Nevertheless, there are some possibly valid reasons for this choice: First,
since the design of this processor is not yet completely fixed, and belongs to
nobody, it might be that it could tickle the fancy of a number of different chip
manufacturers, and lead to a Industry-Wide Standard Design. Naturally, it makes
sense for the 12864 assembly language instruction set to become standardized and
non-proprietary, also. Therefore a second reason for choosing 64-bit addressing
is simply that it would take longer to put this complex chip into production --
and that hopefully gives the software developers plenty of time to convert their
existing software to run on this admittedly incompatible processor. Thus, both
the new computers and their software could arrive at the same time! Finally, a
third reason for jumping straight to 64-bit addressing is that the architecture
of the new computers can be designed with that in mind. Simply because 64 bits
represents such a tremendous enhancement, making it the immediate goal means it
can remain a standard far into the future....
Now let's get into some of the details of the 12864. The total number of
registers of all types will be about 45, give or take a few. This number can
be decided after the Condition-Code/Status Register has had its bits defined.
As stated earlier, every register is 64 bits wide, including CCS. In the CCS
register a number of bits are necessary for various processor functions; just
how many depends on the total list of functions that will be designed. For the
purposes of this essay, let us examine the CCS register of the 68020: It is 16
bits wide, of which 12 are defined and 4 are undefined. If we start with a 64-
bit CCS and only use 12 of them for such things as result-of-instruction flags,
interrupt masks, etc, then that leaves 52 bits that can be equated to the entire
register set of the 12864 microprocessor. However, it is certain that some of
those 52 bits will be dedicated to other processor functions (but I don't know
other dreamers will add to this), and so the number of registers is yet unknown.
In case you are wondering why match the bits of CCS with the register set,
the answer involves the interrupt system. Whenever an interrupt or exception or
other special event occurs, the processor can automatically save on a stack all
the registers that are specified in the CCS register. The processor saves time
because none of those interrupt-type handling routines need include instructions
to specifically save and recover the registers they use. In fact, if the 12864
computer system's main power-up/initialization routine includes defining such a
list of registers in CCS, then all interrupt-type routines can be written using
only those registers. Different boot software, different registers. Note that
2 registers, the Program Counter and CCS, which ALWAYS are saved, do NOT need to
be matched to bits in CCS, and so the 12864 can have 2 more registers than the
simple count of available bits in CCS implies.
The next thing to discuss is the actual list of registers. A major element
in the design of the 12864 is that as far as the programming instruction set is
concerned, all registers are treated equal. But as far as the microcode and the
hardware is concerned, some are more equal than others.... For the sake of this
discussion, let us assume that there are 45 registers, numbered from 0 to 44.
Suppose that Register 33 is the Program Counter, while Register 17 is just an
ordinary general-purpose register. The hardware will always use 33 as a pointer
to the current instruction about to be executed, and the hardware will always
adjust 33 to point it at the appropriate next instruction. But the instruction
set will not distinguish 33 from 17! A Logical-OR instruction that manipulates
a group of bits inside 17 can just as easily manipulate bits inside 33, simply
by specifying 33 instead of 17, in the Logical-Or instruction. Just because
this is something that might be disastrous to the program is no reason to keep
it from being possible! Let the assembly-language programming tool be written
so that it catches such dubious instructions, and warns the programmer! The big
advantage of this scheme is that it leads to an extremely significant reduction
in the total complexity of both the instruction set and the microcode. Examples
later on in this essay may make this more clear.
Let us now examine the bit-format of some of the instructions. By far the
majority of the instructions will have a single format that offers astounding
programming potential (well, what do you expect with 128 bits to play with!)....
Actually, most of this instruction-group format fits into 64 bits, numbered 0 to
63, and defined as follows:
Bits 63-58: These 6 bits hold the actual generic instruction. Of course
this means that there are only 64 such instructions, but if you have any doubts
about this being enough, you don't yet realize how generic they are!
Bits 57-46 are divided into three groups of 4 bits each, hereinafter to be
referred to as 'admode fields', short for 'addressing mode'. Since these fields
have 4 bits, it follows that there are 16 different addressing modes. They will
be explained shortly. The first admode field, bits 57-54, tells the processor
where to find the first chunk of data needed for some instruction, say a SUB.
The second admode field, bits 53-50, tells the 12864 where to find the second
chunk of data; obviously a SUB instruction needs data that can be subtracted.
And the third admode field, bits 49-46, tells the processor where to put the
result of the SUB. Perhaps you now see that with 16 addressing modes for each
admode field, a simple generic SUB instruction can encompass both registers and
memory in quite a few different combinations!
For convenience, let us call the admode fields GET1, GET2, and PUT. A
list of proposed addressing modes follows, and if it is adopted, there will be a
few restrictions on the use of two of them. Should the list be modified during
later design stages of the 12864, these restrictions may still apply. The modes
subject to restriction are marked with * symbols; the limitations are detailed
at the end of the list. The admodes are numbered 0 to 15 in binary.
{POST NOTE: SORRY FOLKS, but the lack of a fixed font here does bad things to the table layouts. If you copy/paste to Notepad, you can remove excess carriage returns, and the tables should look better.}
Direct Modes 0 to 3 Semi-Direct Modes 4 to 7
_0000 Register Data+16\9\10\3bit Offset__0100 Reg Address + 16\9bit Offset
_0001 Register Data+16\9\10\3bit Adjust__0101 Reg Address + 16\9bit Adjust
_0010 GET1=PUTmode, or GET2 or PUT=NONE__0110 Reg Addr+(Reg+10\3bit Adj) Offset
*0011 Immediate 64bit Data______________*0111 Absolute 64bit Address
_____________________________Indirect Modes 8 to 15
_1000 [Reg + 16\9bit Offset],LSig64bits__1100 [Reg + 16\9bit Offset],MSig64bits
_1001 [Reg + 16\9bit Adjust],LSig64bits__1101 [Reg + 16\9bit Adjust],MSig64bits
_1010 [Reg+(Reg+10\3bit Adj)Offst],LS64__1110 [Reg+(Reg+10\3bit Adj)Offst],MS64
_1011 [Reg],LS64+(Reg+10\3bit Adj)Offst--1111 [Reg],MS64+(Reg+10\3bit Adj)Offst
Explanations
Register Data: The value in a register is considered to be data.
Reg Address: The value in a register is considered to be an address.
16\9bit, 10\3bit: 16 or 9 or 10 or 3 bits of twos-complement information, sign-extended to 64 bits (internally) by 12864 processor.
64bit: 64 bits of information fetched with the 64-bit generic instruction.
Adjust: Value in register is modified, using twos-complement information. If info is negative, register adjusted BEFORE instruction executed. If info is positive, register adjusted AFTER instruction executed.
Offset: Similar to Adjust, but register not modified. Computation of the Offset is always performed before instruction is executed.
NONE: No data or address at all.
[]: Value inside brackets is an address. Information at that address is in turn used as an address.
,LSig64bits ,LS64: An address holds 128 bits of data, of which the Least Significant 64 bits are selected for the instruction.
,MSig64bits ,MS64: The Most Significant 64 bits at an address.
(Reg): Distinguishes a second register that this addressing mode uses.
* Recall design decision that limits instructions to 128 bits, including 64 bits of Immediate Data or Absolute Address. It's quite obvious that only one Admode Field can get to use those 64 bits. It also works out that if different admodes exist in all three Admode Fields, then none of the *-marked admodes may be placed in any Admode Field. And in any instruction that uses data acquired through the GET2 field, or in which GET1 is different from PUT...such instructions exclude the Absolute-64bit-Address mode from the PUT field. (Of course, Immediate Data mode is always excluded from the PUT field.) More details of these limits will be provided later; for now it might be noted that the reason that no 64bit-Offset modes exist is to avoid a lot of trouble. It makes the programmer use more registers for indexing, but eliminates much competition between the Admode Fields for the use of the 64 bits that accompany the instruction. Besides, you might be surprised by how well other instructions can replace any 64-bit Offset modes! Anyway the 12864 processor will probably have 30 or more general-purpose registers (registers the hardware doesn't always modify for specific purposes, like CCS or the Stacks---or have their contents used for other purposes, like pointing at cache or program data). It may be easy to find enough available registers for most address-pointing.
Now for some descriptions of the 16 admodes and their consequences:
Direct Modes 0 to 3 all specify data the 12864 processor has on hand, in
a register, or just loaded along-with or as-part-of the instruction. Obviously,
these modes can be executed more quickly than the Semi-Direct or Indirect Modes.
(0) In this admode the data needed by the current instruction is in one
or two of the registers. The 12864 processor has 128 data lines; just because
every register is only 64 bits wide is no reason to limit its ability to process
128 bits. ANY TWO registers may be put together, in any order, to make a place
that holds 128 bits! (Okay, I exaggerated; the 12864 will have both a 'Boss'
mode and a 'Peon' mode. Only the Boss mode can put ANY two registers together;
in the Peon mode a lot of combinations will be illegal. And, even in the Boss
mode, a lot of combinations will be undesirable, like using a Stack pointer with
a Cache-pointing register; the 12864 Assembler would warn the programmer.) Note
that admode 0 merely declares that one or two registers will be used; the actual
register(s) specified are elsewhere among the many bits of this generic format.
After the processor identifies the register(s) holding the data, an offset will
be applied to that data. The offset quantity shall be used by the 12864 in its
implementation of the instruction; the register(s) holding the data will not be
affected by the offset. The maximum size of the offset is affected by how many
registers are used, and by the type of generic instruction being performed; the
details of this will be provided later. The main purpose of admode 0 is to let
us eliminate the LEA (load effective address) instructions from the processor's
list of 64 generics--but certainly other uses will be found for it.
(1) This admode is very much like admode 0. The only real difference is
that the content of the register(s) IS affected by this admode, which makes the
mode useful in counting loops. One thing to keep in mind is that any negative
adjustment is performed before the whole instruction is implemented, while any
positive adjustment is performed after the overall instruction is implemented.
This admode also helps us eliminate LEA instructions (details later).
(2) This is the only admode with a double meaning. If admode 2 is used
within the GET1 field, then it means that the first chunk of data, needed by the
instruction, is currently in the place specified by the PUT field. Thus data at
some location, after manipulation, will return to that location. If we exactly
specify the same admode in both the GET1 and PUT fields (instead of using admode
2 in GET1), we end up being unable to use Immediate Data at all--you'll see! If
admode 2 is used in the GET2 field, then it means NONE, no data for that part of
the instruction. Operations like LSH (logical shift) use admode 2 in GET2; they
need only one main data chunk since any other data is part of the instruction's
definition. In fact, if any admode besides 2 is in GET2 during a LSH or similar
instruction, then the admode should be ignored, or declared illegal. If admode
2 is in GET2 during a SUB or similar instruction, then the net effect of the SUB
will be equivalent to a TST instruction. (With lots of TST-equivalents, there
need not be a specific TST among the 64 generics. But the 12864 Assembler may
include a TST, and translate it into an equivalent.) Finally, admode 2 in the
PUT field also means NONE, no address. The computed result of the SUB or other
manipulation is not put anywhere, and this is useful, too! The definition of a
CMP (compare) is exactly a SUB that doesn't save the result! So the CMP becomes
another common instruction that the 12864 processor excludes from its list of 64
generics.... Like TST, the 12864 Assembler can include CMP, and translate it to
an equivalent: a destinationless SUB. Similarly, the 6809 BIT operation is an
AND instruction with no destination. Designers unite! The 12864 has a full set
of destinationless instructions--and no extra complexity! Moving on, suppose
admode 2 is in both GET1 and PUT: This is basically a no-operation, NOP. Lots
of ways exist to do a NOP; the 12864 Assembler can include NOP, and translate.
(3) This is the Immediate Data admode. Since the instruction is often
64 bits long, while 128 bits are always fetched from memory, admode 3 tells the
instruction to use as data the group of 64 bits fetched with the instruction.
Semi-Direct Modes 4 to 7 all specify that the data the processor has on
hand are memory-addresses of the data needed by the instruction. Admodes 4 to 7
are slower than the Direct Modes because the 12864 has to go fetch the data from
the memory, but this process is still faster than the Indirect Modes.
(4) This admode is like admode 0 in operation. The main difference is
that only one register is ever specified, since one register holds 64 bits and
the memory addressing range is 64 bits. But the offset is figured the same way
as admode 0, and the value in the register is not changed. As mentioned, the
result of the offset computation is a memory address; the data at that location
is fetched for use by the instruction.
(5) This admode combines features of admode 1 and admode 4. Again only
one register is specified as an address-pointer, or index (4). An adjustment of
the value in that register will be applied, pre-decrement or post-increment (1).
If you review admodes 1 and 4, this one should be pretty obvious.
(6) The basic addressing mode for doing 64-bit (or any size larger than
16-bit) offsets is admode 6. One register is specified as a pointer (index) to
the general region of memory; a second register is specified that will hold the
offset from the general place to any specific place. Furthermore, this second
register can be given a predecrement or postincrement adjustment, which makes it
easy to skip through tables of data. Note that although the second register is
adjustable, its value is only an offset; the first register remains unchanged.
(7) This admode specifies that the 64 bits fetched along with the 64-bit
generic instruction is absolute memory address of data the instruction needs.
Indirect Modes 8 to 11 are quite like Indirect Modes 12 to 15: They are
computed the same way, but at some point the data at an address is used as an
address. Now since the data is always 128 bits and addresses are only 64 bits,
which 64 of the 128 do we use? Thus admodes 8-11 use the Least Significant 64
bits of the 128, while admodes 12-15 use the Most Significant 64 bits.
Note that all the admodes that use registers as indexes let the Program
Counter be used as easily as any other register. The 12864 processor needs no
special microcode to provide a host of Program-Counter-Relative admodes, due to
basic design decision making the instruction set handle all registers equally.
The trick to consistency is for the processor to apply any adjustment or offset
to chosen index register AFTER incrementing PC past the current operation This
in turn works due to design choice to make ALL the instructions fit in 128 bits.
Nevertheless, the 12864 Assembler may specifically distinguish Program-Counter-
Relative admodes from the other admodes, and translate appropriately. Finally,
note that it may be undesirable to use the PC register as a data-pointer in any
admode that will adjust the value of the index!
(8) This admode first computes an address in exactly the same way as
admode 4. The 12864 processor then fetches the lowest 64 bits from the memory
at that address, and uses this information as another address. Instruction will
use the data in the memory at the second address.
(9) This admode first computes an address in exactly the same way as
admode 5. Then an address is fetched, and then data, as just described.
(10) This admode first computes an address in exactly the same way as
admode 6. Then an address is fetched, and then data, as just described.
(11) This addressing mode starts by using the value in a register as
an address. The 64 lowest bits in the memory at that address are fetched; they
will be used as a second address. However, before they are used, an offset will
be applied to that second address. A second register is specified, along with
an adjustment. The value in this predecremented/postincremented register is the
64-bit offset that is applied to the second address; the first register's value,
and the memory that held the second address, are not changed by this process.
After computing the new, offset address, the 12864 processor fetches 128 bits of
data from that location in the memory, for the current instruction.
(12) This admode first computes an address in exactly the same way as
admode 4. The 12864 processor then fetches the highest 64 bits from the memory
at that address, and uses this information as another address. Instruction will
use the data in the memory at the second address.
(13) This admode first computes an address in exactly the same way as
admode 5. Then an address is fetched, and then data, as just described.
(14) This admode first computes an address in exactly the same way as
admode 6. Then an address is fetched, and then data, as just described.
(15) This addressing mode starts by using the value in a register as
an address. The 64 highest bits in the memory at that address are fetched; they
will be used as a second address. Then everything proceeds just like admode 11.
Now to show how LEA (load effective address) needn't be included among
the 64 generics. Consider admode 15: At the end of its computations the 12864
processor has an address which it normally uses, right now, to fetch data, after
which the address is not saved. LEA creates that address and saves it for later
use and re-use (doesn't use it now). Suppose admode 15 specifies register 10 (f
irst), register 7 (second), and an adjustment of -58. The Assembler translates
LEA (with syntax specifying admode 15 and the register info) into an ADD: The
GET1 field is given admode 4, register 10, and a 0 offset; the processor fetches
128 bits from the address (part of generic instruction we haven't got to lets us
select correct 64 bits). GET2 field is given admode 1, register 7, and adjust
of -58; the processor modifies the register and gives its content to the ADD.
Then the PUT field specifies where to save result. GET2 might have admode 0 and
get result without modifying register 7. Any LEA can be translated!
At last we can continue the bit-designations of the generic instruction
format. Been about 1000 bytes per bit of explanation, so far...!
Bits 45-39 specify a Bitfield Size for the instruction. These 7 bits can
hold any number from 0 to 127, and with 0 being interpreted by the processor as
128, it becomes possible for the instruction to operate on any data size from 1
to 128 bits. Even though the registers of the 12864 microprocessor are only 64
bits wide, its Arithmetic/Logic Unit is 128 bits wide, and is able to handle any
data size smoothly. So if the Bitfield Size is 79, then 79 bits will be taken
from the place specified via GET1, manipulated (if the instruction requires it)
with 79 bits from the place that GET2 indicates, and finally a 79-bit result is
sent to the place described by PUT. The 12864 Assembler considers the Bitfield
Size to be optional information; if it is not provided by the programmer, a size
of 128 bits will be assumed. Some Assembler instructions, like LEA, default to
64 bits due to the nature of the instruction (LEA computes a 64-bit address).
MUL will always have two 64-bit inputs and one 128-bit output; DIV will always
have a 128-bit dividend, a 64-bit divisor, and a 128-bit quotient. And whenever
Immediate Data is specified, then either the whole instruction must be limited
to 64 bits, or the processor must allow 64 bits to be used in the manipulation
of 128 bits. (Perhaps we can have both: The processor can have the ability to
do the latter, while the Assembler lets the programmer decide the former.) One
way the programer can set the Assembler's default to 64 bits would be to simply
specify only one data-holding register in an instruction's syntax.
Bit 38 of the generic instruction is the Signed Extension Flag. It tells
the processor to treat the result of an operation as a twos-complement number,
if this bit is set. When the result is PUT into its destination, its negative-
ness or positive-ness, as it exists within the Bitfield Size, is extended out to
the Bit-127-mark (the Most Significant Bit is numbered 127; the Least is 0). If
only one register is specified, then sign-extending the result out to the Bit-63
-mark is the thing to do. If the Signed Extension Flag is not set, the result
of the instruction is simply PUT into its destination, and nothing else is done.
Bits 37-34 contain the Do-If condition. Practically the whole instruction
set of the 12864 processor is conditional. This lets the programmer avoid a lot
of conditional-Branches that only skip past a few instructions. Where formerly
some code might have: BCS (branch if carry set) followed by a ROT (rotate) that
would be executed if the carry flag was clear, now we can specify Do-the-ROT-If
Carry Clear, and delete the Branch entirely. In fact, with these 4 bits we can
delete the entire collection of Branch operations from the generic instruction
set of the 12864! The Assembler simply translates any Branch to ADD Immediate
Data to the Program Counter, and sets the appropriate Do-If condition-bits. Of
course, most of the time, most instructions will set the Do-If to ALWAYS. With
only 4 bits, only 16 conditions are allowed. This is enough for Motorola's 6809
and 68020; I hope the final design of the 12864 processor won't require more.
Bits 33-29 are the Flag Mask bits, the other side of the coin from the Do-
If conditions. If every instruction can be controlled by the flags in the CCS
register, it follows that every instruction should be able to specify which CCS
flags, if any, will be affected as a result of its implementation. In fact, for
the Branch instructions to be properly deleted from the generic instruction set,
it is essential that flag-masking be possible. Traditionally, Branch operations
never affect any flags; translating them into ADD instructions makes it obvious
why we require flag-masking. Now consider again the Do-the-ROT-If Carry Clear
that was previously described: What if the instruction after the ROT is also to
be executed only if the Carry flag is clear? A ROT normally affects the Carry
flag! So we mask the flag; the next instruction can also Do-If Carry Clear. In
the 6809 and the 68020 there are only 5 conditions-of-results flags; I hope the
final design of the 12864 processor won't require more.
Bits 28-0 (yes, all the rest) are devoted to the details of the PUT field.
However, the highest seven of them, Bits 28-22, can have another purpose. There
is a group of operations that perform what we might call 'minor manipulations',
and which may need some minor data. The generic instructions of this class that
I have so far identified are, in alphabetical order: ASL and ASR (arithmetic
shift left and right), COPY, INIT (initialize), ISUB (subtract from an initial
value), LSR (logical shift right; LSL = ASL), ROL and ROR (rotations), and
SWAP. ASL, ASR, LSR, ROL, and ROR need data ranging from 1 to 128; INIT, ISUB,
and sometimes COPY, need twos-complement numbers ranging from -64 to +63. The
specification of 7 bits was decided by the needs of ASL, ASR, LSL, ROL and ROR;
other instructions are merely taking advantage of what is already there. Only
SWAP does not need any of those seven data bits. We could have assigned eight
bits to ASL, etc.; twos-complement numbers from -128 to +127 (with zero = +128)
would let us reduce the list of generic instructions even more. Unfortunately,
we are running out of bits! So we can either assign 5 of 64 generic operations
to various kinds of bit-shift, and use 7 bits to describe the size of the shift
--or we can have 3 generic shift operations and use 8 bits to describe the size
of the shift. But ONLY those 3 generic instructions ever really need that 8th
bit! It seems more reasonable to use an extra 2 of the 64 generic instructions.
Let's examine some of the capabilities of these 'minor' manipulators:
ASL (and the identical LSL) merely shift bits from Least Significant to
Most Significant. The Bitfield Size determines how many bit-positions will be
involved in the shift. There is also some Bitfield Start data (which we haven't
got to yet, but has to be mentioned NOW) that specifies exactly where among the
128 bits the Bitfield Size is located. The 12864 Assembler needs to scrutinize
these things carefully; we can't let Bit 100 be the Start while the Size is 34
bits, nor let the Size be 52 bits while the Shift is 73 bits! One final thing
about ASL and LSL: Perhaps they shouldn't be so identical. The 6809 processor
defines them so that there's no reasonable difference between an ASL and an LSL.
But the 68020 places a new flag in the CCS register, an eXtend flag designed to
hold a bit of data specifically for arithmetic operations. The Carry flag holds
data for both arithmetic and logical operations. Yet LSL and ASL both affect
the X flag! So perhaps a distinction can reasonably be made: Only ASL should
affect X. (ASR and LSR also have this small irrationality.)
ASR and LSR are similar to ASL, of course, their main difference being
that these instructions shift bits from Most Significant to Least. More details
of what they do need not be presented here; they all are common instructions.
But we might note that the power of the 12864 processor lets us get data from
just about any place in the computer (using GET1), shift or otherwise manipulate
any part of that data, and then PUT the result almost anywhere else, all in just
one instruction. The mundane turns into the extraordinary.
INIT lets us initialize a register, or registers, or data at any memory
location, such that it becomes a 64-bit or a 128-bit expansion of any number in
the 7-bit range of -64 to +63. INIT replaces CLR (clear), which initializes a
data-storage place to zero only; now we can initialize to 1, or -1, or to any of
more than a hundred possibilities. Note that INIT never needs any GET1 info.
ISUB replaces both NEG (negate) and NOT, which respectively subtract a
number from 0 or -1. ISUB subtracts numbers from anything in the Initial Value
range of -64 to +63. Finding other uses for this operation is not so important
as consolidating NEG and NOT into one generic instruction. The Assembler will,
of course, retain both NEG and NOT, and translate appropriately.
ROL and ROR are pretty much like the shift instructions. The 68020 has
another sort of rotation called ROXL and ROXR, but the 12864 may not need them.
First examine the rotation operation of the 6809: The Carry flag is always part
of the rotation; a bit coming off one end of a byte is moved to Carry by one ROT
and moved out of Carry back into a byte by another ROT. In the 68020 a simple
ROT moves a bit from one end of a location directly to the other end; a copy of
that bit is placed in the Carry Flag. ROX, on the other hand, uses the eXtend
flag the same way the 6809 uses the Carry. In the 12864 processor we can mask
flags that an instruction would normally affect. Suppose the 12864 rotation is
designed to normally flag both X and Carry: If we mask Carry, no copy is sent
there; if we mask X, the bit that normally moves through it simply bypasses it.
(Similarly, the 12864 can have one generic ASL/LSL operation, but the Assembler
can mask the X flag for LSL--if the notion proposed earlier is adopted.)
The COPY instruction replaces LoaD, STore, TransFeR, INC, DEC, JMP from
the 6809, and COPY also replaces MOVE from the 68020. Even some LEA operations
can be translated to COPY. The GET1 admode field lets us specify any place in a
12864-based computer from which to fetch data (and any number of bits from 1 to
128); the PUT field lets us specify almost any other place to receive a copy of
those bits of data. What could be simpler and more powerful? To replace INC
and DEC, GET1 can specify admode 2 -- same as PUT. When GET1 holds 2 while COPY
is being processed, the 7-bit Initialize-data will be used to modify the place
specified by PUT. Instead of only -1 or +1, the INC/DEC can now range from -64
to +63 -- even to +64 if the value of zero is interpreted thus (it's no good for
anything else!). Some LEA instructions that the Assembler translates into COPYs
will have admode 0 in GET1, a specified register, and a 16-bit offset ranging
from -32768 to +32767. PUT would specify the same admode and register, and an
offset of zero. Masking the flags is normal for LEA. Larger offsets can become
ADD Immediate Data to a register, with the flags masked. JMP instructions are
translated into COPY to the PC register, with masked flags--and remember that
any JMP can now be conditional! Load and store and transfer and MOVE operations
become COPY memory to register, reg. to mem., reg. to reg., and mem. to mem.
Another 68020 instruction, PEA (push effective address), may be unneeded in the
12864. It has the effect of computing an address and saving in a place that is
NOT a register, for later use (most likely by the Program Counter, since there
isn't a LEA-to-PC instruction in the 68020). In the 12864 processor, we simply
specify the Program Counter's register-number in the PUT-field data if we want
to LEA-to-PC. Otherwise we can PUT the EA almost anywhere else, for later use.
SWAP is similar to COPY, in that the GET1 data specifies one place while
the PUT data specifies another. However, as the names imply, they do different
things: The 12864 SWAP replaces both the 68020 SWAP and EXG (exchange); data in
the PUT place is sent to the GET1 place, as well as the usual GET1-to-PUT. Two
thing to note about SWAP are that register-adjustments of zero, in the specified
admodes, will probably be common, and the CCS flags will usually be masked. But
consider that if the GET1 admode is 2 (same as PUT), then nothing happens. This
may be the ideal thing for the Assembler to translate a NOP into. And if the
flags are NOT masked while the GET1 admode is 2, during the generic SWAP, then
this may be the ideal thing for the Assembler to translate a TST into. (If the
flags aren't masked during a normal SWAP, then they will be affected only by the
data going from the GET1 place to the PUT place.)
Now back to Bits 28-0 of the generic instruction; as mentioned, they hold
the details of the PUT field data; we shall begin with Bits 0-6. These specify
the Bitfield Start for the PUT field, from 0 to 127. After the 12864 processor
analyzes the identity of the place where the result of an instruction is to be
PUT, the Bitfield Start tells it exactly where in that location the result goes.
For most instructions, most of the time, the value here will be Zero.
Bits 7-12 specify the number of the first register needed to identify the
place where the result is PUT. In other words, if Register 7 is the destination
of the data, then a 7 will be here (admode 1 in the PUT field). To modify flag
bits in the CCS register, simply set a Bitfield Size of 5 (for 5 flags), the CCS
register's number here, and a Bitfield Start of zero (assuming the designers put
the CCS flags in the lowest bit-positions of the register). If a memory address
indexed by Register 15 is the data's destination (admode 4 or 5 in PUT), then 15
will be the number placed here. Bits 7-12 can hold any number from 0 to 63, and
as mentioned early in this essay, the 12864 will probably only have 45 registers
or so, total. Anything more than the highest register number would be illegal,
of course, even in the Boss mode! If admode 2 or admode 7 is specified in the
PUT field, then the processor would ignore any register-number in these bits.
Admode 3 would be another, except it is illegal in the PUT field.
Bits 13-28 specify the offset or adjustment to be applied to the register.
indicated in Bits 7-12. At least this could be true for instructions OTHER than
ASL, ASR, COPY, etc., because only OTHER instructions never need the 7 bits from
22-28. An index register being used with a ROL instruction can only have a nine
bit offset or adjustment applied to it (in Bits 13-21, of course). HOWEVER, it
can be worse! Bits 13-18 may specify a second register altogether! For admodes
0 and 1, any Bitfield Size 65 or more, we must specify 2 registers. For admodes
6, 10, 11, 14, and 15, a second register is a normal part of address-indexing.
(At least those admodes get a 64-bit offset from the second register, applied to
the first register.) After a second register has been specified, only the bits
from 19-28, or from 19-21, can be used as an offset or adjustment to the second
register (a 10-bit or a 3-bit modification, respectively). Here is a chart:
___________|2___________2|2___1|1_________1|1__________|_____________|
___________|8|_|_|_|_|_|2|1|_|9|8|_|_|_|_|3|2|_|_|_|_|7|6|_|_|_|_|_|0|
___________|__16-bit_offset/adjust_________|_First_____|__Bitfield___|
___________|__applied_to_first_Register____|_Register*_|__Start______|
___________|-------------------------------|___________|__for_PUT____|
___________|_ASL,_ASR,___|_9-bit_off/adj___|___________|__only_______|
___________|_COPY,_INIT,_|_to_1st_Register_|___________|_____________|
___________|_ISUB,_LSL,__|-----------------|___________|_____________|
___________|_LSR,_ROL,___|3-bit|_Second____|___________|_____________|
___________|_ROR__data___|_ad-_|_Register*_|_*will_be__|_____________|
___________|_____________|_just|_admode_0,_|_ignored___|_____________|
___________|_____________|_to__|_1,_6,_10,_|_in_admode_|_____________|
___________|_____________|_2nd_|_11,_14,___|_2,_7______|_____________|
___________|_____________|_Reg.|__and_15___|___________|_____________|
___________|-------------------|___________|___________|_____________|
___________|_10-bit_offset_or__|___________|___________|_____________|
___________|_adjust_to_1st_or__|___________|___________|_____________|
___________|_2nd_reg,_depending|___________|___________|_____________|
___________|_on_the_admode.____|___________|___________|_____________|
______And_just_to_be_complete:
____|6_________5|5_____5|5_____5|4_____4|4___________3|3|3_____3|3_______2|
____|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|_|_|_|9|8|7|_|_|4|3|_|_|_|9|
____|__12864____|_GET1__|_GET2__|__PUT__|__Bitfield___|_|_Do-If_|__CCS____|
____|__Instruc._|_admode|_admode|_admode|__Size,_for__|_|_Con-__|__Flag___|
____|__Code_____|_______|_______|_______|__entire_____|_|_dition|__Masks__|
____|___________|_______|_______|_______|__operation__|_|_______|_________|
_______________________________________________________^
________________________________________________Sign-Extension
Having used up 64 bits of the normal 128-bit fetch by the 12864 processor,
it's obvious that to provide details of the admodes specified for GET1 and GET2,
we will need to use the other 64 bits. Now it has already been stated that they
are supposed to hold Immediate Data or an Absolute Address; the potential for
conflict is obvious! This conflict is the main reason admode 2 was created: It
makes the GET1 field use the admode in the PUT field, thereby eliminating any
need for any specific GET1 information among the second 64 bits of the operation
fetch. And if the GET2 field specifies admode 3 (Immediate Data) or 7 (Absolute
Address), then THAT is all the GET2 information needed, and the instruction can
be properly executed. So the main restrictions of limiting the GET1/GET2/PUT
system to a total of 128 bits are these: (1) We can't combine Immediate Data
with more Immediate Data; (2) We can't combine Immediate Data with data at an
Absolute Address; (3) We can't combine the data at two Absolute Addresses; and
(4) We can't use Immediate Data or an Absolute Address in any instruction where
the GET1 admode is different from PUT. How much does it matter that we can't do
these things? We already can't do them with any current processor, right? What
we CAN do is far more important: Not only can we combine Immediate Data or the
content at an Absolute Address with the content of any register (normal for any
processor), we can also combine our Immediate/Absolute information with the data
at any place in the memory that can be index-referenced -- and save it too! The
typical 12864 program will probably be position-independent, anyway, and seldom
need Absolute Addressing. It likely will start by loading several registers
with the addresses of a number of data tables, all relative to the PC register.
No Immediate Data there! Then the remaining registers will become variable-
holders, and use Immediate Data as needed, just like any other program.
So to be a little more specific about how GET1 and GET2 information is set
among the second group of 64 bits, let's first note that it took all of 29 bits
for the PUT information. Keeping that the same for GET1 and GET2 means that 58
of the 64 bits get assigned real quick! Suppose we assign the Least Significant
32 bits to the GET1 information, and the Most Sig. 32 to the GET2 information.
This leaves 3 bits extra for GET1 and 3 extra bits for GET2. The most obvious
thing to do with the extra bits is to expand the offset/adjustment data (from 16
to 19 bits, for example), but perhaps they can be used for something else. Note
that the ASL, ROL, etc. data takes space away ONLY from the PUT information. A
possible use for one of the extra bits is that of being a flag controlling the
Bitfield Start data: If the flag is zero, then the seven bits hold the number
of the starting bit; if the flag is one, then six of the seven bits specify a
register-number where the information on the starting bit is to be found. It
would have been nice to have had enough bits to do this to the PUT field, but it
may not be missed too much, since the PUT field's Bitfield Start is likely to be
zero most of the time, anyway. So here is one more chart:
__The_GET2_info_duplicates_this_GET1_info,_except_Bit-numbers_range_from_32-63.
_______|3|3_____________________1|1_________1|1__________|_|___________|
_______|1|0|_|_|_|_|_|_|_|_|_|_|9|8|_|_|_|_|3|2|_|_|_|_|7|6|5|_|_|_|_|0|
_______|_|__18-bit_offset/adjust,_applied____|_First_____|_|_Register__|
_______|_|__to_First_Register________________|_Register__|_|_holding___|
_______|_|-----------------------------------|___________|_|-----------|
_______|_|_12-bit_off/adj_to_____|_Second____|___________|_Bitfield____|
_______|_|_1st_or_2nd_Register,__|_Register__|___________|_Start_data__|
_______|_|_depending_on_admode___|___________|___________|_for_GET1____|
________^
______Flag_determining_use_of_register_to_hold_Bitfield_Start_data
Now let's consider a few things about the 12864 Assembler. Obviously it's
going to recognize many common assembly instructions, and translate them into
the far fewer set of generic instructions recognized by the processor. The set
of 12864 instructions may be enlarged, simply to take advantage of the possible
list of 64. Ordinary instructions like ADD, SUB, ADDC (add with carry), SUBC,
ABCD (add Binary Coded Decimal), SBCD, OR, EOR (exclusive or), and AND may be
supplemented with NOR, ENOR, and NAND. I don't propose to offer a complete list
here; let the Industry decide all the final details. The main thing that needs
some attention right now is the format of the Assembler instructions; each will
occupy a fair amount of space! But this is reasonable, considering that a 12864
instruction will usually equal 2, and often 3, regular-processor instructions.
All the information from 3 regular lines of Assembly code, plus some new stuff,
has to fit on 1 line in this proposed Assembler format:
|Label|Instruc_|Bitfield|ASL,_etc|_GET1_|_GET2_|_GET3_|Do-If|Flags_|Comment
|field|Mnemonic|_Size___|7bt_data|admode|admode|admode|cond.|masked|_field
The Label field gives this place in a program an optional name, so that it
referred to from other places in the program, if desired.
The Instruction Mnemonic is, of course, the name of the instruction.
Bitfield Size (BfSz) is simply a number 1-128; if this part of the Assembly
format is blank, a 128 Size is assumed -- but Admode data may change it to 64.
7-bit data is required in this area whenever the Mnemonic is ASL, ROL, etc.
The nature of this data has already been described. Note exceptions like SWAP
and COPY, which the Assembler knows never needs this data. The Assembler offers
INC and DEC instructions that will require 7-bit data; the programmer never need
see this get translated to COPY. Exceptions are peculiar, aren't they?!
Below are examples of the syntax for the addressing modes:
Admode__Syntax____________________Explanation
0000____16;+20.33______Register 16 has data. A Bitfield Start (BfSt) of 33 is specified, so data extracted from 16 starts at Bit 33 (BfSz specifies how many bits). Extracted data will have 20 added to it (register not affected), before being given to the current instruction. An assumed BfSz of 128 would change to 64 (one register specified) minus 32 (due to the BfSt). Conflicts cause Assembly errors
0000____6 9.18_________Two registers have data: 6 is Most significant; 9 is Least significant. 9 is the First Register in Bits 7-12 of the Specific Information area; recall charts. Data extracted from registers starts at Bit 18.
0000____20:10__________Data in register 20. BfSt in register 10. Note that a period denotes exact BfSt data; a colon means a register has the data. BfSt-in-a-register is illegal in PUT.
0000____10 11__________Two registers have data. BfSt is assumed to be zero. Note spaces denoting registers. Other admodes will use commas, and admode fields must be tabulation-separated.
0000____7 3;-123_______128 bits of data available in registers 7 and 3. BfSz determines how many are extracted. BfSt assumed to be 0. 123 subtracted from it before instruction gets it.
0000____URHERE PC:12___The Assembler will accept either 'PC' or the actual number of the PC register (as yet unknown!). Assembler computes offset between content of PC register (what it will be at end of instruction) and place in memory that is designated by label 'URHERE'. Suppose PC register is
0000____PC;-87:12____\_____34, and the programmer knows that the offset is -87:
0000____URHERE 34:12__>__would be identical to URHERE PC:12. PC is the only
0000____34;-87:12____/_____register to which labels can be referenced, because it is the only register that has it value known at all times by the Assembler (relative to Origin of program). Note the :12 means BfSt is in register 12 (no, I don't know why the programmer wants that in this example!).
0001____20+20.33_______Check first example; note lack of semicolon here. Plus or minus sign mandatory for all offsets and adjustments. With plus adjust, data first goes from register to the instruction. At END of instruction, data in register is adjusted. With minus adjust, register is adjusted before data extracted from it for use by instruction.
0010___________________To specify admode 2, simply leave the field BLANK!
0011____#123456________Immediate data preceded by #. A + or - is optional.
0100____,20;+20.33_____Check first example; note extra comma here. Register 20 now being used as index, with an offset of 20 applied to it. The offset value (index+offset) is the address from which data is removed, starting at Bit 33. The data may extend to Bit 127, depending on the BfSz. Note that initial index is always just ONE register.
0100____,14:2__________Register 14 is index of address holding data; register 2 has data on where is the BfSt. Specifying offsets, adjustments, or bitfield starts is always optional.
0100____URHERE,PC______Note use of comma. Assembler computes offset between PC and URHERE, as before. In previous exmple the ADDRESS of URHERE was the information (ignoring fact that BfSt specified in that example made the address useless!); now the memory content at that address is the data.
0101____,20+20.33______Check similar examples. Here register 20 is an index which is used to fetch data. Afterwards, an adjustment of +20 is applied to the register. If the adjustment is negative, it is applied to the index before the index is used as an address-pointer that tells us where data is.
0110____,10 12-14:5____Register 10 is the index, holding an address. Register 12 has a 64-bit offset to that address. Before offset is applied, register 12 receives adjustment of -14. The address thus found (by applying adjusted offset to register 10) is the address of the data, which will be accessed using the BfSt data in register 5.
0111____>123456________Absolute Address always preceded by > symbol. Out of 18.4 quintillion possibilities, this one is pretty low!
1000____[,20;+20]L.33__See admode 0100; the part inside brackets is figured in exactly the same way, resulting in an address. Exactly 64 bits are always extracted from that address in this admode. They are the Least Significat 64 bits, as the L indicates. The 64 bits are then used as the address of the data, which starts at Bit 33.
1000____[URHERE,PC]L___The lowest 64 bits of the data at address URHERE are extracted and used as an address. The instruction gets its data from the address thus found. The L (or M, in admodes 12-15) is a mandatory part of the syntax. All programmer does is provide correct syntax; the Assembler will deduce from that syntax the admode number, and the specific info, that are built into the instruction.
1001____[,25-2222]L:13_The value in register 25 is adjusted by -2222 (maximum can be -32768 in PUT before an assembly error occurs, or -131072 in GET1 or GET2), and then the adjusted index is used to fetch an address (least significant 64 bits). In turn the fetched address is used to fetch the needed data, using the BfSt in register 13.
1010____[,5 9-873]L____See the example for admode 6 (0110); bracketed syntax is analyzed the same way, this time using register 5 as the basic index, register 9 as holding the 64-bit offset, and -873 as the adjustment applied to the offset, before the offset is applied to the index. The address thusly computed is the place from which the Least 64 bits are taken and, in turn, are used as the address to fetch the data. Note that -873 is too big for PUT information, but would work as GET1 or GET2 information.
1011____[,18]L 6+3_____Value in register 18 is used as an address to fetch an address from the memory. Least significant 64 bits are taken from memory to become an address. Register 6 has a 64-bit offset, which is applied to the extracted address. The thusly-computed new address is the place where data will be found. (I say 'found' or 'fetched', but address is also a possible place to PUT the data.) Afterwards, register 6 is adjusted by +3. This example, if in the PUT admode field, and if the instruction is LSL or one of that group, is using the largest positive allowable adjustment (3 bits, twos-complement). What's the chance of having only 32 generic instructions, so we can move a bit to the PUT information field?
I don't think I need to provide any examples for admodes 12-15; they are identical to admodes 8-11, with the sole exception that the letter Lin the syntax is replaced by M. The Assembler uses L and M to determine the correct admode; the 12864 processor uses the admode to determine that either the Least Significant or the Most Significant 64 bits are to be taken from the memory and used as an address. This process has absolutely nothing to do with Bitfield Sizes and Bitfield Starts.
It should be repeated that these examples are only a proposal; thinking about them is bound to lead to speculation about how easily the programmer can make a mistake by forgetting a comma. A whole different syntax might be created just to reduce the chance of such accidents, perhaps one where mnemonic letters replace the commas, periods, colons, and semi-colongs -- even lower-case letters, to prevent confusion between O/offset and 0/Zero. This syntax simply attempts to make the admode-field information compact.
The next field of the Assembler format, after the PUT admode field, is the
Do-If condition. Two letters suffice to abbreviate the possible conditions (at
least only 2 letters if Motorola's list is used): HI (higher); LS (lower or
same); CC (carry flag clear); CS (carry set); NE (not equal to zero; zero flag
clear); EQ (equal; zero flag set); VC (oVerflow flag clear); VS (oVerflow set);
PL (plus); MI (minus); GE (greater than or equal to zero); LT (less than zero);
GT (greater than zero); and LE (less than or equal to zero). This list totals
14 possibile Do-If conditions; with a maximum of 16 allowed, the last two are
usually Do Always and Do Never. For the purpose of the Assembler format, the
Do Always condition can be the default if the Do-If field is simply left blank,
but it wouldn't hurt to allow a DA abbreviation. A DN abbreviation is logically
sensible, but practically almost useless -- a NOP for sure! (If the Assembler
converts NOP to SWAP, as proposed, obviously the Do-If would be Never!). Maybe
some other Do-If condition can be created, just to use that 16th possibility.
After the Do-If field in the Assembler instruction format is the Flag Mask
field. Motorola's flags are abbreviated X, N, Z, V, and C, so simply putting an
appropriate letter (or letters) in this field should tell the Assembler that you
don't want a particular flag to be affected by the current instruction. Simply
entering ZCN without any punctuation should be adequate to specify the Carry,
Zero, and Negative-sign flags, for example. Now consider the opposite notion:
Some Assembler instructions, like LEA, will be translated into other operations,
and the flags will automatically be masked by the Assembler during translation.
In the 6809 processor there are two registers Y and U, which are not treated the
same by LEA instructions. LEAY will affect the Zero flag, while LEAU will not.
The idea is to let register Y be used in counting loops, and it works fine. The
12864 Assembler could allow the same sort of thing: If the programmer specifies
the Z flag in the Mask field during an LEA instruction, then the Assembler WON'T
mask the flag! More precisely, what is happening is the programmer telling the
Assembler to reverse its normal handling of the 12864 flagmask bits. If the
Assembler usually doesn't mask a flag, then it will be masked -- and vice-versa.
The last field of the Assembler format is the Comment field, in which the
programmer is supposed to explain the purpose of the instruction. This field is
completely ignored by the Assembler, of course, during the task of creating the
machine code for the 12864 processor from the assembly source listing.
And now my two-cents-worth on the hardware of the 12864 computer; if what I
am about to say is really worth as much as two cents, I'll be surprised! The
average computer has a System Clock that controls the timing of everything that
goes on in the computer. The average microprocessor accesses the memory every
(fill in blank) cycles of the System Clock, on the average. The remaining clock
cycles are spent by the processor processing the data it has accessed. Some of
the newer processors have 'preprocessors' built into them, so they can access
the memory significantly more often. The preprocessors begin working on future
instructions before the main processor finishes the current instruction; it is
known as 'pipelining', I believe. The 12864 will be both similar and different
to this scheme. It'll likely have one main processor for the main instruction,
and 3 subprocessors to handle the data represented by GET1, GET2, and PUT. It
figures that if the average 12864 instruction is as complex as 2 or 3 regular-
processor instructions, the 12864 may have to do as many memory-accesses as 2 or
3 'regulars'. Yet by processing GET1, GET2, and PUT simultaneously, the 12864
is essentially doing the work of the 'pipeliners'. Whether or not pipelining of
the current sort is actually built into the 12864 remains to be seen. In the
meantime, though, the 12864 is still going to spend a number of clock cycles in-
between memory-accesses, during which it is processing the accessed data. Since
it is fairly obvious that the more often a processor can access the memory, the
greater the performance of the computer, the standard trick is to increase the
speed of the System Clock, and building both processors and memory chips to keep
up. Nevertheless, this does not change the fact that the processor spends many
clock-cycles NOT accessing the memory! And I get the impression that the memory
chips are not keeping up with the processors, in the speed race. So here is my
suggestion: Build the 12864 with a faster clock than the System Clock. It will
have to hold its outside lines open for more than one internal clock cycle each
time its subprocessors access memory (to stay in sync with the System Clock),
but while it is doing that, its main processor can be manipulating previously-
accessed data. With proper planning the 12864 should be able to access memory
almost every cycle of the System Clock, at the memory's maximum possible speed.
I have been saving the thorniest problem for last (at least I think the end
of this essay is approaching!), and it concerns the hardware's management of the
data. The first part of the problem is this: While most 12864 instructions are
128 bits long, many will be fully described in only 64 bits. So do we make the
processor skip the other 64 bits, and move on to the next memory location, or do
we scheme to fit another whole instruction in those 64 bits? My inclination is
to ignore the 64 bits, UNLESS it 'just happens' that two adjacent instructions
in the assembly source listing can both be reduced to 64 bits. In other words,
what the processor would do is load 128 bits, discover that the first 64 of them
comprise a complete instruction, execute that instruction, and test the next 64
bits to see if they also comprise a complete instruction. If they don't, they
will be ignored, and the processor will load 128 bits from the next address. It
would be worth having this scheme just to give the programmers a chance to prove
they are clever enough to always make full use of it. Any programmer who NEVER
attempts to conserve memory should be fired! (And so what if there are more
than 18.4 quintillion memory locations -- waste is waste.)
The other aspect of the memory management problem concerns the Stacks, which
are places where random numbers of registers are temporarily stored. If each
address a Stack register points at holds 128 bits, and each register being saved
is only 64 bits wide, then it seems at first obvious to always put 2 registers
at the Stack address. But many times an odd number of registers will be saved;
what then? The very simplest answer is to always only store 1 register at each
Stack address, and ignore the obvious waste, because this way the processor can
never get confused. The next-simplest answer may be to REQUIRE the programmer
to always PUSH or PULL an even number of registers when using the stack -- even
a JSR (jump to subroutine) instruction would have to save another register with
the Program Counter, just to keep the total even. I think I may recommend this
particular solution (would you believe I have been worrying about this since the
middle of this essay, and just now have come up with the idea?).
The bit-code format of instructions like JSR, BSR, PSH, PUL, and MOVEM can't
be the same as the format for most 12864 instructions. The main reason is, as
mentioned, that the instruction has to incorporate a list of registers -- but it
works out OK, because much of the instruction is predefined. Before we get into
any details of that, though, let us examine the Stacking system a little closer.
In the 6809 there are two Stack registers, one of which is always used by the
hardware to save JSR and interrupt information, and one of which the programmer
can use for other things. There are occasions when having two Stacks is really
convenient, notably when moving large blocks of data around. In the 68020 there
are three Stack registers, one for the Boss mode, one for the Interrupt mode,
and one for the Peon. Two bits in the CCS register are devoted to keeping track
of which Stack the hardware is using at the moment, so if it had been wanted, a
fourth Stack could exist in the 68020. This seems worth putting in the 12864.
And another thing: TWO CCS registers! One would be a Boss mode CCS that keeps
track of things like the current Stack being used and interrupt-control flags,
as well as the list of registers to be saved during an Interrupt, as proposed at
the beginning of this essay. The other would have the instruction-result flags
in it and some other stuff. MOST of that other stuff is another register list,
like that in the Boss CCS. Thus when a GSR instruction is used (generic for JSR
and BSR: go to subroutine) a list of registers could specified that would be
saved in the Peon CCS. Here is a proposed bit-map for GSR:
_____________|6_________5|5_____5|5_____5|4_____4|4________________|
_____________|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|.....|_|_|0|
_____________|_Code_For__|_GET1__|_GET2__|_Do-If_|_Register_List___|
_____________|___GSR_____|cannot_|_______|_(PUT__|_Note_Peon_CCS___|
_____________|Instruction|__be___|_______|_is_PC_|_and_PC_registers|
_____________|___________|admode_|_______|always)|_not_on_list;____|
_____________|___________|___2___|_______|_______|_always_saved.___|
_____________________________________^
________If GET2 is admode 2 then data specified by GET1 is copied to PC -- equivalent to JSR. If GET2 is any other admode then the data it specifies is added to the data GET1 specifies, and the result is copied to PC. If GET1 specifies PC then we have a BSR equivalent. The CCS instruction-result flags are NEVER affected by this one. Normal limitations: No adding Immediate Data to Absolute Address!
It is worth noting that the Register List, from 0 to 45, is in agreement
with the early estimate of approximately 45 registers total for the 12864. If
there are any registers that we can be sure NEVER need to be saved during a GSR,
even during the Boss mode, then we can have a few more than the 48 implied here.
When executing a GSR, the processor would copy the specified register list to
the Peon CCS register, save them all on the current Stack, THEN save both the PC
and Peon CCS registers. When an Interrupt occurs, the last two registers saved
would always be PC and the Boss CCS (although the Peon CCS would be saved just
before then). One bit in the same place in the two CCS registers would serve to
identify which is which; this bit cannot be allowed to be changed by anything.
Then when the generic RTN (return) instruction is executed, 128 bits of PC and
CCS data would be taken from the memory; the correct CCS would be identified,
and the correct way of returning would follow. One thing to note about RTN from
a subroutine: The instruction is almost completely pre-defined. The only odd
thing is that the values of the instruction-result flags in CCS BEFORE the RTN
occurs have to be preserved while CCS data is being loaded from the Stack during
the actual RTN operation. Unless various flag-masks are set by the programmer!
The bit-coding of RTN only needs 6 bits for the instruction, 4 bits for Do-If,
and 5 bits for flag-masks (flags the programmer does not want preserved during
the RTN from a subroutine); the rest of the 64 bits can be ignored. Programmers
should be wary of specifying any flag-masks for RTN at the end of an Interrupt
handling routine, since here the normal thing for the processor to do is to NOT
preserve the flags, as they exist at the end of the Interrupt handler. Masking
them would mean transferring Interrupt data to the interrupted program. This
would be OK if the interrupted program was specifically waiting for such....
PSH, PUL, and MOVEM-type instructions can all be combined into one generic,
I think, that we can call STAK. The bit-coding for it might be like this:
_______|6_________5|5____|5|5_____5|4_____4|4____________________|
_______|3|_|_|_|_|8|7|_|_|4|3|_|_|0|9|_|_|6|5|_|_|_|.....|_|_|_|0|
_______|___STAK____|Con-_|_|_Do-If_|__PUT__|__Register_List,_to__|
_______|instruction|trol_|_|_______|_______|____be_stacked_or____|
_______|___code____|Bits_|_|_______|_______|______unstacked______|
_______|___________|_____|_|_______|_______|_____________________|
_______|___________|_____|_|_______|_______|_____________________|
_______________________^__^____________^
_______PUT specifies the address where the stack is to start. If LOCATION OF ADDRESS is in memory elsewhere, one Control Bit denotes L or M for the 128 bits at that location, from which the stack's address will be fetched -- no bitfield specs! After STAK is finished, the PUT place is given a new value, indicating the new start of the stack. (Immediate Data still forbidden in PUT, of course.) One Control Bit specifies top or bottom of stack; another Control bit specifies data being added to or removed from the stack. As always, only an EVEN total number of registers may be specified. Bit 54 means that the Peon CCS register is part of the stack operation. STAK never affects flags, except when loading CCS from this kind of stack. (I forgot to say, details of PUT can be in other 64 bits of the instruction fetch.)
That about wraps it up, I guess. Any inconsistencies you may have noticed
are due to the fact that this is only a proposal, and therefore does not need to
be perfect. Only if the Industry decides to get together to create a standard
microprocessor along these lines would it be necessary to get really finicky on
all the details. And what do I want out of this? First of all, I want to beat
the NIH Syndrome: 'If it is Not Invented Here, we are not interested!' Except
for the fact that computers I own and know well happen to have 6809s in them, I
am not associated in any significant way with any company in the entire computer
industry. I will claim the credit for dreaming up this thing, just to prevent
anyone else from doing so -- and just to prevent any person or any company from
claiming ownership of it, I am quite deliberately placing this whole concept in
the public domain, as of NOW. Thus the whole industry starts off on an equal
basis with respect to the proposed 12864 microprocessor, and there should now be
no barrier to creating an industry wide standard. I am knowingly forfieting all
legal claim to any compensation for these ideas, just to prove I seriously want
the Industry to get its act together. On the other hand, any 'royalties of
conscience' that might come my way will be gladly accepted!
Vernon Nemitz
March 17, 1991
NOTE ADDED JULY 11, 2001: It wouldn't be so tough to build one of these
processors today. The Industry is as non-uniform as ever. The first 80486
DX2 chips began appearing in late 1991. Very Long Instruction Word processors
are also being designed and built, to different specs than those described
here. But the Industry is STILL putting only 8 bits of data at one address,
even as it ramps up the production of 64-bit processors. What a mismatch!
Meanwhile, 64-bit addressing looks to be a stable quantity for 20 or 30 more
years. Maybe I'll try to promote a 25664 microprocessor...128 bits wasn't
really QUITE enough, for all the variety of instructions that I had in mind,
and considering all the new multimedia instructions, well, why not!
For 6809 lovers...
http://www.vavasour.ca/jeff/trs80.html CoCo and Dragon computers can be emulated on Pentiums [Vernon, Jul 11 2001, last modified Oct 04 2004]
Homebrew FPGA CPU
http://www.fpgacpu.org/ Design your own 32-bit (or more) RISC cpu, implement it on a gate-array, and share your results with other hobbyist CPU designers. [wiml, Jul 11 2001, last modified Oct 04 2004]
One bit processor in use
http://srtm.die.uni...vers/comunicati.htm by NASA no less... [phoenix, Jul 11 2001, last modified Oct 04 2004]
(?) Athlon electrical & mechanical data sheet
http://www.amd.com/...hdocs/pdf/23792.pdf What are all those pins for? Find out here. [wiml, Jul 11 2001, last modified Oct 04 2004]
Return to Top
http://www.halfbakery.com/idea/12864 [The Military, Jul 11 2001, last modified Oct 04 2004]
(?) The totally, like bitchin' 12864.
http://www.80s.com/...tainment/ValleyURL/ Enter the URL of this idea. [angel, Jul 11 2001, last modified Oct 04 2004]
(?) Makes technical jargon even more revolting, er more accessible to word on the street
http://www.pornolize.com/ copy/paste URL from here... [thumbwax, Jul 11 2001, last modified Oct 04 2004]
(?) For [POYF], one of 3,510 hits on Yahoo.
http://www.11a.nu/lempelziv.htm I'm surprised you don't know of this. It's used in GIF encoding. [angel, Jul 11 2001, last modified Oct 04 2004]
(?) CodePack Compression for PowerPC
http://www.chips.ib...pc/cores/cdpak.html On-chip compression for PowerPC cores. [JKew, Jul 11 2001, last modified Oct 04 2004]
(?) More ancient technology revisited
http://www.shouldex.../6805;mode=moderate In this case, 6502-inspired. Probably relevant to 6809 issues. [LoriZ, Mar 01 2002, last modified Oct 04 2004]
AMD Hammer
http://www6.tomshar...1/020227/index.html Looks like (roughly) a 31x31 grid of pins... [Vernon, Mar 02 2002, last modified Oct 04 2004]
(?) Transmeta Announcement
http://www.pcworld....0,aid,101516,00.asp Transmeta announces plans for a 256-bit wide Very Long Instruction Word processor. I wonder.... [Vernon, May 31 2002, last modified Oct 04 2004]
12864 (again)
http://www.nemitz.net/vernon/12864.htm In theory, same essay, in unmangled format [Vernon, Oct 04 2004]
Transmeta
http://www.transmeta.com/ Looks like the 256-bit processor is well into production [5th Earth, Feb 16 2005]
Intel fully embracing 128-bit data crunching
http://arstechnica....paedia/cpu/core.ars Now all they need is an efficient instruction set, to access multiple instructions per data-fetch. [Vernon, Apr 14 2006]
Hack-a-Day: Minecraft articles
http://hackaday.com/?s=minecraft Mostly building processor simulators in a games engine [Dub, Nov 14 2010]
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
Um, okay, fishbone for wibni.— | phoenix,
Jul 11 2001, last modified Jul 12 2001 |
|
|
|
Wow. After scrolling for three or four minutes, I was
thinking, "Hey, Vernon really has some competition now!"
Then I got to the bottom after another three or four
minutes of scrolling (just scrolling, mind you, not reading
much) and found out that no, there is no competition,
Vernon is still the reigning champ of verbosity. |
|
|
Time for character limits on the idea textarea, maybe? |
|
|
I had wondered why we had not heard from Vernon for what seems like months...evidently he's spent the entire time typing. |
|
|
FOR ANYONE starting to read the essay, I recommend going down to the end of it, and clicking on the link "12864 (again)". You will find the essay in its intended format to be MUCH more clear about certain things. Enjoy! |
|
|
phoenix, what is "wibini"? I don't recall encountering the term before. If it's an acronym, I can assume it is derogatory ('fishbone' clue), but I'm used to that, so let 'er rip! |
|
|
PotatoStew, as thorough as this microprocessor description tries to be, it is still only half-baked. I only wish I could have posted it on the Internet 10 years ago, when it was
written. If actually baked, the idea will require a whole book for all the details. |
|
|
Rods Tiger, we really do need more than mere 16-bit processors. The maximum ordinary number you can fit into 16 bits is only 65535, and how often do you need to manipulate numbers larger than that? To do so efficiently means using a processor that handles more bits! Since the proposed 12864 handles 128 bits of data at a time, ITS maximum number is 18-quintillion SQUARED. Only mathematicians and physicists regularly need numbers anywhere near that large (or larger). Next, with respect to floating-point calculations, the average 80-bit coprocessor (mentioned in the essay) has a 64-bit mantissa and a 16-bit exponent. This HAS proved to be sufficient for most purposes, most of the time. Floating point calculations on the 12864 could quite naturally still use a 64-bit mantissa (in one register), and a 64-bit exponent (in another register). The accuracy of the calculations would remain the same; only the overall size, of the maximum number that could be manipulated, would increase. Next, I tend to think that if the 6809 had been equipped with 16 data lines, and Motorola had provided a decent large-memory-management chip, then Intel's 8086 wouldn't have stood a chance, because the instructions set really is that efficient and versatile. A 1-Mhz 6809 could stay neck-and-neck with a 4-Mhz 8086. So, if you study the essay, you will find that I kept just about all of that versatility, and made it even more efficient. (For those who don't know, Motorola commissioned a special operating system, "OS9", which is sufficiently UNIX-like to be able to do very good multitasking. So NASA put 6809-based computers in the Space Shuttle.) |
|
|
I see nothing wrong with your suggestion that a Digital Signal Processor have Direct Memory Access capabilities, similar to hard disk and video controllers. I don't see any reason to lessen the capabilities of the processor, though. One of the elements of many modern games is a large amount of PHYSICS calculations (so that when large objects fly though the air and impact massively, both the flight and the results of the impact look realistic. NO mere 16-bit processor can keep up with all those calculations, pertaining to all those objects moving around on the screen, these days! |
|
|
beauxeault, I really have lots of things on my agenda. The more I want to sleep, the less I can play on the HalfBakery. And, this idea really was written 10 years ago. There was a recession, and I was experiencing unemployment, and had time on my hands.... |
|
|
waugsqueke, the essay is still there. Feel free to read it as many times as it takes to sink in. Even <I> had a hard time digesting it, when I discovered it on a disk that has been sitting in storage for 10 years, and reread it. I haven't been mentally in the Assembly Language groove in a long time. But after I saw what angel wrote in the "Restore Duelling" area, I had two reasons for posting this ;) |
|
|
UnaBubba, you have the date wrong: it was 1991 when I wrote what you quoted. Some of the rationale for that statement is explained in my reply to Rods Tiger; the only real mistake is that phrase "any computer". Even back then I knew that NO computer will EVER be powerful enough for mathematicians and physicists! More of the rationale is this: If I recall correctly, in 1991 a decently-sized hard-disk drive had maybe 30 megabytes of storage space. Now it is ten years later, and it takes 1000 times that, 30 gigabytes, to be a decent size. |
|
|
Please note that the maximum number of distinct memory locations that a 32-bit processor can directly access is 4,294,967,296. Because RAM is more expensive than disk space, a "virtual disk" is a piece of special software that sits between the processor and the disk; when the processor wants memory outside of the RAM, the special software accesses the disk and fetches data and feeds it to the processor, AS IF the RAM had actually been there. Today's hard disk drives can easily provide more simulated RAM than a 32-bit processor can psuedo-directly use, and this fact actually should be interpreted to mean that the processors are behind the technology-development curve! |
|
|
If the above disk-storage trend continues, then in 10 more years a decent hard drive will store 30 trillion bytes, and 10 years after that, 30 quadrillion bytes -- and 10 years after THAT (30 years from now), 30 quintillion bytes. WOULD THIS HAVE BEEN A REASONABLE PROJECTION, TEN YEARS AGO? If you say NO, then what you quoted is almost perfectly accurate!!! Because, remember, a 12864 processor will expect to find 128 bits of data (not the current standard of merely 8 bits) at every numerical address -- so even a 30-quintillion-byte hard disk drive can only provide about a tenth of the "data space" that this processor could work with!— | Vernon,
Jul 12 2001, last modified Jul 19 2001 |
|
|
|
[Vernon]: Are you saying it's *my* fault that you posted this? 'wibini' - WIB incredibly NI I also have a soft spot for the 6809; it was the heart, soul and guts of my first computer, the Dragon 32. Further to what [Vernon] suggests, it could knock the socks off a 4MHz Z-80, not just the 8086. As for the rest, I'll leave it to you guys. |
|
|
UnaBubba, thanks for the acronym. I see that angel provided the variant that includes that middle 'i'. |
|
|
angel, when I discovered the 10-year-old file that I posted here, at first I wondered if I should, because it was indeed so long. But I had two recent reminders (yours was one) that technological notions are generally the preferred thing to post on the HalfBakery. OK -- take THIS, heh heh. Besides, it IS a half-baked idea, and even after ten years, it still isn't outdated! |
|
|
Rods Tiger, my first computer was a CoCo, and I also have seen Motorola's reference sheets for the 6809. Did you know that one of the unusual video modes, in all the TRS-80 CoCos, has a hardware bug that is directly derived from those design sheets? Makes you wonder about how many other so-called tech companies (like Tandy) are really just marketing machines.... |
|
|
Wow, Vernon, are you an interesting person.... |
|
|
[Vernon] Wasn't doggin' you
with the wibni, buddy. Wibni's
are discouraged @.5B |
|
|
Anyone ever see a one bit
computer? (No, really) |
|
|
Whew. Vernon, I think you're halfway to inventing RISC processors. You should take a look at the design of the MIPS architecture and then, perhaps, the SPARC and PowerPC architectures. |
|
|
One thing you should consider is that one of the bottlenecks on modern processors is the transfer rate between main memory and the CPU. By making all instructions 64/128 bits long, you've greatly reduced the code density (in terms of useful operations per bit). Sure, you can make the data path to main memory wider, but this makes the machine much more expensive (more pins, more circuit traces, more sockets, more fiddly bits of metal to manufacture, test, and have go wrong). On modern machines, the core CPU talks to a cache controller (which is on the same chip), and the cache controller talks to memory via a data path of whatever width provides the best performance/cost, regardless of the width of the instruction set. |
|
|
wiml, I'm pretty sure that RISC processors were around in 1991, and that I knew about them back then. In one sense, it depends on how you define RISC. The 6809 had its working registers named A, B, D, X, Y, U, and S, and so the assembly-language included instructions (example, to stow contents of a register into the memory) that were literally named STA, STB, STD, STX, STY, STU, and STS.
Now, if I decide to merely NAME things differently, so that I have a generic STO instruction, then the Assembler still needs to know which register and what place in the memory. If the original syntax was STA >10000 and the new syntax is STO A,10000 ---well, somebody might SAY they have reduced the instruction set, but in actuality the opcodes to the processor haven't been changed a bit. What I have described in the essay is simply alternate ways to do some of the same long list of instructions as before. What I really was independently inventing was Very Long Word Instructions, because I thought it would be more efficient to fetch the equivalent of three normal instructions (get someting, do somthing to it, put result somewhere) all at once. Have you noticed that recently processors like Athlon and Pentium 4, when the manufacturers, bump up the CPU clock speed from (say) a multiple of 13 to a multiple of 14 times the system bus, the overall performance doesn't usually go up very much? The reason is that most of the instructions that the CPU loads don't take 13 or 14 clock cycles to process! ESPECIALLY with the built-in pipelining and duplicate internal pathways, that let the CPU process MORE than one instruction in a single CPU clock cycle. SO, I SAY, FETCH MORE INSTRUCTIONS AT ONCE, to keep the CPU busier!!! This means we NEED a wide path to the memory. And as for the number of pins, well, the 6809 had 40, of which 8 were data lines and 16 were address lines. The other 16 were things like power, ground, hardware-interrupt signalling, clock signalling, etc. Okay, granting that a powerful processor needs extra power and ground and clock-signal pins, I hear that the Pentium 4 has more than 400 total! Only 32 of them are address lines, and somewhere I gathered the impression that maybe 64 of them were for data, but 300+ pins for all the other overhead seems ridiculous. The proposed 12864 processor would need 192 pins for data and addresses. In 1991 I <WAS> concerned about such a large number, but I am not bothered about it nowadays, that's for sure! |
|
|
Concerning the PowerPC architecture, in 1991 one of the companies to which I sent a copy of this essay was Motorola. If they borrowed something from the essay, they never told me. Nor did they need to, since I declared it public-domain. |
|
|
I think my biggest gripe with respect to putting only 8 bits of data at one memory location is this: BOUNDARIES. The memory management circuits will gather 8 bits from each of 4 addresses, in parallel, to supply a single 32-bit block to most of today's processors. In such a computer, every 4 addresses (such as from 000000 to 000003) constitutes a bounded group. If I wanted to save 32 bits starting at Address Number 000001, the hardware won't let me! --because it would cross the boundary between Address 000003 and 000004. THEREFORE, IF ONLY 32-BIT DATA IS ALWAYS USED, the net effect is AS IF the total address space had been reduced to 1/4 of its advertised size, from 32-bit- to 30-bit-addressing, that is. I think this fact reveals a total waste of potential! A 16-bit processor should have AT LEAST 16 bits of data at every address, a 32-bit processor should have AT LEAST 32 bits at every address, and so on!. Only then will ALL the advertised number of addresses be fully useful. |
|
|
About the bottleneck between the processor and the memory, I knew about that back in 1991, and if you go down to near the end of the essay, you will find a paragraph that begins with "And now my two cents worth about the hardware..." In there I describe my independently-invented notion of clocking the CPU faster than the rest of the system, to compensate. Yet today that bottleneck STILL exists, in spite of large clock-multipliers, RDRAM, DDRSDRAM, and even QDRSDRAM. Sometimes I think they should use SRAM for ALL the memory, and not merely the cache, heh heh. Well, there ARE some new memory technologies in the pipeline, which may offer the performance of SRAM, the cost of DRAM, <and even> the nonvolatility of ROM. I can wait for one of those. |
|
|
"What I really was independently inventing was Very Long Word Instructions..."
Understatement of a Lifetime.
"I hear that the Pentium 4 has more than 400 total! Only 32 of them are address lines, and somewhere I gathered the impression that maybe 64 of them were for data, but 300+ pins for all the other overhead seems ridiculous. The proposed 12864 processor would need 192 pins for data and addresses."
Vernon, would all 192 pins be solely dedicated to data and addresses, and/or would they absorb all the other overhead as well? If so, what proportions do you assume would be logical in completed form?
RE: "the thorniest problem"... would stack addresses run parallel - virtually or otherwise - on both 64 bit 'divisions' or is that a nonissue?
Given that no one has taken advantage of the 10 year old public domain offer to the fullest meaning, I would hope in all sincerity you are instead able to see this through to your financial benefit or to live long enough to see it done. Fate has its own method to the madness. |
|
|
I am withholding commitment pending further discussion. |
|
|
Rods Tiger, you are reiterating stuff that's already in the essay. Of each 128-bit data fetch, some are defined as "bitfield size" and "bitfield start" information. So that any portion of any group of 128 bits can be used for whatever. There are probably fewer technical difficulties with this than one might first imagine. For example, consider a relatively standard "3-state signal", on one of the data lines between the CPU and the memory: One state specifies Read, one specifies Write, and one specifies Do Nothing. So if one wanted to fetch only 53 bits from somewhere among the 128 at a particular memory location, only 53 wires would be set to Read state, and all the others would be set to Do Nothing. |
|
|
Next, you have also described existing SIMD instructions (Single Instruction Multiple Data), which
are used to process multimedia data streams. I didn't happen to think of that idea, but I did mention at the very bottom of the essay, that they should be included. |
|
|
About distributing the idea, I originally did mail off copies to several places, and then shelved it. Later, when I wanted to start thinking about bit patterns for a 256-bit processor, I couldn't find it! Only a few days ago did it surface by accident, in a very unexpected place. Now it is posted where anybody can find it. Hmmmm. In my various searches of the Web for pages containing whatever-I-might-be-seeking, pages from the HalfBakery never seem to come up. I wonder if Jutta can do something about that...if she wants an even busier place, that is. |
|
|
thumbwax, NO, I did not mean to imply that this proposed microprocessor would ONLY have 192 pins. I don't claim to know enough about all the overhead-pins to offer a guess as to how many total this processor would be expected to have. |
|
|
I'm not quite sure what your are asking "would stack addresses run parallel?", when I THOUGHT the essay fairly clearly states that when stacking the data in the 64-bit registers, two registers would be saved at each memory location, an even total number of registers being required for stacking/unstacking. Since the list of stacked registers corresponds to which-bits-are-set in the stacking instruction, one way to think of the stacking process is as examining a stream of bits from first to last when stacking, and from last to first when unstacking. Each set bit means the associated register has its data stacked, in left64/right64, next-address. left64/right64, next-address, etc. fashion (and right64/left64 when unstacking. (Or call them hi64/lo64, if you prefer.) |
|
|
And thank you! for the kind remarks about potential future compensation. |
|
|
There are 11,913 words in this idea; 67,283 characters (including spaces). Just thought you'd like to know. |
|
|
The most important thing I learnt when doing academic research was: If you've got a complex technical or scientific idea and you can't express it in a single, simple sentence then you haven't thought it through properly and you don't yet know what you're talking about. So how about it: Can you give us one short sentence which expresses why this is an important idea? |
|
|
[Rods]: No, do you think I've got nothing better to do than count carriage returns? |
|
|
angel, unfortunately when something is posted on the HalfBakery, it undergoes a text-massaging process that tries to eliminate excess carriage returns and spaces. For purposes strictly pertaining to posting the idea here, I deliberately added quite a few extra carriage returns, to keep certain lines from becoming run-together. It didn't do enough good. Some of those lines had precisely-spaced characters in them, using a fixed font, and they have all been messed up royally. (What I am trying to say, your count of characters is rather different from what the true count was, before mine and the HalfBakery's text-massaging.) |
|
|
hippo, since this is the place for half-baked ideas, why are short descriptions automatically therefore important? Not to mention that this particular "idea" is actually a whole raft of them. Each had to be explained, and organized in relation to the others. And then there is that phrase "microprocessor description" in the subtitle! Why isn't that good enough for you? Finally, since different people have different standards as to what qualifies as an important idea, all that can be done is let an idea speak for itself. Individuals can only read and decide. |
|
|
// I think it would be an achievement to get the annotations to wrap past the end of this particular idea! // |
|
|
we're about a third of the way there. anyone else have anything interesting to say? |
|
|
If I'd known my text could me massaged for free I wouldn't have been paying for it this whole time. Sorry. |
|
|
I'm afraid I cant realy contribute to a discusion on processors as I don't realy know anything about them and quickly got lost in all the technical details but I am having a good go at helping to extend the anotations... |
|
|
this idea is beyond my attention span |
|
|
I was going to add an annotation to this, just to contribute in the long, slow crawl towards the bottom of the Vernon's idea. For a second, Rods Tiger, your comment almost made me lose faith. Closing the History window on Internet Explorer, suddenly the browser window widened and the distance to the bottom doubled. But then I thought, if it works one way, it works in reverse. So I just re-opened the History window and widened it to take up half the screen. By the time the annotation column is only about 20 characters wide the notes not only reach the bottom of Vernon's idea, they actually pass it... at which point I realised that I didn't have to add this note. Oh well. |
|
|
Returning to the beginning of the annotations I'm just wondering how phoenix can WIBNI what is effectively a brief design spec. Bad phoenix, naughty phoenix - you're grounded for a week and no firelighters for a month. |
|
|
I don't pretend to understand the detail, but what this looks like is a revolution over evolution redesign of the microchip. Unfortunately this will need it's own operating system and the necessary "killer app" to make people buy it. Would this be any use as a games console as these are designed more from the ground up than general purpose computers? |
|
|
Can someone summarize this article for those of us who, well, you know? |
|
|
[st3f] "I'm just wondering how
phoenix can WIBNI what is
effectively a brief design spec" |
|
|
And it's not that this is a bad
idea but the idea of enhancing
a processor by widening its
data buses and/or taking
advantage of data caches is
baked. I believe you'll find
that there is currently a trend
in computing towards smaller
data buses which are more
easily manufactured and can be
more efficiently clocked (e.g.
Serial ATA, USB, FireWire, etc). |
|
|
That said, I hope [Vernon]
finds a buyer for his CPU,
makes millions, buys a tropical
island and has us all over for
tequila. |
|
|
Erm... I'm with egnor, I realy needa sumary. I bet the reason no-one has writen one is that they still have'nt read the whole idea... |
|
|
Man, I love the 6809. In fact, I'm coding on one right now with my current project. LEAX 1,X FOREVER! |
|
|
st3f, with respect to the notion of needing a "killer app", the essay tries to explain how that is not essential: "...the whole industry starts off on an equal basis with respect to the proposed 12864 microprocessor, and there should now be no barrier to creating an industry wide standard." Remember that this was written in 1991, before anybody could consier making this. However, they COULD have said, "Yes, let's create an industry-wide standard CPU, based on something that nobody has any legal control over." Which is why I declared the essay public domain. Then, by the time anyone could have made superprocessors like this one, all the software developers would have had years to prepare for conversion, no matter how incompatible it might be. The basic idea was to design a processor SO superior that everyone would want it, AND provide the time frame needed for acceptance/preparation. It's kind of late for that now, but, as noted at the very bottom, there's no reason why we couldn't set out to design a 256/64, or maybe even a 1024/128 processor.... |
|
|
egnor, the preceding may perhaps qualify as a summary. All the details merely show that I was willing to give the notion reasonable consistency, and bring it to the actual half-baked point. Even as we write, consider recent Intel and AMD history: A couple years ago they decided that they would make 64-bit processors that would obey certain instructions in specific ways, and THEREBY they created "instruction sets". An Assembler lets a programmer use that instruction set to create programs that get things done; an Emulator is a program for Computer A that can take those program-instructions and run them AS IF Computer A was actually Computer B (the computer for which the instructions are actually intended). So, while programmers had a chance to pretend they were using 64-bit processors (and modify existing software to take advantage of their features), Intel and AMD could develop the actual 64-bit hardware that would supposedly function according to the instruction-set specifications. THAT IS THE KEY: It was by specifying much of the instruction set for the 12864 that I hoped other things could follow. |
|
|
Rods Tiger, you got that right. |
|
|
phoenix, the essay is indeed only a brief design spec. I mentioned in a prior annotation that the FULL specifications would require an entire book, and that remains utter truth. Also, you are missing the point about the high-speed narrow-channel serial connections like FireWire: They are used for PERIPHERALS only, and they are used to reduce the cost of connecting those same peripherals. It is simply cheaper to make a cable with 2 wires in it than a cable with 128 wires in it. But, since a cable with 128 wires can carry 64 bits of data simultaneously (each distortion-resistant signal requires a separate pair of wires), the 2-wire cable has to operate at 64 times the speed, to be equal. THIS translates as more-expensive hardware on each end of the cable. So trade-offs are inevitable.... So, for dealing with communication between CPU and main memory, you MAY be able to condense it some (let's say 1/4): 128 bits goes out of the memory to a communications controller at 1Ghz; the controller sends 32 bits at 4Ghz to a second controller at the CPU; and the second controller downshifts it back to 1Ghz and 128 bits for the CPU. Now, HOW MUCH POWER will those communications controllers consume, and is this a worthy trade-off to merely finding space among the layers of the motherboard for 128 data lines, connecting CPU directly to memory? I SHOULD NOTE that the preceding is hypothetical, a more correct description of what is actually done involves ONE intermediary chip between CPU and memory. This chip deals with the fact that the main memory DOESN'T run at the same clock speed as the CPU -- the "bottleneck" mentioned in several places above. Nevertheless, for today's 32-bit systems, the intermediary chip can gather 64 bits at once from the main memory at the max speed that the memory can handle, and feed it to the processor at a faster speed, for the CPU's benefit. [No conflict here: if the memory runs at 200Mhz and the CPU runs at 1.6Ghz (8 times faster), then the intermediary chip has to spend 8 CPU cycles talking to the memory, and THEN spend one CPU cycle talking to the CPU.] Well, if the memory instead communicated 1024 bits at once to the intermediary chip, THAT chip could divide the 1024 bits into eighths, and feed the CPU 128 bits on each one of its cycles! NO SYSTEM TODAY does anything like that, to the best of my knowledge. I didn't even broach the notion in the essay. I DID want the wider data path to try to keep a faster-clocked CPU busier, but ten years after thinking it up, the fact is, as previously stated, the memory still lags behind the processor. Barring the introduction of faster RAMs into the market, THIS is the sort of thing that should be proposed for a far-superior next-generation processor, that everyone would want, and that the Industry might decide to settle on as a standard. Wider data paths really are better! |
|
|
The tropical island party sounds like a pretty good idea, though, but don't expect tequila. Didn't you know that the leading cause of stupidity is alcohol? I don't want to become more stupid than I already am, so I leave the stuff alone. And won't serve it, either. People who want to poison their brains will have to bring their own. |
|
|
RobertKidney, perhaps you will find the above notes to st3f and egnor adequate. |
|
|
koz, if I could have found a decently paying job working with 6809's in 1991, I'd likely be there yet, and perhaps never have gotten around to writing this essay. |
|
|
waugsqueke, good one!— | Vernon,
Jul 16 2001, last modified Jul 17 2001 |
|
|
|
Please can I adopt the word 'verbose'? |
|
|
Sorry, [Vernon]'s already cornered the market on that one. |
|
|
[DrBob] Be my guest. <change of topic> Are you going away on holiday at all this year? |
|
|
Rods, Vernon: It's been so long since I used a unix that the idea of cross-compiling didn't even occur. I guess all you need to get this going is a kernel, a simple C compiler and a shell? (along with a motherboard to support your processor). |
|
|
Quite from POYF's link: "...it is fully compatible with x86 architectural platform...". |
|
|
So that would be a *different* architecture then. |
|
|
Normally I'd look deeper in the site and search even a tenuous link for your assertion, but with a name like that you're just asking to be flamed. |
|
|
I must be missing something. How is a computer with 10 processors (ELBRUS) the same as this proposed single microprocessor? |
|
|
And while I HAVE had some notions of "shared memory" architecture, they were based on dual-port RAM, which I'm pretty sure did not exist even in Russia in 1970. |
|
|
So can anyone tell me where I could go or what I would
have to do to gain the appropriate knowledge to be able
to begin to understand the specifics of what this idea is
saying? Is there a class I can take? |
|
|
POYF: Fine, but if a massively wide memory bus is hooked to 10 processors TODAY, the memory will STILL be unable to keep up with the demands of the CPUs. I think that keeping one CPU well-fed should be sufficient. Not to mention that the deficiencies of merely 8-bit-wide memory should now be so obvious that who cares if the Russians thought of it first, that long ago? Any patent will have expired! |
|
|
More about my shared-memory notion using dual-port RAM. Let me say right away that this is NOT about numbers of data lines at each address; it is only about connecting the address lines. Suppose you had 4 processors, each of which had 32-bit address space. Now imagine a memory array arranged AS IF it had 33-bit addressing (8 gigabytes total): |
|
|
[#4:2gigs:#1][#1:2gigs:#2][#2:2gigs:#3][#3:2gigs:#4] |
|
|
I have broken up the totality into 4 groups of 2 gigabytes. Each processor has full access to 4 gigabytes of RAM (normal for 32-bit machines), but every block of 2gigs is shared between two of the processors. |
|
|
Next, if I recall the general specifications of dual-port RAM correctly, each memory cell is designed so that one 'port' can be used either for Read or Write, while the second port is Read-Only. Originally used in some video cards, the onboard graphics processor could manipulate data in the RAM freely, and a separate chip for scanning the RAM could equally freely translate the data into graphics images for a monitor, with no worry that any given memory cell might be in use by the graphics processor. |
|
|
[#4RO:#1RW][#1RO:#2RW][#2RO:#3RW][#3RO:#4RW] |
|
|
Assuming that the entire 8 gigabytes described above is dual-port RAM, we arrange things so that each processor can Read or Write to 2 gigs, but can Read Only from 2 other gigs. Each processor can place and modify any data it wants one of the other processors to have access to. That processor can ONLY access it. |
|
|
Now, the above desribed scheme offers simplicity in understanding the basic idea. But it offers difficulties if Processor #1 needs to give some data to Processor #2. (The data would have to be passed through Processors 4 and 3, in that order, first.) Certainly a more complicated variation of this idea would allow every processor a decent amount of Read/Write memory space, and access to sufficient ReadOnly memory space, of ALL the other processors, for maximal efficiency. |
|
|
Vernon: If you (or anyone else) wishes to discuss micros with me, email me at 12864@casperkitty.com |
|
|
[supercat]: You might want to check the maximum file size for your e-mail account. [Vernon]: What [waugs] said! |
|
|
POYF, obviously I disagree, and this is why: First of all, on what grounds do you assume that a wide memory path automatically means more expensive memory? In a prior annotation I describe how 4 ordinary (8-bit) bytes at 4 different addresses in a bounded group are currently wired together to yield 1 block of 32 bits for a processor. I say that to get 128 bits at any one address, it is ONLY a wiring issue, not an issue inherently related to the type of memory. Now I do recognize that larger QUANTITIES of memory will be required, and that translates as more expensive -- but that is normal! -- while the way you phrased it implies that an inherently more expensive type of memory must be used, and that is not true. |
|
|
Next, if you had actually read the specs for this proposed processor, you would know that while it has a 128-bit data path, it only uses a 64-bit address bus and 64-bit internal registers. Many of the actual instructions are fully described in 64 bits, so when 128 are fetched, the processor gets both instruction and data at the same time. If that data happens to be address information, IT IS SUFFICIENT. Please make sure you actually know what you are dismissing, before you dismiss it. |
|
|
Next, if 128 data lines effectively connect the CPU to the memory (ignoring intermediary chips for the moment), on what grounds do you assume that the data-transfer process must be slower than when only 32 data lines are connected? I agree that more power will be necessary, simply because 128 circuits and not 32 are being used, but the speed should be exactly the same. |
|
|
Next, only in one respect will programs written for this processor likely take up more memory space than programs written for an ordinary processor. That respect involves actual data storage: If I want to save an ordinary 8-bit byte somewhere today, it takes one byte, whereas on this computer design at least 8-bytes-worth (64 bits) will be consumed. (Since every memory location would be able to hold all the data of 2 registers, I shall assume that any halfway intelligent programmer, when saving data to memory, will always try to save 2 registers -- holding two data items -- at a time.) Nevertheless, it is hoped that the efficiency of the instruction set will make up somewhat for that wasted space -- there is a place in the essay which indicates that if a DATA-LESS 64-bit instruction is pulled, what should the other 64 bits contain? A second data-less instruction! Read it and see. |
|
|
And that brings me back to the quantities-of-memory issue: If I have an existing program that covers 1 million addresses (8 bits each), then IF PERFECTLY rewritten for this processor, it would still occupy 8 million bits, but the layout would be 1/16-of-a-million addresses (128 bits each). How much more it would actually take depends on what percentage of the program consists of data that MUST waste space, as described in the previous paragraph. Let me assume that the program ends up using 1/8-million addresses, due to such wastage. If every average program converted like that, and if your current computer has 128 megabytes of RAM, then your new computer, running the converted software and a 12864 processor, would only need 256MB of RAM (rearranged in terms of number of bits per address, of course). That's a price differential, for ordinary SDRAM, of less than $75 these days. What will the price differential be by the time any 12864-based computers actually hit the market, eh? |
|
|
And as for implying that nobody really needs high-performance computers like this one, there are two things you are ignoring: (1) People have unlimited wants (fundamental basis of capitalism); and (2) Lazy programmers write more and more bloatware. Give 'em enough time, and you wouldn't be able to play Tetris without a supercomputer! |
|
|
And the baby spoke his first word: dhuh? and after two weeks: 110011101011100011110101010011101 |
|
|
I think I'm stuck in the first week.. |
|
|
Comrade Poyf, are you saying this has been baked in the form described within these chambers? |
|
|
POYF, thank you for the more detailed information regarding your position. I did indeed write the essay from a programmer's perspective. After all, I had heard that when Motorola designed the 6809, they
first interviewed a bunch of programmers to find out what they wanted most, and then gave it to them (easy handling of position-independent code, for example). Why shouldn't what worked so well be tried again? "Practical engineering," eh? Ah, but the art does improve as time goes by.... |
|
|
I do see what you are getting at about the *native* wide memory -- the memory has to run at the same speed as the CPU, and only expensive SRAM can currently do that. But the "intermediate chip" that I mentioned was supposed to be understood as simply an appropriately updated version of an ordinary "North Bridge" chip. Perhaps I am basically misunderstanding exactly where the timing-synchronization occurs, that lets the CPU run at a multiple of the main memory speed, yet they still communicate successfully? (In already-existing computers, that is.) |
|
|
Yes, of course you need a very long word to store each very long instruction. However, what does the instruction DO? Fetch or manipulate or store data? The long instructions for the proposed 12864 processor would be capable of specifying all three tasks, every time. Thus all the bytes associated with storing any grouping of three ordinary instructions will be consolidated into just one of these long instructions. I would expect the total number of bits that hold instruction-information to end up being roughly the same, when comparing ordinary processors to this one. The major exceptions are those instructions that do simple things to the contents of data already in a register, such as flip all the bits. The more that some program uses those types of instructions, the more wastage there will be when the program is converted for the proposed 12864. Because, as you said, existing processors do tend to have commonly-used instructions fitting in less space than other instructions. Yet, as previously stated, memory doesn't cost so much these days, that the difference can't be accommodated reasonably easily. |
|
|
With respect to compression and other intermediate tricks for avoiding a 128-bit data bus, I tend to think that the complexity they introduce isn't worth the trade-off. In the long run the capacitance issue you mention will be solved by designing shorter signal paths (or moving to optical paths), among other things. Not to mention that there is one aspect of that issue about which I need clarification. Consider a reasonably standard memory chip, which is wired internally to yield 4 bits per address. Two such chips
provide a full byte when requested, four of them provide a pair of bytes when that is requested, and so on. EACH of those chips is independenly powered! Each is responsible for meeting the capacitance needs of the data lines that it outputs to! So, if I want to fetch 128 bits at one address, there could be 32 of these chips all working together, and each one is still independently powered. WHY, then, is there really the capacitance issue that you described? (Now I do recognize that there is another issue with physically mounting so many chips just to obtain 128 bits per address, but we can save that for later.) |
|
|
Finally, with respect to wasting processor power: "It is better to have and not need, than to need and not have." Uses for such power will inevitably arise, if for no other reason than because it is there. |
|
|
P.S. it does take a certain amount of time to craft a worthy reply. |
|
|
Hmmm... for some reason, I have a funny feeling we may revisit this idea in the future as evidence of prior art in a patent law suit. |
|
|
POYF, it seems you have raised an interesting engineering challenge. Deserves thought. Probably has a solution. And I, too, need to get some actual work done at my job. |
|
|
Rods Tiger, you might look up "magnetic bubble memory", because it holds a serialized bitstream. |
|
|
angel and thumbwax, those two links are hilarious! |
|
|
Tip o' the hat to [Piss On Your
Fire] for breaking the [Vernon]
idea barrier. I can rest
easily now that the annotations
have passed the idea itself. |
|
|
Not on my display, they haven't. I'm only as far as 'I don't propose to offer a complete list here'. |
|
|
We could get to the bottom quickly by saying, '[Vernon], would you please clarify ...' and quoting a whole paragraph. He would then quote our entire annotation before clarifying. Of course, we still wouldn't understand. |
|
|
Hey! No fair! I've read all the way through the words (most of which was well over my head - God forgive you if you're bluffing Vernon, Rods et al) and the bloody firewall denies me my reward by stopping access to Thumb & angel's links. Is there no justice? |
|
|
Heh heh. I've got "And just to be complete: |
|
|
|6 5|5 5|5 5|4 4|4 3|3|3 3|3 2| |
|
|
|3| | | | |8|7| | |4|3| | |0|9| | |6|5| | | | | |9|8|7| | |4|3| | | |9|" And the scroll thingy is only just over halfway down the page. I guess I can't complain about being given a big monitor. |
|
|
DrBob, this essay was no bluff in 1991, and I'm not bluffing now. I do confess to enjoying how well most of the ideas have held up through 10 years of computer progress, but I STILL think that the Industry should design some sort of very powerful processor, that could handle any average person's demands for a whole lotta years, and GO FOR IT. Resolving all those incompatibility issues would be worth it! |
|
|
Rods Tiger, hopefully you now know that instructions ARE a variety of data, which is why they are loaded on the data bus. |
|
|
But speaking of extra busses, I have a few more notions to spout.... |
|
|
Suppose we take that dual-port RAM described earlier, and make it the main memory for a single processor. If I recall right, this type of RAM has two address busses and two data busses (so that two different accesses to it can occur at the same time). Therefore the processor should ALSO have two address busses and two data busses. |
|
|
See, one of the most annoying things a processor has to deal with are conditional branch instructions. IF such-and-such, GOTO some address where appropriate instructions can be found. Otherwise it continues processing the current group of instructions. |
|
|
Now, instructions and data are loaded 'en masse' into a cache, for fast access. Lets say that some block of information has been loaded from Main Memory Address #1,000,000. If the processor encounters an instruction to start manipulating data at Address #2,000,000, it has to re-fill the cache, which slows the system down. EVERY large-enough jump to some other Main Memory Address will have the effect of slowing the system down. |
|
|
These days the processors have special hardware to speculate in advance about which branch is likely, so that the cache can be filled in advance of the need for the data. However, while such speculations are right maybe 95% of the time, it seems to me that there is another way that may be better. |
|
|
For example, what if there were two caches? These days a processor can just about always compute the destination-address of a jump well in advance of knowing whether or not the jump will actually be taken. So, while processing instructions from the current cache, the other cache is being filled with info in case the jump occurs. Then implementing the jump is as easy as switching to the other cache, as the source of instructions and data. |
|
|
Which is why two address busses and two data busses would be needed by the processor! The Read-Only bus would be used to fill the 'just-in-case' cache, and it could be done at the same time that the main Read/Write bus is being used to stow processed data into the Main Memory. |
|
|
(The preceding is not especially related to the size of the data or address busses, but obviously, the more signal lines, the more pins are needed on the CPU, and the harder it becomes to lay out the circuit-traces on a motherboard.) |
|
|
And another alternate idea simply involves ultra-fast Main Memory. SOME day they will probably succeed at making memory fast enough to keep up with the CPU, and cheap enough to make in vast quantity. Then there will be no need for any sort of cache, because whenever the CPU needs to switch to some other address, the data will be there as fast as the switching occurs. |
|
|
[RodsTiger] "If, (while you're dreaming up a new cpu anyway) you incorporated a microcoded-in-kernel (or even auxiliary kernel) hardware Gzip routine" |
|
|
Baked as an option for PowerPC cores (see link) although as far as I can see for code only, not data. |
|
|
It's disconcerting to open up a HalfBakery idea which is about as long as my dissertation (which I'm meant to be writing RIGHT NOW) is supposed to be. |
|
|
[Vernon] Two caches? I dunno. Sounds as if my defragger will be working overtime. I'm convinced M$oft has already created a windows version for this 12864 hardware and will soon weight it down to run on my platform. Call it maybe WD^40; assuming it is crashproof? |
|
|
Right Brain: "Make it stop! Aaaah!"
Left Brain: "Interesting."
3 minutes later
Right Brain: <commits suicide>
Left Brain: "Make it stop! Aaaah!" |
|
|
Following on from AA's admirable critique: |
|
|
Right brain: Please tell me why I should want one of these at the start before you list the 914 registers, addressing modes etc. |
|
|
Left brain: What're the power dissipation figures for this? Could I use it in a laptop? A mobile phone? What's memory running at 2 GHz going to do to that? Isn't there a reason we have 1000 different microprocessors out there? What's wrong with RISC (Small core size, low power requirements, easy to write code for)? Many DSPs already access memory in 48/32 bit not 8 bit chunks - this is very annoying if you want to program them for smaller data sizes. And the ARM7 uses a 32 bit word size but can also use a series of 16 bit instructions for simple operations to save on code size, so that bit's baked. Many processors access the memory almost every cycle due to pipelining; in some processors like the ADSP-2106x you can access memory twice every cycle, because it's divided into 2 separate blocks, and well-pipelined, with delayed branch instructions (branch in 2 instructions' time), so even when you're branching you can execute more instructions while the PC adjusts. |
|
|
I think most of this is baked, but i don't want to read it all. |
|
|
<Head uncovered> Wow. </> |
|
|
This is the stuff of which legends are made.
I thought ideas at the halfbakery had to fit in one screen or less. Not any more, though. |
|
|
I'm curious. Just by how much does the modern P4 or its peers approach this? |
|
|
Everyone who understands all this deserves a croissant. I have no idea what it's all about. |
|
|
[neelandan] Check out some of Vernon's other stuff, some (most?) of it's just as bad. |
|
|
[TeaTotal] Yes, anyone who bothered to read the whole lot (myself excluded) would certainly be in need of nourishment, whether they understood the idea or not. |
|
|
I can't even manage to read all the anotaitions. |
|
|
I really need to change some settings because on my screen the anotations still don't reach the bottom of the idea text... (6 screens to go)... can we have markedfor-reduction or something? |
|
|
Thats an idea - someone who understands the idea in the first place should put it into word and try out the auto-sumary and see if the amount of sense made changes... I wonder what would happen if we got it down to a page... |
|
|
Just scrolled down thinking this was very Vernon-esque. 1/2 an hour later I find out it was by Vernon. Some things on 1/2B are just too predictable. |
|
|
WOOOOOOOOOOOO!!!!! WOOOOOOOOOOOOOO!!!! |
|
|
The new processors are, I believe, faster at such things as compressing video (from avi to mpeg) and sound (from wav to mp3). I have not done the comparison, but that is what the literature available has made me believe. |
|
|
But my question: does modern µP's approach (or, heaven forbid! surpass) Vernon's 12864? |
|
|
neelandan, when the essay was wriiten in 1991, about the only 64-bit proccessors around were mainframes. I am not sure if any 128-bit central processing units exist today, but wouldn't be surprised if mainframes and supercomputers have been at that level for years (I do know that graphics processing units are up to 256 bits). Certainly for the desktop computer market, there are not any 128-bit CPUs, and only a few 64-bit CPUs (such as the Alpha). AMD's new Hammer will join the fray later this year, with 64 bits for anyone who wants it on their desktop. Intel's current 64-bit offering is being directed to the server market (and no doubt the workstation market), rather than the ordinary desktop market, but no doubt their tactics will change if AMD has a lot of success. |
|
|
So processors with 64 address lines will soon be common. Intel's instruction set for its processor is probably comparible to the one proposed here (both are Very Long Instruction Word types). |
|
|
Since all those 64-bit processors also gobble data in only 64-bit chunks, they would not match the 128-bit specification of my proposal. But I may be mistaken: As mentioned in a prior annotation, while the ordinary Pentium processes data 32 bits at a time, it actually pulls 64 bits at a time from the memory (most of the time for cache-fills, I expect). So perhaps something similar is about to happen for the new AMD and Intel processors. |
|
|
So while my proposal is not outdated now, it probably will be in another 5 years (and possibly less). |
|
|
Vernon: The term "reduced instruction set computer" does not refer to the set of instructions being reduced, but rather to the fact that the computer uses a set of reduced instructions: each instruction does 'one thing', with no side-effects. Things like pre-decrement addressing are great in non-pipelined machines on which instructions are inherently guaranteed to run atomically, but would pose two major problems on RISC machines: (1) they introduce additional addressing and data dependencies which must be tracked by the compiler and/or the hardware; (2) on any processor which allows virtual memory, there ust be a means by which a progrma that triggers a page fault can be resumed. On a RISC machine, there is no such thing as a 'partially executed' instruction; either an instruction has completed or it has not. By contrast, an instruction which uses pre-decrement addressing may be in various levels of completion when a page fault occurs. Recovering from a page fault requires capturing the internal state of the progressor, and thus requires the existence of much special circuitry precisely for that purpose. |
|
|
supercat, I did not mean to imply anywhere here that the proposed 12864 microprocessor qualifies as a RISC device. Its instructions are complicated because they are versatile; each is a starting point for a lot of programming possibilities. And certainly appropriate recovery circuitry will be required. Its primary simplification and source of efficiency comes from having as few as possible different KINDS of internal registers, by letting almost any register be used as part of any task. |
|
|
Folks, AMD's new Hammer processors (64 bits) are now in actual computers being tested. There are currently two variants, and both have ENORMOUS numbers of pins. See the link I added. |
|
|
Rods Tiger and quam, see newly added link about Transmeta. Heh heh. |
|
|
OH. MY. GOD. I am humbled. And nowhere near topic-anno parity. So I thought I'd jump in the shallow end of this here ocean & add my little dribble of piss. |
|
|
I find it fascinating that Vernon's last name is Nemitz. Almost like the USS Nimitz. |
|
|
Yes, very apt. The Nimitz Class are the massivest vessels in the history of the universe. |
|
|
Do you have any proposed insturctions as an example to be used with the 128 data fetch and bit set? |
|
|
Thanks in any case. It got me running on an old idea of mine, which I'll put in right now... (See Plug and Play Distributed Processors) |
|
|
One day, when my childrens children are half-baking, they may see the anotations for this idea be longer that the idea. |
|
|
Yes, this idea must be the longest one ever in the history of the HalfBakery! |
|
|
Hi folks! It's been a while since I checked this, but I haven't forgot; just been real busy. |
|
|
pashute, I'm not quite sure what you were asking about.
Perhaps you asked it because of the way the main text
became jumbled? It needs to be viewed in a fixed font.
The link "12864(again)" will take you to another posting
of the idea, which, if your browser doesn't show it in
fixed font, you can copy/paste it to a text editor like
WordPad, and force a fixed font. Many of the 12864
instructions are mentioned in a generic way, and I tried
to present enough detail that specific versions of them
could be deduced. |
|
|
Regarding my last name of Nemitz, I'm told that this
and other similar versions may have a common root,
some centuries back. None are considered "close",
nowadays, so I don't think of myself as having any
significant geneological.relationship to Chester Nimitz. |
|
|
Regarding the crawl of annotations, I might mention that
my monitor resolution is normally set to 1152x864 pixels,
and without modifying any browser settings, I'm now
seeing the annotations exceed the text. On this annotation
page, though, they have a little ways to go yet :) |
|
|
I've just finished reading Vernons anno. Its taken me 1 year 10 months and 3 days. Can anybody explain it to me? |
|
|
Bender from the TV cartoon Futurama set in the year 3000 is powered by a 6502. Hence that should do for all our next millenium's needs. |
|
|
on my apple g4 laptop, at 1152
pixels wide, on safari 1.0 beta 2,
the anno's reach about 75 percent
of the way down the mian idea. |
|
|
i propose that this idea be
required reading. having been a 1/
2baker since 2000, but having
taken a hiatus for most of 2002,
it's refreshing to go back and read
through ideas like this one, which
predated the misconception that
the 1/2b is a forum for bad puns
and cleverly worded shutdowns. |
|
|
For anyone interested, and too lazy to go to the 12864(again) link, I've done some editing, to try to make the tabulated portions of this Idea look a little more like they do when viewed in a fixed font. |
|
|
Hey Vernon. Having the document in better format improves things considerably. At a quick glance, though, I don't quite see what's particularly novel about it. Yeah, it uses big instructions, but since larger instructions take longer to fetch than small ones and/or require more cache to store, the benefits of using such large instructions must be substantial to justify them. |
|
|
BTW, it should be noted that many processors are in fact tending toward VLIW designs when executing code within the internal cache. Basically what happens is that the code stored in RAM is 'uncompressed' when it's read into cache and the uncompressed code is then executed. Allows the advantages of the smaller main memory bandwidth requirement with the advantages of a larger easier-to-decode instruction. |
|
|
Also BTW, are you familiar with the ARM processor? |
|
|
Finally, on a parting note, I found it interesting that Intel has finally come around to the idea of a processor that executes two simultaneous virtual machines. I'll admit I'm surprised it's taken this long, since the idea seemed obvious to me back in 1992 or so when I was learning about pipelined processor designs. To be sure, if I'd implemented such a design back in 1992 it would have been much simpler than Intel's. On the other hand, much of the complexity in Intel's processors stems from a need to have them be able to execute a single instruction stream quickly. Executing more instruction streams slowly would simplify things considerably. |
|
|
The key issues are pipelining and data dependence. With pipelining, a subsystem can accept multiple inputs in less time than it takes to fully process one. As an example, consider a pipelined multipler: it can accept one input on every clock cycle, and will produce a result four cycles after it was given the appropriate inputs. If a "multiply" instruction is executed and the result is not needed until many instructions later, other instructions can execute while the data makes its way through the multiplier. If the result of the multiply is needed before it's done, however, then it's necessary to either have hardware to stall the processor until the necessary result is available, or have lots of hardware to allow the execution of instructions that relied upon the multiply to be delayed while other instructions continue to execute. This sort of thing gets horrendously complicated, but is necessary in order for systems to execute code quickly. |
|
|
The key to making a multi-VM processor simply would be to arrange all the subsystems so that all data paths had a uniform pipeline delay (add latches as needed to make the delay longer on subsystems whose data path was shorter than the maximum). Rather than using direct feedback to have registers hold their values, replace each register with a series of latches and a 'write mux' [which determines whether the register should accept new data or simply recycle the old]. Under such a system, much of the complexity associated with pipelining and data dependancies would vanish. If all data paths were normalized to six cycles, the processor would execute one instruction per cycle without any data dependency issues other than the external memory bus (since by the time any instruction executed, the previous instruction would have run to completion). The latches would add some overhead to be sure, but it would be minor compared to the complexity required to handle register scheduling etc. |
|
|
A few comments: HOLY SHIT, that is a fucking long post, and...fishbone. I can't stand to read that much text on a computer screen. |
|
|
jutta, I'll see what I can do, but can't make a promise about when I can find time to do it. The changes I did to the tables were to replace spaces with underscores, and to add HTML line-breaks, so the double-carriage-returns that had been previously used to separate the lines could be removed. That was all. |
|
|
supercat, thanks. While it is true that for ordinary processors it takes several clock cycles to load a VLIW, simply because that Very Long Instruction Word occupies a lot of bytes of memory, in this processor the goal was to fit every instruction into 16 bytes or less (128 bits) AND specify a data path of 128-bits/16-bytes at EACH memory address (not the usual mere 8-bits/1-byte). Thus every instruction would load in one clock cycle. |
|
|
by far... the longest, most thorough explanation of an idea I've ever come across... I can't bone or croissant it since I can't understand/finish it, but... wow.... |
|
|
Vernon: On modern PC's, it takes a number of cycles (6 to 12 depending upon CPU and bus speed) to read a single word from main memory. Having small instructions means that several instructions can be executed in the time required to fetch the next word from main memory. To be sure, running code from cache (multiple instructions per cycle) would still be faster than running from main memory (which would still be multiple cycles per instruction even if 4 instructions fit in each code word), but the penalty for running from main memory on a machine with smaller instructions is less than on a machine with large instructions. And that penalty matters bigtime. |
|
|
Consider two machines: (1) 90% of the instructions average 0.5 cycles each, while the remaining 10% average 2 cycles each. (2) 90% of the instructions average 0.1 cycles each, while the remaining 10% average 6 cycles each. Which machine is faster? |
|
|
The first machine will take 65 cycles to execute each hundred instructions; the latter will take 69. This despite the fact that the second machine executes 90% of the code five times as fast as the former. |
|
|
Thanks supercat, but somehow I still think you aren't quite grasping what I've been trying to describe here. Before I get to that, though, I'd like to mention that in the main body of this Idea are only a few references to such things as caching and pipelining. I mentioned them because they seemed worthy, but remember that in 1991 while caching was in wide use, pipelining in ordinary PC processors was just getting started. You should assume that the cache in this processor will be 128 bits wide, at each address in the cache, and that appropriate pipelines will exist. |
|
|
Regarding the fact that several clock cycles are taken to load a single instruction word in today's processors, sure, I do know that. The major cause is simply that while a data word may be 32 bits wide, processor instructions are frequently longer than 32 bits. I seem to recall than even on the ancient 8/16-bit 6809 processor, five-byte instructions were not uncommon. Since that CPU could only load 8 bits at a time, it took 5 cycles just to load those 5 bytes of those instructions. Today, CPUs have a wider data path, accessing multiple memory addresses simultaneously, but memory is still laid out with only 8 bits of data at each address. (I mentioned in a prior annotation something about the wastefulness of having 4 billion ADDRESSES, but only 1 billion of them could actually be used, since 4-at-a-time always had to be accessed, because the processor wanted 4 BYTES at a time.) |
|
|
Fancier CPUs come with fancier instructions, and so it STILL takes multiple cycles to load the longer instructions. |
|
|
In the 12864, I am specifying that each memory address should have 128 bits of data, so that when the processor wants a full "chunk", it can get the whole thing from just ONE memory address. The next memory address will have another full 128-bit "chunk" -- so every one of those 18 quintillion adresses can be efficiently used. I have also attempted to design the instruction set to fit in those 128 bits. I make no bones about the fact that frequently the data that must be loaded, to be processed by an instruction, will require a separate loading cycle. But I have also tried to indicate that many lesser instructions will fit in 64 bits, so that some appropriate 64-bit data can be simultaneously loaded, all in just one cycle. ALSO, however, because I have designed this instruction set to mosly incorporate THREE "old" instructions per 12864 instruction (GET1 data, GET2 data & operate (perhaps ADD), and PUT result), there should be some savings of cycles from that alone. LOADING a full instruction plus data may often take only 2 cycles, simply because the REGISTERS here are only 64 bits wide. That is, the efficient programmer would put GET1 data in half the bits at a particular memory address, and GET2 data at the other half of the same address (if that data wasn't already in a register, left over from some prior instruction). So, one cycle would load the full instruction, and one cycle would load the full data. Processing takes 1-to-many additional cycles, and saving the result, IF it goes to main memory, takes one final cycle. |
|
|
The intermediate processing is the thing that get most affected by pipelining, and I don't know enough to say how much effect there will be. I'd like to think that all the hardware pipelining efficiencies that have been learned over the past decade can be applied to the 12864, such that that ".1 cycle per instruction" that you mentioned would indeed be the norm, and for much more than for merely 90% of instructions. |
|
|
Main memory fetches on modern computers are always done at least 64 bits at a time. Code bandwidth is not as relevant to CPU speed as it used to be, but it's still significant. |
|
|
it makes more sense if you read it again. all of it. out loud. in a pirate voice. |
|
|
I knew that page bottom button would come in handy. |
|
|
Hah! Synchronicity - it is precisely 20 years TO THE DAY since I handed in my resignation from my full time job programming a dual-processor 6809 system. My next job involved programming bit-slice machines with > 512 bits per instruction - VLIW is not new!. <FYI> In 1987, a decent sized disk was in the 100-200 Mbytes range, a bit ahead of your 30 Mbyte in 1991. |
|
|
[AbsintheWithoutLeave], thanks for the info. Regarding those hard drives, though, are you talking top-of-the-line models obtained by Big Busines, or the mainstream models that the average PC owner might be able to afford? I'm pretty sure my 30MB figure more accurately reflects the latter group. (I distinctly recall, about 1996, an acquaintance getting a 500MB drive and shaking his head saying, "I can't believe how big that is!" (He had replaced a crammed-full 20megger, if I recall right.) |
|
|
I will obviously be corrected if I am wrong, but wouldn't the power draw of so many more gates being opened at once instead of in rapid succession as is the norm be a problem? |
|
|
If the power is a problem, it could quite severely limit the size of the chip. |
|
|
Wow. I can't believe I read the anno's - I didn't even try the idea! But I was darn sure going to leave my name on this edifice, after having a cramp in my arm from scrolling to the bottom. |
|
|
"My daddy read through the anno's of one of Vernon's longest ideas and all I got was this stupid T-shirt" |
|
|
[vernon] The disk in question in late 1987 was a 115Mbyte ESDI (full height 5.25") in a 386-based IBM PS/2 server (forget the model, possibly an 80) with 1Mbyte of RAM, running the Novell admin network for a company of 25-30 people, so yes, not your average home user. 20Mbyte drives were not uncommon in PCs at the time though. |
|
|
A BIT OFF TOPIC- but quite entertaining: |
|
|
12864 is the American ZIP code for Sabael, New York |
|
|
The word Sabael is a Hebrew word, meaning "the Burden", the Latin Onus. The name of the sixth step of the mystic ladder of Kadosh of the Ancient and Accepted Scottish Rite. Sometimes spelled Sabael. |
|
|
No negative or positive implications implied. (implied implications...hmmmm.) |
|
|
I could see this as a movie. |
|
|
Vernon, you have, including annos, written 18,420 words. All I can say is WOW and congratulations and bun. |
|
|
Oh my... this seriously needs a tl;dr on it. I have a case of don't know what the heck it is! |
|
|
Why not build one in Minecraft?{Linky}... You appear to have the parience to! |
|
|
tl;dr, BTW - Can I suggest adding Histogram Optimization? |
|
|
[Dub], I don't know anything about Minecraft. From the quick look I just took, though, it appears that creating a 12864 emulator would be a whole lot easier just using some ordinary computer language like C. |
|
|
<rolls eyes> No dedication, [Vernon]. No dedication... |
|
|
{Starts porting Linux to run on run on said Minecraft-simulated 12864 } |
|
| |