Why are 2 clocks the minimum clock cycle for the fastest ins

W

Weng Tianxiang

Guest
Hi,
Recently I read Intel 486 insruction set and found that the minimum
clock cycle for an instructions is 2 clocks.

Usually we design in ASIC and FPGA using 1 clock for a register
exchange instruction. What is the reason for 2 clocks of minimum clock
cycles?

Weng
 
Weng Tianxiang wrote:


Recently I read Intel 486 insruction set and found that the minimum
clock cycle for an instructions is 2 clocks.

Usually we design in ASIC and FPGA using 1 clock for a register
exchange instruction. What is the reason for 2 clocks of minimum clock
cycles?
I don't know the exact answer, but I want to give some ideas:

* The instruction has to be fetched (read from Memory). If there is a
cache hit, it is fast, otherwise, it is very slow.
* The instruction has to be decoded and all operands have to be fed to
the ALU, that calculates the result. The result has to be stored. All
this takes some time.

Yes, it is possible, to do all these things in one clock cycle (the
microcontroller MSP430 from TI does it so), but memory-access (resp.
cache-access) is time-consuming and high clock frequencys are hard to
achieve if you put all these things into one clock cycle. (MSP430
reaches only 8MHz.)

Therefore pipelines are used: The nessecary steps are split and a higher
clock frequency can be reached, because some tasks can be interleaved.
So I guess: The 486 has a 2-stage pipeline (resp. 2 stages are needed
for the smallest instruction)?

Ralf
 
Hi Ralf,
I disagree with your opinion.
Why does the following instruction need 2 clocks?
move ax, bx; -- from register bx to ax.

Pipeline means there is middle register to receive data from bx at
first clock, then at second clock to ax, like this data flow:
bx --> Middle register --> ax.

In any situations such middle register seems to me ridiculous.

I guess there is a data bus between all registers. There is a wait
clock inserted between two adjacent instructions to switch the bus
direction.

Weng

Ralf Hildebrandt wrote:
Weng Tianxiang wrote:


Recently I read Intel 486 insruction set and found that the minimum
clock cycle for an instructions is 2 clocks.

Usually we design in ASIC and FPGA using 1 clock for a register
exchange instruction. What is the reason for 2 clocks of minimum clock
cycles?

I don't know the exact answer, but I want to give some ideas:

* The instruction has to be fetched (read from Memory). If there is a
cache hit, it is fast, otherwise, it is very slow.
* The instruction has to be decoded and all operands have to be fed to
the ALU, that calculates the result. The result has to be stored. All
this takes some time.

Yes, it is possible, to do all these things in one clock cycle (the
microcontroller MSP430 from TI does it so), but memory-access (resp.
cache-access) is time-consuming and high clock frequencys are hard to
achieve if you put all these things into one clock cycle. (MSP430
reaches only 8MHz.)

Therefore pipelines are used: The nessecary steps are split and a higher
clock frequency can be reached, because some tasks can be interleaved.
So I guess: The 486 has a 2-stage pipeline (resp. 2 stages are needed
for the smallest instruction)?

Ralf
 
Weng Tianxiang wrote:

Why does the following instruction need 2 clocks?
As Ralf explained, this was a
design choice made a long time ago.
Do you have a vhdl question?

-- Mike Treseler
 
Nice to disagree, but why asking questions in wrong newsgroups when you
didn't like the answers?

A "MOV Bx, Ax" is in some architectures realized as an ADD Bx, 0, Ax.
This simplifies your datapath and your instruction set, as the MOV
actualy only exist in assembler but not in hex code.
If you have a bit of knowlegde about digital HW (As I asume) you know
several reasons, why an Add needs more than one cycle.
Maybe you found a better answer for the particular architecture in
comp.arch.486 (doesn't check wheter thats the correct name of the
newsgroup).

bye Thomas
 
Weng Tianxiang wrote:

I disagree with your opinion.
Why does the following instruction need 2 clocks?
move ax, bx; -- from register bx to ax.

Pipeline means there is middle register to receive data from bx at
first clock, then at second clock to ax, like this data flow:
bx --> Middle register --> ax.
I did not talk about a middle register. I guess (!), that the following
pipeline is used:

-------------- -----------
_|instruction |___|instruction|_
|fetch & decode| |execution |
-------------- -----------

_You_ know _now_, that this instruction only moves a value from one
register to another one, but the CPU has to read the instruction, decode
it and act, as it is coded in the instruction. It is like reading these
lines here. You don't know _now_ what words I will use in the next
sentence. If you have read the posting, you do.

To come back on-topic: Reading the instruction from memory (wich will be
short-passed to reading it from the cache) takes some time. It would
make a clock cycle too long (the frequency to slow) if you do more than
reading the instruction and decoding it. Therefore the execution (the
move reg,reg) is done in the next cycle.

The pipeline has one advantage: the next instruction can be read, while
the actual one is executed. Therefore with every clock you get a result
- this means doubled througput! But note, that the latency of every
instruction did not change - it is 2 clocks as without pipelining.

Pipelining enlarges the througput and shortens the nessecary clock cycle
length, because each step is simpler. But there are limits as everybody
can see with the ultra long pipeline of the Pentium 4 Netburst
architecture: If there is a conditional jump, and the wrong instruction
is in the pipeline, the pipeline has to be flushed (the half-computed
results have to be wasted) and loaded again. This is very
time-consuming, if the pipeline is long.


I try to explain this problem in detail, because pipelining is a method
for hardware description. The 486 you mention is only an example.

Ralf
 
"Weng Tianxiang" <wtxwtx@gmail.com> wrote in message
news:1132010532.780874.251900@g14g2000cwa.googlegroups.com...
Hi Ralf,
I disagree with your opinion.
Why does the following instruction need 2 clocks?
move ax, bx; -- from register bx to ax.
I doubt many people know exactly how 486 works internally, at
micro-instructions level. Probably nobody outside Intel (and AMD!) design
teams. All we can do is to return to basics. Implementing a simple register
to register MOV instruction in only 1 clock does not seem straight forward.
Consider this: What steps our 486 needs to take to execute this instruction?

1- Read the instruction code from the memory
2- Decode the instruction
3- Write the content of BX into AX

So in this case, at least three phases are needed. I don't think using 2
clock cycles to do these 3 phases is ridiculous at all. I can imagine that
486 does the first two phases in the first clock (which is a quite a big
challenge by itself) and the last phase in the second clock. Using the
pipeline does not reduce the number of the phases (sometimes in fact you
need to increase the number of phases for better pipelining!). It gives you
the opportunity to start with the next instruction even if the current
instruction has not finished yet so internally, while CPU is executing
instruction n, it can decode the instruction n+1 and read the instruction
n+2 all at the same time.

Regards
Arash
 
Ralf Hildebrandt a écrit:
Pipelining enlarges the througput and shortens the nessecary clock cycle
length, because each step is simpler. But there are limits as everybody
can see with the ultra long pipeline of the Pentium 4 Netburst
architecture: If there is a conditional jump, and the wrong instruction
is in the pipeline, the pipeline has to be flushed (the half-computed
results have to be wasted) and loaded again. This is very
time-consuming, if the pipeline is long.
That's why MIPS processors always execute the instruction following a
jump: they do not flush their pipeline.

Nicolas
 
Hi Ralf,
-------------- -----------
_|instruction |___|instruction|_
|fetch & decode| |execution |
-------------- -----------
Instruction fetch and decode can be separately implemented in two
procedures, and instrcution execution always starts after next
instrcution is available.

Then instruction
move ax, bx;

can be executed in 1 clock.

Any thing wrong?

Weng


Ralf Hildebrandt wrote:
Weng Tianxiang wrote:

I disagree with your opinion.
Why does the following instruction need 2 clocks?
move ax, bx; -- from register bx to ax.

Pipeline means there is middle register to receive data from bx at
first clock, then at second clock to ax, like this data flow:
bx --> Middle register --> ax.

I did not talk about a middle register. I guess (!), that the following
pipeline is used:

-------------- -----------
_|instruction |___|instruction|_
|fetch & decode| |execution |
-------------- -----------

_You_ know _now_, that this instruction only moves a value from one
register to another one, but the CPU has to read the instruction, decode
it and act, as it is coded in the instruction. It is like reading these
lines here. You don't know _now_ what words I will use in the next
sentence. If you have read the posting, you do.

To come back on-topic: Reading the instruction from memory (wich will be
short-passed to reading it from the cache) takes some time. It would
make a clock cycle too long (the frequency to slow) if you do more than
reading the instruction and decoding it. Therefore the execution (the
move reg,reg) is done in the next cycle.

The pipeline has one advantage: the next instruction can be read, while
the actual one is executed. Therefore with every clock you get a result
- this means doubled througput! But note, that the latency of every
instruction did not change - it is 2 clocks as without pipelining.

Pipelining enlarges the througput and shortens the nessecary clock cycle
length, because each step is simpler. But there are limits as everybody
can see with the ultra long pipeline of the Pentium 4 Netburst
architecture: If there is a conditional jump, and the wrong instruction
is in the pipeline, the pipeline has to be flushed (the half-computed
results have to be wasted) and loaded again. This is very
time-consuming, if the pipeline is long.


I try to explain this problem in detail, because pipelining is a method
for hardware description. The 486 you mention is only an example.

Ralf
 
Weng Tianxiang wrote:


Instruction fetch and decode can be separately implemented in two
procedures
There are many ways to design it. Using procedures is only one. But this
does not have anything in common with the decision to use 2 clocks for
this instruction.


, and instrcution execution always starts after next
instrcution is available.

Then instruction
move ax, bx;

can be executed in 1 clock.
It can, yes. MSP430 does it in one clock - without any pipeline. But
MSP430 reaches only 8MHz (and only now there are faster versions
reaching 16MHz). The 486 reaches 40 MHz - in a much older technology.
(DX2 and DX4 ignored at the moment).

So I guess (!) that using a pipeline and splitting this instruction into
two steps has enabled the clock frequency of 40MHz.

Often there are a lot of design alternatives, but each one of them has
an impact to some other things, like the clock frequency. Always a
decision has to be made. Sometimes there are good reasons for one
decision, sometimes it is politics and sometimes the designers did not
know it better at the moment of the design.

Ralf
 
Nicolas Matringe wrote:


That's why MIPS processors always execute the instruction following a
jump: they do not flush their pipeline.
For an unconditional jump this seems to be clear to me, but what about
conditional jumps? There are two alternatives (the next instruction
following the jump instruction (no jump) or the instruction, where the
jump may go to (jump)). How is it realized at this CPU?

Ralf
 
For an unconditional jump this seems to be clear to me, but what about
conditional jumps? There are two alternatives (the next instruction
following the jump instruction (no jump) or the instruction, where the
jump may go to (jump)). How is it realized at this CPU?
You insert a NOP after a jump (sometimes automatically done by your
Assembler).
I've seen some code that uses this execution after an jump by line
reorder.

a=b+4
b=b-1
jnz
nop
further code

could be written as
b=b-1
jnz
a=b+5
further code
 
usenet_10@stanka-web.de wrote:



You insert a NOP after a jump (sometimes automatically done by your
Assembler).
I've seen some code that uses this execution after an jump by line
reorder.

a=b+4
b=b-1
jnz
nop
further code

could be written as
b=b-1
jnz
a=b+5
further code
Ok - thanks - this means, you waste the instruction (like a pipeline
flush) or may take advantage from this feature, by moving another
instruction behind the jump, if possible.

Ralf
 
The instruction following the jump is always executed, whether the jump
actually jumps or not. because the instruction's already fetched in the
pipeline.
That's also why it is not a very good idea to write programs in
assembler for MIPS processors: compilers usually handle these
peculiarities much better than humans.

Nicolas
 
An FPGA or ASIC is customized hardware doing exactly what you want. But
a microprocessor is stupid and needs to read a program that tells it
what to do. And that takes extra overhead (cycles).

I would recommend a course in basic microprocessor architecture.

Regards,
Mats
 
Hi,
I had a book about basic microprocessor architecture, like a design
simpler than 8051 implementation. But it didn't meet my curiosity about
Intel chip's internal structure and why it needs 2 clocks for the
fastest instructions.

What I am interested is not a concept, like a pipeline, but in its real
VHDL structure.

For Intel chip starting with 80486, there are at least 4 registers: AX,
BX, CX, DX (I limit our discussion in 32-bit environment). And all data
registers can be exchanged, moved from one to another.

To meet the above requirements, the best data structure to connect all
data may be a data bus with all registers coupled to the bus with read
data from cache or memory system.

So there must be a BUS SWITCH WAIT CLOCK between 2 consecutive
instructions to transfer data from one data source to another data
source.

That is the real reason for 2 clocks for any fastest data instructions:
1. bus switch clock;
2. data transferred on the bus.

Pipeline in FPGA means AX-->C-->BX. It is not the case for Intel CPU.

Weng
I
mats wrote:
An FPGA or ASIC is customized hardware doing exactly what you want. But
a microprocessor is stupid and needs to read a program that tells it
what to do. And that takes extra overhead (cycles).

I would recommend a course in basic microprocessor architecture.

Regards,
Mats
 
Weng Tianxiang wrote:


For Intel chip starting with 80486, there are at least 4 registers: AX,
BX, CX, DX (I limit our discussion in 32-bit environment). And all data
registers can be exchanged, moved from one to another.

To meet the above requirements, the best data structure to connect all
data may be a data bus with all registers coupled to the bus with read
data from cache or memory system.
For my opinion this is not the best structure.

Would it not be better to use the ALU for the move, too? All registers
have to be connected to the ALU, because of the other instructions.
Several muxes are needed for this - but everything you need is already
there.

So a move through the ALU is not different to other instructions -
except for the ALU itself, that ignores the destination operand and
connects the source operand straight to the ALU result output, where it
can be fed to the destination register.

The major advantage is that you don't need extra muxes for the move
operation, that connect registers (a crossbar). Only in the ALU this
simple instruction has to be implemented - like the other instructions, too.

Ok - a crossbar between the registers may be faster and would enable
higher clock frequencies, but the other instructions limit the clock
frequency, too and therefore speeding up the move is useless.

Feeding the operands to the ALU, computing the ALU instruction and
feeding the result back to the destination register costs a little bit
time. This time plus the time for instruction fetch + decode was too big
for the desired clock frequencies of this CPU and therefore I guess,
that the designers have chosen a 2 stage pipeline for this problem.


Pipeline in FPGA means AX-->C-->BX. It is not the case for Intel CPU.
A pipeline may be a connection of registers, but this is the simpest case.
A CPU pipeline is something, where several tasks are computed step by
step. It is more a state machine, that controls many tasks every state.
If the state machine steps to it's nexst state again many tasks are
computed. And because one group of tasks (instruction fetch + decode) is
independend from another group (instruction execution) they can be
computed in parallel resulting in the pipeline.


Again: I ouly guess, how it is implemented in the 486. I just want to
give some ideas about CPU data paths and their control.

Ralf
 

Welcome to EDABoard.com

Sponsor

Back
Top