Branch prediction using counters.

In article <ddbhf9$g1l$4@blue.rahul.net>,
Ken Smith <kensmith@green.rahul.net> wrote:
|
|> Yes, there are lots of other examples, but I think the Booth's case is the
|> simplest real case to see the problem with.

It does, however, give the impression that the problem applies only
to very esoteric codes. It doesn't.

Are you suggesting that Booth's divide is "esoteric"? It seems to be a
very common bit of code.
The mind boggles! I have never seen it used, but I can believe that
it is. The most usual software divide is Newton-Raphson. Whatever
the case, only a few systems lack hardware division for the standard
types, very few applications use it for non-standard types, and the
proportion of time spent in such code is miniscule!


Regards,
Nick Maclaren.
 
Ken Smith wrote:
In article <ddafb9$ir7$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
What is needed is a compiler-generated instruction that sets up a
datum to be used for branch prediction, which is then quoted in
the branch, and used by the hardware.

This could sort of work but it is adding an instruction in the process of
all branches. The added instruction time will eat away at any advantage
gained.
You don't have to do it for all branches, just the ones where doing so
is a good idea. (That's one argument against "quoting" an explicit
prediction datum in the branch instruction.)

The cost of a branch mispredict varies with implementation, but 10
cycles is not unknown and more than 5 cycles is reasonably common in
the high-performance world. Since each cycle can be 3-5 instructions,
using a couple of instructions to avoid a mispredict is usually a good
trade (as long as they don't stall).

Remember that some of the RISC and DSP processors duck the problem by
letting some small number of instructions after the branch happen before
the branch takes effect. This only adds instructions to the code in cases
were nothing can be done before the branch happens. I think this is a
much better way to go if you want a lot of speed per transistor.
A small number of instructions isn't enough and it's usually very hard
to find more than 1-2 (on average). You need to find 20-50 if you're
actually trying to go fast. (Yes, you can pick an implementation where
1-2 is enough, but those implementations are slower in general.)

The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent.

No, the object's address is not obviously "excellent", at least to me.
Please explain what you mean.
The object at a given address tends to have the same type for "a
while".

-andy
 
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ddafb9$ir7$1@gemini.csx.cam.ac.uk...
In article <ddadvn$g56$3@blue.rahul.net>,
kensmith@green.rahul.net (Ken Smith) writes:
|
|> Yes, there are lots of other examples, but I think the Booth's case is
the
|> simplest real case to see the problem with.

It does, however, give the impression that the problem applies only
to very esoteric codes. It doesn't.

|> BTW: I disagree with the suggestion that jumps in OO programs are by
|> there nature harder to predict. Virtual method dispatching is a hard
type
|> of branch to predict but in non-OO code, the virtual method call would
be
|> replaced with a long switch statement. In good compilers, long
switches
|> are implemented with jump tables. "goto table" is very hard to
|> predict.

Neither is hard to predict, if you do it right. Few, if any, modern
systems do.

What is needed is a compiler-generated instruction that sets up a
datum to be used for branch prediction, which is then quoted in
the branch, and used by the hardware. The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent. This has the advantage that information
from a previous branch in the same function could be used.

For example, consider the following instructions:

Set predictor. It takes up to 2 registers, and sets one of a
few (4-16) predictors based on their contents (in an unspecified
fashion).

Predict branch. It takes a predictor and also specifies a
preferred direction, a resulting direction or none. It must
immediately precede a branch.

This would allow a compiler to take the following:

IF a > b THEN ... FI;
. . .
IF b < a THEN ... FI;

and issue:

Set_predictor 1 a b
. . .
Predict_branch 1 +1 [ indicating a resulting direction ]
. . .
Predict_branch 1 -2 [ indicating a preferred direction ]

I have little idea how effective this would be, but my gut feeling
is that it would be better than correct approaches for such codes.
As far as I know, I have rendered this unpatentable by posting :)

Dude if this stuff is encoded into the jump instruction then it falls under
my description of inserting counters into jump instructions and therefore I
should still be able to patent it and not you :p

Bye,
Skybuck :)
 
"Paul Hovnanian P.E." <Paul@Hovnanian.com> wrote in message
news:42F97744.959EEC35@Hovnanian.com...
This sort of thing is done at higher levels in various search
algorithms. Far more advanced techniques than just counters are used.

Building this sort of thing into processor firmware or general purpose
library routines might impose an inordinate burden on execution time if
Bullshit... the cpu has extra gates and logic and memory (internal) to
handle it only thing required is more bandwidth for the extra counter data
:)

it was employed for every branch. For cases where the branch execution
probabilities don't change much, just profiling the code and adjusting
the design will get you just as much gain. In the few cases where a
search can be optimized, the designer should be able to apply the
correct heuristics based on knowledge of the problem domain.
Again the programmer makes the decision. If he is wrong or usage pattern
changes these decision are no longer valid :D

Have a nice day =D

Bye,
Skybuck.
 
In article <ddd6ou$6q0$1@news3.zwoll1.ov.home.nl>,
"Skybuck Flying" <nospam@hotmail.com> writes:
|>
|> Dude if this stuff is encoded into the jump instruction then it falls under
|> my description of inserting counters into jump instructions and therefore I
|> should still be able to patent it and not you :p

Look, Bubba, no counters! Try rereading what I posted.


Regards,
Nick Maclaren.
 
Andrew Reilly wrote:

Remember that some of the RISC and DSP processors duck the problem by
letting some small number of instructions after the branch happen before
the branch takes effect. This only adds instructions to the code in cases
were nothing can be done before the branch happens. I think this is a
much better way to go if you want a lot of speed per transistor.

Up to 40 instructions, in the case of the TI C6000 series: 5 instruction
issue slots in the branch shadow, up to eight instructions per slot.
Makes loop pre-amble and termination of otherwise single-cycle loops
amusing, but do-able.
Isn't is nearly as effective to put these instructions in front of the
branch? This means that if later incarnations need a shorter slot
everything will be compatible.

For extra points auto-insert NOPs if the loop is too small.

This will make things more complex but perhaps not slower.


Thomas
 
In comp.arch Skybuck Flying <nospam@hotmail.com> wrote:
"Paul Hovnanian P.E." <Paul@Hovnanian.com> wrote in message
news:42F97744.959EEC35@Hovnanian.com...
This sort of thing is done at higher levels in various search
algorithms. Far more advanced techniques than just counters are used.

Building this sort of thing into processor firmware or general purpose
library routines might impose an inordinate burden on execution time if

Bullshit... the cpu has extra gates and logic and memory (internal) to
handle it only thing required is more bandwidth for the extra counter data
:)
Except that these may be put to a much better use.

Bye,
Skybuck.
--
Sander

+++ Out of cheese error +++
 
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:ddd7j5$l6e$1@gemini.csx.cam.ac.uk...
In article <ddd6ou$6q0$1@news3.zwoll1.ov.home.nl>,
"Skybuck Flying" <nospam@hotmail.com> writes:
|
|> Dude if this stuff is encoded into the jump instruction then it falls
under
|> my description of inserting counters into jump instructions and
therefore I
|> should still be able to patent it and not you :p

Look, Bubba, no counters! Try rereading what I posted.
Ok dinky, then it's aaaalll gooooddddd :)

Regards,
Nick Maclaren.
 
In article <pan.2005.08.10.01.52.31.292001@areilly.bpc-users.org>,
Andrew Reilly <andrew-newspost@areilly.bpc-users.org> wrote:
On Wed, 10 Aug 2005 00:24:09 +0000, Ken Smith wrote:
Remember that some of the RISC and DSP processors duck the problem by
letting some small number of instructions after the branch happen before
the branch takes effect. This only adds instructions to the code in cases
were nothing can be done before the branch happens. I think this is a
much better way to go if you want a lot of speed per transistor.

Up to 40 instructions, in the case of the TI C6000 series: 5 instruction
issue slots in the branch shadow, up to eight instructions per slot.
Makes loop pre-amble and termination of otherwise single-cycle loops
amusing, but do-able. There are various multi-cycle NOP instructions to
fiddle the other cases without taking up too much code space.
The old ATT one had only one extra instruction per branch and also had
conditional math operations. The result was you could do all sorts of
useful and very fast things.

--
--
kensmith@rahul.net forging knowledge
 
In article <ddccfs$lgp$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:

[... booth's divide ..]
The mind boggles! I have never seen it used, but I can believe that
it is. The most usual software divide is Newton-Raphson.
Are you refering to the usual "find 1/X and then multiply by that" method?

Booth's divide is more common in microcontroller land. If you have a micro
that doesn't natively do 32 bit divides and an integer math library,
chances are you are using a Booth's divide.

Whatever
the case, only a few systems lack hardware division for the standard
types, very few applications use it for non-standard types, and the
proportion of time spent in such code is miniscule!
Actually this isn't really true. There are a huge number of processors
out there that don't do the divides in hardware for at least some of the
types that are commonly used on them. These are not desk top PCs. They
are systems that contain a processor but are not general purpose
computers.



--
--
kensmith@rahul.net forging knowledge
 
In article <1123684376.739768.282190@g14g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
[...]
The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent.

No, the object's address is not obviously "excellent", at least to me.
Please explain what you mean.

The object at a given address tends to have the same type for "a
while".
Yes, and once you have XXXXYYYYed that variable, you aren't likely to do
it again. You will XXXXYYYY most of the variables of that same type
though. This I say makes the address of the object a poor input to the
prediction and the VMT a good one.

--
--
kensmith@rahul.net forging knowledge
 
In article <ddfn06$7oh$2@blue.rahul.net>,
kensmith@green.rahul.net (Ken Smith) writes:
|> In article <ddccfs$lgp$1@gemini.csx.cam.ac.uk>,
|> Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
|>
|> [... booth's divide ..]
|> >The mind boggles! I have never seen it used, but I can believe that
|> >it is. The most usual software divide is Newton-Raphson.
|>
|> Are you refering to the usual "find 1/X and then multiply by that" method?

No. That is a component of the method, but you need to do a bit
more to get it quite right.

|> Booth's divide is more common in microcontroller land. If you have a micro
|> that doesn't natively do 32 bit divides and an integer math library,
|> chances are you are using a Booth's divide.

Boggle. Well, if performance isn't a major issue, I suppose that it
doesn't matter what you use.

|> > Whatever
|> >the case, only a few systems lack hardware division for the standard
|> >types, very few applications use it for non-standard types, and the
|> >proportion of time spent in such code is miniscule!
|>
|> Actually this isn't really true. There are a huge number of processors
|> out there that don't do the divides in hardware for at least some of the
|> types that are commonly used on them. These are not desk top PCs. They
|> are systems that contain a processor but are not general purpose
|> computers.

The context was where branch prediction of the division operation
is critical for performance, as the above paragraph makes clear.
If performance of the division IS a major issue, can you explain
why people use Booth's algorithm IN SOFTWARE? As I said, the mind
boggles!

[ All right, I do know of one case. If you have NEITHER hardware
division NOR reasonable hardware multiplication, it is a real pain
to code Newton-Raphson. But anyone who runs codes where division
is a bottleneck on such a system is off his tiny mind. ]


Regards,
Nick Maclaren.
 
Ken Smith wrote:
In article <1123684376.739768.282190@g14g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
The object at a given address tends to have the same type for "a
while".

Yes, and once you have XXXXYYYYed that variable, you aren't likely to do
it again. You will XXXXYYYY most of the variables of that same type
though.
Guess why I wrote "object at a given address" and not "variable".

Lots of variables may refer to the same object - the object's address
captures this. The same variable may refer to different objects at
different times; again, object address differences tell you that type
may be different.

-andy
 
In article <ddfnlk$6pq$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
In article <ddfn06$7oh$2@blue.rahul.net>,
kensmith@green.rahul.net (Ken Smith) writes:
|> In article <ddccfs$lgp$1@gemini.csx.cam.ac.uk>,
|> Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
|
|> [... booth's divide ..]
|> >The mind boggles! I have never seen it used, but I can believe that
|> >it is. The most usual software divide is Newton-Raphson.
|
|> Are you refering to the usual "find 1/X and then multiply by that" method?

No. That is a component of the method, but you need to do a bit
more to get it quite right.
So you say "no" and then "yes".

Boggle. Well, if performance isn't a major issue, I suppose that it
doesn't matter what you use.
Performance can be a major issue along with other limitations like low
power or small size.

|> Actually this isn't really true. There are a huge number of processors
|> out there that don't do the divides in hardware for at least some of the
|> types that are commonly used on them. These are not desk top PCs. They
|> are systems that contain a processor but are not general purpose
|> computers.

The context was where branch prediction of the division operation
is critical for performance,
No, that is not the context at all. The context was showing a simple
example of code where earlier branches are a very bad predicter of later
branches. The Booth's method only appeared in the discussion because it
is so simple and easy to understand. Other examples were cited.


--
--
kensmith@rahul.net forging knowledge
 
In article <1123779133.510860.321150@g44g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
Ken Smith wrote:
In article <1123684376.739768.282190@g14g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
The object at a given address tends to have the same type for "a
while".

Yes, and once you have XXXXYYYYed that variable, you aren't likely to do
it again. You will XXXXYYYY most of the variables of that same type
though.

Guess why I wrote "object at a given address" and not "variable".
You wrote "object" because you couldn't spell "variable". Did I get it
right? :)

[...]
Lots of variables may refer to the same object - the object's address
captures this. The same variable may refer to different objects at
different times; again, object address differences tell you that type
may be different.
That has no effect on the argument. Once you have XXXXYYYYed a given
object you are not likely to have to XXXXYYYY it again. This is why the
VMT is the better way to predict.

--
--
kensmith@rahul.net forging knowledge
 
Ken Smith wrote:
In article <1123779133.510860.321150@g44g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
Guess why I wrote "object at a given address" and not "variable".

You wrote "object" because you couldn't spell "variable". Did I get it
right? :)
Snark and error in the same sentence.... (Yes, objects aren't
variables.)

Lots of variables may refer to the same object - the object's address
captures this. The same variable may refer to different objects at
different times; again, object address differences tell you that type
may be different.

That has no effect on the argument. Once you have XXXXYYYYed a given
object you are not likely to have to XXXXYYYY it again. This is why the
VMT is the better way to predict.
Actually, it does, as my explanation showed. XXXXYYYYing an object's
address does the right thing - it even handles other references to the
same object. XXXYYYing a variable that refers to an object doesn't
work nearly as well because that variable may be used to refer to a
different object at some other time and other references don't get the
benefit of any caching you might do.

-andy
 
In article <1124115964.383868.151910@g14g2000cwa.googlegroups.com>,
Andy Freeman <anamax@earthlink.net> wrote:
Ken Smith wrote:

Lots of variables may refer to the same object - the object's address
[...]
That has no effect on the argument. Once you have XXXXYYYYed a given
object you are not likely to have to XXXXYYYY it again. This is why the
VMT is the better way to predict.

Actually, it does, as my explanation showed. XXXXYYYYing an object's
address does the right thing - it even handles other references to the
same object.
You are assuming that you need to XXXXYYYY the object more than once.
This is not a good thing to assume. You are very likely to XXXXYYYY many
objects of the same type but much less likely to XXXXYYYY the same
variable more than once.

A good example is some sort of widget that is part of a GUI. The widget
gets constructed, displayed, removed from the display and destroyed. Each
of these is done only once to that object. Others of that same type would
get the same things done to it so optimizing based on the VMT address
would work.


--
--
kensmith@rahul.net forging knowledge
 

Welcome to EDABoard.com

Sponsor

Back
Top