Branch prediction using counters.

Ken Smith · Aug 9, 2005

In article <dfydncj3QpTKOmrfRVn-oA@comcast.com>,
Bob Monsen <rcsurname@comcast.net> wrote:
[...]

GCC has a profiling mode that will allow one to capture this
information, for optimizing MIPS output (where it is quite useful, since
they have alternate opcodes depending on which branch is more likely).
As I recall (I've never used it) you run your code for a while, and then
get a dump of the counters, which you then feed back into GCC.

Profiling can be useful on nearly any modern processor if you can't see by
inspection which way the conditionals land. If you are hand tuning some
ASM code for the maximum speed, knowing which jumps are likely helps to
minimize the out of line jumps.

BTW: You can pulse a port bit high if you jump and another if you don't
and use a frequency counter to find the odds. This works well on
microcontrollers.

--
--
kensmith@rahul.net forging knowledge

glen herrmannsfeldt · Aug 9, 2005

Bob Monsen wrote:
(snip)

When I am coding assembler, I find it's almost always obvious which way
is more likely. Unfortunately, compilers don't usually understand what
you are trying to do all that well, and thus often get it wrong. The
profile to feedback scheme might be useful because the compiler can then
do the proper optimization by itself. However, any time I care that
much, I profile the code and rewrite any expensive chunks in assembler.

It isn't always so obvious in assembler.

The ESA/390 instructions BXH and BXLE, branch on index high, and
branch on index low or equal, have opposite branch prediction logic.

There was a post once from someone who had coded a loop with both,
and found very different execution times. With the suggestion that
it might be branch prediction, the two were coded with the opposite
branch logic, and the timing difference was reversed, as would be
expected from branch prediction.

It is, though, not so obvious to an assembly programmer which
way the branch prediction logic is set up.

-- glen

Skybuck Flying · Aug 9, 2005

"Ken Smith" <kensmith@green.rahul.net> wrote in message
news:dd909m$ru3$2@blue.rahul.net...

In article <dd7sjv$1vs$1@news1.zwoll1.ov.home.nl>,
Skybuck Flying <nospam@hotmail.com> wrote:
[... me ..]
No, its is not common. For that matter I've never heard of the idea
being
used. There are problems that the counter. It should not be part of
the
instruction because that would mean that code space has to be written
into
as the program runs. This will slow things down.

If it slows down the system then that would mean the bottleneck is in the
bandwidth/speed of the bus/whatever used to read/write instructions
from/to
main memory.

I may not be the bus its self that is the limitting factor. The memory
device that holds the code may have a longer write time than read time.
This may be a physical limitation of the storage technology. To give an
extreme example, if the code is stored on a drum memory, this will reduce
the speed to one instruction per rotation.

Also, a counter is by its nature a slow operation because the highest bit
depends on all the bits below it. There are ways around this.

How much memory bandwidth/speed is available to transfer instruction
from/to
main memory where program code is located ?
Never enough.
and how much of that
bandwidth is generally used for application or maybe even games ?
All of it.

In other words... is bandwidth from main memory to cpu and back, truely
the
limiting factor ?

No, it is but one limiting factor among many.

BTW: If the code in question is a Booth's divide routine, the branch
prediction you suggest may actually slow the operation down instead of
speeding it up.

For god's sakes what does booth's algorithm have to do with jump
instructions.

http://jingwei.eng.hmc.edu/~rwang/e85/lectures/arithmetic_html/node10.html

See ya.

Bob Monsen · Aug 9, 2005

glen herrmannsfeldt wrote:

Bob Monsen wrote:
(snip)

When I am coding assembler, I find it's almost always obvious which
way is more likely. Unfortunately, compilers don't usually understand
what you are trying to do all that well, and thus often get it wrong.
The profile to feedback scheme might be useful because the compiler
can then do the proper optimization by itself. However, any time I
care that much, I profile the code and rewrite any expensive chunks in
assembler.

It isn't always so obvious in assembler.

The ESA/390 instructions BXH and BXLE, branch on index high, and
branch on index low or equal, have opposite branch prediction logic.

There was a post once from someone who had coded a loop with both,
and found very different execution times. With the suggestion that
it might be branch prediction, the two were coded with the opposite
branch logic, and the timing difference was reversed, as would be
expected from branch prediction.

It is, though, not so obvious to an assembly programmer which
way the branch prediction logic is set up.

-- glen

Actually, what I meant is that it is usually obvious to an assembly
coder which path through a branch is usually taken. Note that the MIPS
architecture has different opcodes for both cases; for example, it has a
BLTZAL (Branch on Less Than Zero And Link) and BLTZALL (Branch On Less
Than Zero And Link Likely), which do exactly the same thing, but the
second one is presumably somewhat faster if the branch is more likely.

I guess they took your example to heart when designing the instruction set.

--
Regards,
Bob Monsen

If a little knowledge is dangerous, where is the man who has
so much as to be out of danger?
Thomas Henry Huxley, 1877

Keith Williams · Aug 9, 2005

In article <42f7e3eb$1@news.meer.net>, dankoren@yahoo.com says...

"Ben Bradley" <ben_nospam_bradley@frontiernet.net> wrote in message
news:jicff1hrkkkucio1ljkdfugtrm4sfkuvld@4ax.com...
In comp.arch,sci.electronics.design, On Mon, 08 Aug 2005 16:38:23 GMT,
forbin@dev.nul (Colonel Forbin) wrote:

In article <dd756m$38b$1@news3.zwoll1.ov.home.nl>,
Skybuck Flying <nospam@hotmail.com> wrote:

Some time ago I already had this idea of inserting counters into if
statements to predict which branch is most likely to be executed. At the
lower levels this would mean inserting counters into jump instructions
and
recording which branch is executed the most by incrementing or maybe even
decrementing counters everytime the branch gets executed.

Actually this is sort of analogous to runtime profiling feedback to
compilers, which is used on a number of systems.

The problem is that it is often impossible to predict an event until
the outcome is observed (think quantum mechanics).

If we knew the outcome of each branch, we could simply eliminate the
branch in most cases and still have a reasonable approximation.

Inserting counters into production code typically comes at a great
price in terms of overhead.

We typically do this in development in order to optimize the most
costly sections of code for the most common outcomes, but trying
to put some sort of low level adaptive optimization metrics into
production code would consume more in overhead than the cost of
doing the task to begin with.

I recall reading of a processor that does pipelining through
branches (I don't remember which, but it might be a TI DSP) by using
several execution units. One executes the instruction(s) where the
branch is taken, the other executes the code where the branch is not
taken. When the branch is decided, the results from the corresponding
execution unit are taken and execution continues from there, and the
other results are 'thrown away' unused. This method takes a lot more
silicon space for the processor, but results in a speed increase that
apparently can't be done otherwise.

This is usually referred to as "speculative execution".

Yes, but speculative execution doesn't usually include following both

paths. Going both ways after branches gets ugly on the second and
subsequent branches. ;-). The PPC-970 can have ~200 instructions in
flight (half in the prefetch/decode stages and half between issue and
completion). To keep the pipe full, it will speculatively fetch/execute
up to sixteen branches deep, but will speculatively execute only one
path of the 65K possible. There are optimization rules for branches
(branches backwards are usually taken, forwards usually not, etc.) and
the programmer can drop hints to override these defaults.

--
Keith

Nick Maclaren · Aug 9, 2005

In article <MPG.1d627c16fde3672b989b65@news.individual.net>,
Keith Williams <krw@att.bizzzz> writes:
|>
|> Yes, but speculative execution doesn't usually include following both
|> paths. Going both ways after branches gets ugly on the second and
|> subsequent branches. ;-). The PPC-970 can have ~200 instructions in
|> flight (half in the prefetch/decode stages and half between issue and
|> completion). To keep the pipe full, it will speculatively fetch/execute
|> up to sixteen branches deep, but will speculatively execute only one
|> path of the 65K possible. There are optimization rules for branches
|> (branches backwards are usually taken, forwards usually not, etc.) and
|> the programmer can drop hints to override these defaults.

Which is fine for the simplest codes, but useless for ones which
are hard to predict. An old rule is that 20% of instructions
are branches, but that might drop to 10% in some ISAs. Anyway,
unless you can predict close to 95% correctly, speculatively
executing 16 branches ahead isn't going to help.

The thing that CAN be done both ways is cache preloading, though
it can increase the memory bandwidth unacceptably. Something
that could be done is a priority based cache preloader, fed with
data from the speculative execution. But even that doesn't help
all that much for difficult codes.

Regards,
Nick Maclaren.

Nick Maclaren · Aug 9, 2005

In article <ddad0d$g56$2@blue.rahul.net>, kensmith@green.rahul.net (Ken Smith) writes:
|> In article <dd9fvt$s9b$1@news5.zwoll1.ov.home.nl>,
|> Skybuck Flying <nospam@hotmail.com> wrote:
|> [...]
|> >For god's sakes what does booth's algorithm have to do with jump
|> >instructions.
|>
|> Take a look at what is suggest be done and you should be able to see that,
|> when coded for speed, other than the loop jumps, the jumps in the first
|> few passes through the code are worse than useless for predicting the
|> jumps later.

The same applies to much of the code in many graph theoretic
algorithms (VERY common and often dominating the time in some
fields) and the "object oriented" programming paradigm favoured
by many C++ and GUI programmers. In the latter case, the correct
key for prediction is the contents of a register and not what
the branch did last time.

Regards,
Nick Maclaren.

Ken Smith · Aug 9, 2005

In article <dd9fvt$s9b$1@news5.zwoll1.ov.home.nl>,
Skybuck Flying <nospam@hotmail.com> wrote:
[...]

For god's sakes what does booth's algorithm have to do with jump
instructions.

Take a look at what is suggest be done and you should be able to see that,
when coded for speed, other than the loop jumps, the jumps in the first
few passes through the code are worse than useless for predicting the
jumps later.

For the loop instructions, the branch condition is known well in advance.
In the fastest RISC machines, this allows the loop jumps to happen with no
lost cycles.

--
--
kensmith@rahul.net forging knowledge

Ken Smith · Aug 9, 2005

In article <ddad91$ei4$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
[... me ...]

|> Take a look at what is suggest be done and you should be able to see that,
|> when coded for speed, other than the loop jumps, the jumps in the first
|> few passes through the code are worse than useless for predicting the
|> jumps later.

The same applies to much of the code in many graph theoretic
algorithms (VERY common and often dominating the time in some
fields) and the "object oriented" programming paradigm favoured
by many C++ and GUI programmers. In the latter case, the correct
key for prediction is the contents of a register and not what
the branch did last time.

Yes, there are lots of other examples, but I think the Booth's case is the
simplest real case to see the problem with.

BTW: I disagree with the suggestion that jumps in OO programs are by
there nature harder to predict. Virtual method dispatching is a hard type
of branch to predict but in non-OO code, the virtual method call would be
replaced with a long switch statement. In good compilers, long switches
are implemented with jump tables. "goto table" is very hard to
predict.

--
--
kensmith@rahul.net forging knowledge

Nick Maclaren · Aug 9, 2005

In article <ddadvn$g56$3@blue.rahul.net>,
kensmith@green.rahul.net (Ken Smith) writes:
|>
|> Yes, there are lots of other examples, but I think the Booth's case is the
|> simplest real case to see the problem with.

It does, however, give the impression that the problem applies only
to very esoteric codes. It doesn't.

|> BTW: I disagree with the suggestion that jumps in OO programs are by
|> there nature harder to predict. Virtual method dispatching is a hard type
|> of branch to predict but in non-OO code, the virtual method call would be
|> replaced with a long switch statement. In good compilers, long switches
|> are implemented with jump tables. "goto table" is very hard to
|> predict.

Neither is hard to predict, if you do it right. Few, if any, modern
systems do.

What is needed is a compiler-generated instruction that sets up a
datum to be used for branch prediction, which is then quoted in
the branch, and used by the hardware. The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent. This has the advantage that information
from a previous branch in the same function could be used.

For example, consider the following instructions:

Set predictor. It takes up to 2 registers, and sets one of a
few (4-16) predictors based on their contents (in an unspecified
fashion).

Predict branch. It takes a predictor and also specifies a
preferred direction, a resulting direction or none. It must
immediately precede a branch.

This would allow a compiler to take the following:

IF a > b THEN ... FI;
. . .
IF b < a THEN ... FI;

and issue:

Set_predictor 1 a b
. . .
Predict_branch 1 +1 [ indicating a resulting direction ]
. . .
Predict_branch 1 -2 [ indicating a preferred direction ]

I have little idea how effective this would be, but my gut feeling
is that it would be better than correct approaches for such codes.
As far as I know, I have rendered this unpatentable by posting

Regards,
Nick Maclaren.

John Savard · Aug 9, 2005

On Mon, 8 Aug 2005 10:32:18 +0200, "Skybuck Flying" <nospam@hotmail.com>
wrote, in part:

1. Is it common to use counters as described above to do branch prediction
or is this something novel ?

2. Suppose it's not novel... then why only use 1 bit to do branch prediction
?

Actually, it's standard practice now to use a counter. But the counter
is inside the chip, not in RAM or in higher-level language. It is
usually a saturating two-bit wide counter.

And usually the counters are used in conjunction with a branch history
table (not to be confused with butylated hydroxytoluene).

John Savard
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 120,000 groups
Unlimited download
http://www.usenetzone.com to open account

Peter \"Firefly\" Lund · Aug 9, 2005

On Tue, 9 Aug 2005, Ken Smith wrote:

there nature harder to predict. Virtual method dispatching is a hard type
of branch to predict but in non-OO code, the virtual method call would be

No, virtual methods are not that expensive (*). It turns out that each
call site usually have very few targets at a time, often only one.

You can use the call site address to predict which of a number of saved
target addresses (or even saved target instructions) to go to.

You can also use software methods to achieve the same thing: your VM (or
whatever) arranges a compare instruction that checks that the target is
what you think it is + a conditional jump + a call with a hardcoded target
address. If the target is different from what you thought, then the
conditional jump will jump to code that handles the new target. Usually,
the jump will be predicted to be a fall-through so this is a cheap way of
handling typical OO calls.

*) It depends on what resources the CPU designers had available and what
kinds of code it was deemed necessary to make fast. The software method
is nice in that it only requires simple branch prediction (which is
relatively cheap) but it can be hard to make your
compiler/linker/loader/VM work that way. The hardware method requires
more resources for caching addresses/instructions but it will work with
more naive toolchains/languages.

-Peter

Tom Linden · Aug 9, 2005

On 9 Aug 2005 14:41:45 GMT, Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:

This would allow a compiler to take the following:
IF a > b THEN ... FI;
. . .
IF b < a THEN ... FI;
and issue:
Set_predictor 1 a b
. . .
Predict_branch 1 +1 [ indicating a resulting direction ]
. . .
Predict_branch 1 -2 [ indicating a preferred direction ]
I have little idea how effective this would be, but my gut feeling
is that it would be better than correct approaches for such codes.
As far as I know, I have rendered this unpatentable by posting

as an aside, the Regulus architecture which was the follow on to the Power
6
from Computer Consoles had a cute feature. If the value of a < b could be
determined in advance a bit was set and when execution reached that point,
only the appropriate branch made it to the pipeline. Multiflow tried
executing
in parallel, selecting the right branch when the truth value was known,
but I think
it became a bit unwieldy at that time (ca. 1984)

Nick Maclaren · Aug 9, 2005

In article <Pine.LNX.4.61.0508091729380.9761@tyr.diku.dk>,
"Peter \"Firefly\" Lund" <firefly@diku.dk> writes:
|> On Tue, 9 Aug 2005, Ken Smith wrote:
|>
|> > there nature harder to predict. Virtual method dispatching is a hard type
|> > of branch to predict but in non-OO code, the virtual method call would be
|>
|> No, virtual methods are not that expensive (*). It turns out that each
|> call site usually have very few targets at a time, often only one.
|> You can use the call site address to predict which of a number of saved
|> target addresses (or even saved target instructions) to go to.

Oh, really? That doesn't apply to non-trivial GUI codes.

A very large number of component widgets are very complex, with
a lot of object-dependent branching internally, and are used in
very different ways from higher-level widgets. Now, this is NOT
how to do it, but it is the programming paradigm that is most
common in that area.

|> You can also use software methods to achieve the same thing: your VM (or
|> whatever) arranges a compare instruction that checks that the target is
|> what you think it is + a conditional jump + a call with a hardcoded target
|> address. If the target is different from what you thought, then the
|> conditional jump will jump to code that handles the new target. Usually,
|> the jump will be predicted to be a fall-through so this is a cheap way of
|> handling typical OO calls.

Regrettably not, in the case I mention above. Each component
widget may have a dozen major attributes, which interact in
truly horrible ways. That would increase the code size by
a large factor.

I agree that nobody in his right mind would ever design code
like that, but that isn't the point. That is how it has been
designed.

Regards,
Nick Maclaren.

Andy Freeman · Aug 9, 2005

Nick Maclaren wrote:

What is needed is a compiler-generated instruction that sets up a
datum to be used for branch prediction, which is then quoted in
the branch, and used by the hardware. The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent. This has the advantage that information
from a previous branch in the same function could be used.
....
As far as I know, I have rendered this unpatentable by posting

US Patent 5,949,995.

You can use an instruction that references the branch or its prediction
instead of a branch that references the prediction datum. The
advantage of the former is that it doesn't require any additional
information in branch instructions and the prediction instructions are
a lot like stores.

-andy

Nick Maclaren · Aug 9, 2005

In article <1123603712.829819.88980@g43g2000cwa.googlegroups.com>,
"Andy Freeman" <anamax@earthlink.net> writes:
|> Nick Maclaren wrote:
|> > What is needed is a compiler-generated instruction that sets up a
|> > datum to be used for branch prediction, which is then quoted in
|> > the branch, and used by the hardware. The instruction history
|> > alone is useless for such codes, but the object address is
|> > (obviously) excellent. This has the advantage that information
|> > from a previous branch in the same function could be used.
|> ...
|> > As far as I know, I have rendered this unpatentable by posting

|>
|> US Patent 5,949,995.

Thanks. I think that my posting includes an aspect that isn't
in that, but I can't be bothered to check. It certainly covers
the basic principle.

I did a search on Google groups to see if I had posted the
method before the patent, but it seems not. It is pretty obvious,
in my view.

Regards,
Nick Maclaren.

Ken Smith · Aug 10, 2005

In article <ddafb9$ir7$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:

In article <ddadvn$g56$3@blue.rahul.net>,
kensmith@green.rahul.net (Ken Smith) writes:
|
|> Yes, there are lots of other examples, but I think the Booth's case is the
|> simplest real case to see the problem with.

It does, however, give the impression that the problem applies only
to very esoteric codes. It doesn't.

Are you suggesting that Booth's divide is "esoteric"? It seems to be a
very common bit of code.

|> BTW: I disagree with the suggestion that jumps in OO programs are by
|> there nature harder to predict. Virtual method dispatching is a hard type
|> of branch to predict but in non-OO code, the virtual method call would be
|> replaced with a long switch statement. In good compilers, long switches
|> are implemented with jump tables. "goto table" is very hard to
|> predict.
[...]

What is needed is a compiler-generated instruction that sets up a
datum to be used for branch prediction, which is then quoted in
the branch, and used by the hardware.

Click to expand...

This could sort of work but it is adding an instruction in the process of
all branches. The added instruction time will eat away at any advantage
gained.

As discribed, it also would only work if the same routine is called for
the same object. You would be better off feeding the address of the
Virtual Method Table into this added instruction. In most C++ programs,
all objects of a given type share the same VMT and hence any use of a
object of a given type would give you the information needed to predict
jumps.

Remember that some of the RISC and DSP processors duck the problem by
letting some small number of instructions after the branch happen before
the branch takes effect. This only adds instructions to the code in cases
were nothing can be done before the branch happens. I think this is a
much better way to go if you want a lot of speed per transistor.

The instruction history
alone is useless for such codes, but the object address is
(obviously) excellent.

Click to expand...

No, the object's address is not obviously "excellent", at least to me.
Please explain what you mean.

[..long description removed..]

I have little idea how effective this would be, but my gut feeling
is that it would be better than correct approaches for such codes.
As far as I know, I have rendered this unpatentable by posting

Click to expand...

Maybe they'll name it after you. You'll be famous!

--
--
kensmith@rahul.net forging knowledge

Ken Smith · Aug 10, 2005

In article <Pine.LNX.4.61.0508091729380.9761@tyr.diku.dk>,
Peter \"Firefly\" Lund <firefly@diku.dk> wrote:

On Tue, 9 Aug 2005, Ken Smith wrote:

there nature harder to predict. Virtual method dispatching is a hard type
of branch to predict but in non-OO code, the virtual method call would be

No, virtual methods are not that expensive (*). It turns out that each
call site usually have very few targets at a time, often only one.

Thinking of some OO code I've done:

The typical case is that a given call could dispatch the, lets say, "edit"
method for some 25 different types. Do you consider 25 a "small" number.

BTW: There are quite a few cases over 25 in the code I was thinking of.

You can use the call site address to predict which of a number of saved
target addresses (or even saved target instructions) to go to.

Are you intending to save this information on the CPU chip? If so,
couldn't the same number of transistors be put to much better use?

You can also use software methods to achieve the same thing: your VM (or
whatever) arranges a compare instruction that checks that the target is
what you think it is + a conditional jump + a call with a hardcoded target
address. If the target is different from what you thought, then the
conditional jump will jump to code that handles the new target. Usually,
the jump will be predicted to be a fall-through so this is a cheap way of
handling typical OO calls.

No, this is not a cheap way of doing VMTs. The Virtual Method Table, is
the fast way of doing it. There is no point in optimizing a slower
method if you want speed. The only case where your method would be faster
is if the majority of times you come to the call the same object type is
in use. I can't think of a program I've written were that was true.

--
--
kensmith@rahul.net forging knowledge

Andrew Reilly · Aug 10, 2005

On Wed, 10 Aug 2005 00:24:09 +0000, Ken Smith wrote:

Remember that some of the RISC and DSP processors duck the problem by
letting some small number of instructions after the branch happen before
the branch takes effect. This only adds instructions to the code in cases
were nothing can be done before the branch happens. I think this is a
much better way to go if you want a lot of speed per transistor.

Up to 40 instructions, in the case of the TI C6000 series: 5 instruction
issue slots in the branch shadow, up to eight instructions per slot.
Makes loop pre-amble and termination of otherwise single-cycle loops
amusing, but do-able. There are various multi-cycle NOP instructions to
fiddle the other cases without taking up too much code space.

As well as good speed/transistor, this does help with predictability too.

Cheers,

--
Andrew

Paul Hovnanian P.E. · Aug 10, 2005

This sort of thing is done at higher levels in various search
algorithms. Far more advanced techniques than just counters are used.

Building this sort of thing into processor firmware or general purpose
library routines might impose an inordinate burden on execution time if
it was employed for every branch. For cases where the branch execution
probabilities don't change much, just profiling the code and adjusting
the design will get you just as much gain. In the few cases where a
search can be optimized, the designer should be able to apply the
correct heuristics based on knowledge of the problem domain.

--
Paul Hovnanian mailto

aul@Hovnanian.com
------------------------------------------------------------------
If you can't beat them, arrange to have them beaten.
-- George Carlin

Branch prediction using counters.

Ken Smith

Guest

glen herrmannsfeldt

Guest

Skybuck Flying

Guest

Bob Monsen

Guest

Keith Williams

Guest

Nick Maclaren

Guest

Nick Maclaren

Guest

Ken Smith

Guest

Ken Smith

Guest

Nick Maclaren

Guest

John Savard

Guest

Peter \"Firefly\" Lund

Guest

Tom Linden

Guest

Nick Maclaren

Guest

Andy Freeman

Guest

Nick Maclaren

Guest

Ken Smith

Guest

Ken Smith

Guest

Andrew Reilly

Guest

Paul Hovnanian P.E.

Guest

Log in

Welcome to EDABoard.com

Sponsor