Can't get Quartus to do what I want

C

Chris Hinsley

Guest
Folks, I'm having a hard time getting this bit of code to do what I want.

I have an execution unit I've done that has N results that can be
written back per clock to the register file, and I don't want a ladder
of multiplexers working out what value get's written to each register,
it's too serial and takes too long for my likeing, and get slower as I
add functional units.

So I thought 'aha, I'll put all the results onto a single bus to each
register useing a demuz (demux with z on none written paths) and just
write the values off the bus into the registers that are getting
updated'. Easy ?

Not wanting to put my entire source here I'll show the relevant bits.
And I should add that the design garentees not to write to the same
register twice from any of the inputs.

This is how I prepare the inputs.

tri [(NUM_REGS - 1):0] [(REG_WIDTH - 1):0] val;
wire [(NUM_FUS - 1):0] [(NUM_REGS - 1):0] fuwe;
genvar u;
generate
for (u = 0; u < NUM_FUS; u++)
begin : G1
demuz #(.SEL_BITS(REG_SEL_BITS), .WIDTH(REG_WIDTH))
U(.i_sel(i_c), .i(i_valc), .o_d(fuwe), .o(val));
end
endgenerate

That puts all the inputs onto a shared bus to each register, and keeps
the decoder bits so I can work out what registers have to be written to
later on.

reg [(NUM_REGS - 1):0] [(REG_WIDTH - 1):0] r;
reg [(NUM_REGS - 1):0] p;
reg [(NUM_REGS - 1):0] we;

// calculate write enble bits for the registers
integer i, j;
we[0] = 0;
for (i = 1; i < NUM_REGS; i++)
begin
we = 0;
for (j = 0; j < NUM_FUS; j++)
begin
we |= fuwe[j];
end
end

// write the registers and clear pending bits for those that get written
r[0] = 0;
p[0] = 0;
for (i = 1; i < NUM_REGS; i++)
begin
if (we)
begin
r = val;
p = 0;
end
end

This bit of code in an 'always@(posedge i_clk)' and writes the new
values off the bus into any register that gets written to.

My problem is Quartus keeps turning the tri state bus into selector
logic, warning me that it's done so, and creates the piles of
multiplexers I'm desperate seeking to remove. Unless I've somhow got it
all wrong (which I might) I should be able to have a single shared bus
to each register that contains a correct value for the case where it
gets written.

What I'm trying to do is produce a muliport register write (from
multiple functional units) that dosn't get slower and slower the more
functional units I add. Which it won't if the shared bus idea is
correct. I know I'm going to take a hit on wires, but at least all the
values come through in parralell.

Anyone care to take a look ? Or have the key to getting Quartus to do
what I want ?

Best regards

Chris
 
My problem is Quartus keeps turning the tri state bus into selector
logic, warning me that it's done so, and creates the piles of
multiplexers I'm desperate seeking to remove. Unless I've somhow got it
all wrong (which I might) I should be able to have a single shared bus
to each register that contains a correct value for the case where it
gets written.

What I'm trying to do is produce a muliport register write (from
multiple functional units) that dosn't get slower and slower the more
functional units I add. Which it won't if the shared bus idea is
correct. I know I'm going to take a hit on wires, but at least all the
values come through in parralell.

Anyone care to take a look ? Or have the key to getting Quartus to do
what I want ?

Best regards

Chris
This is the actual Quartus warning:

Warning: Tri-state node(s) do not directly drive top-level pin(s)
Warning: Converted tri-state node "execute:u1|val[15][31]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][30]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][29]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][28]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][27]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][26]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][25]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][24]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][23]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][22]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][21]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][20]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][19]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][18]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][17]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][16]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][15]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][14]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][13]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][12]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][11]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][10]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][9]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][8]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][7]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][6]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][5]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][4]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][3]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][2]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][1]" into a selector
Warning: Converted tri-state node "execute:u1|val[15][0]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][31]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][30]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][29]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][28]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][27]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][26]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][25]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][24]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][23]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][22]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][21]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][20]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][19]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][18]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][17]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][16]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][15]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][14]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][13]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][12]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][11]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][10]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][9]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][8]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][7]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][6]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][5]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][4]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][3]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][2]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][1]" into a selector
Warning: Converted tri-state node "execute:u1|val[14][0]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][31]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][30]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][29]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][28]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][27]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][26]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][25]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][24]" into a selector
Warning: Converted tri-state node "execute:u1|val[13][23]" into a selector
etc
 
In article <2011042818341646599-chrishinsley@gmailcom>,
Chris Hinsley <chris.hinsley@gmail.com> wrote:
Folks, I'm having a hard time getting this bit of code to do what I want.

I have an execution unit I've done that has N results that can be
written back per clock to the register file, and I don't want a ladder
of multiplexers working out what value get's written to each register,
it's too serial and takes too long for my likeing, and get slower as I
add functional units.

So I thought 'aha, I'll put all the results onto a single bus to each
register useing a demuz (demux with z on none written paths) and just
write the values off the bus into the registers that are getting
updated'. Easy ?

Not wanting to put my entire source here I'll show the relevant bits.
And I should add that the design garentees not to write to the same
register twice from any of the inputs.
<snip code example>

This bit of code in an 'always@(posedge i_clk)' and writes the new
values off the bus into any register that gets written to.

My problem is Quartus keeps turning the tri state bus into selector
logic, warning me that it's done so, and creates the piles of
multiplexers I'm desperate seeking to remove. Unless I've somhow got it
all wrong (which I might) I should be able to have a single shared bus
to each register that contains a correct value for the case where it
gets written.

What I'm trying to do is produce a muliport register write (from
multiple functional units) that dosn't get slower and slower the more
functional units I add. Which it won't if the shared bus idea is
correct. I know I'm going to take a hit on wires, but at least all the
values come through in parralell.

Anyone care to take a look ? Or have the key to getting Quartus to do
what I want ?
Chris,

First things first - internal tri-states are pretty much gone. They
don't exist in ASICs nor FPGAs anymore. So a "shared bus" in the
old sense (many masters/slaves, only one master driving at a time)
can't be implemented.

So, most FPGA tools map any tri-states to combinational logic - in your
case putting all those muxes back which you tried so hard to remove.

A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus". For us instead of actually instaciating "OR" gates,
we use the verilog "wor" declaration for all our bus wires,
and then tie all the outputs together (implementing an "OR" via
the net resolution). All the FPGA tools that we use seem to
be happy with this. "Contention" issues are still a problem,
but that's not much different than before.

You don't get what you want - you're still getting "slower and slower
the more functional units" you add. Just the rate at which you're
getting slower is less since an "OR" gate is less complex than a mux.

You really can't get what you want. For a shared tri-state bus, for
every new load, you'd add capacitance/resistance to the bus - making
it harder to drive. How to solve? Increase drive strength. But wait -
that also increases the loads that all the other drivers "see". Split
up the bus - ok how? All slaves are still a function of all masters.
It becomes an intractable problem. I think this is one of the reasons
tri-states are gone.

Regards,

Mark
 
First things first - internal tri-states are pretty much gone. They
don't exist in ASICs nor FPGAs anymore. So a "shared bus" in the
old sense (many masters/slaves, only one master driving at a time)
can't be implemented.
Drat.

So, most FPGA tools map any tri-states to combinational logic - in your
case putting all those muxes back which you tried so hard to remove.
Double drat. :)

A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus". For us instead of actually instaciating "OR" gates,
we use the verilog "wor" declaration for all our bus wires,
and then tie all the outputs together (implementing an "OR" via
the net resolution). All the FPGA tools that we use seem to
be happy with this. "Contention" issues are still a problem,
but that's not much different than before.
Yes OK.

You don't get what you want - you're still getting "slower and slower
the more functional units" you add. Just the rate at which you're
getting slower is less since an "OR" gate is less complex than a mux.
I just knew somone would say that. :)

You really can't get what you want. For a shared tri-state bus, for
every new load, you'd add capacitance/resistance to the bus - making
it harder to drive. How to solve? Increase drive strength. But wait -
that also increases the loads that all the other drivers "see". Split
up the bus - ok how? All slaves are still a function of all masters.
It becomes an intractable problem. I think this is one of the reasons
tri-states are gone.

Regards,

Mark
Yeah, I knew that deep inside, I just couldn't let go of it once I'd
seen the RTL procuding all those muxes again. :(

Chris
 
A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus".
Switch over to this code. Thanks.

wire [(NUM_FUS - 1):0] [(NUM_REGS - 1):0] [(REG_WIDTH - 1):0] fuval;
wire [(NUM_FUS - 1):0] [(NUM_REGS - 1):0] fuwe;
genvar u;
generate
for (u = 0; u < NUM_FUS; u++)
begin : G1
demux #(.SEL_BITS(REG_SEL_BITS), .WIDTH(REG_WIDTH))
U(.i_sel(i_c), .i(i_valc), .o_d(fuwe), .o(fuval));
end
endgenerate

// calculate write enable bits for the registers
// and or the buses together
reg [(NUM_REGS - 1):0] [(REG_WIDTH - 1):0] val;
reg [(NUM_REGS - 1):0] we;
always @(*)
begin
integer i, j;
we[0] = 0;
for (i = 1; i < NUM_REGS; i++)
begin
we = 0;
for (j = 0; j < NUM_FUS; j++)
begin
we |= fuwe[j];
end
end
val[0] = 0;
for (i = 1; i < NUM_REGS; i++)
begin
val = 0;
for (j = 0; j < NUM_FUS; j++)
begin
val |= fuval[j];
end
end
end

Regards

Chris
 
Chris Hinsley <chris.hinsley@gmail.com> wrote:

First things first - internal tri-states are pretty much gone. They
don't exist in ASICs nor FPGAs anymore. So a "shared bus" in the
old sense (many masters/slaves, only one master driving at a time)
can't be implemented.

Drat.
(snip)
A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus".
Not having actually looked, this seems like the most likely way
the tools emulate tristate drivers. (I suppose wired-and makes
about as much sense.) While often described as multiplexers,
the logic would be much more complicated that way. You would
need a priority encoder, then multiplexer. That doesn't make sense.

For us instead of actually instaciating "OR" gates,
we use the verilog "wor" declaration for all our bus wires,
and then tie all the outputs together (implementing an "OR" via
the net resolution). All the FPGA tools that we use seem to
be happy with this. "Contention" issues are still a problem,
but that's not much different than before.

Yes OK.
And the FPGA can implement actual OR gates instead.

As I understand it, the change in FPGA logic comes from the
requirement, as lines get longer, to buffer them internally.
That only works if you know the direction.

You don't get what you want - you're still getting "slower and slower
the more functional units" you add. Just the rate at which you're
getting slower is less since an "OR" gate is less complex than a mux.

I just knew somone would say that. :)

You really can't get what you want. For a shared tri-state bus, for
every new load, you'd add capacitance/resistance to the bus - making
it harder to drive. How to solve? Increase drive strength. But wait -
that also increases the loads that all the other drivers "see". Split
up the bus - ok how? All slaves are still a function of all masters.
It becomes an intractable problem. I think this is one of the reasons
tri-states are gone.
Pretty much. In addition, for a long time now the wires are small
enough that you can't consider it as lumped resistance driving a
capacitive load. The only solution, to keep up the speed, is
internal buffers on longer lines.

-- glen
 
On 2011-04-28 19:49:10 +0100, Chris Hinsley said:

A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus".

Switch over to this code. Thanks.
<snip>

God damb it. Yet again, even though I thought this way was going to be
faster than the ladder of muxes it dosn't turn out to be quicker.

Is it always like this guys ? You try your dambdest to 'do it the best
way' and your synth tool putting out shocking RTL beats you all the
time. :(

<depressed> :(

Chris
 
In article <2011042822375646394-chrishinsley@gmailcom>,
Chris Hinsley <chris.hinsley@gmail.com> wrote:
On 2011-04-28 19:49:10 +0100, Chris Hinsley said:


A solution we've used is to use a wired-or structure. Where
any master NOT driving assigns its output of the "shared bus"
to "0" (instead of Z). All masters outputs are then OR'd together.
With only one master driving it's wanted value - all the others
driving "0" - the one master's driving value shows up on the
"shared bus".

Switch over to this code. Thanks.

snip

God damb it. Yet again, even though I thought this way was going to be
faster than the ladder of muxes it dosn't turn out to be quicker.

Is it always like this guys ? You try your dambdest to 'do it the best
way' and your synth tool putting out shocking RTL beats you all the
time. :(

depressed> :(
Actually, it's a good thing! Trust your synthesis tool. 95% of the time it is
going to do a good enough job without issue. Could you do better? Probably,
given a lot more time. That other 5%? You may have to beat it into submission -
just pick that 5% wisely. :).

--Mark


--Mark
 
On 2011-05-04 15:31:05 +0100, Andy said:

On Apr 28, 4:37 pm, Chris Hinsley <chris.hins...@gmail.com> wrote:
Is it always like this guys ? You try your dambdest to 'do it the best
way' and your synth tool putting out shocking RTL beats you all the
time. :(
In your original post, you said:

"it's too serial and takes too long for my likeing"

This is the worst possible reason for obfuscating code: to try to make
it run as fast/small as you think it should, when it already meets
resource and timing requirements.

Does it fit? Does it meet timing? Done! Stop right there. Why go to
all sorts of trouble, making your code harder to write, read, review
and maintain, when the way you originally wrote it meets its
requirements?
I know, but theres allways that nagging feeling that you could do
better. I have to learn to resist.

Chris
 
On Apr 28, 4:37 pm, Chris Hinsley <chris.hins...@gmail.com> wrote:
Is it always like this guys ? You try your dambdest to 'do it the best
way' and your synth tool putting out shocking RTL beats you all the
time. :(
In your original post, you said:

"it's too serial and takes too long for my likeing"

This is the worst possible reason for obfuscating code: to try to make
it run as fast/small as you think it should, when it already meets
resource and timing requirements.

Does it fit? Does it meet timing? Done! Stop right there. Why go to
all sorts of trouble, making your code harder to write, read, review
and maintain, when the way you originally wrote it meets its
requirements?

I understand your wanting to make this code extendable, but you also
have to understand that the synthesis tool probably has more tricks
and transformations that it can try WHEN THEY ARE NEEDED. When you add
enough to it, it may well come up with a better way on its own to meet
timing.

You are not alone in your quest to find "the best" way to code
something; I see this all the time. Given the size/cost/schedule of
most projects, we cannot afford to design at the gate level, so don't
code like it unless you absolutely have to. Change your definition of
"the best way to code something" to be more writeable, readable,
maintainable, etc. You will be happier, and the poor fellow that gets
to review, reuse or maintain your code (which just might be you in
another couple of weeks/months/years) will appreciate it too.

Andy
 

Welcome to EDABoard.com

Sponsor

Back
Top