EDK : FSL macros defined by Xilinx are wrong

v_mirgorodsky@yahoo.com wrote:

and thanks for your response. So, that seems that I will not have any
advantages using those pins in my design :( I was hoping to avoid
whatever floorplanning stuff during migration to PCI66, as I did for
PCI33. Is there any possibility, that Xilinx will disclose
functionality behind those pins in the future? Is there any recommended
layouts available for PCI33/66?
Look with google groups for that subject. There was an interesting
discussion November 2005.

Bye

--
Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
 
Vladamir,

The PCI application has odd timing requirements. As a consequence, to
be sure we can always meet the requirements, Xilinx has at various times
had hardened bits of logic "just in case" the final silicon was not
capable of meeting a particular PCI requirement.

If you would like more information, please email me directly.

Since these pins are undocumented, looks like there were not needed.

I am not in the Spartan Design Team, so I will go reseach it further if
you need. The Virtex Design Team has done this also (at various times),
so it is nothing new.

Austin

v_mirgorodsky@yahoo.com wrote:

Hello, ALL!

In our design we are planning to use Spartan-3E in PCI 33/66
environment. We have developed our own PCI core. Since the code is
completely RTL and does not have any platform-specific features we were
able to test it with existing Altera ACEX-1K PCI33 board. Running on
speed grade 2 Altera ACEX-1K device our core has about 1.5-2ns out of
7ns Tsu margin and even more for Tout. Now for production design we are
planning to move to Xilinx Spartan-3E 500 FPGA. During detailed
investigation of FT256 package we found several strange pins, marked as
IRDY1, TRDY1, IRDY2 and TRDY2. Do these pins have any significant
meaning for PCI designs? Unfortunately, I did not find any explanations
in Spartan-3E datasheet, neither in accompanying application notes.

With best regards,
Vladimir S. Mirgorodsky
 
I am not sure how it is for Nios II, but in general.

For static branch prediction, the CPU makes a fixed prediction when it
encounters a conditional branch in the instruction stream. For example
if the CPU encounters a jump on zero flag set, a static branch
predictor will assume that the CPU will take the branch. It does not
use the history of the instruction to determine if it will branch or
not. This I believe has pretty good results.

A dynamic branch predictor works on the history of the instruction
stream to determine if a jump will take place. For example if the CPU
encounters the same aforementioned instruction, it will look into
internals to determine the history of the instruction and make a
prediction based on that. The function to keep track of this varies
widely. A good example to look at would be the original Intel Pentium's
BTB (Branch Target Buffer). Or looking at the way the P2 or P3 do it.
The methods these chips use are ingenious and have very good prediction
rates, but by the looks of it, they do require some complicated
hardware.

Obviously the dynamic predictor will have better prediction records
than the static, but the static is FAR easier to implement.

Here is a link:
http://www.x86.org/articles/branch/branchprediction.htm

I was reading this article sometime ago and was very impressed with the
method Intel used. Perhaps somebody could post a link to a better
method than this, than I will faint :).

-Isaac
 
Hi Udo,

Your chances of getting a reply from an expert are much better if you
post Nios questions on www.niosforum.com .

- Subroto Datta
Altera Corp.



"Udo" <WeikEngOff@aol.com> wrote in message
news:1143324190.583196.306100@g10g2000cwb.googlegroups.com...
Hello,

does somebody know more details about the
Nios II-branch prediction? I know only that
two types are supported - static and dynamic.
Any more details?

Thanks.
Udo
 
Pablo Bleyer Kocik wrote:
Thanks for the pointers. I will try that.

I couldn't turn up that other post that I recalled, but I dug
up a code snippet of the conditional signed skips of my own
homebrew processor. ( no mid-chain split, but overflow logic
coded with pad bits )

Basically, the copy of the MSB input bits at bit position
MSB+1 lets you indirectly look for a difference in the carries
into and out of the MSB position in the inferred RTL adder.

gen_sgbt: if CFG_SKIP_GROUP_B = TRUE generate

skip_b: block
signal wide_diff : std_logic_vector( ALU_MSB+2 downto 0);
signal pad_ar : std_logic_vector( ALU_MSB+2 downto 0);
signal pad_br : std_logic_vector( ALU_MSB+2 downto 0);

begin
pad_ar <= ( '0' & ar(ALU_MSB) & ar );
pad_br <= ( '0' & br(ALU_MSB) & br );

wide_diff <= pad_ar - pad_br;

-- sign, carry, overflow, zero bits
cb_n <= wide_diff(ALU_MSB);
cb_c <= wide_diff(ALU_MSB+2);
cb_v <= wide_diff(ALU_MSB+1) XOR wide_diff(ALU_MSB);
cb_z <= '1' when ( wide_diff(ALU_MSB downto 0) = ALU_ZERO ) else
'0';

end block skip_b;

--
-- mux for skip_b condition decoding
--
skip_decode_b: process(skip_sense, skip_type, cb_z, cb_n, cb_c,
cb_v)
variable skip_mux_b : std_logic;
begin

-- mux condition sources
case skip_type is
when CND_LO => skip_mux_b := cb_c;
when CND_LS => skip_mux_b := cb_z OR cb_c;
when CND_LT => skip_mux_b := cb_n XOR cb_v;
when CND_LE => skip_mux_b := (cb_n XOR cb_v) OR cb_z;
when others => skip_mux_b := '1';
end case;

if skip_sense = '0' then
skip_cond_b <= skip_mux_b;
else
skip_cond_b <= NOT skip_mux_b;
end if;

end process skip_decode_b;

end generate gen_sgbt;


Which implements:

SCCB : skip conditions, group B

000 0 skip.lo lower unsigned, RA < RB
100 0 skip.hs higher or same unsigned, RA >= RB

001 0 skip.ls lower or same unsigned, RA <= RB
101 0 skip.hi higher unsigned, RA > RB

010 0 skip.lt less than signed, RA < RB
110 0 skip.ge greater than or equal signed, RA >= RB

011 0 skip.le less than or equal signed, RA <= RB
111 0 skip.gt greater than signed, RA > RB

There's also a great explanation of generating conditionals and
overflows in sections 2-11 through 2-13 of "Hacker's Delight", Warren,
Addison Wesley, 2003

Brian
 
The older code was getting synthesized in around 20 mins, but the new
code takes hours together to get synthesized, and so does the PAR.
How can we reduce the synthesis time? Why is that the code which
took lesser time to get synthesized is now taking longer?
One thing I've noticed with XST and state machines is that past a
certain complexity, XST run time goes up exponentially when adding
more states (quickly changing from minutes to hours)

Although since you describe only four states, that may not be your
exact problem.

You might try turning off/changing the state machine inference and
encoding settings.

What version XST?
Any weird warnings?
State machine tested in simulation for init/recovery on unexpected
inputs?

See also Answer Records 20366, 22654, 22637, 21682

Brian
 
On 2006-03-25, leaf <adventleaf@gmail.com> wrote:
Yes you're minipci design worked on my pc, i did somehow used it as a
reference, and somehow i have managed to port it into my new design,
unfortunately the former did'nt worked on all PC on our lab.
I'm not sure which design you had problems with -- mine or your modified
version. One thing to be aware of is that modern PCI bridge chipsets
have configurable clocks and the BIOS will disable a slot entirely if
it's not happy with the initial configuration requests to the card. If
your code works at the first boot, you can often re-flash the card while
booted (provided the CPLD or FPGA tristates during the programming and
assuming your code comes up in a sane state without seeing a PCI RST).
However, if your card was not detected at boot, you probably can't reload
it and see it work unless you also restart the PC.

Also, my core did not generate PAR at all because the fan-in for the
necessary XOR wasn't possible in the small CPLD I was using.

note with regards to screenshot:
Will the bus master work with this kind of target (i.e. having Medium
Devsel timing)?
The devsel timing has to meet the spec you put in your config space.
If you say "fast" and don't respond on the clock after the address
cycle, then the initiator could assume that no one will ever reply.
I'm not sure if (or even how) that would apply to config cycles.

Also, in your simulation you have a config space read at address '1'
which should really be a read at 0 with byte enables. And you return
all '1's which would not be valid and could cause the problem I described
above with your BIOS.

--
Ben Jackson
<ben@ben.com>
http://www.ben.com/
 
Hi Isaac,

There are far fancier ones than this. The DEC Alpha 21264 had a snazzy
scheme where they had two different branch predictors, and picked the
better of the two dynamically. A global predictor used the results
(taken/not taken) of the last 12 branches to predict what the next
branch would do. A local predictor tracked for a set of branches
whether they were taken or not taken for the past few times the branch
was evaluated, and predicts accordingly.

The 21264 would then store on a per-branch basis which predictor was
doing a better job. Some branches (say, the end of a for loop with a
lot of iterations) are well predicted by a local predictor. Some
branches (for example, a for loop that iterates exactly 4 times,
resulting in taken-taken-taken-not_taken pattern) are better predicted
by a global predictor. So it woudl get the best of both worlds.

I did a school project on 21264 branch prediction, but can no longer
find the source material on the web. However, I did come across this
fairly good presentation that describes branch prediction methods and
which processors use what.

http://meseec.ce.rit.edu/eecc551-winter2003/551-12-17-2003.pdf

Enjoy,

Paul Leventis
Altera Corp.
 
<snip>
The 8.1i
release of ISE has support for "synchronously controlled initialization
of the RAM data outputs" (pg 218 of XST User Guide) and for RAM
initialization (pg 226). Thinking I can be clever about it, I code up a
RAM with initial values and also with output initialization:

type ram_type is array (0 to 255)
of std_logic_vector(7 downto 0);
signal RAM : ram_type := (
X"12", X"34", yadda, yadda, yadda,

constant Init_Value : std_logic_vector( 7 downto 0 )
:= x"FE";
...
process( Clk ) begin
if RISING_EDGE( Clk ) then
if CE = '1' then
if WE = '1' then
RAM (TO_INTEGER(Addr)) <= DI;
end if;
if rst = '1' then -- optional reset
D_Out <= Init_Value;
else
D_Out <= RAM(TO_INTEGER(addr)) ;
end if; end if; end if;
end process;

and then just tie off we:

WE <= '0';

What does XST do with this? It sees WE tied off, realizes this is a
ROM, not a RAM, and spits out the message:

INFO:Xst:1649 - Currently unable to implement this ROM as a read-only
block RAM. Please look for coming sofware updates.
Hi John,
I have seen this message before when I successfully inferred Block ROM
256x72 for an instruction decoder. This message usually pops out when
You don't follow the template carefully. Why don't You use template for
ROM and just add preset? Something like this:
(Warning, copying from the manual with some modifications)

entity rominfr is
port (
clk : in std_logic;
en : in std_logic;
addr : in std_logic_vector(4 downto 0);
data : out std_logic_vector(3 downto 0));
end rominfr;
architecture syn of rominfr is
type rom_type is array (31 downto 0) of std_logic_vector (3 downto 0);
constant ROM : rom_type :=(
"0001","0010","0011","0100","0101","0110","0111","1000","1001",
"1010","1011","1100","1101","1110","1111","0001","0010","0011",
"0100","0101","0110","0111","1000","1001","1010","1011","1100",
"1101","1110","1111");
begin
process (clk)
begin
if (clk'event and clk = '1') then
if (en = '1') then
if rst = '1' then -- optional reset
do <= Init_Value;
else
data <= ROM(conv_integer(addr);
end if;
end if;
end if;
end process;
end syn;

Best regards,
M.
 
preet wrote:

As a school project i am designing a Viterbi Decoder, now i have to test it on FPGA.
Please can you give me the guidelines or the pointers to how can i actually test the code.
I test code by writing a testbench
and running a simulation to
stimulate inputs and verify outputs.
This might take a long time, but
trial and error synthesis
on the bench will take longer.

Someone gave me a clue that using RS232 will help. But, i have no clue, how to transfer the data from my PC to FPGA and how to receive the output. In which format am i supposed to send the data, which data file? What all steps are needed?
Once the code is working in simulation,
you need some way to stimulate inputs
and verify outputs in hardware.
The best way to do this depends on what devices
are already connected to the fpga on your board.
Using a terminal emulator program and a serial port
is one way to interface to a PC

-- Mike Treseler
 
Maki wrote:
snip
snip
constant ROM : rom_type :=(
"0001","0010","0011","0100","0101","0110","0111","1000","1001",
etc.
Ahhh Maki,
A thousand thanks to you! This does indeed synth down to a BRAM,
with the desired synchronous reset. Of course I'd tried variations from
the XST user guide, but it turned out that I'd blundered and used a
formation of:

signal ROM : rom_type :=(
"0001","0010","0011","0100","0101","0110","0111","1000","1001",

instead of "constant ROM", which was what impeded XST from doing it's
impressive job.
So my apologies to the XST developers,
and I remain,
Just John
 
JustJohn wrote:
Maki wrote:
snip
snip
constant ROM : rom_type :=(
"0001","0010","0011","0100","0101","0110","0111","1000","1001",
etc.
Going back to my code, I actually did use the correct construction. It
turns out that my PROM is 4K deep, and that produces the annoying
message:

"Currently unable to implement this ROM as a read-only block RAM.",

While Maki's example, which is only 32 words deep, produces the unhappy
result:

"The register is removed and the ROM is implemented as read-only block
RAM."

Listen up Xilinx. This is the EXACT OPPOSITE of what almost any
designer would want: Big PROMs should go into BRAM, small PROMs should
go into distRAM.
My apologies for being irked by this behavior, but it seems like such a
simple thing...

While my dander is up, I'll also take this moment to gripe about the
newest 8.1i ISE GUI. Who's idea was it to add the explicit left click
"Set as top module"? This is as buggy as anything in the GUI. Has
anybody else out there noticed that sometimes (often!) it gets
recalcitrant, and refuses to put the Implement Design selection into
the Process window sometimes? I just wasted 10 minutes trying to get
the system to accept that I wanted to take a test design "Test_Maki"
through to PAR. I could click on other modules, set them as top, but
the particular one I wanted, it refused. Really bothersome. I far
preferred the way 7.1 worked: single click - set as top; double click -
open for edit. Was there a study group that decided on this particular
change? How do I get into one of those?

(Always, my apologies for complaining, but it's Sunday and here I am at
the office, instead of being out sailing on a nice sunny SoCal day).

On to Antti's suggestion.
 
Antti wrote:
oh there are zillions of mockups, gotta get used to that.

for your case you may try using a "fake" 0 or 1, eg a signal that is
constant but is not recognized as constant by xilinx flow, this is
clever trick that helps in many different cases :)

Antti
Hey Antti,

(Thanks for your time!) Success at last. Sure, I'd thought of that,
or tying off to an unused I/O pin, but this seemed so painfully
convoluted. After mucking about with Maki's suggestion (see the other
branch) I came back to yours. I used the RAM construct as posted at the
start of this topic, and then produced a Dummy_Zero as follows:

signal Dummy_Ctr : UNSIGNED( 0 downto 0 );
signal Dummy_Sum : UNSIGNED( 0 downto 0 );
signal Dummy_Zero : std_logic;
....
process( Clk ) begin
if RISING_EDGE( Clk ) then Dummy_Ctr <= Dummy_Ctr + 1;
end if; end process;
Dummy_Sum <= Dummy_Ctr + 1;
Dummy_Zero <= '1' when Dummy_Sum = Dummy_Ctr else '0';

One last question I'll leave the group with, does anyone know a simpler
way to produce a Dummy_Zero?

Finally, to the Xilinx folks, I know the GUI garbage is much more
MicroSoft's fault than yours (they keep fixing things that aren't as
badly broken as the fix is), keep up your noble efforts!

Regards All,
John
 
I believe that the MAXDELAY constraint applies to the actual net delay
and does not include clock->Q propagation delay or setup time. So
if your target is 400MHz, 2.5 ns is NOT the right value.

This trick may have worked for me, but because sometimes PAR does OK
and sometimes it doesn't it's hard to say for sure that MAXDELAY is a
sure
fire work-around to force PAR to do the right thing.

It is very annoying that having told PAR how to place two block so that
it
can succeed, it then messes up a very simply route and misses timing.

This sure seems like low-lying fruit that the Xilinx s/w folks could
pick.

John Providenza
 
JustJohn wrote: (See Google Groups if you want the history)

Back to being a cautionary tale... Thought I'd be clever, and fake a 4K
x 8 PROM by coding a 4Kx8 RAM with initial values _and_ a set/reset
value (code below). Looked like it worked, but then I opened FPGA
Editor, and it reports that the S/R Val A is 0, rather than the value I
coded. I guess that you only get either initial values or S/R value,
but not both. I look forward to service pack 4...

Just John

This code produces erroneous bitstream for XST:

constant Init_Val : std_logic_vector( 7 downto 0 ) := x"06";
type RAM_Type is array ( 0 to 4095 )
of std_logic_vector (7 downto 0);
signal RAM : RAM_Type :=(x"01", x"02", x"03", 4K of yadda,
signal Dummy_Ctr : UNSIGNED( 0 downto 0 );
signal Dummy_Sum : UNSIGNED( 0 downto 0 );
signal Dummy_Zero : std_logic;
signal Dummy_Input : UNSIGNED( 7 downto 0 );
....
begin -- Architecture
process( Clk ) begin
if RISING_EDGE( Clk ) then
Dummy_Ctr <= Dummy_Ctr + 1;
Dummy_Input <= Dummy_Input + 1;
end if;
end process;
Dummy_Sum <= Dummy_Ctr + 1;
Dummy_Zero <= '1' when Dummy_Sum = Dummy_Ctr else '0';
process ( Clk ) begin
if RISING_EDGE( Clk ) then
if CE = '1' then
if Dummy_Zero = '1' then
RAM( TO_INTEGER( UNSIGNED( A ))) <=
std_logic_vector( Dummy_Input );
end if;
if Rst_n = '0' then
O <= Init_Val;
else
O <= RAM( TO_INTEGER( UNSIGNED( A )));
end if; end if; end if;
end process;
 
dotnetters wrote

We're working with a Xilinx Virtex II Pro board. As a part of our
project, we had to write a hardware stack. After having made it work,
we thought of optimizing the design and hence removed a few states and
reduced the no. of states from 8 to 4. The older code was getting
synthesized in around 20 mins, but the new code takes hours together to
get synthesized, and so does the PAR. How can we reduce the synthesis
time? Why is that the code which took lesser time to get synthesized is
now taking longer?
Try Synplicity. You may well get some insight from using another compiler -
different warnings, unexpected resource usage, and so on.
 
Austin Lesea wrote:
I wonder if there is any reason why it would be useful to compile the
verilog for a FPGA?
I don't understand...
What You mean?
- Verilog is the problem (VHDL is better) or
- "openSparc Verilog" is the problem (better other "soft-cpu") ?

Sandro
 
On 27 Mar 2006 06:14:16 -0800, "Sandro" <sdroamt@netscape.net> wrote:

Austin Lesea wrote:
I wonder if there is any reason why it would be useful to compile the
verilog for a FPGA?

I don't understand...
What You mean?
- Verilog is the problem (VHDL is better) or
- "openSparc Verilog" is the problem (better other "soft-cpu") ?
It appears openSparc Verilog is written to target an ASIC, not an
FPGA. Whilst it might be possible to get it to compile and even fit
into an FPGA, the performance would probably not be stunning.

In that sense, a different soft-cpu designed to be used on an FPGA
would probably be better.

Regards,
Allan
 
Allan Herriman wrote:

It appears openSparc Verilog is written to target an ASIC, not an
FPGA. Whilst it might be possible to get it to compile and even fit
into an FPGA, the performance would probably not be stunning.

In that sense, a different soft-cpu designed to be used on an FPGA
would probably be better.
---It's interesting to see "SoftCores for Multicore FPGA
implementations" listed as an example research area that can be
explored with OpenSPARC technology, at
http://opensparc.sunsource.net/nonav/research.html. Not sure what
area/delay/power one would end up with if this core is implemented on
an FPGA as is. Perhaps certain enhancements/simplifications may be
carried out to the present core in order to make it useful within an
FPGA. Since they have released a variety of hardware/software tools
(and their sources) I guess it becomes possible to study the
performance impact of any architectural modifications.

Has anybody already started working on implementing this on an FPGA? I
would be very interested to know the results.I want to try to do this
but am presently hampered because I don't have Synopsys DC (which is
the recommended synthesis environment) appropriately set up.

Thanks,
Shyam
 
"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> schrieb im
Newsbeitrag news:6jgg221p6iuffrbbb6dtml39fn3u9sdu4k@4ax.com...
We have a perfect-storm clock problem. A stock 16 MHz crystal
oscillator drives a CPU and two Spartan3 FPGAs. The chips are arranged
linearly in that order (xo, cpu, Fpga1, Fpga2), spaced about 1.5"
apart. The clock trace is 8 mils wide, mostly on layer 6 of the board,
the bottom layer. We did put footprints for a series RC at the end (at
Fpga2) as terminators, just in case.

Now it gets nasty: for other reasons, the ground plane was moved to
layer 5, so we have about 7 mils of dielectric under the clock
microstrip, which calcs to roughly 60 ohms. Add the chips, a couple of
tiny stubs, and a couple of vias, and we're at 50 ohms, or likely
less.

And the crystal oscillator turns out to be both fast and weak. On its
rise, it puts a step into the line of about 1.2 volts in well under 1
ns, and doesn't drive to the Vcc rail until many ns later. At Fpga1,
the clock has a nasty flat spot on its rising edge, just about halfway
up. And it screws up, of course. The last FPGA, at the termination, is
fine, and the CPU is ancient 99-micron technology or something and
couldn't care less.

Adding termination at Fpga2 helps a little, but Fpga1 still glitches
now and then. If it's not truly double-clocking then the noise margin
must be zilch during the plateau, and the termination can't help that.

One fix is to replace the xo with something slower, or kluge a series
inductor, 150 nH works, just at the xo output pin, to slow the rise.
Unappealing, as some boards are in the field, tested fine but we're
concerned they may be marginal.

So we want to deglitch the clock edges *in* the FPGAs, so we can just
send the customers an upgrade rom chip, and not have to kluge any
boards.

Some ideas:

1. Use the DCM to multiply the clock by, say, 8. Run the 16 MHz clock
as data through a dual-rank d-flop resynchronizer, clocked at 128 MHz
maybe, and use the second flop's output as the new clock source. A
Xilinx fae claims this won't work. As far as we can interpret his
English, the DCM is not a true PLL (ok, then what is it?) and will
propagate the glitches, too. He claims there *is* no solution inside
the chip.

2. Run the clock in as a regular logic pin. That drives a delay chain,
a string of buffers, maybe 4 or 5 ns worth; call the input and output
of the string A and B. Next, set up an RS flipflop; set it if A and B
are both high, and clear it if both are low. Drive the new clock net
from that flop. Maybe include a midpoint tap or two in the logic, just
for fun.

3. Program the clock logic threshold to be lower. It's not clear to us
if this is possible without changing Vccio on the FPGAs. Marginal at
best.


Any other thoughts/ideas? Has anybody else fixed clock glitches inside
an FPGA?

John
you can run a genlocked NCO clocked from in-fabric onchip oscillator. your
internal recovered clock will have jitter +-1 clock period of the ring
oscillator (what could be as high as about 370MHz in S3), you might need a
some sync logic that will ensure the 16mhz clock edges are only used to
adjust the NCO.

using DLL/DDM possible want work as the FAE said, DCMs require decent clock
to operate.

Antti
 

Welcome to EDABoard.com

Sponsor

Back
Top