EDK : FSL macros defined by Xilinx are wrong

gliss · Apr 21, 2006

ModelSim is the best simulator you can buy; it's the industr
standard
Unfortunately it's very expensive

Sylvain Munaut · Apr 21, 2006

pei@uwiep.com wrote:

Hello,

Just wondering if anyone can let me know if I'm going about this the
right way -

I'm trying to implement the opencores Ethernet MAC on a Xilinx FPGA,
but the board I have has too few I/Os. So, I want to reduce the width
of the 32bit data inputs and outputs in the wishbone interface to
accommodate (I'm about 70 IOBs short). It seems like this should be
feasible since the data is delivered to the interface in 1 byte chunks.
If I work around the byte counter and write the data to the FIFO right
away without assembling the bytes into 32bits...

I'm not so keen with verilog, so feel free to let me know if I'm going
about this all wrong, or if I'm forgetting something. Or you can tell
me I'm smoking something and that I should just not be cheap and get a
board that has enough IOs.

thanks!

pei

You're putting the wishbone interface externally ? To connect it to what ?

Sylvain

Jim Granville · Apr 21, 2006

kha59@student.canterbury.ac.nz wrote:

Thanks Alvin,

The problem is to assign one of 5 colours for each of the integers from
1 upwards such that if any two integers have the same colour the
integer that they sum to must have a different colour:

eg.
If we have
1-green 2-blue 3-green

then 4 cannot be green or blue as 1+3 = 4 and 2+2 = 4.

I can follow <> green, but <> blue seems to extend your rule ?

A correct sequence of 160 digits for 5 colours is known. I wish to find
a sequence of 162 digits.

So that's a string of ~160 characters, where each character can be one
of 5 values ?

I'm doing an exhaustive search on a severely restricted subset of all
the possible sequences. The sequence is built up one colour at a time
until you get to the point where you know that somewhere down the track
there will be no possible colour, then you go back one in the sequence
and try a different colour etc.

This is quite easy to split up each chip at a time should only increase
the sequence by a few digits (due to memory constraints) and report
each possible final sequence back to the coordinating chip which dishes
each of these out to other chips when they request more work. Of course
there needs to be a prioritisation of sequences such that the sequence
queue doesn't get too big (ie. always dish out the longest sequences
which will be exhaustively searched and therefore removed quickest).
All of this stuff isn't too hard.

To the points you made,
1) efficient communication. Each chip needs to get a sequence per unit
of work which is no biggy, but it will need to report back each
sequence it ends up at..... This could in fact be quite a few - maximum
5^(#of digits sequence was increased by) but usually a lot less. For
each of these it needs to return (sequence length)/2 bytes. I think I
will need to consider this point in some more detail...
2) fault tolerance. I wish to find a single correct sequence and
believe (hope!) that there are many of these (expected running time is
time to first sequence not completion time for exhaustive search which
would in fact take 100s of years!!!), whilst missing one sequence due
to a faulty device/communications would be very bad this wouldn't be
disastrous. I'm not trying to say that no sequence of 162 digits
exists.

This reads a little like sorting primes.
The data set would certainly fit into a (very) small microcontroller
= you can even pack into nibbles, and consume just 80 bytes, but
the problem with many small uC will be ensuring there are no overlaps,
or holes, in their scan coverage.

ie the task is simple enough, but multi-uC management is likely to
be a nightmare.

Something like i2c for the backplane is also likely to be a serious
bottleneck.

I don't know anything about FPGAs or how these would apply, do you
happen to have some useful links?

Look at Altera, Lattice, Xilinx - there are many demo/eval boards and
tool sets.
Also look at the Soft CPUs : Xilinx PicoBlaze, and Lattice Mico8

FPGAs can do hugely parallel tasks, and on a small data set like this,
you have no memory bandswidth issues.

With a FPGA, you could do exclusion mapping - that is, do not store the
Colour@integer, but instead have an array of N x 5 booleans, which are
excluded colours. [ALL 5 => Whoops, go back! ]
An FPGA could scan for all ahead exclusions, very efficently indeed.
One of the small soft CPUs could manage the re-seed process.

<paste>

I've got a pure math problem implemented in C that will take about 3
years to solve using all 5pcs available to me (the algorithm is about
as efficient as it will get without some major mathematical insights).

The algorithm is always where the biggest speed gains can be made,
especially in efficently mapping the algorithm to the hardware it runs
on.

In a FPGA you could set up 'algorithm races', where (eg) you code
4 algorithms in ~1/4 of the chip each, and run it for a couple of days,
and compare their Attained String Lengths.

If the present best is a langth of 160, don't just think about 162, look
to smash it !

I've added comp.arch.fpga, as this really sounds more like a FPGA+smart
algorithm, than a "sea of uC" problem.

-jg

Jim Granville · Apr 21, 2006

Peter Alfke wrote:

Hi, Jim..
We stopped after a week because we were satisfied. In one week, we
proved 10e14, it would take 10 weeks to prove 10e15, and 2 years to
prove 10e16. Diminishing returns...But we definitely did NOT stop
because we found an error. No cheating on my watch!

For some strange reason (fixed in "Virtex-5") there is a
one-clock-pulse latency for FULL. I suggest using ALMOST FULL instead.
FULL is not as important as EMPTY, since a properly designed system
should never overflow the FIFO, whereas it might be nice to empty it
completely. (I often use the savings-account analogy).

Wow! They pay you so much you have to worry about overflow of
your saving account ?!

-jg

Peter Alfke · Apr 21, 2006

That's exactly the point. You want to be able to move to another bank
or leave town, and get your last cent or penny out of the account. But
you really don't worry about overflow.
I know you understood...
And you are right.: The pay is o.k, considering the fun I am still
having...
Peter

Alvin Andries · Apr 21, 2006

"Jim Granville" <no.spam@designtools.co.nz> wrote in message
news:434ee49c$1@clear.net.nz...

kha59@student.canterbury.ac.nz wrote:
Thanks Alvin,

The problem is to assign one of 5 colours for each of the integers from
1 upwards such that if any two integers have the same colour the
integer that they sum to must have a different colour:

eg.
If we have
1-green 2-blue 3-green

then 4 cannot be green or blue as 1+3 = 4 and 2+2 = 4.

I can follow <> green, but <> blue seems to extend your rule ?

A correct sequence of 160 digits for 5 colours is known. I wish to find
a sequence of 162 digits.

So that's a string of ~160 characters, where each character can be one
of 5 values ?

I'm doing an exhaustive search on a severely restricted subset of all
the possible sequences. The sequence is built up one colour at a time
until you get to the point where you know that somewhere down the track
there will be no possible colour, then you go back one in the sequence
and try a different colour etc.

This is quite easy to split up each chip at a time should only increase
the sequence by a few digits (due to memory constraints) and report
each possible final sequence back to the coordinating chip which dishes
each of these out to other chips when they request more work. Of course
there needs to be a prioritisation of sequences such that the sequence
queue doesn't get too big (ie. always dish out the longest sequences
which will be exhaustively searched and therefore removed quickest).
All of this stuff isn't too hard.

To the points you made,
1) efficient communication. Each chip needs to get a sequence per unit
of work which is no biggy, but it will need to report back each
sequence it ends up at..... This could in fact be quite a few - maximum
5^(#of digits sequence was increased by) but usually a lot less. For
each of these it needs to return (sequence length)/2 bytes. I think I
will need to consider this point in some more detail...
2) fault tolerance. I wish to find a single correct sequence and
believe (hope!) that there are many of these (expected running time is
time to first sequence not completion time for exhaustive search which
would in fact take 100s of years!!!), whilst missing one sequence due
to a faulty device/communications would be very bad this wouldn't be
disastrous. I'm not trying to say that no sequence of 162 digits
exists.

This reads a little like sorting primes.
The data set would certainly fit into a (very) small microcontroller
= you can even pack into nibbles, and consume just 80 bytes, but
the problem with many small uC will be ensuring there are no overlaps,
or holes, in their scan coverage.

ie the task is simple enough, but multi-uC management is likely to
be a nightmare.

Something like i2c for the backplane is also likely to be a serious
bottleneck.

I don't know anything about FPGAs or how these would apply, do you
happen to have some useful links?

Look at Altera, Lattice, Xilinx - there are many demo/eval boards and
tool sets.
Also look at the Soft CPUs : Xilinx PicoBlaze, and Lattice Mico8

FPGAs can do hugely parallel tasks, and on a small data set like this,
you have no memory bandswidth issues.

With a FPGA, you could do exclusion mapping - that is, do not store the
Colour@integer, but instead have an array of N x 5 booleans, which are
excluded colours. [ALL 5 => Whoops, go back! ]
An FPGA could scan for all ahead exclusions, very efficently indeed.
One of the small soft CPUs could manage the re-seed process.

paste
I've got a pure math problem implemented in C that will take about 3
years to solve using all 5pcs available to me (the algorithm is about
as efficient as it will get without some major mathematical insights).

The algorithm is always where the biggest speed gains can be made,
especially in efficently mapping the algorithm to the hardware it runs
on.

In a FPGA you could set up 'algorithm races', where (eg) you code
4 algorithms in ~1/4 of the chip each, and run it for a couple of days,
and compare their Attained String Lengths.

If the present best is a langth of 160, don't just think about 162, look
to smash it !

I've added comp.arch.fpga, as this really sounds more like a FPGA+smart
algorithm, than a "sea of uC" problem.

-jg

Hi,

The blue rule seems ok: 2 + 2 == 4.

I'd certainly favour FPGAs for this. The biggest issue would still be the
memory bandwidth: each time you split some work over different processing
nodes, you need to communicate the present N charactrs, while testing for an
expansion could easily take less than O(N) time (depends on how the
algorithm behaves), possibly leaving processing nodes starved for data and
while others might be spening more time on communicating than on processing.

My first guess is that a single dual-ported RAM per processing node should
suffice. That allows many processing nodes on a modern FPGA. Maybe if I
change the memory architecture that I have in mind, some communication
overhead could be eliminated. Oops ... I'm trying to solve it! Can I?

Alvin.

Jim Granville · Apr 21, 2006

Dave Pollum wrote:

Jim Granville wrote:
Peter's stuff snipped

Interesting - so for sustained thru-put on these, you are best to avoid
going empty, which probably means two operating modes : fastest, and
clean-out-the-last-byte(s)
I see some uarts have WDOGs in their fifos, that allow simpler streaming
code, and they generate a time-content interrupt, as well as the normal
threshold one.
Timeout is normally some multiple CHAR times, so the end of message
chars are dealt to without needing polling.

-jg

We upgraded a Zilog Z8530 serial port to a Z85230 serial port, because
the Z85230 has deeper FIFOs. The 85230's recv port has an 8-byte FIFO
vs the 8530's 3-byte FIFO. I had hoped that the CPU would be
interrupted a lot less using the 85230. The 85230 can be set-up to
interrupt when 1 char is recv'd or when there are 4 bytes in the recv
FIFO (half full). This sounded really great and even worked quite
well, until I discovered that when there _3_ bytes in the FIFO and the
chip is set-up to interrupt when half-full, that the chip does not
interrupt until another byte is received, even if that's minutes later.
ARGH, WIPA!

BTW what's a "WDOG"?

Sorry, cryptic mode... WDOG = WatchDog = monostable timer, that
retriggers on every incomming CHAR, and times-out after some user
defined CHAR pauses.
Purpose is to catch exactly the PITA you describe

-jg

Peter Alfke · Apr 21, 2006

Jim, you wrote:
"Interesting - so for sustained thru-put on these, you are best to
avoid
going empty, which probably means two operating modes : fastest, and
clean-out-the-last-byte(s)"

I don't get it. Are you referring to the fact that you effectively lose
a few read clock cycles after having gone EMPTY? Why would that be a
problem? I think it helps the state of the FIFO, so that it does not go
empty so often....
Although that is no problem, as demonstarated by our 200MHz/500 MHz
test.

I do not see any problem. You seem to. Let me know the reason.
Peter Alfke

Alex Shot · Apr 21, 2006

Peter,
I'm just trying to understand you. You are right - FIFO's width is
groups of bits and not bytes, but let me ask you what kind of problem
you can see. For example:
1. memory size is 2048 words of n bits
2. read and write addresses width have 12 bits
3. full indication is generated when fifo is filled up to 2047 (or 2048
in a case of it is generated with delay of 1 write clock cycle)
4. Now, suppose FIFO is filled up to 2048 (not 2047) and fill
indication is activated - no write cycles are allowed
i. within a next read operation, full indication is stay high and
fifo is filled up to 2047
ii. after additional read operation full indication is cleared

Why do you think, there is better to use almost full with latency of 1
to 3 clock cycles, but not real full indication, which works properly.

Peter Alfke · Apr 21, 2006

Let me explain.
Suppose you have written enough data to fill the FIFO completely (248
words in your case)
You will not yet see the FULL flag go active immediately, but rather
see it go active one write clock tick later.
If (and only if) you use this clock to write more data into the FIFO,
you will "write into mid-air", the FIFO will never receive this entry.
So this problem only occurs when you write on the two consecutive
clocks.

Applications work-around:
Run the write clock at twice the write rate, and WE only on every other
clock tick.
Or, as I wrote before, use ALMOST FULL as a wite inhibitor.

FULL then goes inactive a few write clock ticks after something has
been read out of the FIFO ( just like on the read side).
Also remember: The flag manipulation uses every one of its clock ticks
(read for EMPTY, write for FULL), whether enabled or not. That improves
the flag response time. And the clocks can be up to 500 MHz.
I hope this makes it clear.

Peter Alfke, Xilinx Applications

Austin Lesea · Apr 21, 2006

Tim,

http://klabs.org/richcontent/MAPLDCon02/abstracts/ahlquist_a.pdf

is just one of thousands of ways to design a multiplier. This one is
interesting, as they do it in (our) FPGA.

Google: designing multipliers

Depending on what you want (widths, latencies, etc.) you can go from a
serial implementation (extremely cheap, but takes a huge number of
cycles), to a massively parallel, with critical stage pipeline registers
(very expensive, but also very fast, with a low latency).

And, basically, if you can think of it, it has probably been done, more
than once, with at least four or five papers written on it (with
Master's and PhD degrees trailing behind) in each technology generation.

Xilinx chose to implement a simple 18X18 multiplier starting in Virtex
II to facilitate our customers' designs. As we have gone on since then,
we have graduated to a more useful MAC in Virtex 4, but still keeping
its function generic so it is useful across the widest range of
applications.

There are many on this group who are experts in this area (both building
multipliers in ASIC form, and building them using LUTs in FPGAs).

I am sure they will offer up their comments.

Austin

Tim Wescott wrote:

Jeorg's question on sci.electronics.design for an under $2 DSP chip got
me to thinking:

How are 1-cycle multipliers implemented in silicon? My understanding is
that when you go buy a DSP chip a good part of the real estate is taken
up by the multiplier, and this is a good part of the reason that DSPs
cost so much. I can't see it being a big gawdaful batch of
combinatorial logic that has the multiply rippling through 16 32-bit
adders, so I assume there's a big table look up involved, but that's as
far as my knowledge extends.

Yet the reason that you go shell out all the $$ for a DSP chip is to get
a 1-cycle MAC that you have to bury in a few (or several) tens of cycles
worth of housekeeping code to set up the pointers, counters, modes &c --
so you never get to multiply numbers in one cycle, really.

How much less silicon would you use if an n-bit multiplier were
implemented as an n-stage pipelined device? If I wanted to implement a
128-tap FIR filter and could live with 160 ticks instead of 140 would
the chip be much smaller?

Or is the space consumed by the separate data spaces and buses needed to
move all the data to and from the MAC? If you pipelined the multiplier
_and_ made it a two- or three- cycle MAC (to allow time to shove data
around) could you reduce the chip cost much? Would the amount of area
savings you get allow you to push the clock up enough to still do audio
applications for less money?

Obviously any answers will be useless unless somebody wants to run out
and start a chip company, but I'm still curious about it.

Alex Shot · Apr 21, 2006

Peter,
Let me correct you. The FIFO depth is defined as 2047. The memory depth
is 2048. If full indication goes high with delay of 1 write clock, the
new fifo is stored into the extra cell (let say 2048). No data is lost.

Bevan Weiss · Apr 21, 2006

Kolja Sulimma wrote:

Bevan Weiss schrieb:
Kolja Sulimma wrote:

Bevan Weiss wrote:

Getting single cycle high speed multipliers is a very challenging
prospect, and one which much research is still ongoing.
Actually, if you cannot do full custom circuit optimizations
(e.g. because you do standard cell design or because you are using
LUTs in an FPGA) swapping wires is the only possible structural
optimization. All other multiplier transformations can be reduced to
swaps.

An extremely nice property of swapping wires is, that it can be done
after placement. This is such a huge advantage that we were able to beat
sophisticated multiplier generators with a simple greedy algorithm when
applying it after placement:
http://eis.eit.uni-kl.de/eis/research/publications/papers/iccd04.pdf

I was referring to custom design, not the use of standard cells or
FPGAs. It is certainly obvious that if you can't design your cells from
scratch then you're just arranging the cells that you have available.

What is that supposed to mean?
Even if your standard cell library consists of only a NAND-gate in one
size there are still many degrees of freedom in circuit design.
For many design problems there are architectures that trade off the
number of cells for power or speed.
Not so for single cycle multipliers. For any practice multiplier size
the number of 1-bit adders is fixed and there exists a complete set
of transformations to automatically reach all possible setups even after
placement.

So you're saying it makes no difference if booth encoding is used, or
any form of carry ripple reduction? That it's all just a rearranging of
wires? Surely not, using a booth encoder requires different components
to a simple ripple counter and so has broken that theory.

Kolja Sulimma · Apr 21, 2006

Bevan Weiss schrieb:

Not so for single cycle multipliers. For any practice multiplier size
the number of 1-bit adders is fixed and there exists a complete set
of transformations to automatically reach all possible setups even after
placement.

So you're saying it makes no difference if booth encoding is used, or
any form of carry ripple reduction? That it's all just a rearranging of
wires? Surely not, using a booth encoder requires different components
to a simple ripple counter and so has broken that theory.

You are right, my definition was not exact enough. What I wrote applies
to anything that happens after partial product generation.

Carry ripple reduction does not apply to single cycle multipliers. You
need to sum up all carries at the end.
Producing a redundant number representation at the output does not
count, because now you changed the function computed by the circuit.

Kolja Sulimma

Kolja Sulimma · Apr 21, 2006

Weng Tianxiang schrieb:

I found that in Xilinx patents, all lookup table equations are
described in AND/OR/Multiplexer circuits in its claims. Describing a
logic connection for a lookup table in claims is much more complex in
English than presenting an equivalent logic equation.

For example, a lookup table has the equation:
Out <= (A*B) + (C*D);
It is much more concise and simpler than describing the circuit in
AND/OR gate circuits.

After all it is a LUT, so why not describe it as a LUT? List the output
for all 16 input combinations.
It is not more conscise but it is simpler than doing it in english.

Kolja Sulimma

Steven Derrien · Apr 21, 2006

porterboy76@yahoo.com a écrit :

glen herrmannsfeldt wrote:

porterboy76@yahoo.com wrote:

I am looking for the homepage of Xilinx Research Labs, but Google is
not helping me. Does anybody know if they even have a homepage.

Hi,

They used to have one (well, Satnam Singh had his page until he left
Xilinx).

I'd like to know what type of research they do at Xilinx, whether it is all
at the solid state and IC level, or whether they undertake higher level
algorithmic research as well.

From what I know (i.e. academic perspective) Xilinx folks mostly focus
on higher level problems (system level design, hardware compilation,
runtime reconfiguration, etc.).

If you want to have more details have a look to some FPGA related
academic conference proceedings such as FPL or FCCM, you will
probably find some papers by people from Xilinx.

Besides, I am sure that Peter Alfke and Austin Lesea will be glad to
answer your questions.

Post to comp.arch.fpga and ask there.

-- glen

Antti Lukats · Apr 21, 2006

jaxato@gmail.com> schrieb im Newsbeitrag
news:1133913172.512557.30990@o13g2000cwo.googlegroups.com...
Hi, I was just wondering some technicalities about a board.

Ive got the XUP virtex-II pro from digilent and I believe it was
designed by the XRL.
Well I was wondering what kind of CAD (schematic capture + pcb layout)
software does the team used. And also, out of curiosity, what PCB
manufacturer do they use and the number of layers involved. A friend of
mine thinks it's around 12. Which I doubt. So to make things clear, I
am asking the experts here.

Thanks for the tip, but im impressed by the job and I wanna know what
it takes to do that.

JA

if you want to 'see' the technical then take a look at ML300 on
Xilinx website, I think there are gerber files provided for it,
you can use some gerberviewer to look at the layers

but ML300 is designed by SEG not XRL

Antti

Steven Derrien · Apr 21, 2006

Stephen a écrit :

Ps we don't have an external web page, yet, but we are evaluating
options, so please send us suggestions if there's something specific
you'd like to see.

My 2 cents,

Maybe electronic versions of xilinx publications in academic
conferences, for those who are not registered at ACM/IEEE digital libraries.

Regards,

Steven

Stephen · Apr 21, 2006

Thank you Steven,
I will make sure this happens, assuming we post an external web site
and that the papers aren't copyright of any ACM/IEEE conference. Good
suggestion for us to follow-up on.
Regards,
Stephen

Steven Derrien wrote:

Stephen a écrit :

Ps we don't have an external web page, yet, but we are evaluating
options, so please send us suggestions if there's something specific
you'd like to see.

My 2 cents,

Maybe electronic versions of xilinx publications in academic
conferences, for those who are not registered at ACM/IEEE digital libraries.

Regards,

Steven

Apr 21, 2006

Stephen wrote:

Thank you Steven,
I will make sure this happens, assuming we post an external web site
and that the papers aren't copyright of any ACM/IEEE conference. Good
suggestion for us to follow-up on.
Regards,
Stephen

IEEE have no problem with their articles being made available provided
a covering copyright document is included (electronically!), available
from their website. Xilinx's IEEE documents would be very useful in a
centralised location. The website might also provide the names of
Xilinx's research interests/projects/topics, even if it does not
provide exact details... obviously you dont want to give the game away,
but it would be nice to know what Xilinx is interested in, in general.

Cheers
Porterboy

EDK : FSL macros defined by Xilinx are wrong

gliss

Guest

Sylvain Munaut

Guest

Jim Granville

Guest

Jim Granville

Guest

Peter Alfke

Guest

Alvin Andries

Guest

Jim Granville

Guest

Peter Alfke

Guest

Alex Shot

Guest

Peter Alfke

Guest

Austin Lesea

Guest

Alex Shot

Guest

Bevan Weiss

Guest

Kolja Sulimma

Guest

Kolja Sulimma

Guest

Steven Derrien

Guest

Antti Lukats

Guest

Steven Derrien

Guest

Stephen

Guest

Guest

Log in

Welcome to EDABoard.com

Sponsor