same RTL on two same boards giving different behaviour

S

salimbaba

Guest
Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with
national PHY DP83865, xilinx 12.3 for synthesis and implementation and i'
facing a strange problem. I run the same RTL on two boards and it behave
differently on both of them. Has anyone faced this issue before ? Doe
anyone know why it's happening ?



Thanks

Regards
Salimbaba

---------------------------------------
Posted through http://www.FPGARelated.com
 
On Apr 17, 11:06 am, "salimbaba"
<a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:
Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with a
national PHY DP83865, xilinx 12.3 for synthesis and implementation and i'm
facing a strange problem. I run the same RTL on two boards and it behaves
differently on both of them. Has anyone faced this issue before ? Does
anyone know why it's happening ?

Thanks

Regards
Salimbaba          

---------------------------------------        
Posted throughhttp://www.FPGARelated.com
The most common reason is timing failures in the design due to:
- Lack of any timing constraints
- Missing IO timing constraints for device-to-device timing
- Data transfers between asynchronous clocks without synchronization
- Use of asynchronous resets
- Latches in the design due to HDL errors

I would start with reviewing all of the WARNINGs created by the
synthesizer and ISE tools and then move on to the unconstrained paths
reported by the timing analyzer.

Ed McGettigan
--
Xilinx Inc.
 
Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with a
national PHY DP83865, xilinx 12.3 for synthesis and implementation an
i'm
facing a strange problem. I run the same RTL on two boards and it behaves
differently on both of them. Has anyone faced this issue before ? Does
anyone know why it's happening ?
Do you know that both boards are built correctly?
Could it be a board hardware fault?
Does either board behave 'correctly'?


---------------------------------------
Posted through http://www.FPGARelated.com
 
Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with a
national PHY DP83865, xilinx 12.3 for synthesis and implementation and
i'm
facing a strange problem. I run the same RTL on two boards and i
behaves
differently on both of them. Has anyone faced this issue before ? Does
anyone know why it's happening ?

Do you know that both boards are built correctly?
Could it be a board hardware fault?
Does either board behave 'correctly'?


---------------------------------------
Posted through http://www.FPGARelated.com

Yes, the boards are built correctly i suppose. I haven't been able to fin
the fault so far. I have changed the timing constraints, as tight as the
can get, and as loose as they can get, still no success. The RTL work
smoothly on one board, and my packets are transferred fine. Whereas sam
RTL doesn't give the same behavior on other 3 boards. It is basically
Gigabit MAC RTL i have written.

What happens on other boards is that packets are transferred smoothly an
after a minute or so, they start dropping. Had it been buffer overflo
issue, it should have occurred on the 1st board as well since i have teste
the first board with full load.

I changed the PHYs today, but to no avail. I am clueless now. Any pointer
will be of great help.

And about timing constraints, i have done many iterations and almost all o
them work on the FIRST board and fail to deliver same performance on othe
boards.

Kindly help me out.
Thanks..

Regards


---------------------------------------
Posted through http://www.FPGARelated.com
 
On Mon, 25 Apr 2011 15:26:25 -0500, salimbaba wrote:

Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with
a national PHY DP83865, xilinx 12.3 for synthesis and implementation
and
i'm
facing a strange problem. I run the same RTL on two boards and it
behaves
differently on both of them. Has anyone faced this issue before ? Does
anyone know why it's happening ?

Do you know that both boards are built correctly? Could it be a board
hardware fault?
Does either board behave 'correctly'?


--------------------------------------- Posted through
http://www.FPGARelated.com


Yes, the boards are built correctly i suppose. I haven't been able to
find the fault so far. I have changed the timing constraints, as tight
as they can get, and as loose as they can get, still no success. The RTL
works smoothly on one board, and my packets are transferred fine.
Whereas same RTL doesn't give the same behavior on other 3 boards. It is
basically a Gigabit MAC RTL i have written.

What happens on other boards is that packets are transferred smoothly
and after a minute or so, they start dropping. Had it been buffer
overflow issue, it should have occurred on the 1st board as well since i
have tested the first board with full load.

I changed the PHYs today, but to no avail. I am clueless now. Any
pointers will be of great help.

And about timing constraints, i have done many iterations and almost all
of them work on the FIRST board and fail to deliver same performance on
other boards.

Kindly help me out.
Thanks..
A Gb Ethernet MAC/PCS/PHY combination is likely to involve clock domain
crossings. The PHYs negotiate with their link partner to decide on one
end as being the master clock. The MACs, OTOH, will usually be clocked
by the local source. There will be typically a few (+/-200 in the worst
case) ppm difference between the clocks. This will be compensated for by
either adding or deleting words of interframe gap. IIRC this happens in
the PCS.

The boards are supposedly identical, but the clock frequencies will be
slightly different. I suggest looking for a bug in your rate adaptation
circuit. Alternatively, you could add frequency counters to your design
(that measure the PHY clock w.r.t. the local clock) and see if the
direction of clock error correlates with the errors you see).

Regards,
Allan
 
On Apr 25, 1:26 pm, "salimbaba"
<a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:
Hi,
I am using spartan3 xc3s4000 custom board in my design interfaced with a
national PHY DP83865, xilinx 12.3 for synthesis and implementation and
i'm
facing a strange problem. I run the same RTL on two boards and it
behaves
differently on both of them. Has anyone faced this issue before ? Does
anyone know why it's happening ?

Do you know that both boards are built correctly?
Could it be a board hardware fault?
Does either board behave 'correctly'?

---------------------------------------            
Posted throughhttp://www.FPGARelated.com

Yes, the boards are built correctly i suppose. I haven't been able to find
the fault so far. I have changed the timing constraints, as tight as they
can get, and as loose as they can get, still no success. The RTL works
smoothly on one board, and my packets are transferred fine. Whereas same
RTL doesn't give the same behavior on other 3 boards. It is basically a
Gigabit MAC RTL i have written.

What happens on other boards is that packets are transferred smoothly and
after a minute or so, they start dropping. Had it been buffer overflow
issue, it should have occurred on the 1st board as well since i have tested
the first board with full load.

I changed the PHYs today, but to no avail. I am clueless now. Any pointers
will be of great help.

And about timing constraints, i have done many iterations and almost all of
them work on the FIRST board and fail to deliver same performance on other
boards.

Kindly help me out.
Thanks..

Regards

---------------------------------------        
Posted throughhttp://www.FPGARelated.com- Hide quoted text -

- Show quoted text -
It could be due to PPM differences between the clock sources on your
boards where the two that are exhibiting a problem are either faster
or slower than the one that works well generating overruns or
underruns. If your code isn't sending idles or recognizing idles and
removing them this could be a problem.

Or it could be due to two other issues that I described earlier that
you haven't mentioned addressing:

- Data transfers between asynchronous clocks without synchronization
- Latches in the design due to HDL errors

Ed McGettigan
--
Xilinx Inc.
 
It could be due to PPM differences between the clock sources on your
boards where the two that are exhibiting a problem are either faster
or slower than the one that works well generating overruns or
underruns. If your code isn't sending idles or recognizing idles and
removing them this could be a problem.

Or it could be due to two other issues that I described earlier that
you haven't mentioned addressing:

- Data transfers between asynchronous clocks without synchronization
- Latches in the design due to HDL errors

Ed McGettigan
--
Xilinx Inc.
Hey, data transfers between asynchronous clocks are synced and there are n
latches in the design. The behavior of the RTL is a little weird on th
boards on which it is not working, let me explain if i may, i program
fresh FPGA (FPGA that hasn't been turned on for a while), packets seem t
go smoothly, but after few seconds, packets start to drop, and eventuall
within a minute from start, packets stop going altogether =\

Since i have 2 interfaces on the board (interface A and interface B), an
whatever i receive on A, i transmit it on B, this thing works fine. But th
packets i receive on B and transmit on A, that's where the problem exists.

kindly, give me some pointers so that i can debug the problem asap as i
has started to get on my nerves =\


Thanks a lot

regards,

---------------------------------------
Posted through http://www.FPGARelated.com
 
On Apr 27, 11:59 am, "salimbaba"
<a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:
It could be due to PPM differences between the clock sources on your
boards where the two that are exhibiting a problem are either faster
or slower than the one that works well generating overruns or
underruns.  If your code isn't sending idles or recognizing idles and
removing them this could be a problem.

Or it could be due to two other issues that I described earlier that
you haven't mentioned addressing:

- Data transfers between asynchronous clocks without synchronization
- Latches in the design due to HDL errors

Ed McGettigan
--
Xilinx Inc.

Hey, data transfers between asynchronous clocks are synced and there are no
latches in the design. The behavior of the RTL is a little weird on the
boards on which it is not working, let me explain if i may, i program a
fresh FPGA (FPGA that hasn't been turned on for a while), packets seem to
go smoothly, but after few seconds, packets start to drop, and eventually
within a minute from start, packets stop going altogether =\

Since i have 2 interfaces on the board (interface A and interface B), and
whatever i receive on A, i transmit it on B, this thing works fine. But the
packets i receive on B and transmit on A, that's where the problem exists..

kindly, give me some pointers so that i can debug the problem asap as it
has started to get on my nerves =\

Thanks a lot

regards,          

---------------------------------------        
Posted throughhttp://www.FPGARelated.com- Hide quoted text -

- Show quoted text -
How many levels of register synchronization did you use for the
asynchronous signals?
Please post the Device Utilization Summary portion of the PAR report
file.
How many WARNING messages are present in your synthesis report file?


You said that a "fresh FPGA" works at the beginning and then starts to
fail. Does this mean that if you reprogram the part after it has been
running for a while it fails immediately?

Ed McGettigan
--
Xilinx Inc.
 
On Apr 27, 2:59=A0pm, "salimbaba"
a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:

Hey, data transfers between asynchronous clocks are synced and there ar
=
no
latches in the design.

- Are the clocks all directly from either the input pin receiving an
oscillator or the output of an internal PLL/DLL? (The only correct
answer is 'yes')
- Are any clocks generated internally to the design using either logic
of flip flops? (The only correct answer is 'no')

If your design does not have the correct answers as indicated above,
then that is the source of your timing problem, you must find and fix
it.

The behavior of the RTL is a little weird on the
boards on which it is not working, let me explain if i may, i program a
fresh FPGA (FPGA that hasn't been turned on for a while), packets see
to
go smoothly, but after few seconds, packets start to drop, an
eventually
within a minute from start, packets stop going altogether =3D\


When the story begins 'It works for a while and then stops working' or
'When I spray cold spray (or maybe a heat gun) it starts (or stops)
working' then the story will always end the same...you have a timing
problem, end of story.

If you want to prolong the drama, try spraying your FPGA with cold
spray to cool it off and watch it start working again...for the same
few seconds (maybe longer)...then watch it stop working again. If you
don't believe me, try it. At that point, you can continue the
curiousity, or simply accept that the root cause is indeed a timing
problem.

Since i have 2 interfaces on the board (interface A and interface B)
and
whatever i receive on A, i transmit it on B, this thing works fine. Bu
t=
he
packets i receive on B and transmit on A, that's where the proble
exists=
.


Nothing that you've described will give anyone enough information to
debug your problem for you so only general guidelines can be
suggested...here are mine

- Say to yourself, there is a timing problem with the design.
- Say it again...and again if necessary until you fully accept that
the problem is timing and you have to find it.
- You have not correctly performed static timing analysis
- Look at the timing report for the list of clock signals. Get rid of
any internally generated clocks if there are any. Do not proceed any
further until you have completed this task.
- Look at the report for signals that asynchronously reset a flip
flop. For each such signal, verify that the reset signal is the
output of a flip flop that is clocked by the same clock signal that
clocks the flip flop that is getting asynchronously reset. Repeat
this until you have rid the design of all such reset signals. Note:
This is not to say that you cannot use async reset signals, what it
says is that the reset signal must first be synchronized to the clock
that is used with the flop. That also implies that you cannot have an
async reset signal that is used to clear flip flops in more than one
clock domain.
- Verify that the static timing analysis *is* analyzing signals that
cross clock domains. If not enabled, it can't report such errors.
Verify that each and every clock domain transfer is handled
correctly. Do these transfers happen through a dual clock fifo part
that you did *not* write the code for and is widely used? Or do some
(all?) transfers happen through code that you wrote? If there are any
that happen through code that you wrote...that is a likely source of
the design error.
- Verify that there are no signals that cross clock domains and end up
at two different flip flops. The outputs of those flip flops will at
some point be different even though you think they are both sampling
the 'same' signal (which they're not). Rid your design of any such
signals.

kindly, give me some pointers so that i can debug the problem asap a
it
has started to get on my nerves =3D\


Good luck on your search. Doggedly follow the guidelines posted above
and you'll find the error.

Kevin Jennings
Hi Kevin,
That's exactly what's happening, when the FPGA cools down, it again starts
to transmit packets for a while and then eventually stops when the FPGA is
hot.

One thing i forgot to mention was that the packets that are sent out of the
interface, when i capture them on wireshark, it shows some of them in CRC
errors, some in receive errors and the rest nowhere. I tried to capture the
packets on CHIPSCOPE, like on the transmitting end just before the packets
exits the FPGA, the packet is fine till that point. So i placed another
FPGA to capture it and it showed me a garbage packet, like data_valid
signal going low right after SFD.

Anyway that's just some information. I will follow the guidelines as there
were NO in them.

Thanks a lot

---------------------------------------
Posted through http://www.FPGARelated.com
 
On Apr 27, 2:59 pm, "salimbaba"
<a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:
Hey, data transfers between asynchronous clocks are synced and there are no
latches in the design.
- Are the clocks all directly from either the input pin receiving an
oscillator or the output of an internal PLL/DLL? (The only correct
answer is 'yes')
- Are any clocks generated internally to the design using either logic
of flip flops? (The only correct answer is 'no')

If your design does not have the correct answers as indicated above,
then that is the source of your timing problem, you must find and fix
it.

The behavior of the RTL is a little weird on the
boards on which it is not working, let me explain if i may, i program a
fresh FPGA (FPGA that hasn't been turned on for a while), packets seem to
go smoothly, but after few seconds, packets start to drop, and eventually
within a minute from start, packets stop going altogether =\
When the story begins 'It works for a while and then stops working' or
'When I spray cold spray (or maybe a heat gun) it starts (or stops)
working' then the story will always end the same...you have a timing
problem, end of story.

If you want to prolong the drama, try spraying your FPGA with cold
spray to cool it off and watch it start working again...for the same
few seconds (maybe longer)...then watch it stop working again. If you
don't believe me, try it. At that point, you can continue the
curiousity, or simply accept that the root cause is indeed a timing
problem.

Since i have 2 interfaces on the board (interface A and interface B), and
whatever i receive on A, i transmit it on B, this thing works fine. But the
packets i receive on B and transmit on A, that's where the problem exists..
Nothing that you've described will give anyone enough information to
debug your problem for you so only general guidelines can be
suggested...here are mine

- Say to yourself, there is a timing problem with the design.
- Say it again...and again if necessary until you fully accept that
the problem is timing and you have to find it.
- You have not correctly performed static timing analysis
- Look at the timing report for the list of clock signals. Get rid of
any internally generated clocks if there are any. Do not proceed any
further until you have completed this task.
- Look at the report for signals that asynchronously reset a flip
flop. For each such signal, verify that the reset signal is the
output of a flip flop that is clocked by the same clock signal that
clocks the flip flop that is getting asynchronously reset. Repeat
this until you have rid the design of all such reset signals. Note:
This is not to say that you cannot use async reset signals, what it
says is that the reset signal must first be synchronized to the clock
that is used with the flop. That also implies that you cannot have an
async reset signal that is used to clear flip flops in more than one
clock domain.
- Verify that the static timing analysis *is* analyzing signals that
cross clock domains. If not enabled, it can't report such errors.
Verify that each and every clock domain transfer is handled
correctly. Do these transfers happen through a dual clock fifo part
that you did *not* write the code for and is widely used? Or do some
(all?) transfers happen through code that you wrote? If there are any
that happen through code that you wrote...that is a likely source of
the design error.
- Verify that there are no signals that cross clock domains and end up
at two different flip flops. The outputs of those flip flops will at
some point be different even though you think they are both sampling
the 'same' signal (which they're not). Rid your design of any such
signals.

kindly, give me some pointers so that i can debug the problem asap as it
has started to get on my nerves =\
Good luck on your search. Doggedly follow the guidelines posted above
and you'll find the error.

Kevin Jennings
 
On Apr 28, 12:03 am, "salimbaba"
<a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:
I tried to capture the
packets on CHIPSCOPE, like on the transmitting end just before the packets
exits the FPGA, the packet is fine till that point. So i placed another
FPGA to capture it and it showed me a garbage packet, like data_valid
signal going low right after SFD.
Chipscope will not find the problem, it is not the right tool. Your
best bet is to turn off the board, walk away from it, lock it up, do
something to get it out of the picture but do not attempt to debug
this problem on the bench with the board. You have a timing problem,
and you'll never be able to find it via debug of the hardware so any
time you spend trying to solve this problem by looking at the physical
board is wasted time.

The correct tool to use is static timing analysis which, if you follow
the guidelines I suggested, will solve the problem if you apply them
correctly.

KJ
 
On Apr 28, 12:03=A0am, "salimbaba"
a1234573@n_o_s_p_a_m.n_o_s_p_a_m.owlpic.com> wrote:

I tried to capture the
packets on CHIPSCOPE, like on the transmitting end just before th
packet=
s
exits the FPGA, the packet is fine till that point. So i placed another
FPGA to capture it and it showed me a garbage packet, like data_valid
signal going low right after SFD.


Chipscope will not find the problem, it is not the right tool. Your
best bet is to turn off the board, walk away from it, lock it up, do
something to get it out of the picture but do not attempt to debug
this problem on the bench with the board. You have a timing problem,
and you'll never be able to find it via debug of the hardware so any
time you spend trying to solve this problem by looking at the physical
board is wasted time.

The correct tool to use is static timing analysis which, if you follow
the guidelines I suggested, will solve the problem if you apply them
correctly.

KJ
ok thanks a lot =)

---------------------------------------
Posted through http://www.FPGARelated.com
 
In article <8c640be7-ed31-4ebe-901d-2f7aedd67fce@d28g2000yqc.googlegroups.com>,
KJ <kkjennings@sbcglobal.net> writes:

The correct tool to use is static timing analysis which, if you follow
the guidelines I suggested, will solve the problem if you apply them
correctly.
It might be a logic bug. He said packets coming in one side
and going out the other. For that to work without a common clock,
you have to have occasional idle time on the fast links so the slow
links can keep up. Maybe it's something like he isn't discarding
the idle markers on the way info the FIFO and recreating them as
needed on the output side. That would break if the input side
is faster.

--
These are my opinions, not necessarily my employer's. I hate spam.
 
I don't disagree with combing the STA reports, the synth/map/route
warnings, and the async hand-offs for clues or even smoking guns. That
needs to be done. But using chipscope could be fruitful if you are
able to isolate where errors are first identified, such as where the
first packet gets dropped. I imagine that would be an easy condition
to trigger on, and your appropriate choice of signals to monitor might
tell you exactly where the faulty hand-off occurs. Won't hurt to
monitor all appropriate FIFO flags along with any other error
detection logic. Of course, be careful not to create new problems. Use
the correct (sync) clock for the signals you are monitoring. Also,
hopefully your performance margins have enough headroom to absorb the
added chipscope logic. It would be especially wasteful if you had to
jump through hoops just to meet timing with the chipscope logic.
- John
 
On Apr 29, 7:29 am, jc <jcappe...@optimal-design.com> wrote:

<snip>

added chipscope logic. It would be especially wasteful if you had to
jump through hoops just to meet timing with the chipscope logic.
- John

I find that it frequently helps to use SmartGuide when adding
ChipScope to a design. Start with the design that is meeting timing
and is already fully implemented. Then add ChipScope to the design.
Then turn on SmartGuide and finish the build. To turn on SmartGuide,
in the Hierarchy window with the implementation view selected, right
click on the top level of your design and select SmartGuide from the
pop up menu. Do the same to turn it back off latter. This will both
reduce the impact on timing of adding ChipScope, and when I use this
on my designs it reduces the implementation times to about 1/3 of what
they normally are.

That said, this does sound like a timing problem.

Regards,

John McCaskill
www.FasterTechnology.com
 
Hi salimbaba

....
That's exactly what's happening, when the FPGA cools down, it again starts
to transmit packets for a while and then eventually stops when the FPGA is
hot.
....

Assuming your timings constraints are correctly defined, the implementation reports don't show errors and the silicon it's Ok, this sound like an asynchronism in the design.

Walter
 
Hey,
Thanks a lot everyone for the help, it surely taught me a lot of new thing
in the domain of debugging. The problem wasn't with the RTL or anything, i
was a problem with the PHY. It was getting very hot, still in the operatin
range according to the datasheet but i don't know why it was transmittin
any packets. So now i have attached a heat sink n a fan with it an
everything is back to normal =) .

Thanks a lot for your help guys.


Regards


---------------------------------------
Posted through http://www.FPGARelated.com
 

Welcome to EDABoard.com

Sponsor

Back
Top