ML300 and GigE Experiences

T

Tony

Guest
I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 
Tony,

A well designed link should be error free (ie many many hours without a
single bit in error). Contact the hotline for details about MGT support
on specific ML300 series boards: some early versions were not designed
for supporting links above 1 Gbs! as they were designed to show off the
405PPC(tm IBM) instead.

So, there is a hundred things to check once you find out if your board
was built for MGT usage, but you have to start somewhere:

1) is your refclk meeting the jitter spec? The MGTs require a very low
jitter refclk. You can check this by observing a 1,0 pattern from the
outputs of the MGTs and seeing how much jitter is there. Should be much
less than 10% of a unit interval (bit period). If it is more than this,
you have a tx jitter problem. If you loop with a bad jitter rx clock,
everything is OK because the receiver is getting exactly the same bad
clock to work with.

2) is your logic error free when looped back? I think you said yes, but
often timing constraints may be missing, and the fabric is the source of
errors.

3) are your errors in burts? or single? Bursts may indicate FIFO
overflow/underflow (refclks far apart in frequency, and no means to deal
with it, or the means is not working in logic -- when looped, the same
clock is used, so no problem).

4) what is the channel? coax cables are not a differential channel,
common mode noise will roar right into the receiver if the channel is
not differential. Usually the coax's are used to connect the TX and RX
pairs to a XAUI adapter module to the actual backplane (still not
ideal, but at least most of the channel is differential).

5) what does the received eye pattern look like? This will tell you if
you have a jitter problem, or an amplitude/loss problem. If the eye
looks fantastic, that takes you right back to the digital processing,
and takes away the analog side of things again....

6) have you tried a far end loopback? Loop the digital data directly
back to the far end tx from the far end rx to go back to the near end.

7) contact an FAE, and arrange to go to one of our 15 world wide
RocketLabs(tm) locations where we have all of the equipment and
resources to debug your board, and compare it with our own boards and
designs in the labs.

Austin

Tony wrote:

I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 
Matt,

Do you have a case number?

I like to follow up on any less than happy experiences so that we can do
better.

Did you have an FAE visit? Did you visit a RocketLab?

Please reply to me directly, (austin (at) xilinx.com)

There is a case now open for Tony (yup, it took that long), and we are
zeroing in on his issue.

Thank you,

Austin
 
Thanks Austin!

The hotline is getting back to me today or monday with respect to the
MGT gbps ability for our boards.

1) our clock is probably dirty. It is the initial DCM output that
goes to the plb_clk of the reference design. I noticed the DDR clock
is fed from another DCM that de-skews and cleans up the first DCM, so
I will do a quick switch to that to see if there is improvment. I am
more and more convinced that the dedicated 156.25 BREF clock going
straight to the MGT is the cleanest signal, and will also give that a
try. I have to get a scope from the other lab to test the 0->1 jitter
characteristics.

2) I am using the Aurora core v2 from coregen so I am comfortable
saying the fabric is stable. These errors occur when idling (no
pdu/nfc/ufc's), so it is not a sychronization problem with the aurora
core.

3) I havent yet developed a test for this. Right now we are picking
off falling edge HARD_ERROR, SOFT_ERROR, and FRAME_ERROR signals from
the Aurora core, and generating interrupts to the PPC405 core which
then prints to the screen every 100 interrupts, so there is
significant delay, but more than sufficient to gather error rate
statistics in the ~100/sec range.

4) Fiber, the ones that come with the Ml300 Kit.

5) I have to get a scope from the other lab to test this.

6) Far end loopback? Do you mean the serial-mode loop back where it
goes to the pads? Yes that works flawlessly.

7) I was planning a trip just to check out the labs anyway, should be
fun!

I'll reply with the result of switching to the DDR 100 MHz clock, and
the 156.25 MHz clock.

Regards,
Tony

On Fri, 02 Apr 2004 08:03:20 -0800, Austin Lesea <austin@xilinx.com>
wrote:

Tony,

A well designed link should be error free (ie many many hours without a
single bit in error). Contact the hotline for details about MGT support
on specific ML300 series boards: some early versions were not designed
for supporting links above 1 Gbs! as they were designed to show off the
405PPC(tm IBM) instead.

So, there is a hundred things to check once you find out if your board
was built for MGT usage, but you have to start somewhere:

1) is your refclk meeting the jitter spec? The MGTs require a very low
jitter refclk. You can check this by observing a 1,0 pattern from the
outputs of the MGTs and seeing how much jitter is there. Should be much
less than 10% of a unit interval (bit period). If it is more than this,
you have a tx jitter problem. If you loop with a bad jitter rx clock,
everything is OK because the receiver is getting exactly the same bad
clock to work with.

2) is your logic error free when looped back? I think you said yes, but
often timing constraints may be missing, and the fabric is the source of
errors.

3) are your errors in burts? or single? Bursts may indicate FIFO
overflow/underflow (refclks far apart in frequency, and no means to deal
with it, or the means is not working in logic -- when looped, the same
clock is used, so no problem).

4) what is the channel? coax cables are not a differential channel,
common mode noise will roar right into the receiver if the channel is
not differential. Usually the coax's are used to connect the TX and RX
pairs to a XAUI adapter module to the actual backplane (still not
ideal, but at least most of the channel is differential).

5) what does the received eye pattern look like? This will tell you if
you have a jitter problem, or an amplitude/loss problem. If the eye
looks fantastic, that takes you right back to the digital processing,
and takes away the analog side of things again....

6) have you tried a far end loopback? Loop the digital data directly
back to the far end tx from the far end rx to go back to the near end.

7) contact an FAE, and arrange to go to one of our 15 world wide
RocketLabs(tm) locations where we have all of the equipment and
resources to debug your board, and compare it with our own boards and
designs in the labs.

Austin

Tony wrote:

I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 
I should say response time has been extremely fast and the people I
spoke with were great to work with. I called the hotline and they
opened up a case. (Austin, I am not sure if this is the same case,
but I left your email and name with them). I havent used the GigE
core but the PLB interface version seems very clean cut.


Regrds,
Tony

On Fri, 02 Apr 2004 10:10:58 -0800, Austin Lesea <austin@xilinx.com>
wrote:

Matt,

Do you have a case number?

I like to follow up on any less than happy experiences so that we can do
better.

Did you have an FAE visit? Did you visit a RocketLab?

Please reply to me directly, (austin (at) xilinx.com)

There is a case now open for Tony (yup, it took that long), and we are
zeroing in on his issue.

Thank you,

Austin
 
Matt,

The Ml300 also supplies a 156.25 differential clock, but if that gives
problems, the direct diff clock at 125 MHz would indeed be a step in
the right direction. Thanks for the info!

Tony

On Fri, 2 Apr 2004 12:36:32 -0500 (EST), Matthew E Rosenthal
<mer2@andrew.cmu.edu> wrote:

Tony,
I used a HW gmac core on the ml300. I believe we used a differential
clock input (62.5 *2 ) = 125 Mhz. Maybe you can use this clock instead.
This signal is provided on the ml300 board. i dont have the docs in front of me
but I belive it comes in on either pins B13,C13 or B14,C14

My other experience with gmac core and corresponding reference designs are
VERY bad at best, and xilinx support in that area is no better.
maybe using the gig ports with the PPC is a little better but...

Matt

On Fri, 2 Apr 2004, Tony wrote:

I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 
Tony,
I used a HW gmac core on the ml300. I believe we used a differential
clock input (62.5 *2 ) = 125 Mhz. Maybe you can use this clock instead.
This signal is provided on the ml300 board. i dont have the docs in front of me
but I belive it comes in on either pins B13,C13 or B14,C14

My other experience with gmac core and corresponding reference designs are
VERY bad at best, and xilinx support in that area is no better.
maybe using the gig ports with the PPC is a little better but...

Matt

On Fri, 2 Apr 2004, Tony wrote:

I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 
Tony,

See my comments below:

Austin

Tony wrote:
Thanks Austin!
You're welcome. Happy to help out.

The hotline is getting back to me today or monday with respect to the
MGT gbps ability for our boards.
They contacted me, and I gave them the right names who know this stuff.
I may not be that smart, but at least I know who is!

1) our clock is probably dirty. It is the initial DCM output that
goes to the plb_clk of the reference design. I noticed the DDR clock
is fed from another DCM that de-skews and cleans up the first DCM, so
I will do a quick switch to that to see if there is improvment. I am
more and more convinced that the dedicated 156.25 BREF clock going
straight to the MGT is the cleanest signal, and will also give that a
try. I have to get a scope from the other lab to test the 0->1 jitter
characteristics.
Driving the MGT fom the DCM definitely does not meet the MGT clock input
specifications at 2.5Gbs and higher. We have heard that some folks can
do this without trouble at 622 Mbs and 1 Gbs, but it still is not
recommended. Driving it from two in tandem is even worse.

2) I am using the Aurora core v2 from coregen so I am comfortable
saying the fabric is stable. These errors occur when idling (no
pdu/nfc/ufc's), so it is not a sychronization problem with the aurora
core.
Sounds good.

3) I havent yet developed a test for this. Right now we are picking
off falling edge HARD_ERROR, SOFT_ERROR, and FRAME_ERROR signals from
the Aurora core, and generating interrupts to the PPC405 core which
then prints to the screen every 100 interrupts, so there is
significant delay, but more than sufficient to gather error rate
statistics in the ~100/sec range.
Have you thought of using the XBERT design for link characterization?
If you are getting lost frame indications, then that is something far
worse than a few bit errors......

4) Fiber, the ones that come with the Ml300 Kit.
OK

5) I have to get a scope from the other lab to test this.
OK

6) Far end loopback? Do you mean the serial-mode loop back where it
goes to the pads? Yes that works flawlessly.
No, I was thinking of looping it back at the far end receive digital end
to go back towards the receiver, but I do not think you need to do this.

7) I was planning a trip just to check out the labs anyway, should be
fun!
Yes, we have a lot of fun. Since the equipment is there 24 X 7, the
FAEs get to play with it, and they get proficient with it. Time gets
saved because set up is sometimes the hardest part of any verification
or measurement. Knowing the equipment, and the setup, and using it
benefits everyone.


I'll reply with the result of switching to the DDR 100 MHz clock, and
the 156.25 MHz clock.
If that is easy, that might have a real big benefit.

Regards,
Tony

On Fri, 02 Apr 2004 08:03:20 -0800, Austin Lesea <austin@xilinx.com
wrote:


Tony,

A well designed link should be error free (ie many many hours without a
single bit in error). Contact the hotline for details about MGT support
on specific ML300 series boards: some early versions were not designed
for supporting links above 1 Gbs! as they were designed to show off the
405PPC(tm IBM) instead.

So, there is a hundred things to check once you find out if your board
was built for MGT usage, but you have to start somewhere:

1) is your refclk meeting the jitter spec? The MGTs require a very low
jitter refclk. You can check this by observing a 1,0 pattern from the
outputs of the MGTs and seeing how much jitter is there. Should be much
less than 10% of a unit interval (bit period). If it is more than this,
you have a tx jitter problem. If you loop with a bad jitter rx clock,
everything is OK because the receiver is getting exactly the same bad
clock to work with.

2) is your logic error free when looped back? I think you said yes, but
often timing constraints may be missing, and the fabric is the source of
errors.

3) are your errors in burts? or single? Bursts may indicate FIFO
overflow/underflow (refclks far apart in frequency, and no means to deal
with it, or the means is not working in logic -- when looped, the same
clock is used, so no problem).

4) what is the channel? coax cables are not a differential channel,
common mode noise will roar right into the receiver if the channel is
not differential. Usually the coax's are used to connect the TX and RX
pairs to a XAUI adapter module to the actual backplane (still not
ideal, but at least most of the channel is differential).

5) what does the received eye pattern look like? This will tell you if
you have a jitter problem, or an amplitude/loss problem. If the eye
looks fantastic, that takes you right back to the digital processing,
and takes away the analog side of things again....

6) have you tried a far end loopback? Loop the digital data directly
back to the far end tx from the far end rx to go back to the near end.

7) contact an FAE, and arrange to go to one of our 15 world wide
RocketLabs(tm) locations where we have all of the equipment and
resources to debug your board, and compare it with our own boards and
designs in the labs.

Austin

Tony wrote:


I am curious if anyone here has had success maintaining a very low BER
link using the fiber connections on the ML300 boards.

We have implemented an Aurora Protocol PLB Core for the ML300 (adding
interface FIFO and FSMs to the Aurora CoreGen v2 core. It is
currently a single lane system using Gige-0 on the ml300 board (MGT
X3Y1). We were having small issues using the 156.25 bref clock so we
are currently using a 100 MHz clock (we are just using the PLB clock
plb_clk out of the Clock0 module on the EDK2 reference system). Clock
compensation occurs at about 2500 reference clocks. (tried 5000, same,
if not worse problems). Best results were with Diffswing=800mv,
Pre-Em=33%.

Unfortunately our link has problems staying up for more than 20
minutes (it will spontaneously lose link and channel, until a
mgt-reset on both partners kicks them off again). Additionally, there
are mass HARD and SOFT errors reported by the Aurora core. I do not
send any data, just let the Aurora core auto-idle. This is the
timing:

DIFFSW=800 PREEM=33% Stays up: 30+ minutes, ~5 soft errors/sec
DIFFSW=700 PREEM=33% Stays up: 30+ minutes, ~10 soft errors/sec
DIFFSW=600 PREEM=33% Stays up: not tested, ~20 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)
DIFFSW=500 PREEM=33% Stays up: not tested, ~30 soft errors/sec
(explodes to 200-300 errors/sec at about 13 minutes)

DIFFSW=800 PREEM=25% Stays up: not testeds, ~200-300 soft errors/sec

- In loopback mode (serial or parallel) the channel/lane are crisp and
clean as ever.

- When the boards start up, the errors in each situation are small
parts/second, but then grow over time. I dont know if this is a
function of board/chip temperature (i put a heat sink on and it seems
to slow the increase of the error rate), or if for some reason the
Aurora core cannot compensate for some clock skew and jitter

-


Could any of you guys steer me in the right direction?

Is the higher loaded plb_clk as my ref_clk a source of problem?
Anybody able to get low error rates?

Thanks,
Tony
 

Welcome to EDABoard.com

Sponsor

Back
Top