PCI-X DMA problem w/ Xeon?

M

Mark Schellhorn

Guest
If anyone can offer help with this (or a pointer to a more suitable forum) it
would be greatly appreciated. I am posting here because I know there is some
some folks with expertise who frequent this group (and we're using a Xilinx core
:)).

We are attempting to perform 64B burst PCI-X DMA write transfers from our add-in
card into host memory on a dual Xeon system.

Our linux device driver (kernel 2.24.x/2.26.x) is notified via an interrupt &
single dword "doorbell" write that the larger 64B entry is available for processing.

The order of operation on the PCI-X bus is as follows:

64B data write --> 4B doorbell write --> interrupt.

Upon receiving the interrupt, the device driver polls the location in memory
where the 4B doorbell write is expected to show up. Once he sees the doorbell
location written, he reads the 64B data location. PCI ordering should guarantee
that the 64B data location is written to system memory before the 4B doorbell
write is.

The above writes are performed as Memory Write Block transactions (we have also
tried Memory Write transactions), the No Snoop bit is cleared, and the Relaxed
Ordering bit is cleared.

We consistently encounter a situation where the device driver correctly receives
the interrupt & single dword doorbell write, but the 64B write fails to appear
in memory. Instead, the device driver reads a stale 64B block of data (data from
the last successful DMA write).

As a debug measure, we had the FPGA on our add-in card perform a readback
(Memory Read Block) of the 64B entry immediately after writing it. We obeserved
that the data read back was stale and matched the stale data that the device
driver saw. Eg:

1) Location 0xABCDEF00 is known to contain stale 64B data 0xAAAA....AAAA.
1) FPGA does Memory Write Block 64B 0xBBBB....BBBB at address 0xABCDEF00.
2) FPGA does Memory Read Block 64B at address 0xABCDEF00 (Split esponse).
3) Split Completion is returned by bridge with data 0xAAAA....AAAA.

This appears to be a violation of PCI ordering rules. Again, the No Snoop and
Relaxed Order bits are cleared for all of these transactions.

The device driver *never* writes to the 64B location, so there should be no
possibility of a collision occurring where he writes/flushes stale data that
overwrites the incoming DMA write.

This tells me that the location is NOT getting written because, according to PCI
ordering rules, the FPGA read *must* push the Memory Write Block into system
memory before reading back the location.

We observe this behaviour in dual Xeon systems with both the Intel E7501 chipset
and the Broadcom Serverworks GC-LE chipset.

We observe this in SMP and single processor configurations.

When bus traffic is light at 133MHz, or whenever the bus is running at 66MHz, we
do *not* observe this problem. We occasionally observe the problem when the bus
is running at 100MHz with heavy traffic. This suggests that we are hitting a
narrow timing window at higher bus speeds.

We are suspicious that we might be encountering a cache errata in the Xeon, and
are wondering if anyone can confirm this and possibly provide a workaround?

We've been banging our heads on this for a couple of weeks now.

Mark
 
I was observing from the scope that the non-DMA from-device writes were
taking anywhere between 30 and 40 bus clock cycles for the lower dword to
transfer. I would recommend doing something along this line:

host writes to DMA trigger register ==>
device causes DMA transaction ==>
on done device causes interrupt ==>
driver reads doorbell register to check transaction completion size,
etc.

"Mark Schellhorn" <mark@seawaynetworks.com> wrote in message
news:SOAEc.175874$207.1241277@news20.bellglobal.com...
If anyone can offer help with this (or a pointer to a more suitable forum)
it
would be greatly appreciated. I am posting here because I know there is
some
some folks with expertise who frequent this group (and we're using a
Xilinx core
:)).

We are attempting to perform 64B burst PCI-X DMA write transfers from our
add-in
card into host memory on a dual Xeon system.

Our linux device driver (kernel 2.24.x/2.26.x) is notified via an
interrupt &
single dword "doorbell" write that the larger 64B entry is available for
processing.

The order of operation on the PCI-X bus is as follows:

64B data write --> 4B doorbell write --> interrupt.

Upon receiving the interrupt, the device driver polls the location in
memory
where the 4B doorbell write is expected to show up. Once he sees the
doorbell
location written, he reads the 64B data location. PCI ordering should
guarantee
that the 64B data location is written to system memory before the 4B
doorbell
write is.

The above writes are performed as Memory Write Block transactions (we have
also
tried Memory Write transactions), the No Snoop bit is cleared, and the
Relaxed
Ordering bit is cleared.

We consistently encounter a situation where the device driver correctly
receives
the interrupt & single dword doorbell write, but the 64B write fails to
appear
in memory. Instead, the device driver reads a stale 64B block of data
(data from
the last successful DMA write).

As a debug measure, we had the FPGA on our add-in card perform a readback
(Memory Read Block) of the 64B entry immediately after writing it. We
obeserved
that the data read back was stale and matched the stale data that the
device
driver saw. Eg:

1) Location 0xABCDEF00 is known to contain stale 64B data
0xAAAA....AAAA.
1) FPGA does Memory Write Block 64B 0xBBBB....BBBB at address
0xABCDEF00.
2) FPGA does Memory Read Block 64B at address 0xABCDEF00 (Split
esponse).
3) Split Completion is returned by bridge with data 0xAAAA....AAAA.

This appears to be a violation of PCI ordering rules. Again, the No Snoop
and
Relaxed Order bits are cleared for all of these transactions.

The device driver *never* writes to the 64B location, so there should be
no
possibility of a collision occurring where he writes/flushes stale data
that
overwrites the incoming DMA write.

This tells me that the location is NOT getting written because, according
to PCI
ordering rules, the FPGA read *must* push the Memory Write Block into
system
memory before reading back the location.

We observe this behaviour in dual Xeon systems with both the Intel E7501
chipset
and the Broadcom Serverworks GC-LE chipset.

We observe this in SMP and single processor configurations.

When bus traffic is light at 133MHz, or whenever the bus is running at
66MHz, we
do *not* observe this problem. We occasionally observe the problem when
the bus
is running at 100MHz with heavy traffic. This suggests that we are hitting
a
narrow timing window at higher bus speeds.

We are suspicious that we might be encountering a cache errata in the
Xeon, and
are wondering if anyone can confirm this and possibly provide a
workaround?

We've been banging our heads on this for a couple of weeks now.

Mark
 
I have a similar application. We perform PCI-X DMA with 4Kbyte bursts
from our add-in
card into the host's memory (Menory Write = command x"07"). This works
well in a dual
Xeon system with Intel E7501 chipset @ 133 MHz, we have a datarate of
about 420 MB/s. We

have been working for a while before the handshake between driver and
hardware worked
properly. We do it like following:
--> add-in card performs an interrupt and sets an interrupt bit
--> the device driver writes a valid address for the DMA into a BAR
register
--> the device driver clears the interrupt bit!
--> the user application in the FPGA starts the DMA and attempts to
write the desired
burst length to this address. The user
application keeps track of the address. In case of a DMA abort
it makes a new
transaction request till the 4Kbyte are
finished.
--> add-in card performs an interrupt
and so on ....

Within 4Kbyte bursts we have about 10 aborts, depending on bus traffic
and system. I
have also measured aborts right from the start of the DMA without any
data being written

to the host's memory! If you don't consider aborts, this can also be
your problem!

Matthias




Mark Schellhorn schrieb:

If anyone can offer help with this (or a pointer to a more suitable
forum) it
would be greatly appreciated. I am posting here because I know there
is some
some folks with expertise who frequent this group (and we're using a
Xilinx core
:)).

We are attempting to perform 64B burst PCI-X DMA write transfers from
our add-in
card into host memory on a dual Xeon system.

Our linux device driver (kernel 2.24.x/2.26.x) is notified via an
interrupt &
single dword "doorbell" write that the larger 64B entry is available
for processing.

The order of operation on the PCI-X bus is as follows:

64B data write --> 4B doorbell write --> interrupt.

Upon receiving the interrupt, the device driver polls the location in
memory
where the 4B doorbell write is expected to show up. Once he sees the
doorbell
location written, he reads the 64B data location. PCI ordering should
guarantee
that the 64B data location is written to system memory before the 4B
doorbell
write is.

The above writes are performed as Memory Write Block transactions (we
have also
tried Memory Write transactions), the No Snoop bit is cleared, and the
Relaxed
Ordering bit is cleared.

We consistently encounter a situation where the device driver
correctly receives
the interrupt & single dword doorbell write, but the 64B write fails
to appear
in memory. Instead, the device driver reads a stale 64B block of data
(data from
the last successful DMA write).

As a debug measure, we had the FPGA on our add-in card perform a
readback
(Memory Read Block) of the 64B entry immediately after writing it. We
obeserved
that the data read back was stale and matched the stale data that the
device
driver saw. Eg:

1) Location 0xABCDEF00 is known to contain stale 64B data
0xAAAA....AAAA.
1) FPGA does Memory Write Block 64B 0xBBBB....BBBB at address
0xABCDEF00.
2) FPGA does Memory Read Block 64B at address 0xABCDEF00 (Split
esponse).
3) Split Completion is returned by bridge with data
0xAAAA....AAAA.

This appears to be a violation of PCI ordering rules. Again, the No
Snoop and
Relaxed Order bits are cleared for all of these transactions.

The device driver *never* writes to the 64B location, so there should
be no
possibility of a collision occurring where he writes/flushes stale
data that
overwrites the incoming DMA write.

This tells me that the location is NOT getting written because,
according to PCI
ordering rules, the FPGA read *must* push the Memory Write Block into
system
memory before reading back the location.

We observe this behaviour in dual Xeon systems with both the Intel
E7501 chipset
and the Broadcom Serverworks GC-LE chipset.

We observe this in SMP and single processor configurations.

When bus traffic is light at 133MHz, or whenever the bus is running at
66MHz, we
do *not* observe this problem. We occasionally observe the problem
when the bus
is running at 100MHz with heavy traffic. This suggests that we are
hitting a
narrow timing window at higher bus speeds.

We are suspicious that we might be encountering a cache errata in the
Xeon, and
are wondering if anyone can confirm this and possibly provide a
workaround?

We've been banging our heads on this for a couple of weeks now.

Mark

--
Matthias Müller
Fraunhofer Institut Integrierte Schaltungen IIS
-Bildsensorik-
Am Wolfsmantel 33
D-91058 Erlangen
Tel: +49 (0)9131-776-554
Fax: +49 (0)9131-776-598
mailto:mur@iis.fhg.de http://www.iis.fhg.de
 

Welcome to EDABoard.com

Sponsor

Back
Top