DMA w/ Xilinx PCIX core: speed results and question

B

Brannon King

Guest
Params:
Xilinx's PCIX core for PCI64/PCIX at 66MHz
2v4000-4 running the controller core with 40 Fifos (10 targets, 2 channels,
r/w) and a busmaster wrapper
Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM
Win2k Server sp4
No scatter/gather support in driver
Exact same software and hardware for both reads and writes
Bus commands 1110 and 1111

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s
Development time: six months w/ two engineers for both driver and core
wrapper


The timer does not include the memory allocations. Any ideas why the write
speed is so much slower? Would it be the latency parameters in the core? An
OS issue?
 
Hi,

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s

The timer does not include the memory allocations.
Any ideas why the write speed is so much slower?
Would it be the latency parameters in the core? An
OS issue?
When you say "write speed" do you refer to your device
becoming bus master and doing memory writes to the
system RAM behind the host bridge? Likewise, by the
term "read speed" do you refer to your device becoming
bus master and doing memory reads of the system RAM
behind the host bridge?

I just want to make sure I didn't mis-interpret your
question before I try to answer it. Or did I get it
backwards?

Eric
 
Is the bus operating in PCI or PCIX mode? If it's in PCI mode then you are
seeing the disadvantage of not being able to post read requests. Your device is
getting told to retry while the chipset fetches the read data.

If it's in PCIX mode then you should make sure that your DMA engine is issuing
as many posted read requests as possible of as large a size as possible.

Mark


Brannon King wrote:
To clarify one issue, host write refers to DMA busmaster read (the busmaster
is on my device and is actually reading the data in from the host.)

"Brannon King" <bking@starbridgesystems.com> wrote in message
news:bu6q76$4s4@dispatch.concentric.net...

Params:
Xilinx's PCIX core for PCI64/PCIX at 66MHz
2v4000-4 running the controller core with 40 Fifos (10 targets, 2

channels,

r/w) and a busmaster wrapper
Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM
Win2k Server sp4
No scatter/gather support in driver
Exact same software and hardware for both reads and writes
Bus commands 1110 and 1111

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s
Development time: six months w/ two engineers for both driver and core
wrapper


The timer does not include the memory allocations. Any ideas why the write
speed is so much slower? Would it be the latency parameters in the core?

An

OS issue?
 
Hello,

Brannon King wrote:

"Host write" refers to busmaster read.
Max host write speed: 70MB/s
Max host read speed: 230MB/s
I think Mark described it well in his post. If
this is PCI mode, it isn't entirely surprising.
If this is in PCI-X mode, and you are using split
transactions (supporting multiple outstanding is
best) then you may need to do some hunting.

The best tool for this is a bus analyzer, if you
have one (or maybe can borrow one from a vendor
to "evaluate" it?) There could be all manner of
secondary issues that cause problems:

* bus traffic from other agents
* you are behind a bridge
* your byte counts are small

Sorry I don't have a more specific answer for you.
Eric
 
To clarify one issue, host write refers to DMA busmaster read (the busmaster
is on my device and is actually reading the data in from the host.)

"Brannon King" <bking@starbridgesystems.com> wrote in message
news:bu6q76$4s4@dispatch.concentric.net...
Params:
Xilinx's PCIX core for PCI64/PCIX at 66MHz
2v4000-4 running the controller core with 40 Fifos (10 targets, 2
channels,
r/w) and a busmaster wrapper
Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM
Win2k Server sp4
No scatter/gather support in driver
Exact same software and hardware for both reads and writes
Bus commands 1110 and 1111

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s
Development time: six months w/ two engineers for both driver and core
wrapper


The timer does not include the memory allocations. Any ideas why the write
speed is so much slower? Would it be the latency parameters in the core?
An
OS issue?
 
For those speed tests the device was in PCI mode. I was assuming it would be
the same speed as PCIX (at the same bus speed) because the timing diagrams
all looked compatible between the two. Please explain what you mean by "post
read requests". Is there some workaround for this to make the PCI mode
handle this better?


"Mark Schellhorn" <mark@seawaynetworks.com> wrote in message
news:JlDNb.1075$XZ.151148@news20.bellglobal.com...
Is the bus operating in PCI or PCIX mode? If it's in PCI mode then you are
seeing the disadvantage of not being able to post read requests. Your
device is
getting told to retry while the chipset fetches the read data.

If it's in PCIX mode then you should make sure that your DMA engine is
issuing
as many posted read requests as possible of as large a size as possible.

Mark


Brannon King wrote:
To clarify one issue, host write refers to DMA busmaster read (the
busmaster
is on my device and is actually reading the data in from the host.)

"Brannon King" <bking@starbridgesystems.com> wrote in message
news:bu6q76$4s4@dispatch.concentric.net...

Params:
Xilinx's PCIX core for PCI64/PCIX at 66MHz
2v4000-4 running the controller core with 40 Fifos (10 targets, 2

channels,

r/w) and a busmaster wrapper
Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM
Win2k Server sp4
No scatter/gather support in driver
Exact same software and hardware for both reads and writes
Bus commands 1110 and 1111

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s
Development time: six months w/ two engineers for both driver and core
wrapper


The timer does not include the memory allocations. Any ideas why the
write
speed is so much slower? Would it be the latency parameters in the core?

An

OS issue?
 
"Brannon King" <bking@starbridgesystems.com> wrote in message news:<bu6q76$4s4@dispatch.concentric.net>...
Params:
Xilinx's PCIX core for PCI64/PCIX at 66MHz
2v4000-4 running the controller core with 40 Fifos (10 targets, 2 channels,
r/w) and a busmaster wrapper
Tyan 2721 MB w/Xeon 2.6GHz w/ 4GB RAM
Win2k Server sp4
No scatter/gather support in driver
Exact same software and hardware for both reads and writes
Bus commands 1110 and 1111

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s
Development time: six months w/ two engineers for both driver and core
wrapper


The timer does not include the memory allocations. Any ideas why the write
speed is so much slower? Would it be the latency parameters in the core? An
OS issue?
Have you used a PCI bus analyzer to see the bus traffic?

Is the write data sourced from cache, or is it being fetched from main memory?

--a
 
Actually I shouldn't have called them "posted reads". Posting a transaction
means that the initiator never gets an explicit acknowledgement that the
transaction reached its destination (like posting a letter in the mail). PCI
writes are posted. A PCI read by definition is non-posted because the initiator
must receive an acknowledgement (the read data).

What I should have said was that the PCI-X protocol allows the initiator to
pipeline reads. If you have a copy, the PCI-X spec explains it pretty well.
Here's the short version:

In PCI-X, the target of a transaction can terminate the transaction with a split
response, which tells the initiator that the target will get back to him later
with a completion transaction (data if it's a read). The request is tagged with
a 5-bit number that will come back with the completion so that the initiator can
match completions to outstanding requests. The initiator is allowed to have up
to 32 split requests outstanding in the pipeline at any one time. Each read
request can be for up to 4kB of data. The throughput of a system that takes full
advantage of split transaction is highest when the amount of data being
transferred is large and the latency is small enough that 32 tags can keep the
pipeline full.

In PCI, the target of a read transaction must either respond with data
immediately, or repeatedly terminate the read attempts with retry while he goes
off and fetches the data. Once he's fetched it, he will be able to respond
immediately to the initiator on the initiator's next attempt. This is very
inefficient because there is only one transaction in the pipeline at a time. If
the latency is large (the initiator has to retry many times), the throughput is
much lower than when pipelined reads are used.

If PCI-X mode is available, use it. Or, there may be chipset settings that you
can use to improve PCI mode performance. The chipset may be able to do
pre-fetching of data in anticipation of you reading it. There may also be burst
length settings that allow you to increase the amount of data transferred in a
single transaction. You need to read the specs for the chipset you are using and
figure out what can be tweaked.

Mark

Brannon King wrote:
For those speed tests the device was in PCI mode. I was assuming it would be
the same speed as PCIX (at the same bus speed) because the timing diagrams
all looked compatible between the two. Please explain what you mean by "post
read requests". Is there some workaround for this to make the PCI mode
handle this better?
 
As it seems a valuable response, here is Eric's answer:

Hi,

In PCI mode, when you try to "read" the host, most hosts

will immediately issue retry. However, they have gleaned

some valuable information -- the starting address.

That is called a "delayed read request".

Then, the host goes off and prefetches data from that

starting address. How much it prefetches is up to the

person that designed the host device. Probably 64 bytes

or something small like that.

While it is prefetching, if your device retries the read,

you'll keep getting retry termination. Time is passing.

Eventually, when the host is finished prefetching however

much is is going to prefetch, and you return to retry

the transaction (for the millionth attempt) it will this

time NOT retry you but will give you some data (from one

DWORD up to however much it prefetched...)

That is called a "delayed read completion".

If that satisfied your device, the "transaction" is over.

If you actually wanted more data (the host has no idea

how much data you wanted, since there are no attributes

in PCI mode) your device will get disconnected. Then,

your device will start a new "transaction" with a new

starting address, and this horrible process repeats.

It is terribly inefficient (but supposedly better than

having the host insert thousands of wait states, which

keeps the bus locked up so everyone else is not getting

a turn...)

This is replaced by something called split transactions

in PCI-X mode, which is more efficient. It is a bit more

complicated to explain, though. If you want me to give

that a stab, write back and I'll give it a shot tomorrow.

Eric


"Eric Crabill" <eric.crabill@xilinx.com> wrote in message
news:4006FF0D.6262552F@xilinx.com...
Hi,

Results:
Max host write speed: 70MB/s
Max host read speed: 230MB/s

The timer does not include the memory allocations.
Any ideas why the write speed is so much slower?
Would it be the latency parameters in the core? An
OS issue?

When you say "write speed" do you refer to your device
becoming bus master and doing memory writes to the
system RAM behind the host bridge? Likewise, by the
term "read speed" do you refer to your device becoming
bus master and doing memory reads of the system RAM
behind the host bridge?

I just want to make sure I didn't mis-interpret your
question before I try to answer it. Or did I get it
backwards?

Eric
 

Welcome to EDABoard.com

Sponsor

Back
Top