Guaranteeing switch bandwidth...

Don Y · Aug 28, 2023

Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Or, do I have to resort to a separate switch for this?

Dan Purgert · Aug 28, 2023

On 2023-08-24, Don Y wrote:

Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Don Y · Aug 28, 2023

On 8/24/2023 8:30 AM, Dan Purgert wrote:

On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

For what \'n\' (!= N) and packet size/traffic pattern, will the
switch add latency to other packets (effectively reducing bandwidth
for that link)?

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Dan Purgert · Aug 28, 2023

On 2023-08-24, Don Y wrote:

On 8/24/2023 8:30 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

For what \'n\' (!= N) and packet size/traffic pattern, will the
switch add latency to other packets (effectively reducing bandwidth
for that link)?

Basically ONLY if your \"N\" hosts are all trying to compete for a single
link (or some \"M\" links that don\'t allow for sufficient breaks in
traffic to let all the data out the door). But again, this is a problem
of your uplinks, and not the switch.

A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.
- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.
- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

John Walliker · Aug 28, 2023

On Thursday, 24 August 2023 at 17:40:12 UTC+1, Dan Purgert wrote:

On 2023-08-24, Don Y wrote:
On 8/24/2023 8:30 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.
Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

Some switches implement flow control so that rather than dropping
packets on the floor they first ask the sender to slow down to prevent
the buffers from filling completely and they only drop packets if the
sender ignores the request and the buffers do fill up. This is usually
a configurable option in managed switches. Some unmanaged switches
use flow control, other don\'t.
John

For what \'n\' (!= N) and packet size/traffic pattern, will the
switch add latency to other packets (effectively reducing bandwidth
for that link)?
Basically ONLY if your \"N\" hosts are all trying to compete for a single
link (or some \"M\" links that don\'t allow for sufficient breaks in
traffic to let all the data out the door). But again, this is a problem
of your uplinks, and not the switch.

A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.
Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?
Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.
- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.
- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.
--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Don Y · Aug 28, 2023

On 8/24/2023 9:40 AM, Dan Purgert wrote:

Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

There are multiple executing threads on each host. Each host
\"serves\" some number of objects -- that can be referenced by
clients on other hosts (those other hosts serving objects
of their own). Additionally, the objects can be migrated to other
hosts as can the servers that back them. (so, traffic patterns
are highly dynamic)

The network (as manifested in the switch) plays the role of
the \"PCI bus\" in a multimaster design. Any time anything wants
to talk to anything else (things being objects -- or, more acurately,
the servers *backing* those objects) -- that transaction happens
on the wire.

Transactions aren\'t (necessarily) shortlived -- like a DNS resolution
or infrequent (like a web page lookup).

E.g., a client on one host may push a video stream to an object
(e.g., a \"motion detector\") on another host -- while other
hosts are similarly passing streams to *other* motion detectors
on that same host. And, at the same time, other objects on
each of the mentioned hosts may be initiating actions of
other objects on other hosts *or* be called upon to act FBO
some other host\'s object(s).

Imagine a web server (object) being asked to serve up a
particular page. But, *it* has to contact a resolver
to determine where the page exists -- and, then contact
the host that serves the page\'s contents. Which may
necessitate a (server-side) reference to yet another
object hosted somewhere else.

The initial web client is still waiting on his results.
As *other* requests are coming in behind it. Each with
their own set of object dependencies...

For what \'n\' (!= N) and packet size/traffic pattern, will the
switch add latency to other packets (effectively reducing bandwidth
for that link)?

Basically ONLY if your \"N\" hosts are all trying to compete for a single
link (or some \"M\" links that don\'t allow for sufficient breaks in
traffic to let all the data out the door). But again, this is a problem
of your uplinks, and not the switch.

Think back to my PCI bus analogy. How often do you think comms
stop in an ongoing process? Is the (PCI bus) ever *idle*?
Your goal is to maximize communications within a design.
In a closed box, that\'s set by the bandwidth (and arbitration
scheme for multimasters) of the \"PCI bus\"

I.e., even if a process is blocking waiting on I/O, some other
process will use the available bandwidth.

A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

And, at the same time, can\'t be tickled by other hosts.

[Every box is a server for *something*; there are no
clean client-server distinctions like in typical network
services]

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.

That assumes every destination port is \"available\".

- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.

How does this differ from the above? Or, is this a measure of
how deep the internal store is?

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

So, either be able to create a \"virtual switch\" with guaranteed
performance for the hosts serviced by that \"virtual switch\"

OR

A separate PHYSICAL switch that has no other traffic to
contend with.

Imagine serving iSCSI on a switch intended to support
a certain type of \"application traffic\". Suddenly,
there\'s all of this (near continuous) traffic as
the fabric tries to absorb the raw disk I/O.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

(this is the nature of what I want to be able to do in the
switch during development -- instead of isolating the applicable
hosts on another SEPARATE switch dedicated to their needs)

Don Y · Aug 28, 2023

On 8/24/2023 11:17 AM, John Walliker wrote:

On Thursday, 24 August 2023 at 17:40:12 UTC+1, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
On 8/24/2023 8:30 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.
Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

Some switches implement flow control so that rather than dropping
packets on the floor they first ask the sender to slow down to prevent
the buffers from filling completely and they only drop packets if the
sender ignores the request and the buffers do fill up. This is usually
a configurable option in managed switches. Some unmanaged switches
use flow control, other don\'t.

But that\'s only marginally better than shitting yourself.
It lets the hosts know that something is wrong. But, there\'s
nothing they can do to *fix* it.

If the application/clients/servers have been designed with
the expectation of a certain level of performance ON A DEDICATED
SWITCH, failing to provide that (on a \"shared\" switch) just
doesn\'t work. You\'d have to manually shut down other traffic
that could potentially compete with this application traffic.

Which decreases the value of keeping those \"other things\"
on that same switch.

I.e., if you can\'t make a switch-within-a-switch with
defined performance, then your best approach will be to
put a separate switch OUTSIDE it. (and, if traffic
needs to travel between the two -- e.g., debugging
information -- then tie them together for *just*
that role, isolating all of the other traffic)

Dan Purgert · Aug 28, 2023

On 2023-08-24, Don Y wrote:

On 8/24/2023 9:40 AM, Dan Purgert wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

The whole idea is to drop frames early, so that you don\'t bog down other
parts of the network.

Note too that ethernet does include provisions for signalling frame
errors.

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

There are multiple executing threads on each host. Each host
\"serves\" some number of objects -- that can be referenced by
clients on other hosts (those other hosts serving objects
of their own). Additionally, the objects can be migrated to other
hosts as can the servers that back them. (so, traffic patterns
are highly dynamic)

Okay, so you\'ve got basically a bog-standard network design ...

[...]
A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.

That assumes every destination port is \"available\".

Well, yeah, a switch can\'t magically talk out a port that\'s not
connected to anything

But it\'s good to know that the switch can constantly transmit at the
combined line rate of all ports.

- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.

How does this differ from the above? Or, is this a measure of
how deep the internal store is?

The switching capacity is \"how fast can the switch shuffle stuff around
its ASIC\". Slight correction to my initial statement, switching capacity
should be the sum of 2x the ports of all supported bandwidths.

Consider those super-cheap 5 port things you\'ll find at your local
big-box electronics store. They (might) only have a switching capacity of
5 gbps ... which cannot support the potential 10gbps of traffic 5 hosts
connected to it could generate. BUT, well, it\'s not meant for a scenario
where you have 5 hosts talking amongst themselves at line rate.

As another example, I have a 48 port switch that includes 4x SFP cages
(2x 1G SFP + 2x 1/10G SFP+). As I recall, its switching capacity is on
the order of 104 gbps (i.e. 2x 52 [1gpbs] ports). So I know it\'ll never
become the bottleneck if I don\'t use 10gbit SFP cages ... or if I do
need 10g switching, I have to give up a bit on the copper port capacity,
OR just accept that the switch WILL be a bottleneck if I\'m trying to use
it at port-capacity with at least 1x 10g card).

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

Appliation? like \"program\"? Switches don\'t operate with \"applications\".
They operate on ethernet frames.

Imagine serving iSCSI on a switch intended to support
a certain type of \"application traffic\". Suddenly,
there\'s all of this (near continuous) traffic as
the fabric tries to absorb the raw disk I/O.

iSCSI isn\'t served by a switch ... it\'s just SCSI commands from an
initiator to a target, wrapped in TCP. The ultimate bulk data transfer
on the network looks effectively like any other (TCP-based) data
transfer.

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).
Likewise, initiator can only accept it as fast as it can download (e.g.
1gpbs). And, well, a halfway decent switch can handle that all day every
day. If either \"Target\" or \"Initiator\" bogs down (because their 1gbps
link can only move 1gbps, and they want to do more than just transfer
block storage data back and forth), then frames start getting dropped,
and TCP starts backing off ... and the switch is not the bottleneck.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets
involved. If your \"application\" needs more throughput, you\'ll need to
handle it at the host it\'s running on. I mean, if we have a theoretical
switch with 500gpbs capacity, that\'s all going to waste if the 48 hosts
connected to it only have gbit ports ...

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Dan Purgert · Aug 28, 2023

On 2023-08-24, John Walliker wrote:

On Thursday, 24 August 2023 at 17:40:12 UTC+1, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
On 8/24/2023 8:30 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.
Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

Some switches implement flow control so that rather than dropping
packets on the floor they first ask the sender to slow down to prevent
the buffers from filling completely and they only drop packets if the
sender ignores the request and the buffers do fill up. This is usually
a configurable option in managed switches. Some unmanaged switches
use flow control, other don\'t.

Right. I\'m trying to just keep to generalities, since the whole thing
seems some big theorycraft with no real defined goal ...

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Don Y · Aug 28, 2023

On 8/25/2023 4:35 AM, Dan Purgert wrote:

On 2023-08-24, Don Y wrote:
On 8/24/2023 9:40 AM, Dan Purgert wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

The whole idea is to drop frames early, so that you don\'t bog down other
parts of the network.

You are missing the point. The network is used as part of the \"memory bus\"
in the system.

Simple program/expression/line-of-code:
hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC. So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

If packets are dropped, the sqrt() isn\'t executed so the code fails.
If dropped packets are DETECTED (and retransmitted), the sqrt()
works -- but at a presumably much slower rate.

A rate that can\'t be predicted for any particular \"sqrt()\" instance.

[But, you\'d not \"remote\" something as trivial as a sqrt() function]

A more practical line of code:
speaker := recognizer(audio_source)
where audio_source might be the (stream) output of a BT microphone,
a podcast, the soundtrack from a movie, etc. And, recognizer is a
process that examines a set of templates that characterize particular
voices to be used to identify the speaker.

In a conventional uniprocessor, there is no possibility of
sqrt() not being executed -- unless the processor has crashed
or there is a memory fault. Memory is *directly* connected
to the CPU.

In a distributed multiprocessor, the interconnect medium is
often exposed/vulnerable.

You likely can\'t remove the DIMM that has the sqrt() code
in it from your PC and expect the PC to continue to function.
But, you *can* unplug the host (from the network) that implements
the sqrt() procedure.

Or, have other things happening on the network that interfere with
the EXPECTED performance of that remote procedure!

Note too that ethernet does include provisions for signalling frame
errors.

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

There are multiple executing threads on each host. Each host
\"serves\" some number of objects -- that can be referenced by
clients on other hosts (those other hosts serving objects
of their own). Additionally, the objects can be migrated to other
hosts as can the servers that back them. (so, traffic patterns
are highly dynamic)

Okay, so you\'ve got basically a bog-standard network design ...

The fabric is not exceptional. The system/application running atop
it, however, is far from typical. So, the traffic is atypical.

E.g., each process has its own namespace. In most systems, the
filesystem acts as the UNIFIED namespace for all processes
running on that system.

So, each process can have a (trivial) namespace defined as:
/inputs
/1
/2
/n
/outputs
/A
/B
/C
/error
/inputs/1 for process X has nothing to do with /inputs/1 for
process Y. (just like stdin and stdout are each specific to
a particular process)

There is no way for process X to even *see* the namespace for
process Y as there is no global context that makes all of them
visible (even if only to a \"superuser\") -- unlike the shared
filesystem mentioned above.

When a process wants to resolve a name, it passes that identifier
to a local resolver bound to the namespace (object!) for its process.
The resolver parses, e.g., \"/inputs/1\" in the \"top level Context\"
and finds an object called \"inputs\". It then asks the inputs
object to resolve \"1\", ultimately yielding the object referenced.

[Sort of like a recursive DNS]

The namespace object can reside anywhere in the system.
The top level Context object can also reside anywhere.
The \"inputs\" object (itself a Context) can reside anywhere.
The \"1\" object can reside anywhere.

Each object is accessed via an RPC (RMI, to be pedantic).

So, when process X on host Alpha tries to resolve \"/inputs/1\",
it may send a message to host Beta because beta has the top
level context for process X\'s namespace. The referenced
object (the \"inputs\" Context) may, in turn, exist on host Gamma.
And, the \"1\" object referenced in that Context may exist on
host Delta.

*Or*, they might all (or some combination of) exist on
Alpha; or Beta; or...

[Expanding on the DNS analogy, you wouldn\'t, typically, have a DNS
server on Alpha redirecting you to another on Beta which redirects
to yet another on Gamma and eventually authoritative on Delta.
AND, wouldn\'t have a completely different set of servers involved
in process Y\'s name resolution! Instead, all DNS traffic would
be directed to one/few hosts for the entire set of clients.]

I dynamically monitor the location of objects (their servers)
to try to minimize the amount of network traffic and offset
the processing/storage needs of individual processes. So, I
may choose to migrate the namespace objects for a group of
processes (possibly residing on a dozen different hosts) to
one particular host that is lightly loaded *if* the requests
on those objects are infrequent enough that I can absorb the
cost of those RPCs. Processes that hammer on their namespaces
(cuz you can also modify a namespace) might be better served
if the namespace object (and the service backing it) resided
on the same host.

[...]
A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

AFTER it has entered and possibly been queued *in* the switch.

[My \"I\'m not leaving the switch\" comment was meant as \"I\'m not
pushing packets to ANOTHER switch\"; all of my hosts are served
by \"the\" switch]

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.

That assumes every destination port is \"available\".

Well, yeah, a switch can\'t magically talk out a port that\'s not
connected to anything

No, I meant \"ready to accept NEW frames\". If it is busy receiving
frames that were initiated \"a bit sooner\" (from some other host/port)
OR internally queued because multiple hosts (ports) tried to send
packets to it while it was busy, then that port is overloaded,
regardless of the (lack of?) activity on other ports

But it\'s good to know that the switch can constantly transmit at the
combined line rate of all ports.

- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.

How does this differ from the above? Or, is this a measure of
how deep the internal store is?

The switching capacity is \"how fast can the switch shuffle stuff around
its ASIC\". Slight correction to my initial statement, switching capacity
should be the sum of 2x the ports of all supported bandwidths.

Consider those super-cheap 5 port things you\'ll find at your local
big-box electronics store. They (might) only have a switching capacity of
5 gbps ... which cannot support the potential 10gbps of traffic 5 hosts
connected to it could generate. BUT, well, it\'s not meant for a scenario
where you have 5 hosts talking amongst themselves at line rate.

Wouldn\'t that depend on the internal architecture of the switch?

As another example, I have a 48 port switch that includes 4x SFP cages
(2x 1G SFP + 2x 1/10G SFP+). As I recall, its switching capacity is on
the order of 104 gbps (i.e. 2x 52 [1gpbs] ports). So I know it\'ll never
become the bottleneck if I don\'t use 10gbit SFP cages ... or if I do
need 10g switching, I have to give up a bit on the copper port capacity,
OR just accept that the switch WILL be a bottleneck if I\'m trying to use
it at port-capacity with at least 1x 10g card).

My goal is to size the switch as small as possible (to keep costs
low -- hundreds of PoE ports in even the smallest of installations;
I have 240, here) by dynamically exploiting traffic patterns. E.g.,
if the required (observed) bandwidth between process X and process Y
is \"high\", then colocating them on the same physical host is worth
the effort of migrating one (or both) of them. Even if that means
powering up another host just to have a place for them to co-exist
without consuming switch resources.

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

The switch\'s specs tell me nothing about how the *application* can
avail itself of its abilities/limitations. I have to evaluate the
performance of the *application* in the constraints of the *switch*.

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

Appliation? like \"program\"? Switches don\'t operate with \"applications\".
They operate on ethernet frames.

The switch is an integral part of the applications that
rely on it to exchange data. Running a DBMS where the clients
can\'t talk to the DBMS server is kind of meaningless, eh?

Imagine serving iSCSI on a switch intended to support
a certain type of \"application traffic\". Suddenly,
there\'s all of this (near continuous) traffic as
the fabric tries to absorb the raw disk I/O.

iSCSI isn\'t served by a switch ... it\'s just SCSI commands from an
initiator to a target, wrapped in TCP. The ultimate bulk data transfer
on the network looks effectively like any other (TCP-based) data
transfer.

iSCSI relies on the switch to connect the initiators and
targets. The commands are insignificant packets; the payloads
are large and often \"ongoing\". Not short transactions that
engage and then release the resources of the switch.

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).

You can *write* as fast as the initiator can synthesize packets.
There is often a cache on the target so you can write faster than the
disk\'s write rate. Likewise, a read-ahead cache to read faster than
the disk can source data.

And, targets often have multiple spindles (striped or individual)
so the actual rate can easily exceed the rate of the individual
drives within. (I can keep four Gbe ports saturated with relative
ease -- and that\'s just one SAN)

Likewise, initiator can only accept it as fast as it can download (e.g.
1gpbs). And, well, a halfway decent switch can handle that all day every
day. If either \"Target\" or \"Initiator\" bogs down (because their 1gbps
link can only move 1gbps, and they want to do more than just transfer
block storage data back and forth), then frames start getting dropped,
and TCP starts backing off ... and the switch is not the bottleneck.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets

You\'re thinking of bog-standard hardware and OSs. No PCs involved, here.

I move memory pages directly onto the wire by prepending an appropriate
header and letting the NIC pull the data straight out of the page
(jumbo frames, no packet reassembly required, etc.).

Similarly, on the receiving end, the packet payload is dropped
into a page and the page them mapped into the receiving process\'s
address space.

Why pass the page -- or, fragments of the page -- up and down the stack
when you can just put it where it belongs?

I.e., I can keep every link running flat out continuously while
the hosts are busy doing \"real work\"; there\'s no *host* processing
involved in moving the data -- other than passing a pointer to
an ISR so it knows which page(s) to move, next.

Typical network stacks are too slow and bloated. This is also the case
with filesystem interfaces, etc. (e.g., if you want a general solution
to a general problem, you get \"meh\" performance)

involved. If your \"application\" needs more throughput, you\'ll need to
handle it at the host it\'s running on. I mean, if we have a theoretical
switch with 500gpbs capacity, that\'s all going to waste if the 48 hosts
connected to it only have gbit ports ...

Think: many hundreds of ports. Would you want many tpbs of capacity?
Or, would you want a smarter way of (re)shaping the traffic to give
you the performance that would be available, there?

(e.g., if you have to leave A switch to get to ANOTHER switch, then
that becomes a scarce resource -- 48 ports on switch #1 each wanting
to talk AT LINE SPEED to 48 other ports on switch #2)

Most classical networks don\'t have such traffic patterns.
There are servers and there are clients.

With IoT, every server is likely also a client. And, its
easy to have ten-fold more \"devices\" than \"computers\" in an
organization! Welcome to the future.

Dimiter_Popoff · Aug 28, 2023

....
You are missing the point.Â The network is used as part of the \"memory bus\"
in the system.

Simple program/expression/line-of-code:
Â Â Â hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC.Â So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.
All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".
The only thing you can do is to put some time limit to a response you
get on a tcp connection and retry the whole transaction if it fails,
after closing it (so it will be reset on a subsequent attempt to be
talked to by the peer if it is not aware it has been closed).

Just let the switch people care about their switches speed, packet
losses etc., I would choose on a basis \"that one works fine, stick to
it for now\" basis.

I know well you are aware of these basics, just thought like a look
from \"outside\" of it might be useful.

======================================================
Dimiter Popoff, TGI http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Don Y · Aug 28, 2023

On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:

....
You are missing the point.Â The network is used as part of the \"memory bus\"
in the system.

Simple program/expression/line-of-code:
Â Â Â Â hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC.Â So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes. Because you have to accommodate \"foreign\"
devices intermixed with your own. And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON. If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.
Instead, you\'d likely lock the design so that such interference
wasn\'t possible.

*Generic* networks are designed /ad hoc/ and your expectations
from such a design is \"whatever I can get -- NOW!\" But, that
doesn\'t mean all networks have to be so accommodating! E.g., you
pass a CAN message on an automotive network and EXPECT it to arrive
promptly and intact. Because the owner isn\'t going to ADD devices
to that network. Ditto industrial automation networks, etc.

You design the network -- and protocols -- to meet the needs
of the application that it hosts and the level of reliability
you expect.

[E.g., CAN networks aren\'t secure from tampering -- regardless
of whether or not malevolent. But, the designers assume that
one can\'t tamper with a vehicle in motion and won\'t tamper with
a vehicle that\'s parked. ooops!]

The only thing you can do is to put some time limit to a response you
get on a tcp connection and retry the whole transaction if it fails,
after closing it (so it will be reset on a subsequent attempt to be
talked to by the peer if it is not aware it has been closed).

TCP is a heavyweight protocol. It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP. If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Just let the switch people care about their switches speed, packet
losses etc., I would choose on a basis \"that one works fine, stick to
it for now\" basis.

My immediate need is to decide if I should add another 24port switch
to the office (for development) and squeeze the test platform nodes
onto unused ports, there. Or, add a smaller switch PLUS another
switch dedicated to just the test platform nodes, linked into
the other switch with a single connection that I can deliberately
choose to idle when the test platform is operating (and not have
that switch aware of or affected by any of the traffic on the other
~60 ports in the office)

In a real deployment (no development hardware involved), there are
agents (AIs) in the codebase that watch the system\'s performance
and interactions and invoke reconfiguration actions to move
processes to hosts where their intercommunications will have less
impact on the switch (and likely incur less latency).

In some cases, this requires informing the owner/operator/user
that cables need to be rerouted because I/Os associated with
those hosts demand certain processes reside on them (and, thus,
can\'t migrate without likely taking on a big communications hit).
E.g., I wouldn\'t want to run the DBMS on a different node than
the node that has the mass storage IOs as that\'s a lot of \"disk\"
traffic to route over the network needlessly.

I know well you are aware of these basics, just thought like a look
from \"outside\" of it might be useful.

Dimiter_Popoff · Aug 28, 2023

On 8/25/2023 17:08, Don Y wrote:

On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:
....
You are missing the point.Â The network is used as part of the
\"memory bus\"
in the system.

Simple program/expression/line-of-code:
Â Â Â Â hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC.Â So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes.Â Because you have to accommodate \"foreign\"
devices intermixed with your own.Â And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON.Â If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

TCP is a heavyweight protocol.Â It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...
If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think. You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

Dan Purgert · Aug 28, 2023

On 2023-08-25, Don Y wrote:

On 8/25/2023 4:35 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
On 8/24/2023 9:40 AM, Dan Purgert wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

The whole idea is to drop frames early, so that you don\'t bog down other
parts of the network.

You are missing the point. The network is used as part of the \"memory
bus\" in the system.

With the assumption of a suitably capable switch, and \"N\" hosts
connected to it; any communications are inherently limited by the link
between the switch and the host(s).

Any given host cannot shove data up to the switch any faster than its
interface speed (e.g. 1gbps).

Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

That\'s it, there\'s nothing else network-side you can do. The potential
for dropped / lost traffic is going to be handled in your host\'s network
stack (e.g. TCP resends), or in your application directly (or both).

Note too that ethernet does include provisions for signalling frame
errors.

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

There are multiple executing threads on each host. Each host
\"serves\" some number of objects -- that can be referenced by
clients on other hosts (those other hosts serving objects
of their own). Additionally, the objects can be migrated to other
hosts as can the servers that back them. (so, traffic patterns
are highly dynamic)

Okay, so you\'ve got basically a bog-standard network design ...

The fabric is not exceptional. The system/application running atop
it, however, is far from typical. So, the traffic is atypical.

E.g., each process has its own namespace. In most systems, the
filesystem acts as the UNIFIED namespace for all processes
running on that system.

So, each process can have a (trivial) namespace defined as:
/inputs
/1
/2
/n
/outputs
/A
/B
/C
/error
/inputs/1 for process X has nothing to do with /inputs/1 for
process Y. (just like stdin and stdout are each specific to
a particular process)

There is no way for process X to even *see* the namespace for
process Y as there is no global context that makes all of them
visible (even if only to a \"superuser\") -- unlike the shared
filesystem mentioned above.

And this is exactly how inter-host communication works in a standard
ethernet network. No one host knows what any other host is doing at any
particular time before it starts transmitting ...

[Expanding on the DNS analogy, you wouldn\'t, typically, have a DNS
server on Alpha redirecting you to another on Beta which redirects
to yet another on Gamma and eventually authoritative on Delta.
AND, wouldn\'t have a completely different set of servers involved
in process Y\'s name resolution! Instead, all DNS traffic would
be directed to one/few hosts for the entire set of clients.]

That\'s exactly how DNS works though, assuming that Alpha/Beta/Gamma
haven\'t already cached the answers received from Delta (or Epsilon,
etc.).

[...]
A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

AFTER it has entered and possibly been queued *in* the switch.

[My \"I\'m not leaving the switch\" comment was meant as \"I\'m not
pushing packets to ANOTHER switch\"; all of my hosts are served
by \"the\" switch]

The only time data will be queued \"in\" the switch is on an egress
buffer, but at that point it\'s already exited the switch ASIC (and all
the \"important work\" is done).

Think of it kinda like an amusement park ride that has a photobooth at
the exit -- the fact that the group you got off the ride is waiting for
their photo to show up on the TV doesn\'t slow the ride down any.

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.

That assumes every destination port is \"available\".

Well, yeah, a switch can\'t magically talk out a port that\'s not
connected to anything

No, I meant \"ready to accept NEW frames\". If it is busy receiving
frames that were initiated \"a bit sooner\" (from some other host/port)
OR internally queued because multiple hosts (ports) tried to send
packets to it while it was busy, then that port is overloaded,
regardless of the (lack of?) activity on other ports

Non-Blocking Throughput is *only* concerned with transmission (i.e.
egress). Receipt is nonblocking (the blocker would be the host at the
other end not being able to sustain line rate).

But it\'s good to know that the switch can constantly transmit at the
combined line rate of all ports.

- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.

How does this differ from the above? Or, is this a measure of
how deep the internal store is?

The switching capacity is \"how fast can the switch shuffle stuff around
its ASIC\". Slight correction to my initial statement, switching capacity
should be the sum of 2x the ports of all supported bandwidths.

Consider those super-cheap 5 port things you\'ll find at your local
big-box electronics store. They (might) only have a switching capacity of
5 gbps ... which cannot support the potential 10gbps of traffic 5 hosts
connected to it could generate. BUT, well, it\'s not meant for a scenario
where you have 5 hosts talking amongst themselves at line rate.

Wouldn\'t that depend on the internal architecture of the switch?

In some sense, I mean you *can* get cheapo switches that don\'t have
enough switching capacity to sustain line rate ...

As another example, I have a 48 port switch that includes 4x SFP cages
(2x 1G SFP + 2x 1/10G SFP+). As I recall, its switching capacity is on
the order of 104 gbps (i.e. 2x 52 [1gpbs] ports). So I know it\'ll never
become the bottleneck if I don\'t use 10gbit SFP cages ... or if I do
need 10g switching, I have to give up a bit on the copper port capacity,
OR just accept that the switch WILL be a bottleneck if I\'m trying to use
it at port-capacity with at least 1x 10g card).

My goal is to size the switch as small as possible (to keep costs
low -- hundreds of PoE ports in even the smallest of installations;
I have 240, here) by dynamically exploiting traffic patterns. E.g.,
if the required (observed) bandwidth between process X and process Y
is \"high\", then colocating them on the same physical host is worth
the effort of migrating one (or both) of them. Even if that means
powering up another host just to have a place for them to co-exist
without consuming switch resources.

So you don\'t want a switch, you want Software Defined Networking (SDN),
which tends to lean \"not cheap\".

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

The switch\'s specs tell me nothing about how the *application* can
avail itself of its abilities/limitations. I have to evaluate the
performance of the *application* in the constraints of the *switch*.

\"Applications\" hand off to the networking stack, which ultimately talks
in packets, frames, and bits on the wire.

So, if you want to know if \"an application\" will not starve from the
network, you need to know what kind of data rates \"an application\" needs
to sustain itself ... and provide \"the host\" with a sufficiently fast
network card (or use link aggregation).

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

Appliation? like \"program\"? Switches don\'t operate with \"applications\".
They operate on ethernet frames.

The switch is an integral part of the applications that
rely on it to exchange data. Running a DBMS where the clients
can\'t talk to the DBMS server is kind of meaningless, eh?

But \"the application\" doesn\'t talk on the wire. \"the network card\"
does, and \"the application\" is constrained by the host\'s network card.

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Imagine serving iSCSI on a switch intended to support
a certain type of \"application traffic\". Suddenly,
there\'s all of this (near continuous) traffic as
the fabric tries to absorb the raw disk I/O.

iSCSI isn\'t served by a switch ... it\'s just SCSI commands from an
initiator to a target, wrapped in TCP. The ultimate bulk data transfer
on the network looks effectively like any other (TCP-based) data
transfer.

iSCSI relies on the switch to connect the initiators and
targets. The commands are insignificant packets; the payloads
are large and often \"ongoing\". Not short transactions that
engage and then release the resources of the switch.

They certainly are. Nothing more than a buncha frames whizzing past at
1gbps.

Maybe my Target has a 10g link ... but it can still only send at 1gbps
to any one initiator because that initiator only has a 1g link (and when
it tries sending faster, frames start getting dropped, and it backs
off).

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).

You can *write* as fast as the initiator can synthesize packets.
There is often a cache on the target so you can write faster than the
disk\'s write rate. Likewise, a read-ahead cache to read faster than
the disk can source data.

And those are both only so deep. The point here is that there are
limits outside of the switch in the middle that have just as much (if
not more) impact on things than the switch does.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets

You\'re thinking of bog-standard hardware and OSs. No PCs involved, here.

I move memory pages directly onto the wire by prepending an appropriate
header and letting the NIC pull the data straight out of the page
(jumbo frames, no packet reassembly required, etc.).

Fine, s/PC/NIC/ then.

In either case, frames can only be kicked over the wire as fast as the
card\'s PHY / MII is capable of sustaining (plus upstream CPU dumping
data to the card, etc).

[...]
involved. If your \"application\" needs more throughput, you\'ll need to
handle it at the host it\'s running on. I mean, if we have a theoretical
switch with 500gpbs capacity, that\'s all going to waste if the 48 hosts
connected to it only have gbit ports ...

Think: many hundreds of ports. Would you want many tpbs of capacity?
Or, would you want a smarter way of (re)shaping the traffic to give
you the performance that would be available, there?

If the network warranted \"many hundreds of ports\", then most certainly I
would want switching capacity of \"2x many hundreds of ports\" (at a
minimum, and probably higher, to account for using 10/100g links where
necessary).

It might not be \"common\" for the smallish networks (~200 hosts max) I
work with to *NEED* to move 100gpbs / switch around .. but I\'m also not
going to pretend that they\'ll be better off if they just reshape their
traffic patterns. Network\'s there to take whatever abuse they wanna
throw at it, and I don\'t have to worry.

(e.g., if you have to leave A switch to get to ANOTHER switch, then
that becomes a scarce resource -- 48 ports on switch #1 each wanting
to talk AT LINE SPEED to 48 other ports on switch #2)

Pretty standard stuff. Either use a 100g port as the between the two
switches, or interconnect their backplanes. Really depends on what the
switch\'s capabilities are. The backplane ones are really nice, because
after you define the master, the rest just look like daughtercards
(rather than N individual switches as with 10/100g uplinks)... but they
also cost an arm and a leg.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Joe Gwinn · Aug 28, 2023

On Fri, 25 Aug 2023 17:45:24 +0300, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

On 8/25/2023 17:08, Don Y wrote:
On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:
....
You are missing the point. The network is used as part of the
\"memory bus\"
in the system.

Simple program/expression/line-of-code:
hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC. So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes. Because you have to accommodate \"foreign\"
devices intermixed with your own. And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON. If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

TCP is a heavyweight protocol. It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...
If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think. You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

TCP has a very specific niche, one that does not fit all uses, and TCP
has all manner of interesting failure modes. Nor does TCP support
multicast.

UDP plus some application code can actually be far simpler than TCP.

And for \"safety\" applications, such as controlling a dangerous machine
of some kind, an interlocked handshake protocol based on UDP is far
far cleaner and simpler than anything else, and can be proven safe.

Joe Gwinn

Don Y · Aug 28, 2023

On 8/25/2023 8:03 AM, Dan Purgert wrote:

Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

The whole idea is to drop frames early, so that you don\'t bog down other
parts of the network.

You are missing the point. The network is used as part of the \"memory
bus\" in the system.

With the assumption of a suitably capable switch, and \"N\" hosts
connected to it; any communications are inherently limited by the link
between the switch and the host(s).

The *average* communication rate is thusly limited. The elastic
store in the switch allows that to be exceeded, depending on how
deep the store and the nature of the traffic hammering on it.

Any given host cannot shove data up to the switch any faster than its
interface speed (e.g. 1gbps).

Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

That\'s it, there\'s nothing else network-side you can do. The potential
for dropped / lost traffic is going to be handled in your host\'s network
stack (e.g. TCP resends), or in your application directly (or both).

Okay, so you\'ve got basically a bog-standard network design ...

The fabric is not exceptional. The system/application running atop
it, however, is far from typical. So, the traffic is atypical.

E.g., each process has its own namespace. In most systems, the
filesystem acts as the UNIFIED namespace for all processes
running on that system.

So, each process can have a (trivial) namespace defined as:
/inputs
/1
/2
/n
/outputs
/A
/B
/C
/error
/inputs/1 for process X has nothing to do with /inputs/1 for
process Y. (just like stdin and stdout are each specific to
a particular process)

There is no way for process X to even *see* the namespace for
process Y as there is no global context that makes all of them
visible (even if only to a \"superuser\") -- unlike the shared
filesystem mentioned above.

And this is exactly how inter-host communication works in a standard
ethernet network. No one host knows what any other host is doing at any
particular time before it starts transmitting ...

And hosts don\'t trend to chatter amongst themselves that much.
I.e., limit the bandwidth of your databus inside your CPU
to 1Gb/s and tell me how that affects performance?

[Expanding on the DNS analogy, you wouldn\'t, typically, have a DNS
server on Alpha redirecting you to another on Beta which redirects
to yet another on Gamma and eventually authoritative on Delta.
AND, wouldn\'t have a completely different set of servers involved
in process Y\'s name resolution! Instead, all DNS traffic would
be directed to one/few hosts for the entire set of clients.]

That\'s exactly how DNS works though, assuming that Alpha/Beta/Gamma
haven\'t already cached the answers received from Delta (or Epsilon,
etc.).

You\'ve missed the \"every process has its own namespace\" bit.

Four hosts. 400 processes (that\'s a typical number).
400 namespaces.

There can be as few as one \"namespace server\" (the piece of code
that \"backs\" each namespace object) in the collection of hosts.
Or, as many as 400. (there can be even more that just don\'t have
any namespace objects to manage, presently).

These can be distributed randomnly among the 4 hosts. And, the
namespace objects AND the namespace servers (themselves objects)
can also migrate between the hosts.

So, process A1 (#1 on alpha) has a namespace object that may be backed
by a namespace server on beta. (i.e., an instance of BIND running on
beta to which A1 directs its requests). The remaining 399 processes
can have their namespace objects backed by servers on any of the
4 hosts -- including alpha!

An object \"named\" in A1\'s namespace (backed on beta) may resolve to an
object existing on Gamma. So, the Gamma object (a \"Context\" -- a subdomain,
if you will), can in turn reference an object on Delta... which may
eventually reference the final object residing on Alpha! (or anywhere else).

Meanwhile, the second (of 400) processes -- also on alpha -- needs to resolve
a name against it\'s namespace -- which may be backed by a server also on
beta -- or any other host (including alpha).

This is (or can be) a different server instance. Like running 400 copies
of BIND each with a configuration for JUST the names that their ONE
client wants resolved among the 4 hosts.

And, unlike BIND where you\'re trying to resolve a hostname that you
will likely communicate with for some time, this is an *object* that
you are trying to resolve. ANYTHING THAT EXISTS OUTSIDE OF YOUR MEMORY
SPACE (even if you happen to be the server for that object!).

Imagine using DNS to lookup filenames in your LOCAL filesystem.
Or, wanting to get the current time of day -- \"Where\'s the clock???\"

The value of separate namespaces is that it restricts your code from
accessing things that it\'s not supposed to know even exist!
I.e., if your only interface to a filesystem is via a set of
file descriptors passed to you on startup, then I have no worry
that you will ever be able to read \"MySecrets\" -- because there
is no way for you to access it!

[This is profoundly different from IP addressing where I can
access the entire IP space just by wandering through
successive IP addresses and have to rely on something -- which
could be misconfigured -- to BLOCK those accesses]

[...]
A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

AFTER it has entered and possibly been queued *in* the switch.

[My \"I\'m not leaving the switch\" comment was meant as \"I\'m not
pushing packets to ANOTHER switch\"; all of my hosts are served
by \"the\" switch]

The only time data will be queued \"in\" the switch is on an egress
buffer, but at that point it\'s already exited the switch ASIC (and all
the \"important work\" is done).

But anything wanting to use that wire is still waiting.
It doesn\'t matter where it waits; it\'s still waiting.
And, the wait is proportional to the depth of the queue
and the policy for how it is filled.

E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Think of it kinda like an amusement park ride that has a photobooth at
the exit -- the fact that the group you got off the ride is waiting for
their photo to show up on the TV doesn\'t slow the ride down any.

The switch is an approximation of a mesh. Under what conditions
does that approximation fall flat?

Decent manuals will provide three pieces of data for the switch:

- Non-Blocking Throughput -- should be equal to the number of ports.
How much can the switch transmit before it bogs down.

That assumes every destination port is \"available\".

Well, yeah, a switch can\'t magically talk out a port that\'s not
connected to anything

No, I meant \"ready to accept NEW frames\". If it is busy receiving
frames that were initiated \"a bit sooner\" (from some other host/port)
OR internally queued because multiple hosts (ports) tried to send
packets to it while it was busy, then that port is overloaded,
regardless of the (lack of?) activity on other ports

Non-Blocking Throughput is *only* concerned with transmission (i.e.
egress). Receipt is nonblocking (the blocker would be the host at the
other end not being able to sustain line rate).

But it\'s good to know that the switch can constantly transmit at the
combined line rate of all ports.

- Switching Capacity -- should be 2x the number of ports. How much
total traffic the fabric can handle before it bogs down.

How does this differ from the above? Or, is this a measure of
how deep the internal store is?

The switching capacity is \"how fast can the switch shuffle stuff around
its ASIC\". Slight correction to my initial statement, switching capacity
should be the sum of 2x the ports of all supported bandwidths.

Consider those super-cheap 5 port things you\'ll find at your local
big-box electronics store. They (might) only have a switching capacity of
5 gbps ... which cannot support the potential 10gbps of traffic 5 hosts
connected to it could generate. BUT, well, it\'s not meant for a scenario
where you have 5 hosts talking amongst themselves at line rate.

Wouldn\'t that depend on the internal architecture of the switch?

In some sense, I mean you *can* get cheapo switches that don\'t have
enough switching capacity to sustain line rate ...

Of course. There\'s a reason different manufacturers make different
switches for different markets.

As another example, I have a 48 port switch that includes 4x SFP cages
(2x 1G SFP + 2x 1/10G SFP+). As I recall, its switching capacity is on
the order of 104 gbps (i.e. 2x 52 [1gpbs] ports). So I know it\'ll never
become the bottleneck if I don\'t use 10gbit SFP cages ... or if I do
need 10g switching, I have to give up a bit on the copper port capacity,
OR just accept that the switch WILL be a bottleneck if I\'m trying to use
it at port-capacity with at least 1x 10g card).

My goal is to size the switch as small as possible (to keep costs
low -- hundreds of PoE ports in even the smallest of installations;
I have 240, here) by dynamically exploiting traffic patterns. E.g.,
if the required (observed) bandwidth between process X and process Y
is \"high\", then colocating them on the same physical host is worth
the effort of migrating one (or both) of them. Even if that means
powering up another host just to have a place for them to co-exist
without consuming switch resources.

So you don\'t want a switch, you want Software Defined Networking (SDN),
which tends to lean \"not cheap\".

In my *application* (product), yes. But, the point of my question
was considerably simpler -- before we got onto the topic of how
switches work.

Namely, could I *emulate* a dedicated switch *in* a larger
switch without concern for the other traffic that could
be operating in that larger switch -- or, would I need to
have a physical, separate switch for any performance
guarantees (given that I can shape the traffic on the
hosts for taht smaller switch but can\'t shape the traffic
for the \"other\" hosts on the larger switch)

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

The switch\'s specs tell me nothing about how the *application* can
avail itself of its abilities/limitations. I have to evaluate the
performance of the *application* in the constraints of the *switch*.

\"Applications\" hand off to the networking stack, which ultimately talks
in packets, frames, and bits on the wire.

So, if you want to know if \"an application\" will not starve from the
network, you need to know what kind of data rates \"an application\" needs
to sustain itself ... and provide \"the host\" with a sufficiently fast
network card (or use link aggregation).

Or, shape the traffic so it fits within the capabilities present.

\"Build a bigger power supply or use less power\" -- two approaches
to the same problem of \"fit\"

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

Appliation? like \"program\"? Switches don\'t operate with \"applications\".
They operate on ethernet frames.

The switch is an integral part of the applications that
rely on it to exchange data. Running a DBMS where the clients
can\'t talk to the DBMS server is kind of meaningless, eh?

But \"the application\" doesn\'t talk on the wire. \"the network card\"
does, and \"the application\" is constrained by the host\'s network card.

Semantics. The application decides what will get sent and
what will be expected based on the code that it executes.
The network car is just the physical manifestation of the
interface; the virtual manifestation is the set of RPCs that
the developer invokes (often without realizing that it *is*
an RPC)

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Of course. Except if the application spans multiple hosts
(and, thus, NICs).

Don\'t think of an application as *a* program running on *a*
host but, rather, a collection of processes operating on\\
set of processors to address a need.

When you DL something in Firefox, there are many processes that
are active. And, one that just handles the download (i.e.,
the browser window can close while the download continues).
We\'d all consider that *an* application, despite the fact that
there are multiple processes -- some potentially different
*programs* (separate EXEs on your disk) -- that co-operate to
meet that need.

Imagine serving iSCSI on a switch intended to support
a certain type of \"application traffic\". Suddenly,
there\'s all of this (near continuous) traffic as
the fabric tries to absorb the raw disk I/O.

iSCSI isn\'t served by a switch ... it\'s just SCSI commands from an
initiator to a target, wrapped in TCP. The ultimate bulk data transfer
on the network looks effectively like any other (TCP-based) data
transfer.

iSCSI relies on the switch to connect the initiators and
targets. The commands are insignificant packets; the payloads
are large and often \"ongoing\". Not short transactions that
engage and then release the resources of the switch.

They certainly are. Nothing more than a buncha frames whizzing past at
1gbps.

And, \"occupying\" the switch ports while they are on-the-wire.
Competing with other traffic that wants to access that same port(s).

Maybe my Target has a 10g link ... but it can still only send at 1gbps
to any one initiator because that initiator only has a 1g link (and when
it tries sending faster, frames start getting dropped, and it backs
off).

Different issue.

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).

You can *write* as fast as the initiator can synthesize packets.
There is often a cache on the target so you can write faster than the
disk\'s write rate. Likewise, a read-ahead cache to read faster than
the disk can source data.

And those are both only so deep. The point here is that there are
limits outside of the switch in the middle that have just as much (if
not more) impact on things than the switch does.

But you would qualify those just as you would the switch.
You don\'t just throw things together and \"hope\" -- unless
you don\'t really care about performance: \"It\'ll get
done, eventually\"

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets

You\'re thinking of bog-standard hardware and OSs. No PCs involved, here.

I move memory pages directly onto the wire by prepending an appropriate
header and letting the NIC pull the data straight out of the page
(jumbo frames, no packet reassembly required, etc.).

Fine, s/PC/NIC/ then.

In either case, frames can only be kicked over the wire as fast as the
card\'s PHY / MII is capable of sustaining (plus upstream CPU dumping
data to the card, etc).

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

[I actually intentionally do that with \"dummy\" traffic injected
to ensure any eavesdropping doesn\'t leak information]

This isn\'t common in most businesses, offices, etc. Network
traffic is bursty and intermittent. And, usually doesn\'t have
any service guarantees beyond \"let it get there\".

involved. If your \"application\" needs more throughput, you\'ll need to
handle it at the host it\'s running on. I mean, if we have a theoretical
switch with 500gpbs capacity, that\'s all going to waste if the 48 hosts
connected to it only have gbit ports ...

Think: many hundreds of ports. Would you want many tpbs of capacity?
Or, would you want a smarter way of (re)shaping the traffic to give
you the performance that would be available, there?

If the network warranted \"many hundreds of ports\", then most certainly I
would want switching capacity of \"2x many hundreds of ports\" (at a
minimum, and probably higher, to account for using 10/100g links where
necessary).

It might not be \"common\" for the smallish networks (~200 hosts max) I
work with to *NEED* to move 100gpbs / switch around .. but I\'m also not
going to pretend that they\'ll be better off if they just reshape their
traffic patterns. Network\'s there to take whatever abuse they wanna
throw at it, and I don\'t have to worry.

*They* aren\'t predictable clients. My clients are processes with
well defined objectives. I can watch their performance and traffic
to determine how to economize on the hardware. Just like you
can shift your electric load throughout the day to take advantage
of cheaper rates overnight.

You can\'t tell employees that <something> has decided that
they should come in at 6:25P and work for 4 hours, then go
home and return at 2:10A for the balance of their shift.

(e.g., if you have to leave A switch to get to ANOTHER switch, then
that becomes a scarce resource -- 48 ports on switch #1 each wanting
to talk AT LINE SPEED to 48 other ports on switch #2)

Pretty standard stuff. Either use a 100g port as the between the two
switches, or interconnect their backplanes. Really depends on what the
switch\'s capabilities are. The backplane ones are really nice, because
after you define the master, the rest just look like daughtercards
(rather than N individual switches as with 10/100g uplinks)... but they
also cost an arm and a leg.

The latter is the rub. A node costs me about $15. How much do I
want to spend to make it connect to other nodes?

Dimiter_Popoff · Aug 28, 2023

On 8/25/2023 18:29, Joe Gwinn wrote:

On Fri, 25 Aug 2023 17:45:24 +0300, Dimiter_Popoff <dp@tgi-sci.com
wrote:

On 8/25/2023 17:08, Don Y wrote:
On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:
....
You are missing the point.Â The network is used as part of the
\"memory bus\"
in the system.

Simple program/expression/line-of-code:
Â Â Â Â hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC.Â So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes.Â Because you have to accommodate \"foreign\"
devices intermixed with your own.Â And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON.Â If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

TCP is a heavyweight protocol.Â It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...
If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think. You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

TCP has a very specific niche, one that does not fit all uses, and TCP
has all manner of interesting failure modes. Nor does TCP support
multicast.

Niche as in save bidirectional link - OK, if that is a niche
than it is a niche. It does not do multicast, if you need that
*and* a safe transaction to all nodes.... well, you will have to do
a much more complex thing than tcp.

UDP plus some application code can actually be far simpler than TCP.

It cannot be made simpler than tcp if you need a two-way safe
communication protocol. You can write an excessively complex
tcp, obviously, but writing *the part of tcp* that you need
is simpler than doing what you need under udp, at least you
will not have to deal with the udp part of it while you *will*
have to do all of the tcp you need to do.
The idea of doing a tcp replacement under udp stems from the
understanding, that tcp must implement in all its varieties
and thus be complex and difficult to do - which it is not,
it just is (perceived as?) not readily available.

And for \"safety\" applications, such as controlling a dangerous machine
of some kind, an interlocked handshake protocol based on UDP is far
far cleaner and simpler than anything else, and can be proven safe.

I don\'t know how you make a tcp replacement under udp which is
cleaner. I can only see how if it is one way (that is, one of the
sides won\'t be aware that the other side has finished the
transaction).

Don Y · Aug 28, 2023

On 8/25/2023 7:45 AM, Dimiter_Popoff wrote:

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

I don\'t need TCP\'s functionality. Remember, it is intended to solve
a need in an unreliable interconnect medium with unknown clients
and unknown traffic.

When these are known, there is less reliance on the protocol.

E.g., SunRPC could run over UDP in a benign network environment.
(of course, it couldn\'t leave the local subnet).

If you have to constrain yourself to getting all of your
reassurances from the protocol, then you have to build all of
those reassurances *into* the protocol.

But, if you take a more INTEGRATED view (i.e. a protocol in
a context) then you can leverage other mechanisms to
provide the reassurances that the protocol lacks.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes.Â Because you have to accommodate \"foreign\"
devices intermixed with your own.Â And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If I put a UART on a PCB and have it talk to another UART
on the same PCB, would you expect it to be unreliable?
I.e., context determines the risks you have to address.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON.Â If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

I\'m speaking at a much more \"intimate\" level. Imagine I
open the box and cobble some additional hardware onto
your address/data busses so they can share the bus with
your host processor. Would you expect your product to
behave as intended?

This is the risk of mainstream networking -- you have no control
over the \"other\" devices on the network, the quality of the cabling,
the design of the fabric, etc.

But, if you control that AS IF it was \"inside your box\", then
you can reasonably make expectations of it.

TCP is a heavyweight protocol.Â It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...

But that means lots of overhead just to set up a connection
that may only persist for a few dozen microseconds -- and
then have another/other connections (to that or other hosts)
competing for those same resources (bandwidth, latency,
memory, MIPS, etc.)

Imagine sqrt() was implemented in a TCP session created JUST
for that invocation. Create the connection. Send \"sqrt, value\".
await reply. Teardown connection. Repeat for next (remote)
function invocation. The overhead quickly discourages the
use of RPCs. Which means processes become more bloated as
they try to reinvent services that already exist elsewhere.

I\'m taking the other approach: make it easy to reuse
a service (in the form of an instantiated object) on
some other node by keeping the invocation cost low.
(recall, you are competing with the cost of setting up
a stack frame and a BAL into the function)

If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think.

It\'s actually a fair amount -- in time and bandwidth overhead.
Try to open (and then close) TCP connections as fast as possible
to get an idea for how much they cost. And, remember, the
target plays a role in that calculation as well. If it is
slow to complete it\'s end of the handshake, then the
\"caller\" has to pause.

You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

But I can get those reassurances from other mechanisms that are in place
without burdening each transaction with the cost of setting up a TCP
connection. Esp if dropped packets are the exception and not the
rule!

Don Y · Aug 28, 2023

On 8/25/2023 9:22 AM, Dimiter_Popoff wrote:

UDP plus some application code can actually be far simpler than TCP.

It cannot be made simpler than tcp if you need a two-way safe
communication protocol.

But you don\'t need \"two way safe\". You can push a packet at the
target (the RMI request -- an identifier for the object to be
referenced, the member function to be invoked, the arguments that
it requires along with their types, and directions on where to
return the reply/result). Then, wait for the target to push the
reply back to you -- wherever you specified.

This is also a win if the host that will eventually reply is not
the host to which you directed the request! (imagine \"breaking\"
a TCP connection and reconnecting it to some other host without
mucking up either end)

You can write an excessively complex
tcp, obviously, but writing *the part of tcp* that you need
is simpler than doing what you need under udp, at least you
will not have to deal with the udp part of it while you *will*
have to do all of the tcp you need to do.
The idea of doing a tcp replacement under udp stems from the
understanding, that tcp must implement in all its varieties
and thus be complex and difficult to do - which it is not,
it just is (perceived as?) not readily available.

Look at Sun\'s RPC to see how UDP can easily be used
for this.

Dimiter_Popoff · Aug 28, 2023

On 8/25/2023 19:31, Don Y wrote:

...

If I put a UART on a PCB and have it talk to another UART
on the same PCB, would you expect it to be unreliable?
I.e., context determines the risks you have to address.

While I would do that with no second thoughts in plenty of
cases it is not as safe as \"tcp\" is, i.e. two-way safe like
in syn-ack-synack etc.

Imagine sqrt() was implemented in a TCP session created JUST
for that invocation.Â Create the connection.Â Send \"sqrt, value\".
await reply.Â Teardown connection.Â Repeat for next (remote)
function invocation.Â The overhead quickly discourages the
use of RPCs.Â Which means processes become more bloated as
they try to reinvent services that already exist elsewhere.

Well this is clearly a one way safe operation, if you need it
safe at all. Assuming you want is as safe as a \"mul\" opcode
on the core you run at you will be fine, say just two crc-s
or sort of (for both the request and reply). You won\'t need
more than a single rtt.

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think.

It\'s actually a fair amount -- in time and bandwidth overhead.
Try to open (and then close) TCP connections as fast as possible
to get an idea for how much they cost.Â And, remember, the
target plays a role in that calculation as well.Â If it is
slow to complete it\'s end of the handshake, then the
\"caller\" has to pause.

As long as you don\'t need it two-way safe you can obviously
go faster. However you cannot guard against trivial stuff like
races etc. without doing extra work (say you issue two \"mul\"
requests and the replies come out of order; to wrestle that
you will have to do some of the work tcp necessitates you
to do).

But I can get those reassurances from other mechanisms that are in place
without burdening each transaction with the cost of setting up a TCP
connection.Â Esp if dropped packets are the exception and not the
rule!

Allowing for an exception in a networking context - even if the
exception from the rule - changes the game altogether. If you don\'t
need a safe delivery mechanism then you don\'t need it of course.

Guaranteeing switch bandwidth...

Don Y

Guest

Dan Purgert

Guest

Don Y

Guest

Dan Purgert

Guest

John Walliker

Guest

Don Y

Guest

Don Y

Guest

Dan Purgert

Guest

Dan Purgert

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Dan Purgert

Guest

Joe Gwinn

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Don Y

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Log in

Welcome to EDABoard.com

Sponsor