Guaranteeing switch bandwidth...

Don Y · Aug 28, 2023

On 8/25/2023 9:58 AM, Dimiter_Popoff wrote:

On 8/25/2023 19:31, Don Y wrote:
...

If I put a UART on a PCB and have it talk to another UART
on the same PCB, would you expect it to be unreliable?
I.e., context determines the risks you have to address.

While I would do that with no second thoughts in plenty of
cases it is not as safe as \"tcp\" is, i.e. two-way safe like
in syn-ack-synack etc.

If I put a NIC on a PCB and have it talk to another NIC
on the same PCB, would you expect it to be unreliable? :>

Imagine sqrt() was implemented in a TCP session created JUST
for that invocation.Â Create the connection.Â Send \"sqrt, value\".
await reply.Â Teardown connection.Â Repeat for next (remote)
function invocation.Â The overhead quickly discourages the
use of RPCs.Â Which means processes become more bloated as
they try to reinvent services that already exist elsewhere.

Well this is clearly a one way safe operation, if you need it
safe at all. Assuming you want is as safe as a \"mul\" opcode
on the core you run at you will be fine, say just two crc-s
or sort of (for both the request and reply). You won\'t need
more than a single rtt.

You don\'t even have to address data corruption.
If you have determined the media to be unreliable, then
all bets are off.

[What do you do when your memory starts throwing errors?]

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think.

It\'s actually a fair amount -- in time and bandwidth overhead.
Try to open (and then close) TCP connections as fast as possible
to get an idea for how much they cost.Â And, remember, the
target plays a role in that calculation as well.Â If it is
slow to complete it\'s end of the handshake, then the
\"caller\" has to pause.

As long as you don\'t need it two-way safe you can obviously
go faster. However you cannot guard against trivial stuff like
races etc. without doing extra work (say you issue two \"mul\"
requests and the replies come out of order; to wrestle that
you will have to do some of the work tcp necessitates you
to do).

Each request is serialized. Replies bear the id of the
request that initiated them. So, it\'s a simple matter for the
correct receiving stub to located and the return value(s)
propagated back to the caller.

Even if requests come in to a server out-of-order,
the replies will eventually meet their correct intiators.
Stateful objects inherently have race issues so you
have to address that explicitly -- either with semaphores
or some other synchronization primitive.

But I can get those reassurances from other mechanisms that are in place
without burdening each transaction with the cost of setting up a TCP
connection.Â Esp if dropped packets are the exception and not the
rule!

Allowing for an exception in a networking context - even if the
exception from the rule - changes the game altogether. If you don\'t
need a safe delivery mechanism then you don\'t need it of course.

If you can get \"safe\" without the added costs of protocol
changes, then it\'s \"free\".

Remember, all of the legacy protocols were designed when networking
was far more of a crap shoot.

Joe Gwinn · Aug 28, 2023

On Fri, 25 Aug 2023 19:22:03 +0300, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

On 8/25/2023 18:29, Joe Gwinn wrote:
On Fri, 25 Aug 2023 17:45:24 +0300, Dimiter_Popoff <dp@tgi-sci.com
wrote:

On 8/25/2023 17:08, Don Y wrote:
On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:
....
You are missing the point. The network is used as part of the
\"memory bus\"
in the system.

Simple program/expression/line-of-code:
hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC. So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes. Because you have to accommodate \"foreign\"
devices intermixed with your own. And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON. If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

TCP is a heavyweight protocol. It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...
If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think. You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

TCP has a very specific niche, one that does not fit all uses, and TCP
has all manner of interesting failure modes. Nor does TCP support
multicast.

Niche as in save bidirectional link - OK, if that is a niche
than it is a niche. It does not do multicast, if you need that
*and* a safe transaction to all nodes.... well, you will have to do
a much more complex thing than tcp.

UDP plus some application code can actually be far simpler than TCP.

It cannot be made simpler than tcp if you need a two-way safe
communication protocol. You can write an excessively complex
tcp, obviously, but writing *the part of tcp* that you need
is simpler than doing what you need under udp, at least you
will not have to deal with the udp part of it while you *will*
have to do all of the tcp you need to do.
The idea of doing a tcp replacement under udp stems from the
understanding, that tcp must implement in all its varieties
and thus be complex and difficult to do - which it is not,
it just is (perceived as?) not readily available.

Well the classic use case in radar is where individual radar detection
reports are _not_ precious, and so it\'s OK if one loses a few here and
there. So just fire away, but count packets sent versus received, and
complain if too many are being lost. With TCP, the link would hang up
because it assumes that all packets are precious.

For those few UDP messages that are precious, implement a retry and
timeout mechanism, one that gives up and complains after three
attempts. This being so the code does not lose control should a
packet fail to arrive. Here, TCP will just hang, waiting for Godot.

And for \"safety\" applications, such as controlling a dangerous machine
of some kind, an interlocked handshake protocol based on UDP is far
far cleaner and simpler than anything else, and can be proven safe.

I don\'t know how you make a tcp replacement under udp which is
cleaner. I can only see how if it is one way (that is, one of the
sides won\'t be aware that the other side has finished the
transaction).

Depends on one\'s definition of \"safety\".

In this case, TCP is out of the question, as it does not implement a
safety-critical interlocked handshake adequate to control say a
weapon. TCP was not designed for anything of the kind.

More generally, which is better, TCP or UDP, has no single universal
answer - it depends on the application.

Joe Gwinn

Dan Purgert · Aug 28, 2023

On 2023-08-25, Don Y wrote:

On 8/25/2023 8:03 AM, Dan Purgert wrote:
Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

This would be A Bad Thing as it would have to rely on timeouts
to detect that packets a missing.

The whole idea is to drop frames early, so that you don\'t bog down other
parts of the network.

You are missing the point. The network is used as part of the \"memory
bus\" in the system.

With the assumption of a suitably capable switch, and \"N\" hosts
connected to it; any communications are inherently limited by the link
between the switch and the host(s).

The *average* communication rate is thusly limited. The elastic
store in the switch allows that to be exceeded, depending on how
deep the store and the nature of the traffic hammering on it.

No. Any given host can only transmit to the switch at link speed (e.g.
1gpbs). Therefore, 5 hosts communicating via a switch can NEVER push
more than 5 gpbs into the switch, and there will NEVER be more than
5gbps that the switch needs to kick out \"some interface\".

If the interface cannot sustain the traffic (e.g. 4x hosts all trying to
hammer the 5th), frames get dropped, flow control (if available) kicks
in, and the sending hosts slow down such that the inbound rate doesn\'t
exceed the capacity for the target host to consume data (i.e. each
inbound host transmits at ~250mbps). Otherwise, the switch happily
shuffles 10gbps of traffic around, and everybody is shoving data out at
1gbps.

Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s it, there\'s nothing else network-side you can do. The potential
for dropped / lost traffic is going to be handled in your host\'s network
stack (e.g. TCP resends), or in your application directly (or both).

Okay, so you\'ve got basically a bog-standard network design ...

The fabric is not exceptional. The system/application running atop
it, however, is far from typical. So, the traffic is atypical.

E.g., each process has its own namespace. In most systems, the
filesystem acts as the UNIFIED namespace for all processes
running on that system.

So, each process can have a (trivial) namespace defined as:
/inputs
/1
/2
/n
/outputs
/A
/B
/C
/error
/inputs/1 for process X has nothing to do with /inputs/1 for
process Y. (just like stdin and stdout are each specific to
a particular process)

There is no way for process X to even *see* the namespace for
process Y as there is no global context that makes all of them
visible (even if only to a \"superuser\") -- unlike the shared
filesystem mentioned above.

And this is exactly how inter-host communication works in a standard
ethernet network. No one host knows what any other host is doing at any
particular time before it starts transmitting ...

And hosts don\'t trend to chatter amongst themselves that much.
I.e., limit the bandwidth of your databus inside your CPU
to 1Gb/s and tell me how that affects performance?

That makes no sense whatsoever. If you\'re using 10g or 100g links, just
say so (but let\'s face it, the only difference is that they get
\"faster\", it\'s not like the actual behavior I\'m describing changes).

[...]
A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

AFTER it has entered and possibly been queued *in* the switch.

[My \"I\'m not leaving the switch\" comment was meant as \"I\'m not
pushing packets to ANOTHER switch\"; all of my hosts are served
by \"the\" switch]

The only time data will be queued \"in\" the switch is on an egress
buffer, but at that point it\'s already exited the switch ASIC (and all
the \"important work\" is done).

But anything wanting to use that wire is still waiting.
It doesn\'t matter where it waits; it\'s still waiting.
And, the wait is proportional to the depth of the queue
and the policy for how it is filled.

E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

[...]
So you don\'t want a switch, you want Software Defined Networking (SDN),
which tends to lean \"not cheap\".

In my *application* (product), yes. But, the point of my question
was considerably simpler -- before we got onto the topic of how
switches work.

Namely, could I *emulate* a dedicated switch *in* a larger
switch without concern for the other traffic that could
be operating in that larger switch -- or, would I need to
have a physical, separate switch for any performance
guarantees (given that I can shape the traffic on the
hosts for taht smaller switch but can\'t shape the traffic
for the \"other\" hosts on the larger switch)

No, switches don\'t \"emulate switches inside themselves\". A switch has
an ASIC (that can shove data around at up to N gbps OR M pps; whichever
limit you hit first) and a buncha ports connected to that ASIC.

They are dead simple devices in that regard.

Fancy ones might have fancy features like QoS or flow control or
similar; but your bog standard switch will just hang out and shuffle
data between the ports at line rate.

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

The switch\'s specs tell me nothing about how the *application* can
avail itself of its abilities/limitations. I have to evaluate the
performance of the *application* in the constraints of the *switch*.

\"Applications\" hand off to the networking stack, which ultimately talks
in packets, frames, and bits on the wire.

So, if you want to know if \"an application\" will not starve from the
network, you need to know what kind of data rates \"an application\" needs
to sustain itself ... and provide \"the host\" with a sufficiently fast
network card (or use link aggregation).

Or, shape the traffic so it fits within the capabilities present.

\"Build a bigger power supply or use less power\" -- two approaches
to the same problem of \"fit\"

Well, yeah, but when I said that in like my first or second post
(something along the lines of \"if 1g isn\'t fast enough, get faster
interfaces\") you balked at that too ...

But, if other applications are also sharing the switch,
then you (me) have to be able to quantify *their* impact on
YOUR application.

Appliation? like \"program\"? Switches don\'t operate with \"applications\".
They operate on ethernet frames.

The switch is an integral part of the applications that
rely on it to exchange data. Running a DBMS where the clients
can\'t talk to the DBMS server is kind of meaningless, eh?

But \"the application\" doesn\'t talk on the wire. \"the network card\"
does, and \"the application\" is constrained by the host\'s network card.

Semantics. The application decides what will get sent and
what will be expected based on the code that it executes.
The network car is just the physical manifestation of the
interface; the virtual manifestation is the set of RPCs that
the developer invokes (often without realizing that it *is*
an RPC)

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Of course. Except if the application spans multiple hosts
(and, thus, NICs).

Don\'t think of an application as *a* program running on *a*
host but, rather, a collection of processes operating on\\
set of processors to address a need.

More that I\'m saying \"an application\" doesn\'t matter (or many
applications). You want to talk about the capabilities of a switch; and
that discussion is well below the level of \"application\".

iSCSI relies on the switch to connect the initiators and
targets. The commands are insignificant packets; the payloads
are large and often \"ongoing\". Not short transactions that
engage and then release the resources of the switch.

They certainly are. Nothing more than a buncha frames whizzing past at
1gbps.

And, \"occupying\" the switch ports while they are on-the-wire.
Competing with other traffic that wants to access that same port(s).

Sure, but it\'s not the switch\'s problem that the devices can only talk
at 1gbps ... in fact, the stuff on the wire is long past the switch
ASIC, and completely forgotten about. (in a manner of speaking)

Maybe my Target has a 10g link ... but it can still only send at 1gbps
to any one initiator because that initiator only has a 1g link (and when
it tries sending faster, frames start getting dropped, and it backs
off).

Different issue.

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).

You can *write* as fast as the initiator can synthesize packets.
There is often a cache on the target so you can write faster than the
disk\'s write rate. Likewise, a read-ahead cache to read faster than
the disk can source data.

And those are both only so deep. The point here is that there are
limits outside of the switch in the middle that have just as much (if
not more) impact on things than the switch does.

But you would qualify those just as you would the switch.
You don\'t just throw things together and \"hope\" -- unless
you don\'t really care about performance: \"It\'ll get
done, eventually\"

no, I just throw things together and know it\'ll work, because I know
what the link speeds are, and I know what jobs are trying to be done.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets

You\'re thinking of bog-standard hardware and OSs. No PCs involved, here.

I move memory pages directly onto the wire by prepending an appropriate
header and letting the NIC pull the data straight out of the page
(jumbo frames, no packet reassembly required, etc.).

Fine, s/PC/NIC/ then.

In either case, frames can only be kicked over the wire as fast as the
card\'s PHY / MII is capable of sustaining (plus upstream CPU dumping
data to the card, etc).

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

[...]
*They* aren\'t predictable clients. My clients are processes with
well defined objectives. I can watch their performance and traffic
to determine how to economize on the hardware. Just like you
can shift your electric load throughout the day to take advantage
of cheaper rates overnight.

Good, you have well-defined bandwidth needs then? What are they? is
1gbps / port sufficient? Or do your devices need 10g?

[...]
The latter is the rub. A node costs me about $15. How much do I
want to spend to make it connect to other nodes?

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Dimiter_Popoff · Aug 28, 2023

On 8/25/2023 20:16, Joe Gwinn wrote:

On Fri, 25 Aug 2023 19:22:03 +0300, Dimiter_Popoff <dp@tgi-sci.com
wrote:

On 8/25/2023 18:29, Joe Gwinn wrote:
On Fri, 25 Aug 2023 17:45:24 +0300, Dimiter_Popoff <dp@tgi-sci.com
wrote:

On 8/25/2023 17:08, Don Y wrote:
On 8/25/2023 6:34 AM, Dimiter_Popoff wrote:
....
You are missing the point.Â The network is used as part of the
\"memory bus\"
in the system.

Simple program/expression/line-of-code:
Â Â Â Â hypot = sqrt(sideA^2 + sideB^2)
But, sqrt() is an RPC.Â So, what you don\'t see in the statement
is the marshalling of the argument to sqrt(), it\'s being wrapped
with appropriate controls to indicate to the stub on the receiving
host exactly which function is being invoked along with the
format of the argument(s) and where the result should be \"returned\".

THEN, those are passed to the network \"stack\" for delivery to the
correct remote host.

And, the result awaited.

I am not following the switch etc. but I would strongly advise against
going into these waters.

I\'m not sure I understand which \"waters\" you\'re referencing?

I meant reinventing tcp, I have seen attempts on it time and again.

All of it is meant to speed things up, *not* as a tcp replacement, i.e.
it remains a \"best effort delivery\".

In a generic network, yes.Â Because you have to accommodate \"foreign\"
devices intermixed with your own.Â And, failures/faults/misuses.

But, if YOU control all of the devices AND can shape the traffic,
then you can get predictable performance even with protocols
that would be \"unreliable\" in generic deployments.

But giving up on tcp or a tcp replacement you already have given
up to \"safe\" functionality. Even an rs-232 link is not guaranteed
to be safe as long as you don\'t implement a \"tcp replacement\", i.e.
some request-response-ack/nak mechanism.

If you wanted to design an appliance with performance X, you can
do so as long as you don\'t let anything else compete for the resources
that you RELY ON.Â If I added a \"foreign\" bus mastering
device to your netmca, you\'d not expect it to perform as advertised
without knowing how my device was competing for its resources.

Not sure I would want to know anything beyond \"bandwidth\" I can
get. I have had customers in a South Africa facility who were not
allowed to use a good link they had so I had to support them over
something about 30 kbps or so - slowly but it worked eventually
(countless RFB reconnections etc. of course).

TCP is a heavyweight protocol.Â It\'s costly to set up connections
(so, you\'d want to tunnel under a semi-persistent connection, instead)
and consumes a lot of resources relative to something as trivial
as UDP.

Omitting that is what I am advising against (obviously not knowing
enough details). A tcp connection takes a syn-synack-ack and
eventually a fin-finack; if you want to make sure a button was
pressed at the other side I know of no simpler way of doing it
safely...
If you don\'t have the muscle on your mcu I know how this goes
all too well, but if you use udp you will *add* only work if
you are after having a tcp replacement. If you don\'t need more
than a few bytes just write a small/limited tcp stack, doing it
over udp will only mean more work.

If you aren\'t worried about other hosts jabbering or
the target being \"over busy\" (and missing the packet), then why
go to the expense of setting up a connection for a short transaction?
(esp if transactions are frequent)

E.g., a TCP connection per RPC/RMI would dramatically impact the
cost of each operation!

Not that much I think. You have to either choose between
reliable or best effort, this is the bottom line. Your RPC/RMI
will have to deal with dropped/reinitiated/reset connections
just like anybody else on the network, there is no middle ground,
this is my point.

TCP has a very specific niche, one that does not fit all uses, and TCP
has all manner of interesting failure modes. Nor does TCP support
multicast.

Niche as in save bidirectional link - OK, if that is a niche
than it is a niche. It does not do multicast, if you need that
*and* a safe transaction to all nodes.... well, you will have to do
a much more complex thing than tcp.

UDP plus some application code can actually be far simpler than TCP.

It cannot be made simpler than tcp if you need a two-way safe
communication protocol. You can write an excessively complex
tcp, obviously, but writing *the part of tcp* that you need
is simpler than doing what you need under udp, at least you
will not have to deal with the udp part of it while you *will*
have to do all of the tcp you need to do.
The idea of doing a tcp replacement under udp stems from the
understanding, that tcp must implement in all its varieties
and thus be complex and difficult to do - which it is not,
it just is (perceived as?) not readily available.

Well the classic use case in radar is where individual radar detection
reports are _not_ precious, and so it\'s OK if one loses a few here and
there. So just fire away, but count packets sent versus received, and
complain if too many are being lost. With TCP, the link would hang up
because it assumes that all packets are precious.

For those few UDP messages that are precious, implement a retry and
timeout mechanism, one that gives up and complains after three
attempts. This being so the code does not lose control should a
packet fail to arrive. Here, TCP will just hang, waiting for Godot.

If you don\'t need two way safety then you don\'t need it, like I said.
Obviously it makes no sense to wait for expendable data, I am not
talking about that at all.
A more everyday example would be to trigger an alarm clock remotely.
The consequences of not having it ack the fact it was triggered within
some sane time frame can vary from benign to lethal.

And for \"safety\" applications, such as controlling a dangerous machine
of some kind, an interlocked handshake protocol based on UDP is far
far cleaner and simpler than anything else, and can be proven safe.

I don\'t know how you make a tcp replacement under udp which is
cleaner. I can only see how if it is one way (that is, one of the
sides won\'t be aware that the other side has finished the
transaction).

Depends on one\'s definition of \"safety\".

In this case, TCP is out of the question, as it does not implement a
safety-critical interlocked handshake adequate to control say a
weapon. TCP was not designed for anything of the kind.

That would depend on both context and latency requirements.
I can\'t see why a protocol other than tcp can be that much
better on interlocking, though I can see how taking just any
tcp off the shelf would not be usable generally speaking.

More generally, which is better, TCP or UDP, has no single universal
answer - it depends on the application.

There is no \"better\" - these are two very different protocols.
I happen to have implemented both and of course I know tcp takes
a lot more work to implement, just as I know how it could be done
such to be not as heavy and slow like people think \"from outside\".

Don Y · Aug 28, 2023

On 8/25/2023 10:35 AM, Dimiter_Popoff wrote:

More generally, which is better, TCP or UDP, has no single universal
answer - it depends on the application.

There is no \"better\" - these are two very different protocols.
I happen to have implemented both and of course I know tcp takes
a lot more work to implement, just as I know how it could be done
such to be not as heavy and slow like people think \"from outside\".

But you likely *expose* those protocols to the developer.
The application layer has no knowledge of the network, at all.
No IP addresses. No hostnames. No sockets. etc.

A process is just given a set of \"handles\" and invokes the
methods that those handles support on them. The OS sorts
out where the targeted object resides and arranges for
the method to be applied to the referenced object wherever
it is.

Why should the programmer have to be aware of the network?
It\'s possible that every object with which it interacts
\"resides\" on the same host that *it* does!

[Only OS-level services are aware of different hosts
and even they are oblivious to the lower level mechanics
of how the network works. You can just say (assuming
you have privilege to do so) \"ThatHost.power_off\"
and whatever host is referenced by the ThatHost handle
will be powered down. Or, ThisProcess.Migrate(ThatHost)
to cause the process referenced by ThisProcess to be
migrated to the host referenced by ThatHost -- regardless
of which host currently has ThatProcess!]

One can potentially determine if an object is local or
remote by timing the operation. But, can\'t rule out
that it\'s process might be swapped out during the
invocation (which would explain a longer than expected
time). Or, that the object won\'t *move* between now
and the next time it is accessed. So, there is little
value to this effort beyond curiosity.

Dimiter_Popoff · Aug 28, 2023

On 8/25/2023 22:01, Don Y wrote:

On 8/25/2023 10:35 AM, Dimiter_Popoff wrote:
More generally, which is better, TCP or UDP, has no single universal
answer - it depends on the application.

There is no \"better\" - these are two very different protocols.
I happen to have implemented both and of course I know tcp takes
a lot more work to implement, just as I know how it could be done
such to be not as heavy and slow like people think \"from outside\".

But you likely *expose* those protocols to the developer.
The application layer has no knowledge of the network, at all.
No IP addresses.Â No hostnames.Â No sockets.Â etc.

I can only speak for the tpc I have implemented for dps.
It is heavily reliant on the inherent, run time object system
which has made it very flexible and expandable.
You can (even from a dps script!) \"make\" a tcp connection,
talking to (possible more than one in a system) tcp/ip_subsystem;
you get what you call a handle, which is a dps object.
All it takes to type in a command line
op $[ts <address> <destination port> <source port>.
You can even copy data (I wrote a device driver for that)
to/from a tcp connection, I needed that initially to make things
work so I still have it. Obviously from code you complile
(an\"app\" as they have it) things are not harder to do.
Once you deal with a connection you have access to a variety
of parameters you may or may not choose to get into; like
rtt, try count, time until first retry etc., these are
also available via the dps script command line, rp/sp
(as in getxpar/setxpar) returning/checking for parameter
type, unit, sign if applicable etc.
If you just \"make\" a connection it will adhere to the
defaults; if you want it to open with a sliding window
size you can do so (at open time), passively open etc.

Basically you can communicate over a tcp_connection
in the plainest way to dealing with it as much as the
code I have written to allow you to do (which is all I have
ever needed to do when I needed to).

What I was getting at knowing a little of what you are
doing is that I suggested that you might be better off
if you wrote your simplest tcp implementations (not like
the dps one, it is full blown, although being vpa written
it takes just about 300k on disk, good luck with these figures
with your C vendors I guess).

A process is just given a set of \"handles\" and invokes the
methods that those handles support on them.Â The OS sorts
out where the targeted object resides and arranges for
the method to be applied to the referenced object wherever
it is.

Why should the programmer have to be aware of the network?
It\'s possible that every object with which it interacts
\"resides\" on the same host that *it* does!

[Only OS-level services are aware of different hosts
and even they are oblivious to the lower level mechanics
of how the network works.Â You can just say (assuming
you have privilege to do so) \"ThatHost.power_off\"
and whatever host is referenced by the ThatHost handle
will be powered down.Â Or, ThisProcess.Migrate(ThatHost)
to cause the process referenced by ThisProcess to be
migrated to the host referenced by ThatHost -- regardless
of which host currently has ThatProcess!]

I don\'t see how udp in the conext of using a NIC and even
involving switches makes any difference instead of using
tcp, that would take a lot of detail. You know how friendly
suggestions are, \"watch at your own risk\"

Don Y · Aug 28, 2023

On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster

than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s it, there\'s nothing else network-side you can do. The potential
for dropped / lost traffic is going to be handled in your host\'s network
stack (e.g. TCP resends), or in your application directly (or both).

Okay, so you\'ve got basically a bog-standard network design ...

The fabric is not exceptional. The system/application running atop
it, however, is far from typical. So, the traffic is atypical.

E.g., each process has its own namespace. In most systems, the
filesystem acts as the UNIFIED namespace for all processes
running on that system.

So, each process can have a (trivial) namespace defined as:
/inputs
/1
/2
/n
/outputs
/A
/B
/C
/error
/inputs/1 for process X has nothing to do with /inputs/1 for
process Y. (just like stdin and stdout are each specific to
a particular process)

There is no way for process X to even *see* the namespace for
process Y as there is no global context that makes all of them
visible (even if only to a \"superuser\") -- unlike the shared
filesystem mentioned above.

And this is exactly how inter-host communication works in a standard
ethernet network. No one host knows what any other host is doing at any
particular time before it starts transmitting ...

And hosts don\'t trend to chatter amongst themselves that much.
I.e., limit the bandwidth of your databus inside your CPU
to 1Gb/s and tell me how that affects performance?

That makes no sense whatsoever. If you\'re using 10g or 100g links, just
say so (but let\'s face it, the only difference is that they get
\"faster\", it\'s not like the actual behavior I\'m describing changes).

The point is that every host is a client and a server to some
number (thousands) of objects at the same time. Remote operations
on those objects can easily queue on the port (input or output
doesn\'t matter as in either case they consume resources).

You don\'t run a DNS server on every host in your network, do you?
Let alone one for each process?

So, you don\'t see that sort of traffic IN JUST THE NAME RESOLVER
(we haven\'t even talked about *using* objects other than
namespaces!)

A good switch will happily switch all 24 / 48 ports at line rate all day
every day.

It\'s usually just easier to have bigger uplink connections (e.g.
10/100G) making up the backbone.

I\'m not leaving the switch.

Then I\'m not really sure why you\'re asking about N hosts talking to an
M\'th host ... as all that traffic will enter (and exit) your switch ...

Any host (or hosts) on the switch can make a request (RPC)
of any other host at any time. Even for synchronous requests,
having *made* a request doesn\'t mean that other requests
(from other threads on your host) can\'t *also* be tickling the
switch -- possibly the same host that the first RPC targeted,
or possibly some other.

This description still has your data \"leaving\" the switch.

AFTER it has entered and possibly been queued *in* the switch.

[My \"I\'m not leaving the switch\" comment was meant as \"I\'m not
pushing packets to ANOTHER switch\"; all of my hosts are served
by \"the\" switch]

The only time data will be queued \"in\" the switch is on an egress
buffer, but at that point it\'s already exited the switch ASIC (and all
the \"important work\" is done).

But anything wanting to use that wire is still waiting.
It doesn\'t matter where it waits; it\'s still waiting.
And, the wait is proportional to the depth of the queue
and the policy for how it is filled.

E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

Until you are trying to shove more data across a port
than the link can sustain IN A GIVEN TIMEFRAME. Otherwise,
why have *any* buffer?

- Forwarding Rate -- should be about 1.5x the number of ports. How
many frames the switch can process before it bogs down.

As long as you\'re within these specs, \"the switch\" is not impacting the
traffic at all.

If the switch is owned ENTIRELY by the application, then
these limits can be evaluated.

Evaluated? They\'re right there in the datasheet, the work\'s been done
for you.

The switch\'s specs tell me nothing about how the *application* can
avail itself of its abilities/limitations. I have to evaluate the
performance of the *application* in the constraints of the *switch*.

\"Applications\" hand off to the networking stack, which ultimately talks
in packets, frames, and bits on the wire.

So, if you want to know if \"an application\" will not starve from the
network, you need to know what kind of data rates \"an application\" needs
to sustain itself ... and provide \"the host\" with a sufficiently fast
network card (or use link aggregation).

Or, shape the traffic so it fits within the capabilities present.

\"Build a bigger power supply or use less power\" -- two approaches
to the same problem of \"fit\"

Well, yeah, but when I said that in like my first or second post
(something along the lines of \"if 1g isn\'t fast enough, get faster
interfaces\") you balked at that too ...

Faster interfaces cost more money. Smarter approach is to shape the
traffic so a slower interface (and attendant fabric) can meet your
needs. This lets you have the \"faster fabric\" option to address future
needs, instead of using up all of that capacity with a silly initial
design.

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Of course. Except if the application spans multiple hosts
(and, thus, NICs).

Don\'t think of an application as *a* program running on *a*
host but, rather, a collection of processes operating on\\
set of processors to address a need.

More that I\'m saying \"an application\" doesn\'t matter (or many
applications). You want to talk about the capabilities of a switch; and
that discussion is well below the level of \"application\".

You fail to see how the switch is an integral part of the application.
Like \"lets not talk about the address/data bus that connects to
the processor\".

The switch is the means by which the application communicates with itself.
It delivers power to the hosts. It ensures a synchronized timebase
is available across the range of hosts so a process operating on
one node has the same notion of \"now\" as processes operating on
other nodes. Or, worse, that a process migrating from one node
to another doesn\'t see a discontinuity (possibly non-monotonic) in
its notion of time.

Without the switch, the application can\'t work. it\'s performance
and characteristics are as important to the application\'s design
as the bus speed in a \"PC\" or other appliance.

Target can only serve it back to the initiator as fast as it can upload
(e.g. 1gpbs, although that\'s quite likely limited by disk read speed).

You can *write* as fast as the initiator can synthesize packets.
There is often a cache on the target so you can write faster than the
disk\'s write rate. Likewise, a read-ahead cache to read faster than
the disk can source data.

And those are both only so deep. The point here is that there are
limits outside of the switch in the middle that have just as much (if
not more) impact on things than the switch does.

But you would qualify those just as you would the switch.
You don\'t just throw things together and \"hope\" -- unless
you don\'t really care about performance: \"It\'ll get
done, eventually\"

no, I just throw things together and know it\'ll work, because I know
what the link speeds are, and I know what jobs are trying to be done.

What do you do when the jobs change?
Or, the same jobs but reassigned to different hosts?
Rewire the switch?

The switch is an integral part of the application.

In conventional services, things just slow down. You
may wait many seconds before a request times out. And,
you may abandon that request.

But, if the application expects a certain type of performance
from the communication subsystem and something is acting as
a parasite, there, then what recourse does the application
have? It can\'t tell the switch \"disable all other ports
because their activities are interfering with my expected
performance\".

Your \"application\" is bottlenecked by your PC\'s network stack (and
ability to kick data onto the wire) before your theoretical switch gets

You\'re thinking of bog-standard hardware and OSs. No PCs involved, here.

I move memory pages directly onto the wire by prepending an appropriate
header and letting the NIC pull the data straight out of the page
(jumbo frames, no packet reassembly required, etc.).

Fine, s/PC/NIC/ then.

In either case, frames can only be kicked over the wire as fast as the
card\'s PHY / MII is capable of sustaining (plus upstream CPU dumping
data to the card, etc).

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

No, they can\'t. Unless you restrict the processes that are tied
to switch #1 so that they don\'t need to access \"too many\" of the
processes on switch #2. E.g., 48 ports operating at line speed
on switch #1 wanting to talk to 48 ports on switch #2.

[Keeping in mind that there are costs to money]

*They* aren\'t predictable clients. My clients are processes with
well defined objectives. I can watch their performance and traffic
to determine how to economize on the hardware. Just like you
can shift your electric load throughout the day to take advantage
of cheaper rates overnight.

Good, you have well-defined bandwidth needs then? What are they? is
1gbps / port sufficient? Or do your devices need 10g?

I\'ve deployed Gbe links throughout. That lets me stream most
practical things and still keep latency low -- for a given
cable/fabric/NIC cost.

The latter is the rub. A node costs me about $15. How much do I
want to spend to make it connect to other nodes?

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

Switch needs to be able to deliver a full 15W to each and all of the nodes.

Power has to be controllable so the system can shed loads when
power is scarce (when operating on its own backup power). Switch must
be able to report the power delivered to each node and limit on
direction.

Switch has to implement Transparent Clocks or Boundary Clocks
for time protocol.

Switch has to be inexpensive. (i.e., a 48 port switch serves
$720 worth of hosts so a switch costing comparable represents
half of the hardware cost)

(Yes, you can do this. But, you have to be creative and not just resort
to off-the-shelf solutions -- this is s.e.d not lets.buy.some.kit)

Don Y · Aug 28, 2023

On 8/25/2023 12:42 PM, Dimiter_Popoff wrote:

On 8/25/2023 22:01, Don Y wrote:
On 8/25/2023 10:35 AM, Dimiter_Popoff wrote:
More generally, which is better, TCP or UDP, has no single universal
answer - it depends on the application.

There is no \"better\" - these are two very different protocols.
I happen to have implemented both and of course I know tcp takes
a lot more work to implement, just as I know how it could be done
such to be not as heavy and slow like people think \"from outside\".

But you likely *expose* those protocols to the developer.
The application layer has no knowledge of the network, at all.
No IP addresses.Â No hostnames.Â No sockets.Â etc.

I can only speak for the tpc I have implemented for dps.
It is heavily reliant on the inherent, run time object system
which has made it very flexible and expandable.
You can (even from a dps script!) \"make\" a tcp connection,
talking to (possible more than one in a system) tcp/ip_subsystem;
you get what you call a handle, which is a dps object.
All it takes to type in a command line
op $[tsÂ <address> <destination port> <source port>.
You can even copy data (I wrote a device driver for that)
to/from a tcp connection, I needed that initially to make things
work so I still have it. Obviously from code you complile
(an\"app\" as they have it) things are not harder to do.
Once you deal with a connection you have access to a variety
of parameters you may or may not choose to get into; like
rtt, try count, time until first retry etc., these are
also available via the dps script command line, rp/sp
(as in getxpar/setxpar) returning/checking for parameter
type, unit, sign if applicable etc.
If you just \"make\" a connection it will adhere to the
defaults; if you want it to open with a sliding window
size you can do so (at open time), passively open etc.

But, that\'s for your device to talk to the outside world.
There\'s nothing comparable for your device to talk to
*itself* (i.e., if it was built of hundreds of CPUs).

When you want to access something in your box, you just
put the correct address on the bus and wait one access
time for the data to appear.

When you want to call a subroutine, you just execute a \"CALL\"
(JSR/BAL) opcode and pass along the target address.

When the \"subroutine\" is located on another host,
there\'s more mechanism required. If you require the
developer to manually create that scaffolding for each
such invocation, he\'s going to screw up. So, you fold
it into the OS where you can ensure its correctness
as well as protect the system from a bug (or rogue agent)
accessing something that it \"shouldn\'t\".

In a HLL, the compiler would build a stack frame and arrange
for the arguments to be transfered to the targeted subr/ftn.
I do that with a preprocessor (to give me a friendlier, custom
syntax) and an IDL compiler to generate client and server-side
stubs for each method.

The developer thinks he is writing code for a uniprocessor
with two exceptions:
- each function/subr returns an additional argument that
indicates the effectiveness of the invocation mechanism
(e.g., object doesn\'t exist, insufficient privilege,
transaction not completed, etc.) instead of just the
expected return value
OR
- the stubs throw exceptions that the developer has
to handle, explicitly.

[I\'ve not settled on which is the \"better\" approach as
both are foreign to most developers. As even error
handling is anathema to some! :> (this leads me to want
to use exceptions universally as I can always install a
default exception handler that kills their process if
they fail to deal with an exception)]

The OS handles the case of sorting out where the object
resides (i.e., which server backs it), *now*. And, if
an object happens to be in the process of migrating,
ensures the message gets redirected to the object\'s
\"new\" home. So, the developer doesn\'t have to sweat
those details.

Basically you can communicate over a tcp_connection
in the plainest way to dealing with it as much as the
code I have written to allow you to do (which is all I have
ever needed to do when I needed to).

What I was getting at knowing a little of what you are
doing is that I suggested that you might be better off
if you wrote your simplest tcp implementations (not like
the dps one, it is full blown, although being vpa written
it takes just about 300k on disk, good luck with these figures
with your C vendors I guess).

There are lots of things happening on the network at all times.
(think of it as a bus). The OS keeps track of all of these
so it has \"other\" information that the (ahem) \"network stack\"
lacks. I leverage this knowledge to know when transmissions
are faulty and diagnose how I should address those problems
(is a node defective? switch port? are the failures
related to a particular process or object? etc. these are
all things that the \"protocol\" can\'t keep track of but that
are important to the application\'s proper operation.

A process is just given a set of \"handles\" and invokes the
methods that those handles support on them.Â The OS sorts
out where the targeted object resides and arranges for
the method to be applied to the referenced object wherever
it is.

Why should the programmer have to be aware of the network?
It\'s possible that every object with which it interacts
\"resides\" on the same host that *it* does!

[Only OS-level services are aware of different hosts
and even they are oblivious to the lower level mechanics
of how the network works.Â You can just say (assuming
you have privilege to do so) \"ThatHost.power_off\"
and whatever host is referenced by the ThatHost handle
will be powered down.Â Or, ThisProcess.Migrate(ThatHost)
to cause the process referenced by ThisProcess to be
migrated to the host referenced by ThatHost -- regardless
of which host currently has ThatProcess!]

I don\'t see how udp in the conext of using a NIC and even
involving switches makes any difference instead of using
tcp, that would take a lot of detail. You know how friendly
suggestions are, \"watch at your own risk\"

I prototyped my system atop a traditional network stack using
TCP as a tunnel for each RPC. Performance sucked. I lumped
all of the problems into stack+protocol instead of looking into
ways of refining it.

As I (the OS) has control over everything that gets exchanged
over the network, it was easiest to design a protocol that
lets me move what I want as efficiently as possible, given
the characteristics of the hardware hosting me.

E.g., I originally used \"tiny pages\" to move parameters
onto the network. But, they aren\'t present on newer
CPU offerings so I now have to use bigger pages.

But, regardless, I can do things like pass a parameter
block to a remote procedure and let the contents of the
block (page!) be altered after the call -- but possibly
before the parameters have been put onto the wire -- which
would normally be a nasty race and likely impossible to
diagnose due to its likely intermittent nature.

Spend CPU resources to make the product more robust.
CPUs/memory are cheap.

Dan Purgert · Aug 28, 2023

On 2023-08-25, Don Y wrote:

On 8/25/2023 9:58 AM, Dimiter_Popoff wrote:
On 8/25/2023 19:31, Don Y wrote:
...

If I put a UART on a PCB and have it talk to another UART
on the same PCB, would you expect it to be unreliable?
I.e., context determines the risks you have to address.

While I would do that with no second thoughts in plenty of
cases it is not as safe as \"tcp\" is, i.e. two-way safe like
in syn-ack-synack etc.

If I put a NIC on a PCB and have it talk to another NIC
on the same PCB, would you expect it to be unreliable? :

Yes, same as RS485 or raw UART to UART. It\'s why we have CRC32 / parity
bits / etc.

[...]
Remember, all of the legacy protocols were designed when networking
was far more of a crap shoot.

And the only reason it isn\'t a crapshoot is because of those protocols /
safeties.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Dan Purgert · Aug 28, 2023

On 2023-08-25, Don Y wrote:

On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that, they\'ll
probably corner the market in a heartbeat. Granted it\'ll probably cost
a small country\'s GDP ...

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

I.e., limit the bandwidth of your databus inside your CPU
to 1Gb/s and tell me how that affects performance?

That makes no sense whatsoever. If you\'re using 10g or 100g links, just
say so (but let\'s face it, the only difference is that they get
\"faster\", it\'s not like the actual behavior I\'m describing changes).

The point is that every host is a client and a server to some
number (thousands) of objects at the same time. Remote operations
on those objects can easily queue on the port (input or output
doesn\'t matter as in either case they consume resources).

You don\'t run a DNS server on every host in your network, do you?
Let alone one for each process?

Yes. All the machines keep local caches, because then they don\'t need
to wait on a (potentially bogged down) upstream device.

This has been standard feature for the last 2 decades or so (15 years?
When did Win7 come out again?) Mac / Linux / phones had it earlier, but
only by a short margin.

You can, of course, disable this, but then you\'re (potentially) waiting
on the upstream DNS cache on your router (or whatever\'s configured as
your closest upstream DNS resolver).

[...]
E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

Until you are trying to shove more data across a port
than the link can sustain IN A GIVEN TIMEFRAME. Otherwise,
why have *any* buffer?

To absorb an insanely small spike in data egress. You do understand how
little 1 megabyte is when data is flying around at >1gbps, right?

[...]
Or, shape the traffic so it fits within the capabilities present.

\"Build a bigger power supply or use less power\" -- two approaches
to the same problem of \"fit\"

Well, yeah, but when I said that in like my first or second post
(something along the lines of \"if 1g isn\'t fast enough, get faster
interfaces\") you balked at that too ...

Faster interfaces cost more money. Smarter approach is to shape the
traffic so a slower interface (and attendant fabric) can meet your
needs. This lets you have the \"faster fabric\" option to address future
needs, instead of using up all of that capacity with a silly initial
design.

10/100 g is \"standard\" these days. But if you want to limit your link
to 1gpbs, then any \"N\" hosts can only ever talk to an M\'th host at a
combined rate of 1gbps, and there\'s nothing more to really discuss. You
know your limits.

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Of course. Except if the application spans multiple hosts
(and, thus, NICs).

Don\'t think of an application as *a* program running on *a*
host but, rather, a collection of processes operating on\\
set of processors to address a need.

More that I\'m saying \"an application\" doesn\'t matter (or many
applications). You want to talk about the capabilities of a switch; and
that discussion is well below the level of \"application\".

You fail to see how the switch is an integral part of the application.
Like \"lets not talk about the address/data bus that connects to
the processor\".

No, you\'re just trying to make things more complex than they really are.

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

[...]
no, I just throw things together and know it\'ll work, because I know
what the link speeds are, and I know what jobs are trying to be done.

What do you do when the jobs change?

Does this change in jobs come with a commensurate change in link speed
requirements? No? Well, nothing to change then, since the existing
1gbps links are sufficient.

Do the hosts performing the job have capacities for faster link speeds?
No? Well, nothing to change then, since even though you want more speed,
the host is the physical bottleneck.

If the answer to either question is \"yes\", then what\'s done depends on
what options I have to hand. Maybe I use Link Aggregation, maybe I move
a host to 10g.

But before I do anything, there needs to be some form of a concrete \"hey
we need to upgrade HostA from 1gbps to Xgbps \".

[...]
Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

No, they can\'t. Unless you restrict the processes that are tied
to switch #1 so that they don\'t need to access \"too many\" of the
processes on switch #2. E.g., 48 ports operating at line speed
on switch #1 wanting to talk to 48 ports on switch #2.

So you either

(a) use 100gbit uplinks between them OR
(b) use backplane interconnects

Either approach works fine; though with 240 downstream hosts, I\'d
probably go for a backplane interconnect.

Cue your new complaint that it\'s expensive... well no duh, but you\'re
the one who has decided that 240 hosts need to communicate at 1gbps at
all times regardless of which switch the traffic has to cross.

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

Switch needs to be able to deliver a full 15W to each and all of the nodes.

meh, that\'s any 750 watt PoE switch on the market (for 48 ports). Might
run you $500 - $700 these days (they used to be $400, but, well 2020
killed that)

Power has to be controllable so the system can shed loads when
power is scarce (when operating on its own backup power). Switch must
be able to report the power delivered to each node and limit on
direction.

\"limit on direction\" ?

Switches aren\'t smart though, they don\'t know what they\'re plugged into
(BUT there are some neat WISP ones that might actually break into this
territory ... if you buy into their full power delivery system too).

Switch has to implement Transparent Clocks or Boundary Clocks
for time protocol.

Switch has to be inexpensive. (i.e., a 48 port switch serves
$720 worth of hosts so a switch costing comparable represents
half of the hardware cost)

Yep, but that\'s what they cost. Better $700 than $7000 though.

(Yes, you can do this. But, you have to be creative and not just resort
to off-the-shelf solutions -- this is s.e.d not lets.buy.some.kit)

Well, if you think you can develop a switch ASIC that meets your
requirements, AND undercuts the existing vendors on price... lemme know when
you\'re putting product to market. I could use another option.

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Don Y · Aug 28, 2023

On 8/25/2023 5:13 PM, Dan Purgert wrote:

On 2023-08-25, Don Y wrote:
On 8/25/2023 9:58 AM, Dimiter_Popoff wrote:
On 8/25/2023 19:31, Don Y wrote:
...

If I put a UART on a PCB and have it talk to another UART
on the same PCB, would you expect it to be unreliable?
I.e., context determines the risks you have to address.

While I would do that with no second thoughts in plenty of
cases it is not as safe as \"tcp\" is, i.e. two-way safe like
in syn-ack-synack etc.

If I put a NIC on a PCB and have it talk to another NIC
on the same PCB, would you expect it to be unreliable? :

Yes, same as RS485 or raw UART to UART. It\'s why we have CRC32 / parity
bits / etc.

[...]
Remember, all of the legacy protocols were designed when networking
was far more of a crap shoot.

And the only reason it isn\'t a crapshoot is because of those protocols /
safeties.

(sigh) Wow, you REALLY are trapped in conventional thinking!
Best let others do it for you!

Don Y · Aug 28, 2023

On 8/25/2023 5:54 PM, Dan Purgert wrote:

On 2023-08-25, Don Y wrote:
On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that, they\'ll
probably corner the market in a heartbeat. Granted it\'ll probably cost
a small country\'s GDP ...

So, you\'re saying switches have no predictable characteristics.
Makes you wonder why they specify any, then...

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

No. B allows for the possibility of dropped packets:
\"(eventually) starts dropping traffic\". This condition
doesn\'t exist in C.

I.e., limit the bandwidth of your databus inside your CPU
to 1Gb/s and tell me how that affects performance?

That makes no sense whatsoever. If you\'re using 10g or 100g links, just
say so (but let\'s face it, the only difference is that they get
\"faster\", it\'s not like the actual behavior I\'m describing changes).

The point is that every host is a client and a server to some
number (thousands) of objects at the same time. Remote operations
on those objects can easily queue on the port (input or output
doesn\'t matter as in either case they consume resources).

You don\'t run a DNS server on every host in your network, do you?
Let alone one for each process?

Yes. All the machines keep local caches, because then they don\'t need
to wait on a (potentially bogged down) upstream device.

That\'s a RESOLVER not a DNS *server*.

This has been standard feature for the last 2 decades or so (15 years?
When did Win7 come out again?) Mac / Linux / phones had it earlier, but
only by a short margin.

You can, of course, disable this, but then you\'re (potentially) waiting
on the upstream DNS cache on your router (or whatever\'s configured as
your closest upstream DNS resolver).

And, in my application, you ALWAYS want to resolve a name with
the latest binding (e.g., as if TTL=0) because bindings can
be changed by privileged processes. So, \"/error\" may be changed
by a diagnostic process that wants to examine your error stream
NOW, but not THEN. Because you likely will run 24/7/365 and
we don\'t want to have to signal you to force a reexamination
of the name bindings that your creator thought were important
to you at the time of creation.

[I am only using DNS as an example as I strongly suspect you\'ve
never written code to run in an environment without a filesystem
to act as the sole *shared* namespace.]

[...]
E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

Until you are trying to shove more data across a port
than the link can sustain IN A GIVEN TIMEFRAME. Otherwise,
why have *any* buffer?

To absorb an insanely small spike in data egress. You do understand how
little 1 megabyte is when data is flying around at >1gbps, right?

Do you understand how LARGE a time interval that is in a real-time
system? 8 million bits is 8ms. I use Gbe because the time required
to transmit 4KB at 100Mbs is ghastly long when a typical ftn call
may execute in a handful of MICROseconds.

You\'re thinking like a datacenter where the client is a human being
who won\'t notice delays and/or latency.

Or, shape the traffic so it fits within the capabilities present.

\"Build a bigger power supply or use less power\" -- two approaches
to the same problem of \"fit\"

Well, yeah, but when I said that in like my first or second post
(something along the lines of \"if 1g isn\'t fast enough, get faster
interfaces\") you balked at that too ...

Faster interfaces cost more money. Smarter approach is to shape the
traffic so a slower interface (and attendant fabric) can meet your
needs. This lets you have the \"faster fabric\" option to address future
needs, instead of using up all of that capacity with a silly initial
design.

10/100 g is \"standard\" these days. But if you want to limit your link
to 1gpbs, then any \"N\" hosts can only ever talk to an M\'th host at a
combined rate of 1gbps, and there\'s nothing more to really discuss. You
know your limits.

10/100G is NOT \"standard\" in $15 embedded devices. Witness the dearth
of MCUs and SoCs with 10G NICs.

Again, data center...

\"An Application\" can\'t suck down 10gbps if the host only has a 1gbps
link...

Of course. Except if the application spans multiple hosts
(and, thus, NICs).

Don\'t think of an application as *a* program running on *a*
host but, rather, a collection of processes operating on\\
set of processors to address a need.

More that I\'m saying \"an application\" doesn\'t matter (or many
applications). You want to talk about the capabilities of a switch; and
that discussion is well below the level of \"application\".

You fail to see how the switch is an integral part of the application.
Like \"lets not talk about the address/data bus that connects to
the processor\".

No, you\'re just trying to make things more complex than they really are.

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

At 1Gbps. Add 2 more hosts and THEY will saturate their links.
And 2 more after that. etc. Because each additional processor brings
additional *processes* and abilities to the party. The alternative is
to have dumb devices and put all the effort into some big central
processor and have to periodically update it as the number of
\"peripherals\" increases.

The links are the bottleneck, just like memory is the bottleneck
in a SoC.

no, I just throw things together and know it\'ll work, because I know
what the link speeds are, and I know what jobs are trying to be done.

What do you do when the jobs change?

Does this change in jobs come with a commensurate change in link speed
requirements? No? Well, nothing to change then, since the existing
1gbps links are sufficient.

Of course it does! Because you\'ve told me I can\'t MOVE the jobs to reshape
the traffic. *YOU* have to alter the network configuration to accommodate
the different traffic. I alter the workload to make it fit in the
existing resource constraints.

Do the hosts performing the job have capacities for faster link speeds?
No? Well, nothing to change then, since even though you want more speed,
the host is the physical bottleneck.

Read the above. You keep thinking conventionally.

If the answer to either question is \"yes\", then what\'s done depends on
what options I have to hand. Maybe I use Link Aggregation, maybe I move
a host to 10g.

And, \"you\" are shipped with every system sold? \"Dear homeowner, here is
your complementary technician to redesign the network in your home.\"
(Gee, I didn\'t realize adding a few security cameras was going to be
such a hassle!)

But before I do anything, there needs to be some form of a concrete \"hey
we need to upgrade HostA from 1gbps to Xgbps \".

So, the homeowner has to have his prior collection of kit reassessed
in order to ADD anything to it? I\'m sure that\'s a great selling point
(NOT!).

Or the school district. Or the business owner. Or...

No one wants to upgrade existing kit that has been working
(and is still working) just because they want to add to the
mix. That\'s the flaw of the \"big central processor\"
approach -- it doesn\'t SCALE.

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

No, they can\'t. Unless you restrict the processes that are tied
to switch #1 so that they don\'t need to access \"too many\" of the
processes on switch #2. E.g., 48 ports operating at line speed
on switch #1 wanting to talk to 48 ports on switch #2.

So you either

(a) use 100gbit uplinks between them OR
(b) use backplane interconnects

Either approach works fine; though with 240 downstream hosts, I\'d
probably go for a backplane interconnect.

And the associated cost?

Again, datacenter...

I.e., *you* are limited to that type of solution -- because you
have no control of the code executing on each host and the code
executing wasn\'t written with this inevitability in mind. I\'m not
because I\'ve addressed this issue in the initial design of the
system.

Cue your new complaint that it\'s expensive... well no duh, but you\'re
the one who has decided that 240 hosts need to communicate at 1gbps at
all times regardless of which switch the traffic has to cross.

Yet again data center.

(Have you ever designed any consumer kit?)

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

Switch needs to be able to deliver a full 15W to each and all of the nodes.

meh, that\'s any 750 watt PoE switch on the market (for 48 ports). Might
run you $500 - $700 these days (they used to be $400, but, well 2020
killed that)

So, we\'re already making the switch half of the cost of the system...

Power has to be controllable so the system can shed loads when
power is scarce (when operating on its own backup power). Switch must
be able to report the power delivered to each node and limit on
direction.

\"limit on direction\" ?

When directed by the application. E.g., if power fails, then
I won\'t want to waste battery on powering extra nodes whose
workload can be deferred (to a time when power returns).
Or, when a workload is removed, being able to power down nodes
and consolidate their workloads on the remaining hosts.
(Or, power UP nodes to avail yourself of the additional
compute power they provide)

Switches aren\'t smart though, they don\'t know what they\'re plugged into
(BUT there are some neat WISP ones that might actually break into this
territory ... if you buy into their full power delivery system too).

Switch doesn\'t have to know when to make these decisions.
But, has to provide the data to the application (how much
power each node is *currently* using not just how much
it has negotiated in its most recent PoE negotiation)
and the ability for the application to reconstrain the
power delivered to the node (increase or decrease).

Switch has to implement Transparent Clocks or Boundary Clocks
for time protocol.

Switch has to be inexpensive. (i.e., a 48 port switch serves
$720 worth of hosts so a switch costing comparable represents
half of the hardware cost)

Yep, but that\'s what they cost. Better $700 than $7000 though.

So, considerably more than I\'ve spent in my approach.
Market advantage!

(Yes, you can do this. But, you have to be creative and not just resort
to off-the-shelf solutions -- this is s.e.d not lets.buy.some.kit)

Well, if you think you can develop a switch ASIC that meets your
requirements, AND undercuts the existing vendors on price... lemme know when
you\'re putting product to market. I could use another option.

You\'ve got such a conventional outlook on how to address the problem!
You need to learn to ignore the box you\'re living in! :>

Dan Purgert · Aug 28, 2023

On 2023-08-26, Don Y wrote:

On 8/25/2023 5:54 PM, Dan Purgert wrote:
On 2023-08-25, Don Y wrote:
On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and know
that it can sustain an overload of X on Y ports for Z seconds.
Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that, they\'ll
probably corner the market in a heartbeat. Granted it\'ll probably cost
a small country\'s GDP ...

So, you\'re saying switches have no predictable characteristics.
Makes you wonder why they specify any, then...

No, I\'m saying that they don\'t work the way you think they do.

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

No. B allows for the possibility of dropped packets:
\"(eventually) starts dropping traffic\". This condition
doesn\'t exist in C.

And the removal of that condition doesn\'t exist here in the real world.

[...]
Yes. All the machines keep local caches, because then they don\'t need
to wait on a (potentially bogged down) upstream device.

That\'s a RESOLVER not a DNS *server*.

dnsmasq is, in fact, a dns server.

[...]
[I am only using DNS as an example as I strongly suspect you\'ve
never written code to run in an environment without a filesystem
to act as the sole *shared* namespace.]

I\'m a networking guy. You tell me what bandwidth you need, I design the
infra that gets it to you ..

[...]
E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

Until you are trying to shove more data across a port
than the link can sustain IN A GIVEN TIMEFRAME. Otherwise,
why have *any* buffer?

To absorb an insanely small spike in data egress. You do understand how
little 1 megabyte is when data is flying around at >1gbps, right?

Do you understand how LARGE a time interval that is in a real-time
system? 8 million bits is 8ms. I use Gbe because the time required

Aren\'t you the one who wants \"zero delay\" in his network? Can\'t be
touching these buffers if that\'s what you want.

You\'re thinking like a datacenter where the client is a human being
who won\'t notice delays and/or latency.

No. I\'m trying to get you to forget about the buffers. Ideally you
*never* touch them, because as soon as you do, you\'re introducing delay.

[...]
10/100G is NOT \"standard\" in $15 embedded devices. Witness the dearth
of MCUs and SoCs with 10G NICs.

Nope, but it\'s industry standard in switches (which is what we\'re
talking about).

But now that you\'ve finally said you\'re limited to 1g, a lot of the
theorycraft can be thrown out the window.

In order to have N hosts simultaneously communicate with an M\'th host,
they MUST NOT exceed a combined data rate of >1gbps.

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

At 1Gbps.

Right. And with a switch hosting 48 of these 1gpbs devices ... they can
all communicate simultaneously at line rate. The switch won\'t even
exist as far as they\'re concerned.

UNTIL such time that you try to send >1gpbs worth of traffic out any one
port (because it\'s physically impossible).

The links are the bottleneck, just like memory is the bottleneck
in a SoC.

Yes, which has nothing to do with \"the switch\".

[...]
What do you do when the jobs change?

Does this change in jobs come with a commensurate change in link speed
requirements? No? Well, nothing to change then, since the existing
1gbps links are sufficient.

Of course it does! Because you\'ve told me I can\'t MOVE the jobs to reshape
the traffic. *YOU* have to alter the network configuration to accommodate
the different traffic. I alter the workload to make it fit in the
existing resource constraints.

So you\'re changing your link speed requirements, you need faster links
on your hosts. There\'s no way around it. Can\'t fit 10 pounds of
potatoes in a 5 pound bag and all that.

If the answer to either question is \"yes\", then what\'s done depends on
what options I have to hand. Maybe I use Link Aggregation, maybe I move
a host to 10g.

And, \"you\" are shipped with every system sold? \"Dear homeowner, here is
your complementary technician to redesign the network in your home.\"
(Gee, I didn\'t realize adding a few security cameras was going to be
such a hassle!)

I mean, I was under the impression we were talking about \"me, as a
network professional\" talking to \"you, a design professional\" hammering
out concrete requirements for some new network.

Your whole argument seems to center around this idea that your hosts
with gbit links can somehow magically process data faster than 1gbit /
second.

But before I do anything, there needs to be some form of a concrete \"hey
we need to upgrade HostA from 1gbps to Xgbps \".

So, the homeowner has to have his prior collection of kit reassessed
in order to ADD anything to it? I\'m sure that\'s a great selling point
(NOT!).

I mean, you gotta know if you have a free port available.

Or the school district. Or the business owner. Or...

No one wants to upgrade existing kit that has been working
(and is still working) just because they want to add to the
mix. That\'s the flaw of the \"big central processor\"
approach -- it doesn\'t SCALE.

The funny thing is, networks scale extremely well without agonizing over
their \"real time\" performance. Monitor things, and if we notice links
getting bogged down, then we consider increasing capacity.

But it\'s also pretty trivial to generalize traffic flow (for example,
your bog standard \"writing documents\" type office work still be handled
over fast ethernet without too much pain. We don\'t *do* that, but... )

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

No, they can\'t. Unless you restrict the processes that are tied
to switch #1 so that they don\'t need to access \"too many\" of the
processes on switch #2. E.g., 48 ports operating at line speed
on switch #1 wanting to talk to 48 ports on switch #2.

So you either

(a) use 100gbit uplinks between them OR
(b) use backplane interconnects

Either approach works fine; though with 240 downstream hosts, I\'d
probably go for a backplane interconnect.

And the associated cost?

Again, datacenter...

Again, \"reality\". But hey, if you can build a switch that\'ll move
500gbps around for $25 ... more power to you.

Cue your new complaint that it\'s expensive... well no duh, but you\'re
the one who has decided that 240 hosts need to communicate at 1gbps at
all times regardless of which switch the traffic has to cross.

Yet again data center.

Stuff costs money... sometimes a lot of it.

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

Switch needs to be able to deliver a full 15W to each and all of the nodes.

meh, that\'s any 750 watt PoE switch on the market (for 48 ports). Might
run you $500 - $700 these days (they used to be $400, but, well 2020
killed that)

So, we\'re already making the switch half of the cost of the system...

I mean, if that\'s what you need because of your other requirements (PoE,
etc). Note that there is a point where chassis models start to win on
per-port costs, but those options are usually only available from the
\"2k for a basic 48 port switch\" type outfits (and, well, I can get 2x
PoE switches for that from my smaller vendors...)

Switch has to be inexpensive. (i.e., a 48 port switch serves
$720 worth of hosts so a switch costing comparable represents
half of the hardware cost)

Yep, but that\'s what they cost. Better $700 than $7000 though.

So, considerably more than I\'ve spent in my approach.

Well, if you already have switches, why\'d you even start this thread?

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Bertrand Sindri · Aug 28, 2023

Don Y <blockedofcourse@foo.invalid> wrote:

On 8/25/2023 5:54 PM, Dan Purgert wrote:
On 2023-08-25, Don Y wrote:
On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and
know that it can sustain an overload of X on Y ports
for Z seconds. Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that,
they\'ll probably corner the market in a heartbeat. Granted it\'ll
probably cost a small country\'s GDP ...

So, you\'re saying switches have no predictable characteristics.
Makes you wonder why they specify any, then...

If your traffic spike doesn\'t overfill a buffer, nobody\'s the
wiser, and nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

No. B allows for the possibility of dropped packets: \"(eventually)
starts dropping traffic\". This condition doesn\'t exist in C.

Only because you\'ve removed those actual words and then applied your
personal reality distortion field to ignore that the condition must, in
a real physical world, in fact exist.

Your C says that \"[the switch] can sustain an overload of X on Y ports
for Z seconds\".

So what has to happen in the inevitable situation where one, or both,
of the following events occur?:

The switch incurs an overload of X+B on Y ports for Z seconds.

The switch incurs an overload of X on Y ports for Z+B seconds.

What has to happen in either situation is the switch has to start
dropping packets.

The fact that you erased the words from B does not mean that reality
won\'t puncture your distortion field and cause your switch to have to
drop packets.

Don Y · Aug 28, 2023

On 8/25/2023 8:30 PM, Bertrand Sindri wrote:

Don Y <blockedofcourse@foo.invalid> wrote:
On 8/25/2023 5:54 PM, Dan Purgert wrote:
On 2023-08-25, Don Y wrote:
On 8/25/2023 10:24 AM, Dan Purgert wrote:
a>>> Likewise, the switch cannot shove data down to any given host faster
than its interface speed (1gpbs).

If 5 hosts are all shoving data out to a 6th host at 1gpbs
simultaneously, then either

(A) Host_6 needs a 10gbps link
(B) Switch (eventually) starts dropping traffic, ideally while
telling Hosts{1..5} to slow down (but flow-control is an
optional feature on switches).

(C) You\'ve codified the characteristics of the switch and
know that it can sustain an overload of X on Y ports
for Z seconds. Else, why have buffers IN switches?

This point (C) is merely re-wording (B).

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that,
they\'ll probably corner the market in a heartbeat. Granted it\'ll
probably cost a small country\'s GDP ...

So, you\'re saying switches have no predictable characteristics.
Makes you wonder why they specify any, then...

If your traffic spike doesn\'t overfill a buffer, nobody\'s the
wiser, and nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

No. B allows for the possibility of dropped packets: \"(eventually)
starts dropping traffic\". This condition doesn\'t exist in C.

Only because you\'ve removed those actual words and then applied your
personal reality distortion field to ignore that the condition must, in
a real physical world, in fact exist.

No, the condition doesn\'t exist. I control (and have designed) EVERYTHING
connected to the switch. *I* define reality.

Your C says that \"[the switch] can sustain an overload of X on Y ports
for Z seconds\".

So what has to happen in the inevitable situation where one, or both,
of the following events occur?:

The switch incurs an overload of X+B on Y ports for Z seconds.

The switch incurs an overload of X on Y ports for Z+B seconds.

What has to happen in either situation is the switch has to start
dropping packets.

Can\'t happen. It\'s like saying \"what happens when a diode sees
a PRV exceeding it\'s design limits. \"Can\'t happen\" because the
circuit wasn\'t designed to allow for that possibility.

Or, do you exist in a reality where such things CAN happen?
(then how does anyone ever design ANYTHING that works?)

The fact that you erased the words from B does not mean that reality
won\'t puncture your distortion field and cause your switch to have to
drop packets.

Just like designing a diode with a 50V PRV in a circuit that
never has a potential of more than 20V would have to worry about
the diode failing?

This is engineering, not playing the lottery!

Don Y · Aug 28, 2023

On 8/25/2023 8:08 PM, Dan Purgert wrote:

No. C avoids the possibility of packets being dropped entirely.
No need to tell anyone anything. No degradation of performance.
Each process\'s planned deadlines are always met.

well, if someone ever develops a switch that works like that, they\'ll
probably corner the market in a heartbeat. Granted it\'ll probably cost
a small country\'s GDP ...

So, you\'re saying switches have no predictable characteristics.
Makes you wonder why they specify any, then...

No, I\'m saying that they don\'t work the way you think they do.

Yes, they are totally random devices that can\'t bee characterized
beyond price and weight.

Really? Then how can *you* pick The Right Switch?

If your traffic spike doesn\'t overfill a buffer, nobody\'s the wiser, and
nothing gets sent back telling anyone to slow down.

That\'s C.

That\'s B. You just re-worded it to make yourself happy.

No. B allows for the possibility of dropped packets:
\"(eventually) starts dropping traffic\". This condition
doesn\'t exist in C.

And the removal of that condition doesn\'t exist here in the real world.

Sure it does. You are just stuck thinking of the real world in a way
that you\'re accustomed to thinking. That doesn\'t make it a universal truth.

You don\'t see bit rates of UARTs like 7302.6 baud -- but that doesn\'t
prevent them from being used. You\'re assuming \"reality\" can only
support those devices clocked at those rates (as if the silicon would
magically stop working at any other rate).

You only have to comply with standards when interfacing to other
devices that comply with them. There is nothing to stop you from
using your own standard if you control all of the devices with
which you will interact.

24 bit floating point? No problem. 9 bit UARTs? No problem.
As long as your use is \"contained\", you can do whatever you
want (as long as you are self-consistent).

Yes. All the machines keep local caches, because then they don\'t need
to wait on a (potentially bogged down) upstream device.

That\'s a RESOLVER not a DNS *server*.

dnsmasq is, in fact, a dns server.

libbind is, in fact, a library if routines to interface to a DNS service.

[I am only using DNS as an example as I strongly suspect you\'ve
never written code to run in an environment without a filesystem
to act as the sole *shared* namespace.]

I\'m a networking guy. You tell me what bandwidth you need, I design the
infra that gets it to you ..

I\'m a *systems* guy. You tell me what problem you want to solve
and I design the system that gets it to you ..

E.g., an M byte queue per port is different than an N*M
byte buffer memory that can be allocated, as needed.

Except that there is no queue, until you\'re trying to shove more data
across a port than the link can sustain (in which case, get faster
links, or slow down the transmit...)

Until you are trying to shove more data across a port
than the link can sustain IN A GIVEN TIMEFRAME. Otherwise,
why have *any* buffer?

To absorb an insanely small spike in data egress. You do understand how
little 1 megabyte is when data is flying around at >1gbps, right?

Do you understand how LARGE a time interval that is in a real-time
system? 8 million bits is 8ms. I use Gbe because the time required

Aren\'t you the one who wants \"zero delay\" in his network? Can\'t be
touching these buffers if that\'s what you want.

No, you want bounded performance. In real-time systems you
want to be able to predict the temporal behavior of your
system (which is the set of all components that must cooperate to
solve the problem at hand).

This is why the OS supports task and object migration, deadline
scheduling, deadline handlers, etc. -- so the developer doesn\'t
have to manually make all of those assurances.

You\'re thinking like a datacenter where the client is a human being
who won\'t notice delays and/or latency.

No. I\'m trying to get you to forget about the buffers. Ideally you
*never* touch them, because as soon as you do, you\'re introducing delay.

In most switch deployments, the delay is tolerable and preferable to the
time required to retransmit for a lost packet. Otherwise, why put them
into the switch?

10/100G is NOT \"standard\" in $15 embedded devices. Witness the dearth
of MCUs and SoCs with 10G NICs.

Nope, but it\'s industry standard in switches (which is what we\'re
talking about).

Again, only in datacenters. How many SOHO routers do you see with
10G ports? How many SOHO routers do you think there are deployed vs.
data center switches?

How in tune are you with the consumer market? How many of your
friends, relatives, neighbors have any 10G (ignoring 100G) kit
on hand?

How many office PCs?

But now that you\'ve finally said you\'re limited to 1g, a lot of the
theorycraft can be thrown out the window.

In order to have N hosts simultaneously communicate with an M\'th host,
they MUST NOT exceed a combined data rate of >1gbps.

But that\'s an instantaneous assessment that (over short periods)
can be exceeded (by relying on buffering in the switch) -- esp if
one or more ports are idle.

Watching what every host is doing in the system lets you reshape
traffic to keep within this bound. That\'s why I can move
\"programs\" from one host to another -- to control their
resource usage (memory, MIPS, power, network bandwidth, latency).

This is atypical for most networked systems. Largely because the
apps that run there tend to be big and not finely divisible.
I can run a \"task\" on an entirely different host from the
I/Os that it uses. E.g., the telephone interface is implemented
on one host yet the audio recording, speaker diarisation, speaker
identification and identification/recognizer model retraining can
happen on any number of other hosts if they have more \"spare\"
resources. Ditto for scene analysis for the cameras, motion
detection, etc. Just because you THINK of them as camera
related functions doesn\'t mean that the camera has to perform
them!

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

At 1Gbps.

Right. And with a switch hosting 48 of these 1gpbs devices ... they can
all communicate simultaneously at line rate. The switch won\'t even
exist as far as they\'re concerned.

UNTIL such time that you try to send >1gpbs worth of traffic out any one
port (because it\'s physically impossible).

Or to another switch to get to one of the other ~190 hosts.

The links are the bottleneck, just like memory is the bottleneck
in a SoC.

Yes, which has nothing to do with \"the switch\".

The links are the switch. I can run every link at 1Gps and be guaranteed
to drop packets if I blindly deliver them to any-old host. The LINKS
clearly wouldn\'t be the problem but, rather, the switch.

What do you do when the jobs change?

Does this change in jobs come with a commensurate change in link speed
requirements? No? Well, nothing to change then, since the existing
1gbps links are sufficient.

Of course it does! Because you\'ve told me I can\'t MOVE the jobs to reshape
the traffic. *YOU* have to alter the network configuration to accommodate
the different traffic. I alter the workload to make it fit in the
existing resource constraints.

So you\'re changing your link speed requirements, you need faster links
on your hosts. There\'s no way around it. Can\'t fit 10 pounds of
potatoes in a 5 pound bag and all that.

No, you really don\'t seem to understand. I am changing the mix of
processes hosted on a particular node to ensure their network
requirements fit within the capabilities of the connected fabric.
The fabric can\'t be changed -- there\'s no \"staff\" available to
upgrade the switch. So, you have to change the usage of those
resources to fit within their constraints.

I don\'t locate the DBMS on one node and the disk subsystem on
another because the interconnect fabric limits the RDBMS\'s performance
artificially.

If the answer to either question is \"yes\", then what\'s done depends on
what options I have to hand. Maybe I use Link Aggregation, maybe I move
a host to 10g.

And, \"you\" are shipped with every system sold? \"Dear homeowner, here is
your complementary technician to redesign the network in your home.\"
(Gee, I didn\'t realize adding a few security cameras was going to be
such a hassle!)

I mean, I was under the impression we were talking about \"me, as a
network professional\" talking to \"you, a design professional\" hammering
out concrete requirements for some new network.

The production design already works. The reason for my *post*
was to see if I could accommodate a *model*/emulation of that
switch within a stock set of switches that are used to connect
a set of \"desktop\" devices.

Your whole argument seems to center around this idea that your hosts
with gbit links can somehow magically process data faster than 1gbit /
second.

They *can* process data faster than gigabit rates. But, are intentionally
reshuffled to not require the use of the interconnect fabric to do so.

I have a process that just scrubs \"dirty\" memory pages. I can locate the
scrubbing process on any node that I want and access the pages to be
scrubbed on any other node(s). I can write 4KB of data into a page
much faster than the network could transport those data to a REMOTE
page (GHz processors). So, locating the process on a remote node
is a serious waste of resources -- it moves a shitload of data
across the network that it shouldn\'t have to (stealing bandwidth
from other processes that would like to use that bandwidth)
AND limits the rate at which pages can be scrubbed and returned to
the free pool.

So, while there is only a need for one scrubbing process in the entire
system, *having* just one is a poor use of resources. Instead,
have one on each node so LOCAL pages are scrubbed at the processor\'s
speed without bothering the network.

OTOH, recording audio (phone calls, in-person interactions) and
trying to analyze it on the same node overloads the real-time
abilities of a processor. So, ship the audio over to some
other node(s) and let them do some of the analysis *for* you
with surplus resources available, there. It imposes a hit on
the network (but a modest one as audio is a low bandwidth
signal) but is then able to leverage a lot of free resources
elsewhere (e.g., let the node controlling the feeder for the
tablet press do this work for you as it is typically lightly
loaded -- and can be verified as being so, NOW!)

But before I do anything, there needs to be some form of a concrete \"hey
we need to upgrade HostA from 1gbps to Xgbps \".

So, the homeowner has to have his prior collection of kit reassessed
in order to ADD anything to it? I\'m sure that\'s a great selling point
(NOT!).

I mean, you gotta know if you have a free port available.

He sees that by looking at the switch. He introduces a new
device to the system (using a dedicated, physically secured
port for key intialization) and then is told where to
connect the device to the switch (flash indicator).

In most cases, this requires running a new cable. If the
switch is located in a utility space (like a basement), this
is a piece of cake. If in an existing structure, not so much.

Eventually, if the system determines (from an analysis of traffic)
that the device should be connected to a different port,
it tells the user to swap specific cables to achieve the
desired cabling pattern (because it\'s not one monolithic 240
port -- or 3000 port -- switch).

But, that\'s just because the system doesn\'t have \"arms\".

Or the school district. Or the business owner. Or...

No one wants to upgrade existing kit that has been working
(and is still working) just because they want to add to the
mix. That\'s the flaw of the \"big central processor\"
approach -- it doesn\'t SCALE.

The funny thing is, networks scale extremely well without agonizing over
their \"real time\" performance. Monitor things, and if we notice links
getting bogged down, then we consider increasing capacity.

You are paid to babysit installations. Homeowners don\'t want
to pay for babysitters for their kit -- it makes the kit too
costly (how many folks do you know who just upgrade kit rather
than figure out how to *fix* it)

If a person interacts with the system, it must respond within
~250ms in order for the person not to get annoyed. That
means the input has to be processed (voice, gesture, keystrokes,
mechanisms, etc.), recognized, reacted to and \"signaled\"
in that time.

Do you hit \"refresh page\" on your browser if you don\'t see the
desired page appear in 250ms? 500? 2000?

I love listening to Ring doorbells -- you hear the LOCAL response
to their button presses and a fraction of a second later, the
same signal as relayed from their network connection to the
occupants.

What if the sound is out of sync being delivered to the speakers
than the video to the display? Or, the display on one \"TV\"
out of sync with the same display on another?

But it\'s also pretty trivial to generalize traffic flow (for example,
your bog standard \"writing documents\" type office work still be handled
over fast ethernet without too much pain. We don\'t *do* that, but... )

These aren\'t real-time in any sense other than \"people don\'t like to
wait\". The correctness of the application doesn\'t rely on how
quickly the interaction takes place.

Yes. But, I can keep all 240 NICs running at link speeds (Gbe)
all day long.

So can a handful of 48-port switches... as I\'ve said countless times.

No, they can\'t. Unless you restrict the processes that are tied
to switch #1 so that they don\'t need to access \"too many\" of the
processes on switch #2. E.g., 48 ports operating at line speed
on switch #1 wanting to talk to 48 ports on switch #2.

So you either

(a) use 100gbit uplinks between them OR
(b) use backplane interconnects

Either approach works fine; though with 240 downstream hosts, I\'d
probably go for a backplane interconnect.

And the associated cost?

Again, datacenter...

Again, \"reality\". But hey, if you can build a switch that\'ll move
500gbps around for $25 ... more power to you.

You don\'t move 500gbps. *You* are stuck with whatever constraints the
folks specifying the datacenter hosts impose on you. You likely
can\'t tell them they can\'t do what they want -- it\'s your
job to make it happen.

I\'m not under those constraints. I decide what I need to solve a problem
and then how I can get there most economically and reliably.

If a competitor comes up with a better/cheaper implementation, then
I risk losing market share. Too bad, so sad. If he thinks
conventionally, then he\'s at a disadvantage (because *anyone*
can think conventionally)

How often does another network tech walk into your data center
and say they could have configured it differently (depriving
you of your job)?

Cue your new complaint that it\'s expensive... well no duh, but you\'re
the one who has decided that 240 hosts need to communicate at 1gbps at
all times regardless of which switch the traffic has to cross.

Yet again data center.

Stuff costs money... sometimes a lot of it.

Yes. But if you can arrange for your requirements to be met with
LESS expensive approaches, you gain a competitive advantage.

One of my first commercial products did a lot of floating point math.
Folks always think of 32b math libraries (now 64 or 80). But,
a 32b library uses 33% more data per value than a 24b library
would. And, takes proportionately longer to execute operations.
When you have a certain amount of real time available to complete
a given set of operations, you don\'t want to waste it if you
aren\'t gaining anything *practical* from the added expense.

So, it\'s worth the effort to redesign the floating point library
to support a narrower data type. This translates to better performance
and reduced product cost. A competitor who felt that floating
point operations \"should\" be 32b (why??) would be at a disadvantage
on both counts.

How much bandwidth does a node need? PoE? Because honestly, 48x 1g
copper with PoE will run like I dunno, $400? If you don\'t need PoE,
then maybe $200? If you NEED backplane-grade interconnects, you\'re
looking at 2 grand / switch.

But if gbit to a node isn\'t fast enough, then you\'re gonna start paying
a whole lot more ...

Switch needs to be able to deliver a full 15W to each and all of the nodes.

meh, that\'s any 750 watt PoE switch on the market (for 48 ports). Might
run you $500 - $700 these days (they used to be $400, but, well 2020
killed that)

So, we\'re already making the switch half of the cost of the system...

I mean, if that\'s what you need because of your other requirements (PoE,
etc). Note that there is a point where chassis models start to win on
per-port costs, but those options are usually only available from the
\"2k for a basic 48 port switch\" type outfits (and, well, I can get 2x
PoE switches for that from my smaller vendors...)

The switch has always been the most cost-intensive component
(beyond installation labor). But, there is no way around that
unless you resort to wireless (which has numerous vulnerabilities
as well as a shorter obsolescence path.

So, it has to offer lots of value, relatively speaking, to the
system -- beyond just \"moving packets\". E.g., clock synchronization,
power distribution and control (how would you power 240 wireless
devices? Batteries? Wall warts??), management functions, etc.
(and, ideally, not throw off any BTUs as 3KW, even at 90%
efficiency, throws off a shitload of heat that has to go
somewhere -- no cold aisle, here!)

Switch has to be inexpensive. (i.e., a 48 port switch serves
$720 worth of hosts so a switch costing comparable represents
half of the hardware cost)

Yep, but that\'s what they cost. Better $700 than $7000 though.

So, considerably more than I\'ve spent in my approach.

Well, if you already have switches, why\'d you even start this thread?

Because I don\'t use these switches in my *office* (development
environment). Those devices are standard network peripherals:
PCs, SANs, NASs, printers, scanners, etc. So, no need to
use a special switch with characteristics that aren\'t applicable
to those devices.

If I had oodles of space in the office, I would just add another
switch or bigger switches. But, that\'s not an easy solution (space,
power and thermal), hence the desire to find an alternative that
could exploit my existing switches without fear of introducing
\"unexpected behavior\" -- or, worse, \"intermittent quirks\" -- in
the prototype platform.

Don Y · Aug 28, 2023

On 8/25/2023 1:10 PM, Don Y wrote:

When you want to access something in your box, you just
put the correct address on the bus and wait one access
time for the data to appear.

When you want to call a subroutine, you just execute a \"CALL\"
(JSR/BAL) opcode and pass along the target address.

When the \"subroutine\" is located on another host,
there\'s more mechanism required.Â If you require the
developer to manually create that scaffolding for each
such invocation, he\'s going to screw up.Â So, you fold
it into the OS where you can ensure its correctness
as well as protect the system from a bug (or rogue agent)
accessing something that it \"shouldn\'t\".

The OS handles the case of sorting out where the object
resides (i.e., which server backs it), *now*.Â And, if
an object happens to be in the process of migrating,
ensures the message gets redirected to the object\'s
\"new\" home.Â So, the developer doesn\'t have to sweat
those details.

This is how Amoeba (antoerh distributed OS) handled RPCs
many years ago:
<https://dl.acm.org/doi/pdf/10.1145/151250.151253>

I went a different direction because:
- my system is closed
- I\'m not internetworking PCs but, rather, appliances
- relying on one-way hashes to obfuscate services is a dodge
(the traffic is still cleartext!)
- they relied a lot on multicast -- which bothers every
host in the system each time it is invoked.

I opted for a distributed kernel so each physical node
acts as if part of a single unified node with knowledge of
all the others.

I prototyped my system atop a traditional network stack using
TCP as a tunnel for each RPC.Â Performance sucked.Â I lumped
all of the problems into stack+protocol instead of looking into
ways of refining it.

Note the (dated) performance figures of different approaches in
the cited article. If an RPC is a \"big performance hit\", then
you won\'t leverage services that reside on other hosts. Instead,
you will reimplement them (in full or partially) and with different
sets of bugs and performance quirks.

Ever notice how windows reports file sizes inconsistently in
various contexts? Is this \"1K\" file actually *0* bytes long?
And, is this 48K file really the same as the 47K file of
the same name mentioned elsewhere?

Each instance came up with their own rules for presenting
the size to the user so the user is never sure if \'A\' is
really \'a\' or just something coincidentally of a similar
size.

Or, how an application will hit the maxpathlen limit if
invoked one way but you can work-around it if invoked a
different way? The file system doesn\'t care -- yet the
tools that access it behave inconsistently when accessing it!

[How can this be A Good Thing?]

As I (the OS) has control over everything that gets exchanged
over the network, it was easiest to design a protocol that
lets me move what I want as efficiently as possible, given
the characteristics of the hardware hosting me.

E.g., I originally used \"tiny pages\" to move parameters
onto the network.Â But, they aren\'t present on newer
CPU offerings so I now have to use bigger pages.

But, regardless, I can do things like pass a parameter
block to a remote procedure and let the contents of the
block (page!) be altered after the call -- but possibly
before the parameters have been put onto the wire -- which
would normally be a nasty race and likely impossible to
diagnose due to its likely intermittent nature.

Spend CPU resources to make the product more robust.
CPUs/memory are cheap.

Dan Purgert · Aug 28, 2023

On 2023-08-26, Don Y wrote:

On 8/25/2023 8:08 PM, Dan Purgert wrote:
[...]
No. I\'m trying to get you to forget about the buffers. Ideally you
*never* touch them, because as soon as you do, you\'re introducing delay.

In most switch deployments, the delay is tolerable and preferable to the
time required to retransmit for a lost packet. Otherwise, why put them
into the switch?

In most switch deployments, you size the links to buffer as little as
possible (ideally, none). And you buffer as little as possible, so the
clients can start throttling back their transmission rate faster.

Can\'t let host_A saturate the uplink, when Hosts_{B...F} also want to
use it. Everyone\'s gotta share (even though that means they\'re all
getting less than line rate would indicate they \"should\" get.

10/100G is NOT \"standard\" in $15 embedded devices. Witness the dearth
of MCUs and SoCs with 10G NICs.

Nope, but it\'s industry standard in switches (which is what we\'re
talking about).

Again, only in datacenters. How many SOHO routers do you see with
10G ports? How many SOHO routers do you think there are deployed vs.
data center switches?

I\'ve installed them. They\'re \"only\" about 1.5 - 2x the price of a
multiport router that\'d have comparable inter-vlan routing capacity.

But then again, neither option is really something joe-consumer would
pick up from bestbuy.

[...]
But now that you\'ve finally said you\'re limited to 1g, a lot of the
theorycraft can be thrown out the window.

In order to have N hosts simultaneously communicate with an M\'th host,
they MUST NOT exceed a combined data rate of >1gbps.

But that\'s an instantaneous assessment that (over short periods)
can be exceeded (by relying on buffering in the switch) -- esp if
one or more ports are idle.

Huh? no. The wire can carry data at 1gpbs. That\'s it, you can\'t go any
faster.

Idle ports don\'t come into play at all. Why do you keep talking about
them as if they matter?

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

At 1Gbps.

Right. And with a switch hosting 48 of these 1gpbs devices ... they can
all communicate simultaneously at line rate. The switch won\'t even
exist as far as they\'re concerned.

UNTIL such time that you try to send >1gpbs worth of traffic out any one
port (because it\'s physically impossible).

Or to another switch to get to one of the other ~190 hosts.

You use a faster inter-switch link. We\'ve been over this already.

The links are the bottleneck, just like memory is the bottleneck
in a SoC.

Yes, which has nothing to do with \"the switch\".

The links are the switch. I can run every link at 1Gps and be guaranteed
to drop packets if I blindly deliver them to any-old host. The LINKS
clearly wouldn\'t be the problem but, rather, the switch.

No, \"the switch\'s capabilities\" are not bound to what\'s plugged into it
in that manner. It depends where the traffic is going.

If your contrived scenario was 1 -> 2 -> 3 -> ... -> 48 -> 1 ; then all
of the traffic would fly around the switch with zero loss (and zero
buffer usage). Because the switch can internally sling data around at
100gbps (or faster).

However, if this contrived scenario is {1..4} -> 5, then the supposed
1MB egress buffer on port 5 will fill in ~2 milliseconds ... and frames
will start getting dropped. Hosts {1-4} will throttle back to ~250 mbps
each, and host 5 will consume data at 1gbps.

[...]
So you\'re changing your link speed requirements, you need faster links
on your hosts. There\'s no way around it. Can\'t fit 10 pounds of
potatoes in a 5 pound bag and all that.

No, you really don\'t seem to understand. I am changing the mix of
processes hosted on a particular node to ensure their network
requirements fit within the capabilities of the connected fabric.

If you have a host that needs to send data out (or receive data in) at
over 1gbps ... how is it ever going to do this *unless* you upgrade its
link? Doesn\'t matter how good the switch is if you can\'t even get the
data into it at the rate you need.

This flow:
host_a <--> (1gbps) <--> switch <--> (1gbps) <--> host_b

is just as infinitely sustainable as this flow:
host_a -\\ 1g
host_b --\\ 1g
host_c ---> 1g <--> (5x 1g) <--> sw <--> (5x 1g) <--> host_f
host_d --/ 1g
host_e --/ 1g

But this one is impossible with a single 1gbps link between the host and
the switch:
host_a (req 2gbps traffic) -> (1g link) -> switch

Therefore, you need a faster link for host_a.

If the answer to either question is \"yes\", then what\'s done depends on
what options I have to hand. Maybe I use Link Aggregation, maybe I move
a host to 10g.

And, \"you\" are shipped with every system sold? \"Dear homeowner, here is
your complementary technician to redesign the network in your home.\"
(Gee, I didn\'t realize adding a few security cameras was going to be
such a hassle!)

I mean, I was under the impression we were talking about \"me, as a
network professional\" talking to \"you, a design professional\" hammering
out concrete requirements for some new network.

The production design already works. The reason for my *post*
was to see if I could accommodate a *model*/emulation of that
switch within a stock set of switches that are used to connect
a set of \"desktop\" devices.

Oh. \"No\" then. Switch ASICs are very much \"Do one thing, and do it
well.\"

Your whole argument seems to center around this idea that your hosts
with gbit links can somehow magically process data faster than 1gbit /
second.

They *can* process data faster than gigabit rates. But, are intentionally
reshuffled to not require the use of the interconnect fabric to do so.

Wait, you can move 10k bits in one second over a 9600 baud link? Do
please explain how.

Processing data locally doesn\'t matter at all to your questions about
switching (the switch doesn\'t come into play ... )

[...]
How often does another network tech walk into your data center
and say they could have configured it differently (depriving
you of your job)?

\"Ask 10 programmers how to do something, get 11 different answers\"

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Don Y · Aug 28, 2023

On 8/26/2023 6:29 AM, Dan Purgert wrote:

But now that you\'ve finally said you\'re limited to 1g, a lot of the
theorycraft can be thrown out the window.

In order to have N hosts simultaneously communicate with an M\'th host,
they MUST NOT exceed a combined data rate of >1gbps.

But that\'s an instantaneous assessment that (over short periods)
can be exceeded (by relying on buffering in the switch) -- esp if
one or more ports are idle.

Huh? no. The wire can carry data at 1gpbs. That\'s it, you can\'t go any
faster.

B8ut the port limits (potentially) the throughput through the switch if
other traffic is queued on the same output port desired by the traffic
coming IN on the wire.

If the incoming traffic is blocked (like in a hub), then the sending
device pauses unnecessarily, lowering its EFFECTIVE throughput.
The buffer in the switch allows the sending device to move on to
sending the next packet to some other port thereby allowing the
overload encountered by the switch to be ignored.

Idle ports don\'t come into play at all. Why do you keep talking about
them as if they matter?

Let\'s take the switch out of the equation, and just have 2 hosts
interconnected via their gbit interfaces for this \"application\". How
fast can they share data?

At 1Gbps.

Right. And with a switch hosting 48 of these 1gpbs devices ... they can
all communicate simultaneously at line rate. The switch won\'t even
exist as far as they\'re concerned.

UNTIL such time that you try to send >1gpbs worth of traffic out any one
port (because it\'s physically impossible).

Or to another switch to get to one of the other ~190 hosts.

You use a faster inter-switch link. We\'ve been over this already.

So, interconnect each 48 port switch to each other with 50Gb
links? That will be cheap, right?

The links are the bottleneck, just like memory is the bottleneck
in a SoC.

Yes, which has nothing to do with \"the switch\".

The links are the switch. I can run every link at 1Gps and be guaranteed
to drop packets if I blindly deliver them to any-old host. The LINKS
clearly wouldn\'t be the problem but, rather, the switch.

No, \"the switch\'s capabilities\" are not bound to what\'s plugged into it
in that manner. It depends where the traffic is going.

The switch gets the traffic where it is intended to go.
If the traffic can\'t be reshaped, then the switch becomes the
bottleneck. Putting 10G links on a 1G switch just means
the buffers in the switch fill quicker. The switch sets the
overall throughput for ALL transactions.

If your contrived scenario was 1 -> 2 -> 3 -> ... -> 48 -> 1 ; then all
of the traffic would fly around the switch with zero loss (and zero
buffer usage). Because the switch can internally sling data around at
100gbps (or faster).

However, if this contrived scenario is {1..4} -> 5, then the supposed
1MB egress buffer on port 5 will fill in ~2 milliseconds ... and frames
will start getting dropped. Hosts {1-4} will throttle back to ~250 mbps
each, and host 5 will consume data at 1gbps.

And, if every device potentially wants to interact with every other
device, then its only a matter of time before things get dropped.

If, OTOH, you reshape the traffic so that no (outgoing) link is ever
taxed beyond it\'s capacity, then data sources can freely push data
without fear of any of it being lost.

This is just common sense.

Datacenters tend not to have \"everything talking to everything
else at high packet rates. Host 1 doesn\'t have 100 name resolvers
running (each for a different top level domain) in the system while
hosts 2, 3 and 4 also each support different 100X TLDs. It\'s not
a common occurrence in a network\'s design.

But, in a distributed operating system, clients on one host will
almost certainly be wanting to access services on other hosts while
clients on those other hosts will simultaneously want to access
services on the first host.

Think about it in terms of memory. Draw arbitrary boundaries in
physical memory and see how often they are crossed. Redraw
them and reassess. You will discover that interactions have
particular patterns. If you partition in a way that allows
those patterns to be \"localized\", you get more performance.
This is why caches were created and why ignorance of the underlying
hardware can cause \"unfortunately written\" programs to perform
incredibly poorly.

The fabric is the boundaries drawn between the memories of the
individual hosts. If an access pattern crosses a boundary needlessly,
it adds traffic to the network. If an access pattern can be localized
*to* a node, then the network drops out of the equation.

Most programs are built as static objects. They assume they can
be loaded in their entirety (or, faulted in as needed). They
have no concern over their interactions with other programs
(or peripherals) and figure \"something else\" will just make it
work.

In a distributed OS, you focus on the design of those \"programs\"
and their interactions because small changes can make big differences
in performance up to -- and including -- failure to perform.

So you\'re changing your link speed requirements, you need faster links
on your hosts. There\'s no way around it. Can\'t fit 10 pounds of
potatoes in a 5 pound bag and all that.

No, you really don\'t seem to understand. I am changing the mix of
processes hosted on a particular node to ensure their network
requirements fit within the capabilities of the connected fabric.

If you have a host that needs to send data out (or receive data in) at
over 1gbps ... how is it ever going to do this *unless* you upgrade its
link? Doesn\'t matter how good the switch is if you can\'t even get the
data into it at the rate you need.

How come you keep missing this point?

I *move* -- \"physically\" remove the bits of the program that
needs to access some \"remote\" resource and TRANSPORT THEM to
the other host. Or, to a host that has a \"less used\"
connection to that host. In doing so, I have effectively
increased the bandwidth of the communications path.

And, removed the switch from the equation. Thus making
more NETWORK bandwidth available to other hosts/clients.

This flow:
host_a <--> (1gbps) <--> switch <--> (1gbps) <--> host_b

is just as infinitely sustainable as this flow:
host_a -\\ 1g
host_b --\\ 1g
host_c ---> 1g <--> (5x 1g) <--> sw <--> (5x 1g) <--> host_f
host_d --/ 1g
host_e --/ 1g

But this one is impossible with a single 1gbps link between the host and
the switch:
host_a (req 2gbps traffic) -> (1g link) -> switch

Therefore, you need a faster link for host_a.

No, you DYNAMICALLY reconfigure the software so it is:

[host_a (req 2gbps traffic) -> target] -> (1g link) -> switch -> other targets

The \"2gbps\" traffic is now happening INSIDE host_a.

Most systems (collections of networked processors) can\'t do this.
Or, can only do it if manually directed to do so (e.g., \"let\'s move the
web server to host_a so its closer to the file server also hosted there\")

I watch everything that is happening in the system and dynamically
rejigger where code executes, when it executes, what it accesses,
etc. to keep the *system* operating within its current resource
constraints.

[It should be noted that this is an NP-complete problem akin
to the knapsack problem. But, if you are running \"forever\"
and have storage available, you can \"notice\" how each attempted
configuration performs and store that information while you look
for a better configuration. If the new configuration yields worse
performance (resource utilization, etc.) then you can try another
*or* return to the previously noted configuration. Again, you don\'t
do this in datacenter environments because it costs manpower to move
applications from one host to another -- along with the time
lost providing those services to their cients.]

E.g., if the I/Os on a node are not being used, now, then I
try to migrate the compute resources that are being used,
there, onto some other node -- so I can power down that node.
This tailors the available resources to fit the current NEED
for resources (i.e., I\'m eliminating excess compute resources
and the power required to use them).

[Otherwise, you\'d just keep everything powered up at all times so you
always have an abundance of resources -- at the expense of power
consumption and possibly some performance penalty due to communication
costs]

Likewise, if power fails, there\'s no need for me to spend time
stripping commercials from the broadcast videos I recorded
earlier today. Kill off those processes. Along with other
\"less important\" processes. Then, consolidate the \"essential\"
processes onto as few nodes as possible with a goal of powering
down nodes that are no longer needed -- and thus maximizing the
\"power available\" (battery) resource.

Your whole argument seems to center around this idea that your hosts
with gbit links can somehow magically process data faster than 1gbit /
second.

They *can* process data faster than gigabit rates. But, are intentionally
reshuffled to not require the use of the interconnect fabric to do so.

Wait, you can move 10k bits in one second over a 9600 baud link? Do
please explain how.

Processing data locally doesn\'t matter at all to your questions about
switching (the switch doesn\'t come into play ... )

The switch deliberately is taken out of the equation BECAUSE it represents
a limitation.

Also remember that there are \"needs\" and \"desires\". If firefox ran 5% slower,
would you kill it off because it wasn\'t meeting your NEEDS -- despite the
fact that it may be meeting your desires LESS than ideally?

I mentioned remotely scrubbing dirty memory pages. They are going to get
scrubbed \"eventually\" so who cares how long it takes?

Ah, but there WILL be clients (programs/processes) that will have to stall
because they are waiting for a LOCALLY AVAILABLE clean page in order to
continue processing. In a traditional, non-real-time environment,
the system would just feel slow. Maybe your web page doesn\'t render
quite as fast.

But, in a real-time environment, time keeps flowing even if computation
pauses. Deadlines exist at fixed points in time. So, if the process
can\'t get back to work soon enough to meet its deadline, then the
missed deadline will cause the missed-deadline-handler to be invoked.
In addition to not being desirable, it can also result in the process
being aborted -- because the designer had determined that if the
deadline wasn\'t met, there is no point in continuing to work on that
problem!

The consequences of such a decision can vary. If you don\'t get the
commercials removed from a broadcast video before the user wants
to watch it, he may be annoyed with the system\'s performance. But,
likely will still watch the program grumbling about the commercials.

OTOH, if the process was monitoring the progress of your vehicle as
it enters the garage and the deadline was missed causing the process
to abort before detecting that you were going to crash into the back
of the garage, the user might have a more volatile opinion!

How often does another network tech walk into your data center
and say they could have configured it differently (depriving
you of your job)?

\"Ask 10 programmers how to do something, get 11 different answers\"

As long as at least one is correct, it doesn\'t matter, does it?

chrisq · Aug 28, 2023

On 8/25/23 11:41, Dan Purgert wrote:

On 2023-08-24, John Walliker wrote:
On Thursday, 24 August 2023 at 17:40:12 UTC+1, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
On 8/24/2023 8:30 AM, Dan Purgert wrote:
On 2023-08-24, Don Y wrote:
Is there any way to guarantee the bandwidth available to a
subgroup of ports on a (managed/L3?) switch?

Not really. You can try implementing QoS on the switch to tag priority,
but that doesn\'t necessarily \"guarantee\" bandwidth; noting, of course,
that most (all) decent switches have enough capacity in their switch
fabric to run all ports at line rate simultaneously.

But that assumes the destination port is always available.
E.g., if N-1 ports all try to deliver packets to the Nth
port, then the packets are queued in the switch, using switch
resources.
Yes and no. Switches have *very* small buffers (your typical \"decent
biz\" rack-mount option having somewhere in the realm of 16 MiB ...
maybe).

If a link gets so congested that the buffers fill, the switch will just
start dropping traffic on the floor. This in turn will typically act as
a signal for the sending parties to retry (and slow down).

Bear in mind that \"filling buffers\" only happens in cases like you\'ve
described -- if you\'ve got 10 hosts all talking amongst themselves
(rather than 9 trying to slam the 10th), a decent switch will keep up
with that forever.

Some switches implement flow control so that rather than dropping
packets on the floor they first ask the sender to slow down to prevent
the buffers from filling completely and they only drop packets if the
sender ignores the request and the buffers do fill up. This is usually
a configurable option in managed switches. Some unmanaged switches
use flow control, other don\'t.

Right. I\'m trying to just keep to generalities, since the whole thing
seems some big theorycraft with no real defined goal ...

If you can define what bandwidth you need, and response time, you can
then compare that to the various switch specs. As you say, anything else
is just imagination...

Guaranteeing switch bandwidth...

Don Y

Guest

Joe Gwinn

Guest

Dan Purgert

Guest

Dimiter_Popoff

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Don Y

Guest

Don Y

Guest

Dan Purgert

Guest

Dan Purgert

Guest

Don Y

Guest

Don Y

Guest

Dan Purgert

Guest

Bertrand Sindri

Guest

Don Y

Guest

Don Y

Guest

Don Y

Guest

Dan Purgert

Guest

Don Y

Guest

chrisq

Guest

Log in

Welcome to EDABoard.com

Sponsor