Driver to drive?

May 13, 2018

On Saturday, 12 May 2018 21:15:20 UTC+1, Tom Gardner wrote:

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

What!! The seller assured me it was a genuine fast PDP-11. I've been had!

If it's not a fast PDP-11, what's this paper tape reader for?

NT

Phil Hobbs · May 13, 2018

On 05/12/18 20:51, Tom Gardner wrote:

On 12/05/18 22:42, Phil Hobbs wrote:
On 05/12/18 17:08, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 2:57:21 PM UTC-4, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining
perimeter
as your parallel processor.Â You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the
perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating
sin/cos
etc. would be hyper fast. Apparently the editor assumed that
those
functions were calculated using Taylor series. The article
suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a
6800
was
my first paid job as a vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In
reality,
the trigonometrical functions in those days were calculated
using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million
transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are
the XMOS xCORE processors with xC. They have been around in
several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been
around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.Â Yes, we've had this discussion
before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs
make
them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.Â However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.Â Seems the original
prototype
required the use of a pair of TMS320C6xxx devices which use around
a watt
each if I remember correctly.Â This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too
high.

Of those, the programming techniques are the most difficult; get
those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will
have no
significant advantages over an FPGA.

I don't see that.Â The super computers of today use standard
processors, not
FPGAs.Â "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C
cores",
although that is yesterday's fastest computer surpassed by "The
Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each
processor chip
contains 256 processing cores, and an additional four auxiliary
cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."Â This monster uses 15
MW...
yes, 15 MW!Â A professor in college used to kid about throwing the
power
switch on his CPUMOB machine and watching the lights dim in
College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.Â Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088
Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC
P4600
storage."Â Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Not if you don't use cache.

Rick C.

It's got nothing to do with you, it's the processor architecture.Â If
you want global cache coherence in a highly multicore processor, it
has to be architected to support that.Â If you do your own custom CPU
in FPGA, you can do whatever you want, but general purpose means
general purpose.

Caches are a red herring. If you get rid of them
(hello xCORE!) then you still have global memory
coherence to screw you up.

IMNSHO you /don't/ want global memory coherence in
a highly multicore system, because the penalties
for achieving that are unacceptable.

Ditto global clocks.
Ditto global time.

(Read Leslie Lamport for the latter!)

SMPs are a lot more generally useful than less-symmetric systems. The
more you relax the coherence guarantees the more you restrict the range
of problems you can address efficiently.

A dozen years or so ago, I wrote a clusterized 3D FDTD electromagnetic
solver with advanced optimization capability to support my work in
optical antennas, and I still use it. Even though it's almost in the
embarrassingly-parallelizable category, it runs better on a big SMP than
on a cluster of smaller ones on account of cache coherence.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Phil Hobbs · May 13, 2018

On 05/12/18 20:47, Tom Gardner wrote:

On 12/05/18 22:40, Phil Hobbs wrote:
On 05/12/18 16:15, Tom Gardner wrote:
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining
perimeter
as your parallel processor.Â You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the
perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far
future the processor clock speed would be over 1 GHz and
calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that
those
functions were calculated using Taylor series. The article
suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a
6800 was
my first paid job as a vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In
reality,
the trigonometrical functions in those days were calculated
using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million
transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are
the XMOS xCORE processors with xC. They have been around in
several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been
around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.Â Yes, we've had this discussion
before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs
make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned.
As for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.Â However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much
more advanced
than a graphic equalizer type of filter.Â Seems the original
prototype
required the use of a pair of TMS320C6xxx devices which use around
a watt
each if I remember correctly.Â This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too
high.

Of those, the programming techniques are the most difficult; get
those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will
have no
significant advantages over an FPGA.

I don't see that.Â The super computers of today use standard
processors, not
FPGAs.Â "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C
cores",
although that is yesterday's fastest computer surpassed by "The
Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010
manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each
processor chip
contains 256 processing cores, and an additional four auxiliary
cores for
system management (also RISC cores, just more fully featured) for
a total of
10,649,600 CPU cores across the entire system."Â This monster uses
15 MW...
yes, 15 MW!Â A professor in college used to kid about throwing the
power
switch on his CPUMOB machine and watching the lights dim in
College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.
Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088
Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4
multi-node servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC
P4600
storage."Â Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
Â Â "A 2015 survey of C programmers, compiler writers,
Â Â and standards committee members raised several
Â Â issues about the comprehensibility of C. For
Â Â example, C permits an implementation to insert
Â Â padding into structures (but not into arrays)
Â Â to ensure that all fields have a useful alignment
Â Â for the target. If you zero a structure and then
Â Â set some of the fields, will the padding bits
Â Â all be zero? According to the results of the
Â Â survey, 36 percent were sure that they would
Â Â be, and 29 percent didn't know. Depending on
Â Â the compiler (and optimization level), it may
Â Â or may not be.

Fun.Â He's a CS academic, so he probably doesn't write programs and so
doesn't realize how much stuff can't reasonably be done in his
favourite toy language, but it's a good read anyhow.

Two choices: either the toy language is C,
or it can't be reasonably be done in C

Your point being?

In his case, the toy language is Erlang.

There are too many poorly understood nooks
and crannies in C; we need better.

Simple example: if we flip to C++, ahem, even
the language /designers/ didn't understand what
they were creating. They refused to believe the
complexity of the template language made it Turing
complete - until someone forcibly rubbed their
noses in it. He wrote a short legal C++ program
that never finished compiling - because the compiler
was emitting the sequence of prime numbers during
compilation[1].

There's a latin tag, "Abusus non tollit usus", i.e. the abuse doesn't
abolish the use.

Now if even the language designers don't understand
basic principles of their language, what chance have
mere mortals.

That's a Big Clue that the tool is part of the
problem. I prefer tools that are part of the solution.

As a chemist friend of mine says, "If you aren't part of the solution,
you're part of the precipitate."

But I've been using C for 35 years, and I imagine
I'll continue to use it.

Why on earth not? Just pissing off CS academics is reason enough, not
to mention that it gets the job done.

Cheers

Phil Hobbs

(Who has had enough of "pure science" types to last him a lifetime even
in physics and math. CS types pulling the same crap is just silly.)

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Tom Gardner · May 13, 2018

On 12/05/18 22:42, Phil Hobbs wrote:

On 05/12/18 17:08, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 2:57:21 PM UTC-4, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.Â You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800
was
my first paid job as a vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.Â Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs make
them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.Â However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.Â Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.Â This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too high.

Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.Â The super computers of today use standard
processors, not
FPGAs.Â "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."Â This monster uses 15
MW...
yes, 15 MW!Â A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.Â Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."Â Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Not if you don't use cache.

Rick C.

It's got nothing to do with you, it's the processor architecture.Â If you want
global cache coherence in a highly multicore processor, it has to be architected
to support that.Â If you do your own custom CPU in FPGA, you can do whatever you
want, but general purpose means general purpose.

Caches are a red herring. If you get rid of them
(hello xCORE!) then you still have global memory
coherence to screw you up.

IMNSHO you /don't/ want global memory coherence in
a highly multicore system, because the penalties
for achieving that are unacceptable.

Ditto global clocks.
Ditto global time.

(Read Leslie Lamport for the latter!)

Tom Gardner · May 13, 2018

On 12/05/18 22:40, Phil Hobbs wrote:

On 05/12/18 16:15, Tom Gardner wrote:
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.Â You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was
my first paid job as a vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.Â Yes, we've had this discussion before.
They are good for a small class of applications which they are optimized
for. Otherwise the advantages of other processors or even FPGAs make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As for the
1000 processors chips, I have seen zero implementations, zero programming
techniques and zero applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any application
in mind.Â However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more advanced
than a graphic equalizer type of filter.Â Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.Â This type of signal processing app has
potential for a multiprocessor with little memory, as long as it doesn't get
too complex or if it needs external memory the bandwidth isn't too high.

Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.Â The super computers of today use standard processors, not
FPGAs.Â "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a total of
10,649,600 CPU cores across the entire system."Â This monster uses 15 MW...
yes, 15 MW!Â A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.Â Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node servers,
with each server featuring components such as two Intel Xeon Gold processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."Â Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and eventually
dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
Â Â "A 2015 survey of C programmers, compiler writers,
Â Â and standards committee members raised several
Â Â issues about the comprehensibility of C. For
Â Â example, C permits an implementation to insert
Â Â padding into structures (but not into arrays)
Â Â to ensure that all fields have a useful alignment
Â Â for the target. If you zero a structure and then
Â Â set some of the fields, will the padding bits
Â Â all be zero? According to the results of the
Â Â survey, 36 percent were sure that they would
Â Â be, and 29 percent didn't know. Depending on
Â Â the compiler (and optimization level), it may
Â Â or may not be.

Fun.Â He's a CS academic, so he probably doesn't write programs and so doesn't
realize how much stuff can't reasonably be done in his favourite toy language,
but it's a good read anyhow.

Two choices: either the toy language is C,
or it can't be reasonably be done in C

There are too many poorly understood nooks
and crannies in C; we need better.

Simple example: if we flip to C++, ahem, even
the language /designers/ didn't understand what
they were creating. They refused to believe the
complexity of the template language made it Turing
complete - until someone forcibly rubbed their
noses in it. He wrote a short legal C++ program
that never finished compiling - because the compiler
was emitting the sequence of prime numbers during
compilation[1].

Now if even the language designers don't understand
basic principles of their language, what chance have
mere mortals.

That's a Big Clue that the tool is part of the
problem. I prefer tools that are part of the solution.

But I've been using C for 35 years, and I imagine
I'll continue to use it.

[1]
https://en.wikibooks.org/wiki/C%2B%2B_Programming/Templates/Template_Meta-Programming#History_of_TMP

Tom Gardner · May 13, 2018

On 13/05/18 01:58, Phil Hobbs wrote:

On 05/12/18 20:47, Tom Gardner wrote:
On 12/05/18 22:40, Phil Hobbs wrote:
On 05/12/18 16:15, Tom Gardner wrote:
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.Â You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was
my first paid job as a vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.Â Yes, we've had this discussion before.
They are good for a small class of applications which they are optimized
for. Otherwise the advantages of other processors or even FPGAs make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As for the
1000 processors chips, I have seen zero implementations, zero programming
techniques and zero applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.Â However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.Â Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.Â This type of signal processing app has
potential for a multiprocessor with little memory, as long as it doesn't get
too complex or if it needs external memory the bandwidth isn't too high.

Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.Â The super computers of today use standard processors, not
FPGAs.Â "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a total of
10,649,600 CPU cores across the entire system."Â This monster uses 15 MW...
yes, 15 MW!Â A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS. Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."Â Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
Â Â "A 2015 survey of C programmers, compiler writers,
Â Â and standards committee members raised several
Â Â issues about the comprehensibility of C. For
Â Â example, C permits an implementation to insert
Â Â padding into structures (but not into arrays)
Â Â to ensure that all fields have a useful alignment
Â Â for the target. If you zero a structure and then
Â Â set some of the fields, will the padding bits
Â Â all be zero? According to the results of the
Â Â survey, 36 percent were sure that they would
Â Â be, and 29 percent didn't know. Depending on
Â Â the compiler (and optimization level), it may
Â Â or may not be.

Fun.Â He's a CS academic, so he probably doesn't write programs and so
doesn't realize how much stuff can't reasonably be done in his favourite toy
language, but it's a good read anyhow.

Two choices: either the toy language is C,
or it can't be reasonably be done in C

Your point being?Â

In his case, the toy language is Erlang.

Erlang is definitely not a toy language.

It is the implementation language used for
a significant proportion of the world's telco
systems.

It has many very significant advantages in systems
where you can't turn them off, e.g. gradually
upgrading /all/ parts of a /running/ system.

Its philosophy and techniques are being incorporated
into modern real-world languages.

Is Erlang suited for low-level code? I'm skeptical
and would need convincing.

Is it suitable for HPC computing? More likely, if
you can get the right granularity.

There are too many poorly understood nooks
and crannies in C; we need better.

Simple example: if we flip to C++, ahem, even
the language /designers/ didn't understand what
they were creating. They refused to believe the
complexity of the template language made it Turing
complete - until someone forcibly rubbed their
noses in it. He wrote a short legal C++ program
that never finished compiling - because the compiler
was emitting the sequence of prime numbers during
compilation[1].

There's a latin tag, "Abusus non tollit usus", i.e. the abuse doesn't abolish
the use.

Now if even the language designers don't understand
basic principles of their language, what chance have
mere mortals.

That's a Big Clue that the tool is part of the
problem. I prefer tools that are part of the solution.

As a chemist friend of mine says, "If you aren't part of the solution, you're
part of the precipitate."

Ah, the old ones are the good ones

But I've been using C for 35 years, and I imagine
I'll continue to use it.

Why on earth not?Â Just pissing off CS academics is reason enough, not to
mention that it gets the job done.

Because too often it only /looks/ like it gets the
job done, sufficiently to pass the unit tests. Put
it in a real-world environment and you start to find
all sorts of /subtle unrepeatable/ errors that /very/
few people can even /begin/ to diagnose. And most
of those people are retiring and being replaced by
people that don't know that L3 caches even exist

(Been there, seen that

)

If you want to see the accumulated experience of
someone that has been at the sharp end of HPC
and C/C++ (and many other languages) since the 60s,
have a look at
http://people.ds.cam.ac.uk/nmm1/

Only a quarter of a century late C/C++ is apparently
incorporating the concept of a memory model. It remains
to be seen whether it can be used correctly by compiler
writers and typical practitioners.

Phil Hobbs

(Who has had enough of "pure science" types to last him a lifetime even in
physics and math.Â CS types pulling the same crap is just silly.)

Oh indeed. Theory without practice is mental
masturbation, but equally practice without
theory is blind fumbling.

Tom Gardner · May 13, 2018

On 13/05/18 02:04, Phil Hobbs wrote:

On 05/12/18 20:51, Tom Gardner wrote:
On 12/05/18 22:42, Phil Hobbs wrote:
On 05/12/18 17:08, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 2:57:21 PM UTC-4, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner
wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner
wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex
wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare:
example, dedicate a few rows of cores as data
routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length,
addressed...), and use the remaining perimeter as
your parallel processor. You might still get
dozens of cores in parallel that way, even if the
utilization is effectively maybe 20%.

Each core can "know" what the load is for the cores
on its perimeter; if some core needs to send a
message to the perimeter a best-first/greedy-search
algorithm like beam search might be able to be used
to find the path where passing would cause the least
disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when
looking for something else and had a few brain cells
wakened.

Not directly related to this issues, but some
recollections of articles in electronic journals from the
early 1970's.

In one article it was speculated that one day in the far
future the processor clock speed would be over 1 GHz and
calculating sin/cos etc. would be hyper fast. Apparently
the editor assumed that those functions were calculated
using Taylor series. The article suggested using the
trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC
on a 6800 was my first paid job as a vacation student. I
still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi.
In reality, the trigonometrical functions in those days
were calculated using 3rd or 4th order polynomials for
single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.

In an other article an Intel representative was
interviewed what happens, when it is possible to
integrate a million transistors on a single chip. He
could not give a clear answer.

My guess is that the situation today is similar with 1000
or a million processor cores available.

As always, the hardware is easy, but programming techniques
for multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are the XMOS xCORE processors with xC. They have been
around in several generations for over a decade, and are
*very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have
been around since the 70s (xC/CSP) and 80s
(xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion
before. They are good for a small class of applications which
they are optimized for. Otherwise the advantages of other
processors or even FPGAs make them more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you
mentioned. As for the 1000 processors chips, I have seen zero
implementations, zero programming techniques and zero
applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with
any application in mind. However, I'm told they have customers,
one application in particular was for an advanced hearing aid
using something much more advanced than a graphic equalizer type
of filter. Seems the original prototype required the use of a
pair of TMS320C6xxx devices which use around a watt each if I
remember correctly. This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get too complex or if it needs external memory the
bandwidth isn't too high.

Of those, the programming techniques are the most difficult;
get those right and implementations and applications may
*follow*.

I strongly suspect that anything with 1000s of processors will
have no significant advantages over an FPGA.

I don't see that. The super computers of today use standard
processors, not FPGAs. "Tianhee-2 uses over 3 million Intel Xeon
E5-2692v2 12C cores", although that is yesterday's fastest
computer surpassed by "The Sunway TaihuLight uses a total of
40,960 Chinese-designed SW26010 manycore 64-bit RISC processors
based on the Sunway architecture.[6] Each processor chip contains
256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for
a total of 10,649,600 CPU cores across the entire system." This
monster uses 15 MW... yes, 15 MW! A professor in college used to
kid about throwing the power switch on his CPUMOB machine and
watching the lights dim in College Park. The Sunway would do it
for sure!

A new machine is due out any time now running at 130 PFLOPS.
Called the AI Bridging Cloud Infrastructure, "The system will
consist of 1,088 Primergy CX2570 M4 servers, mounted in Fujitsu's
Primergy CX400 M4 multi-node servers, with each server featuring
components such as two Intel Xeon Gold processor CPUs, four
NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage." Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/ similar to
the GA144, or that Sony cell processor, or that intel experiment
with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large PCB and
plus into the mains with something that fits on a tennis court and
plugs into a substation isn't really very enlightening.

There's a fundamental problem with parallel computing without
shared memory[1]: choosing the granularity of parallelism. Too
small and the comms costs dominate, too large and you have to copy
too much context and wait a long time for other computations to
complete.

That has been an unsolved problem since the 1960s, except for
applications that are known as "embarrassingly parallel". Canonical
examples of embarrassingly parallel computations are monte carlo
simulations, telecom/financial systems, some place/route
algorithms, some simulated annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Not if you don't use cache.

Rick C.

It's got nothing to do with you, it's the processor architecture. If you
want global cache coherence in a highly multicore processor, it has to
be architected to support that. If you do your own custom CPU in FPGA,
you can do whatever you want, but general purpose means general purpose.

Caches are a red herring. If you get rid of them (hello xCORE!) then you
still have global memory coherence to screw you up.

IMNSHO you /don't/ want global memory coherence in a highly multicore
system, because the penalties for achieving that are unacceptable.

Ditto global clocks. Ditto global time.

(Read Leslie Lamport for the latter!)

SMPs are a lot more generally useful than less-symmetric systems. The more
you relax the coherence guarantees the more you restrict the range of
problems you can address efficiently.

Yup; that's the dilemma - and there is little
(if any) sign of it being resolved.

The other side is, of course, that SMP restricts
the /scale/ of problems that can be computed in
a reasonable time.

A dozen years or so ago, I wrote a clusterized 3D FDTD electromagnetic solver
with advanced optimization capability to support my work in optical
antennas, and I still use it. Even though it's almost in the
embarrassingly-parallelizable category, it runs better on a big SMP than on a
cluster of smaller ones on account of cache coherence.

I'm sure it does. There are sweet spots for most
technologies.

I've seen, assessed (and avoided) a telco system
that runs on SMPs and makes full use of their features.
The assessment concluded that there was a hard
unbreakable limit in the scale of the system that
could be implemented. The PHBs decreed it would be
implemented, and guess what happened.

That was 18 months of someone's life down the
drain; he wasn't happy, since he had predicted it.

May 13, 2018

On Sat, 12 May 2018 16:23:18 +0100, Tom Gardner
<spamjunk@blueyonder.co.uk> wrote:

On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a
vacation student. I still have the code

CORDIC was also easily implementable in hardware.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.

Polynomials was used on many DEC systems. The PDP-11 FORTRAN IV+ used
forth order for single and 8th degree for double. On VAX/VMS, the arc
was divided into more sectors and single precision was 3rd order and
double 7th order. The VAX had the POLY instruction, essentially
multiple MAC (Multiple adds), so after you had split the argument to
correct sector with correct coefficients for each sector, a single
POLY instruction was needed for sin, log, square roots etc.
calculations.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

As always, the hardware is easy, but programming techniques
for multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are the XMOS xCORE processors with xC. They have been
around in several generations for over a decade, and are
*very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been
around since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

With a large number of identical processors, fast DSP algorithms could
be implemented by splitting a complex algorithm into a pipeline, each
processor handling only one stage of the algorithm.

Also some PLC (Programmable Logic Controller) type operations can
easily be implemented.

I agree that trying to convert any random Windows application for a
large number of processors, is a big challenge.

May 13, 2018

On Sunday, 13 May 2018 10:27:50 UTC+1, upsid...@downunder.com wrote:

With a large number of identical processors, fast DSP algorithms could
be implemented by splitting a complex algorithm into a pipeline, each
processor handling only one stage of the algorithm.

Also some PLC (Programmable Logic Controller) type operations can
easily be implemented.

I agree that trying to convert any random Windows application for a
large number of processors, is a big challenge.

it should suit net browsers well though, and those are the biggest cpu hog on a lot of desktop systems.

NT

John Larkin · May 13, 2018

On Sat, 12 May 2018 16:23:18 +0100, Tom Gardner
<spamjunk@blueyonder.co.uk> wrote:

On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a
vacation student. I still have the code

I wrote a preemptive multitasking RTOS for the 6800. In longhand. In
Juneau Alaska. I mailed a few sheets a day back to New Orleans, where
people punched cards for assembly. It had one bug.

It was a nuisance. A NOP took two microseconds. There wasn't an
instruction to push the index register onto the stack, and it wouldn't
multiply, among other things.

--

John Larkin Highland Technology, Inc

lunatic fringe electronics

Tom Gardner · May 13, 2018

On 13/05/18 21:19, John Larkin wrote:

On Sat, 12 May 2018 16:23:18 +0100, Tom Gardner
spamjunk@blueyonder.co.uk> wrote:

On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a >> vacation student. I still have the code

That also taught me the reason why semaphores are
necessary; my university project supervisor was
pleasantly surprised when I knew about them before
the course had been started

I wrote a preemptive multitasking RTOS for the 6800. In longhand. In
Juneau Alaska. I mailed a few sheets a day back to New Orleans, where
people punched cards for assembly. It had one bug.

My final year project was a 4TTY -> computer multiplexer,
both hardware and 6800 asm software.

I wrote an exceptionally simple "executive" that
did little more than save/restore each task's PC
and SP. It was only a few instructions.

It was a nuisance. A NOP took two microseconds. There wasn't an
instruction to push the index register onto the stack,

That was used infrequently in "user code", unlike loading
the index register from a location defined by the index
register. I used that very frequently to chain along
linked lists. When I tried to do that in the Z80 with its
IX and IY registers, it was seriously painful; so painful
that it was better to use plain old 8080 instructions
with the HL register.

My theory is that the Z80 won because of a better hardware
interface and the beguiling possibility of doing a print
statement in a single instruction. Beguiling to hardware
engineers that is, until they tried to do it in a real
system, and found all its inadequacies.

and it wouldn't
multiply, among other things.

That was rare then, but surprisingly doing it in software
was little slower than having an external hardware
multiplier chip.

Next Sunday I expect to see the world's oldest operating
computer, based on Dekatrons, plus a working example of
the first computer I used in anger, complete with 39bit
words and magnetic film peripheral complete with sprocket
holes. At the last two visits the attendant whipped out
the circuit diagrams and we poured over them together

That's my kind of museum

Plus the clunk of ASR33s is always /extremely/ evocative.

Tauno Voipio · May 14, 2018

On 14.5.18 00:50, Tom Gardner wrote:

On 13/05/18 21:19, John Larkin wrote:
On Sat, 12 May 2018 16:23:18 +0100, Tom Gardner
spamjunk@blueyonder.co.uk> wrote:

On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you
like:
duplexed, synchronous, packetized, variable length,
addressed...), and
use the remaining perimeter as your parallel processor.Â You
might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able
to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for
something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a >> vacation student. I
still have the code

That also taught me the reason why semaphores are
necessary; my university project supervisor was
pleasantly surprised when I knew about them before
the course had been started

I wrote a preemptive multitasking RTOS for the 6800. In longhand. In
Juneau Alaska. I mailed a few sheets a day back to New Orleans, where
people punched cards for assembly. It had one bug.

My final year project was a 4TTY -> computer multiplexer,
both hardware and 6800 asm software.

I wrote an exceptionally simple "executive" that
did little more than save/restore each task's PC
and SP. It was only a few instructions.

It was a nuisance. A NOP took two microseconds. There wasn't an
instruction to push the index register onto the stack,

That was used infrequently in "user code", unlike loading
the index register from a location defined by the index
register. I used that very frequently to chain along
linked lists. When I tried to do that in the Z80 with its
IX and IY registers, it was seriously painful; so painful
that it was better to use plain old 8080 instructions
with the HL register.

My theory is that the Z80 won because of a better hardware
interface and the beguiling possibility of doing a print
statement in a single instruction. Beguiling to hardware
engineers that is, until they tried to do it in a real
system, and found all its inadequacies.

and it wouldn't
multiply, among other things.

That was rare then, but surprisingly doing it in software
was little slower than having an external hardware
multiplier chip.

Next Sunday I expect to see the world's oldest operating
computer, based on Dekatrons, plus a working example of
the first computer I used in anger, complete with 39bit
words and magnetic film peripheral complete with sprocket
holes. At the last two visits the attendant whipped out
the circuit diagrams and we poured over them together
That's my kind of museum

Plus the clunk of ASR33s is always /extremely/ evocative.

Elliott 803 / 503?

--

-TV

Tom Gardner · May 14, 2018

On 14/05/18 07:17, Tauno Voipio wrote:

On 14.5.18 00:50, Tom Gardner wrote:
Next Sunday I expect to see the world's oldest operating
computer, based on Dekatrons, plus a working example of
the first computer I used in anger, complete with 39bit
words and magnetic film peripheral complete with sprocket
holes. At the last two visits the attendant whipped out
the circuit diagrams and we poured over them together
That's my kind of museum

Plus the clunk of ASR33s is always /extremely/ evocative.

Elliott 803 / 503?

Elliott 803

David Brown · May 14, 2018

On 12/05/18 23:52, Lasse Langwadt Christensen wrote:
<snip 230 lines>

Folks, how about snipping a little here? 230 lines of quotation for a
smiley? What is this, AOL?

dcaster@krl.org · May 14, 2018

On Monday, May 14, 2018 at 7:54:15 AM UTC-4, David Brown wrote:

Folks, how about snipping a little here? 230 lines of quotation for a
smiley? What is this, AOL?

+1

Dan

Pimpom · May 22, 2018

On 5/22/2018 8:39 PM, Phil Hobbs wrote:

On 10/07/2008 12:25 PM, Palindrome wrote:
j wrote:
Hello,

I'm using the TI TLV431 in the shunt configuration with two
external resistors (see schematic figure link below) to output +2V
(vO), and an Input of +5V. I've determined that the R1/R2 ration
needs to be 0.613 (given that VREF=1.24V) for this to happen.

However, I'd like to determine sink resistance (output
resistance?) of this circuit but I'm not 100% sure that I'm doing
it correctly.

My first take on this is that the output resistance would be
determined by shorting the output to ground and determining the
short- circuit current (isc), which would be completely driven by
the +5V input and the resistor Rin, thus the output resistance
being: Rin. Is this the correct approach?

Schematic figure:
http://img115.imageshack.us/my.php?image=12965654es6.png

Datasheet for the TLV431 can be found here:
http://focus.ti.com/lit/ds/symlink/tlv431.pdf

Any help would be appreciated.

It really depends on why you want the "output resistance".

Normally, you will use this circuit to supply a range of output
current to other circuitry. I would measure the output voltage when
sourcing the min and max designed output current and derive the
"output resistance" from that.

I don't seem to be able to look at that image without signing up at
Imageshack, which I have no desire to do. However, I assume it's the
usual adjustable shunt regulator circuit, with a resistor Rs from supply
to cathode, anode grounded, and a voltage divider between them to drive
the feedback pin.

Output resistance is a small-signal parameter, so shorting the output
(and so turning the TLV431 off completely) will give the wrong answer,
because the output resistance is Rs when the regulator isn't running,
but under an ohm when it is. (The datasheet says 0.25 ohm typical, 0.4
ohm max.)

If you have a good voltmeter, you can measure the change in the output
voltage when you change Rs, and do the math.

Cheers

Phil Hobbs

Phil, you're replying to a 10-year-old thread.

BTW, Imageshack was free then.

Phil Hobbs · May 22, 2018

On 05/22/2018 11:09 AM, Phil Hobbs wrote:

On 10/07/2008 12:25 PM, Palindrome wrote:
j wrote:
Hello,

I'm using the TI TLV431 in the shunt configuration with two
external resistors (see schematic figure link below) to output +2V
(vO), and an Input of +5V. I've determined that the R1/R2 ration
needs to be 0.613 (given that VREF=1.24V) for this to happen.

However, I'd like to determine sink resistance (output
resistance?) of this circuit but I'm not 100% sure that I'm doing
it correctly.

My first take on this is that the output resistance would be
determined by shorting the output to ground and determining the
short- circuit current (isc), which would be completely driven by
the +5V input and the resistor Rin, thus the output resistance
being: Rin. Is this the correct approach?

Schematic figure:
http://img115.imageshack.us/my.php?image=12965654es6.png

Datasheet for the TLV431 can be found here:
http://focus.ti.com/lit/ds/symlink/tlv431.pdf

Any help would be appreciated.

It really depends on why you want the "output resistance".

Normally, you will use this circuit to supply a range of output
current to other circuitry. I would measure the output voltage when
sourcing the min and max designed output current and derive the
"output resistance" from that.

I don't seem to be able to look at that image without signing up at
Imageshack, which I have no desire to do. However, I assume it's the
usual adjustable shunt regulator circuit, with a resistor Rs from supply
to cathode, anode grounded, and a voltage divider between them to drive
the feedback pin.

Output resistance is a small-signal parameter, so shorting the output
(and so turning the TLV431 off completely) will give the wrong answer,
because the output resistance is Rs when the regulator isn't running,
but under an ohm when it is. (The datasheet says 0.25 ohm typical, 0.4
ohm max.)

If you have a good voltmeter, you can measure the change in the output
voltage when you change Rs, and do the math.

Weird. The referenced post showed up as new in Thunderbird today--I
just noticed the 2008 date. I hope the OP figured out the problem!

Cheers

Phil Hobbs
--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
https://hobbs-eo.com

Phil Hobbs · May 22, 2018

On 10/07/2008 12:25 PM, Palindrome wrote:

j wrote:
Hello,

I'm using the TI TLV431 in the shunt configuration with two
external resistors (see schematic figure link below) to output +2V
(vO), and an Input of +5V. I've determined that the R1/R2 ration
needs to be 0.613 (given that VREF=1.24V) for this to happen.

However, I'd like to determine sink resistance (output
resistance?) of this circuit but I'm not 100% sure that I'm doing
it correctly.

My first take on this is that the output resistance would be
determined by shorting the output to ground and determining the
short- circuit current (isc), which would be completely driven by
the +5V input and the resistor Rin, thus the output resistance
being: Rin. Is this the correct approach?

Schematic figure:
http://img115.imageshack.us/my.php?image=12965654es6.png

Datasheet for the TLV431 can be found here:
http://focus.ti.com/lit/ds/symlink/tlv431.pdf

Any help would be appreciated.

It really depends on why you want the "output resistance".

Normally, you will use this circuit to supply a range of output
current to other circuitry. I would measure the output voltage when
sourcing the min and max designed output current and derive the
"output resistance" from that.

I don't seem to be able to look at that image without signing up at
Imageshack, which I have no desire to do. However, I assume it's the
usual adjustable shunt regulator circuit, with a resistor Rs from supply
to cathode, anode grounded, and a voltage divider between them to drive
the feedback pin.

Output resistance is a small-signal parameter, so shorting the output
(and so turning the TLV431 off completely) will give the wrong answer,
because the output resistance is Rs when the regulator isn't running,
but under an ohm when it is. (The datasheet says 0.25 ohm typical, 0.4
ohm max.)

If you have a good voltmeter, you can measure the change in the output
voltage when you change Rs, and do the math.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
https://hobbs-eo.com

Jeff Liebermann · May 22, 2018

On Tue, 22 May 2018 11:12:01 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

Weird. The referenced post showed up as new in Thunderbird today--I
just noticed the 2008 date. I hope the OP figured out the problem!
Cheers
Phil Hobbs

My guess(tm) is that your Thunderbirdie global messages database is
mangled. Easy fix:
<https://support.mozilla.org/en-US/kb/rebuilding-global-database>
For a while last year, Thunderbirdie was delivering indexing failures,
searches returning blank messages, and tangled threading. Rebuilding
global-messages-db.sqlite fixes the problem.

You may also want to rebuild some of the associated MSF message
indexes:
<https://www.lifewire.com/repair-folders-thunderbird-1173102>
I'm not sure if this part is really necessary, but it doesn't hurt to
rebuild everything.

You can watch the action by monitoring:
Tools -> Activity Monitor

--
Jeff Liebermann jeffl@cruzio.com
150 Felker St #D http://www.LearnByDestroying.com
Santa Cruz CA 95060 http://802.11junk.com
Skype: JeffLiebermann AE6KS 831-336-2558

whit3rd · May 22, 2018

On Tuesday, October 7, 2008 at 9:10:17 AM UTC-7, j wrote:

I'm using the TI TLV431 in the shunt configuration...

However, I'd like to determine sink resistance (output resistance?) of
this circuit but I'm not 100% sure that I'm doing it correctly.

My first take on this is that the output resistance would be
determined by shorting the output to ground and determining the short-
circuit current ...

No, of course that's wrong. Shorting the 'output' means powering off
the feedback amplifier in the TLV431, which is NOT normal operation.
Testing for a linear property (like output impedance) makes sense only
with small-signal perturbations of output current that do not
exceed normal operating conditions.

Driver to drive?

Guest

Phil Hobbs

Guest

Phil Hobbs

Guest

Tom Gardner

Guest

Tom Gardner

Guest

Tom Gardner

Guest

Tom Gardner

Guest

Guest

Guest

John Larkin

Guest

Tom Gardner

Guest

Tauno Voipio

Guest

Tom Gardner

Guest

David Brown

Guest

dcaster@krl.org

Guest

Pimpom

Guest

Phil Hobbs

Guest

Phil Hobbs

Guest

Jeff Liebermann

Guest

whit3rd

Guest

Log in

Welcome to EDABoard.com

Sponsor