Driver to drive?

Klaus Kragelund <klauskvik@hotmail.com> wrote in
news:ff9ad6df-ad19-4d9f-b134-2896ed0c8df9@googlegroups.com:
snip

I have a relay that has a specified must operate voltage of 9V (12V
relay, so 85% of nominal)

The holding voltage is lower than the must operate, since the must
operate is for pulling in the armature. When the armature is pulled
in, the magnetic path is lower, and thus needs less current to have
the same force on the armature to withstand vibration etc

I do not have a holding spec, but I need it and would like to deduce
it from the must operate voltage

So I was thinking about opening up the relay, and mouting it
downwards, so i could place a weight on the armature, to measure
precisely the force needed to keep the equilibrium state (no movement
of the armature), both for the pulled-in case, and also for the
released state, so I can calculate the needed holding voltage

(and yes, I have contacted Omron, but they can sofar not present
Holding voltage specs)

I could also measure the inductance of the coil, in both pulled-in
state and the released state, which also should give me the different
in current to obtain the same magnetic field (magnetic path in
pulled-in state is shorter)

Anyone got ideas?

Cheers

Klaus

I would make a circuit board which generates and fills the supercap,
and wire the relay to that. The trigger fires it. Just like a camera
flash unit, you have to wait till the cap is charged to fire it. Your
trigger would be electronic.

Your cycle repeat rate takes the hit, but the store and fire PCB
charge time sets that. The relay would have to be able to hold or latch
on its own until released, however.

You could simply use a SSR too. or make an SSR.

Or make a direct DC to DC point of use converter that converts your
sagged voltage to that required by the relay.
 
On Tuesday, April 17, 2018 at 3:30:14 PM UTC+2, John Devereux wrote:
tabbypurr@gmail.com writes:

On Monday, 16 April 2018 18:25:29 UTC+1, Klaus Kragelund wrote:
On Monday, April 16, 2018 at 2:47:11 PM UTC+2, TTman wrote:

SNIP
So I was thinking about opening up the relay, and mouting it
downwards, so i could place a weight on the armature, to measure
precisely the force needed to keep the equilibrium state (no
movement of the armature), both for the pulled-in case, and also
for the released state, so I can calculate the needed holding
voltage

(and yes, I have contacted Omron, but they can sofar not present Holding voltage specs)

I could also measure the inductance of the coil, in both
pulled-in state and the released state, which also should give
me the different in current to obtain the same magnetic field
(magnetic path in pulled-in state is shorter)

Anyone got ideas?

Cheers

Klaus

Drive it with a variable voltage PSU ???? Start with 9V to operate, then
reduce the voltage until the relay drops out... then add some margin % ?
Seems simple to me.

I know how to do that, but I need the robustness margin for vibration etc to be valid

Cheers

Klaus

the only way to ensure it's valid is test it, using frequency swept vibration. Reducing coil power is nice but it does erode that margin.

Even then I think there is still the possibility of variations in the
springs (when new and also perhaps aging). The manufacturers must also
have to allow for this in the spec. So maybe determine what the pull-in
margin actually is and then use the same margin for the hold?

Good points. I am just amazed that Omron that makes relays for a living doesn't have answers to this important spec

Seems like some other workaround would be more practical here.

There are latched relays, maybe one of those would be better for the
application?

They are too expensive, need buttom dollar here as usual

Cheers

Klaus
 
On Thursday, January 25, 2018 at 3:13:29 PM UTC-5, gnuarm.del...@gmail.com wrote:
My Win 8 machine is having convulsions and I figured it is time to get a new
one.

I'm moving Eudora over to a Win 10 machine and I am getting a strange error message,

"Can't start Eudora the first time with attachments

Please start Eudora by double clicking the icon or by the command line"


I'm not even sure what this means. What attachments are there for Eudora?
I'm pretty sure I don't have any. I'm guessing it is not finding the .ini file?

Win 10 sets up a bit differently than any other windows I've used. It asked
me to set up an email account then used that as my user name. So the user
name is not the same between the two machines. I had to edit the
deudora.ini file line from this

DataFolder=C:\Users\Rick\AppData\Eudora

to

DataFolder=C:\Users\ColdW\AppData\Eudora

But that's where all the mail folders are, so it should be right.

I'm not sure what else there is to change. Any clues?

I'm wondering if this is an issue with Windows faking out the location of read/write data files when installed in the Program Files directory? That's why there is a deudora.ini file. That is read to find the location of the eudora.ini file which is where all the real info is stored. Could Windows be messing with this and not showing Eudora *any* .ini file so it thinks it is starting cold?

Rick C.

I am moving Eudora to another machine and I couldn't seem to get it to work with this same error. A google search didn't find the answer... again, but eventually I remembered that there is a Eudora newsgroup where I found the answer. So that I won't have to rummage around so much if I do this again... here are all the steps I needed to take to port Eudora between PCs.

1. Copy contents of C:\Program Files (x86)\Eudora\ directory to same place on new machine.

2. Copy contents of C:\Users\<user name>\AppData\Eudora directory to same place on new machine, adjusted for new user name if needed.

3. If user name has changed, edit deudora.ini changing path in "DataFolder=" to point to the directory in step 2 above.

4. This is the tricky one I keep forgetting and causes the error "Cannot start Eudora the first time with attachments." The shortcut has to be edited to add the data directory as a parameter to the command like this...

"C:\Program Files (x86)\Eudora\Eudora.exe" "C:\Users\<user name>\AppData\Eudora"

No need to change the "Start in" directory as that will point to the deudora.ini file although I'm not sure what good that does since it isn't sufficient to point Eudora to the data directory.

Anyway, these steps got my Eudora working under Win 10. I'm posting this here and in the Eudora newsgroup for my own use if nothing else. This should show up in Google searches in the future.

Rick C.
 
Finally, I can get a mobile data 4G connection! I got a new SIM card and
that may have fixed it, but if so there was a delay because about the time
it started working the tech at Consumer Cellular came back on the phone line
and told me that even though all the settings on her end looked good she had
reset the account anyway. She changed my plan to voice only, then changed
it back to voice and data so everything got reset in the "new" account
setup. The new SIM had been in and phone on a couple of minutes before the
4G symbol lit up, right as she came back, so who knows if it just took the
phone that long to make a connection or if it was the reset, but my guess is
that the reset was the fix. I wonder if Samsung will ever send me back my
phone? They have had it for about six weeks now after saying 7 to 10 or 14
days to repair it, I'm using a second S5 I bought on eBay. Anyway, thanks
everyone for the suggestions and encouragement.

--
Regards,
Carl Ijames
 
On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

The difference between the GA144 and any other processor is not the way the processors are connected so much or nearly any other aspect that folks focus on when learning about it, the real difference is not the various strong points it is the weakness of only having 64 words of RAM, which is combined data and program space, on each processor. So programs have to be small, (but it does have 5 bit instructions, so up to 4 per word) so small you shouldn't think of them as programs anymore. The GA144 core processor (the F18A) is a bit of programmable logic that is not configured with individual bits, rather with instruction streams.

I have not designed any apps for this CPU myself other than paper studies to see what will fit. But in the Forth language the user is encouraged to write in a very modular fashion with a lot of nesting. I think software design on the GA144 would need to do the same thing, consider the F18A as an element on which to implement a few words and structure them in a hierarchy to accomplish a task.

Rick C.
 
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much advantage.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.
 
On Saturday, May 12, 2018 at 10:43:43 AM UTC-4, upsid...@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much advantage.

In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

Certainly the solution space has not been adequately explored. I wonder if some sort of neural net would be practical if you could dedicate a processor to each neuron? You could simulate a very fast thinking cockroach with 1000 nodes.

Rick C.
 
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a
vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

As always, the hardware is easy, but programming techniques
for multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are the XMOS xCORE processors with xC. They have been
around in several generations for over a decade, and are
*very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been
around since the 70s (xC/CSP) and 80s (xCORE/Treansputer).
 
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT), gnuarm.deletethisbit@gmail.com
wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a
few rows of cores as data routing (which can be any combination you
like: duplexed, synchronous, packetized, variable length,
addressed...), and use the remaining perimeter as your parallel
processor. You might still get dozens of cores in parallel that
way, even if the utilization is effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter;
if some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to
be used to find the path where passing would cause the least
disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for
something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of articles
in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those functions
were calculated using Taylor series. The article suggested using the
trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was my
first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality, the
trigonometrical functions in those days were calculated using 3rd or 4th
order polynomials for single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what happens,
when it is possible to integrate a million transistors on a single chip.
He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are the
XMOS xCORE processors with xC. They have been around in several generations
for over a decade, and are *very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around since
the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion before. They
are good for a small class of applications which they are optimized for.
Otherwise the advantages of other processors or even FPGAs make them more
suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned.
As for the 1000 processors chips, I have seen zero implementations,
zero programming techniques and zero applications for them. The
XMOS devices are therefore (currently) infinitely superior.

Of those, the programming techniques are the most difficult;
get those right and implementations and applications may
*follow*.

I strongly suspect that anything with 1000s of processors
will have no significant advantages over an FPGA.
 
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a few
rows of cores as data routing (which can be any combination you like:
duplexed, synchronous, packetized, variable length, addressed...), and
use the remaining perimeter as your parallel processor. You might still
get dozens of cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter; if
some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to be
used to find the path where passing would cause the least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc.
would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus
CORDIC on a 6800 was my first paid job as a
vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

As always, the hardware is easy, but programming techniques
for multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software
story are the XMOS xCORE processors with xC. They have been
around in several generations for over a decade, and are
*very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been
around since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion before. They are good for a small class of applications which they are optimized for. Otherwise the advantages of other processors or even FPGAs make them more suitable.

I was just browsing FPGA prices and I can get a small FPGA for $2.50 qty 1. Unfortunately volume FPGA prices are seldom advertised, so hard to tell what the 1000 piece price would be.

Rick C.
 
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT), gnuarm.deletethisbit@gmail.com
wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate a
few rows of cores as data routing (which can be any combination you
like: duplexed, synchronous, packetized, variable length,
addressed...), and use the remaining perimeter as your parallel
processor. You might still get dozens of cores in parallel that
way, even if the utilization is effectively maybe 20%.

Each core can "know" what the load is for the cores on its perimeter;
if some core needs to send a message to the perimeter a
best-first/greedy-search algorithm like beam search might be able to
be used to find the path where passing would cause the least
disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking for
something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of articles
in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos etc..
would be hyper fast. Apparently the editor assumed that those functions
were calculated using Taylor series. The article suggested using the
trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was my
first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality, the
trigonometrical functions in those days were calculated using 3rd or 4th
order polynomials for single precession, so not much advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what happens,
when it is possible to integrate a million transistors on a single chip.
He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a million
processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are the
XMOS xCORE processors with xC. They have been around in several generations
for over a decade, and are *very* usable for *hard* realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around since
the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion before. They
are good for a small class of applications which they are optimized for..
Otherwise the advantages of other processors or even FPGAs make them more
suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned.
As for the 1000 processors chips, I have seen zero implementations,
zero programming techniques and zero applications for them. The
XMOS devices are therefore (currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any application in mind. However, I'm told they have customers, one application in particular was for an advanced hearing aid using something much more advanced than a graphic equalizer type of filter. Seems the original prototype required the use of a pair of TMS320C6xxx devices which use around a watt each if I remember correctly. This type of signal processing app has potential for a multiprocessor with little memory, as long as it doesn't get too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult;
get those right and implementations and applications may
*follow*.

I strongly suspect that anything with 1000s of processors
will have no significant advantages over an FPGA.

I don't see that. The super computers of today use standard processors, not FPGAs. "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores", although that is yesterday's fastest computer surpassed by "The Sunway TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit RISC processors based on the Sunway architecture.[6] Each processor chip contains 256 processing cores, and an additional four auxiliary cores for system management (also RISC cores, just more fully featured) for a total of 10,649,600 CPU cores across the entire system." This monster uses 15 MW... yes, 15 MW! A professor in college used to kid about throwing the power switch on his CPUMOB machine and watching the lights dim in College Park. The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS. Called the AI Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node servers, with each server featuring components such as two Intel Xeon Gold processor CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600 storage." Expected power consumption will be down to only 3 MW.

Rick C.
 
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor. You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion before.
They are good for a small class of applications which they are optimized
for. Otherwise the advantages of other processors or even FPGAs make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As for the
1000 processors chips, I have seen zero implementations, zero programming
techniques and zero applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any application
in mind. However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more advanced
than a graphic equalizer type of filter. Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly. This type of signal processing app has
potential for a multiprocessor with little memory, as long as it doesn't get
too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that. The super computers of today use standard processors, not
FPGAs. "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a total of
10,649,600 CPU cores across the entire system." This monster uses 15 MW...
yes, 15 MW! A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS. Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node servers,
with each server featuring components such as two Intel Xeon Gold processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage." Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.
 
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800
was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs make
them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard
processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."  This monster uses 15
MW...
yes, 15 MW!  A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."  Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are optimized
for. Otherwise the advantages of other processors or even FPGAs make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As for the
1000 processors chips, I have seen zero implementations, zero programming
techniques and zero applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it doesn't get
too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a total of
10,649,600 CPU cores across the entire system."  This monster uses 15 MW...
yes, 15 MW!  A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node servers,
with each server featuring components such as two Intel Xeon Gold processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."  Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and eventually
dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
"A 2015 survey of C programmers, compiler writers,
and standards committee members raised several
issues about the comprehensibility of C. For
example, C permits an implementation to insert
padding into structures (but not into arrays)
to ensure that all fields have a useful alignment
for the target. If you zero a structure and then
set some of the fields, will the padding bits
all be zero? According to the results of the
survey, 36 percent were sure that they would
be, and 29 percent didn't know. Depending on
the compiler (and optimization level), it may
or may not be.
 
On 05/12/18 16:15, Tom Gardner wrote:
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating
sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article
suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a
6800 was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In
reality,
the trigonometrical functions in those days were calculated
using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors
on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs
make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a
watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too
high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard
processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor
chip
contains 256 processing cores, and an additional four auxiliary
cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."  This monster uses
15 MW...
yes, 15 MW!  A professor in college used to kid about throwing the
power
switch on his CPUMOB machine and watching the lights dim in College
Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088
Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC
P4600
storage."  Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
 "A 2015 survey of C programmers, compiler writers,
 and standards committee members raised several
 issues about the comprehensibility of C. For
 example, C permits an implementation to insert
 padding into structures (but not into arrays)
 to ensure that all fields have a useful alignment
 for the target. If you zero a structure and then
 set some of the fields, will the padding bits
 all be zero? According to the results of the
 survey, 36 percent were sure that they would
 be, and 29 percent didn't know. Depending on
 the compiler (and optimization level), it may
 or may not be.

Fun. He's a CS academic, so he probably doesn't write programs and so
doesn't realize how much stuff can't reasonably be done in his favourite
toy language, but it's a good read anyhow.

Cheers

Phil Hobbs


--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Saturday, May 12, 2018 at 2:57:21 PM UTC-4, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800
was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs make
them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard
processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."  This monster uses 15
MW...
yes, 15 MW!  A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."  Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Not if you don't use cache.

Rick C.
 
On Saturday, May 12, 2018 at 2:38:24 PM UTC-4, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor. You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future the
processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800 was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy. Yes, we've had this discussion before.
They are good for a small class of applications which they are optimized
for. Otherwise the advantages of other processors or even FPGAs make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As for the
1000 processors chips, I have seen zero implementations, zero programming
techniques and zero applications for them. The XMOS devices are therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any application
in mind. However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more advanced
than a graphic equalizer type of filter. Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly. This type of signal processing app has
potential for a multiprocessor with little memory, as long as it doesn't get
too complex or if it needs external memory the bandwidth isn't too high..


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that. The super computers of today use standard processors, not
FPGAs. "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore 64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a total of
10,649,600 CPU cores across the entire system." This monster uses 15 MW...
yes, 15 MW! A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS. Called the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node servers,
with each server featuring components such as two Intel Xeon Gold processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage." Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

Your explanation is not of value. The GA144 is a very odd duck, but the objections most people have with it are pointless. They complain you can't utilize the full MIPS available in the processors because of the limitations.... so? A chip isn't about one number. It's about getting a job done. Your hand waving above is not at all useful when considering FPGAs to multicore processors. So what is different between the two in the way you analyze above?


That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

None of this is useful in comparing FPGAs to multiprocessors.

Rick C.
 
On 05/12/18 17:08, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 2:57:21 PM UTC-4, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a 6800
was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In reality,
the trigonometrical functions in those days were calculated using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs make
them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard
processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor chip
contains 256 processing cores, and an additional four auxiliary cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."  This monster uses 15
MW...
yes, 15 MW!  A professor in college used to kid about throwing the power
switch on his CPUMOB machine and watching the lights dim in College Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088 Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC P4600
storage."  Expected power consumption will be down to only 3 MW.

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Not if you don't use cache.

Rick C.

It's got nothing to do with you, it's the processor architecture. If
you want global cache coherence in a highly multicore processor, it has
to be architected to support that. If you do your own custom CPU in
FPGA, you can do whatever you want, but general purpose means general
purpose.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 08/24/17 22:49, bitrex wrote:
IMHO the best bang-for-buck uP available right now is the ATTiny85.
Luxurious 8K of program memory, 512 bytes RAM, 512 bytes flash, 4 ADC
channels, 2 hardware PWM outputs, 20 MHz clock @ 5 volts.

Under a buck in quantity at Mouser:

http://www.mouser.com/ProductDetail/Microchip-Technology-Atmel/ATtiny85-20SU/?qs=sGAEpiMZZMtkfMPOFRTOlzm3F7l5sNgt

Of course the NXP LPC802M001 is a 32-bit chip with 12 12-bit ADC
channels, 16k of flash, 2k of RAM, 3 PWM outputs, all for 58 cents in
thousands.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
lørdag den 12. maj 2018 kl. 23.39.39 UTC+2 skrev Phil Hobbs:
On 05/12/18 16:15, Tom Gardner wrote:
On 12/05/18 19:57, Phil Hobbs wrote:
On 05/12/18 14:38, Tom Gardner wrote:
On 12/05/18 18:51, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 12:16:56 PM UTC-4, Tom Gardner wrote:
On 12/05/18 16:53, gnuarm.deletethisbit@gmail.com wrote:
On Saturday, May 12, 2018 at 11:23:22 AM UTC-4, Tom Gardner wrote:
On 12/05/18 15:43, upsidedown@downunder.com wrote:
On Sat, 12 May 2018 06:42:19 -0700 (PDT),
gnuarm.deletethisbit@gmail.com wrote:

On Sunday, August 27, 2017 at 5:43:56 PM UTC-4, bitrex wrote:
On 08/27/2017 03:40 PM, Tim Williams wrote:

Not so bad when you've got so many to spare: example, dedicate
a few rows of cores as data routing (which can be any
combination you like: duplexed, synchronous, packetized,
variable length, addressed...), and use the remaining perimeter
as your parallel processor.  You might still get dozens of
cores in parallel that way, even if the utilization is
effectively maybe 20%.

Each core can "know" what the load is for the cores on its
perimeter; if some core needs to send a message to the perimeter
a best-first/greedy-search algorithm like beam search might be
able to be used to find the path where passing would cause the
least disruption.

https://en.wikipedia.org/wiki/Beam_search

I know this is an old thread, but I came across it when looking
for something else and had a few brain cells wakened.

Not directly related to this issues, but some recollections of
articles in electronic journals from the early 1970's.

In one article it was speculated that one day in the far future
the processor clock speed would be over 1 GHz and calculating
sin/cos
etc. would be hyper fast. Apparently the editor assumed that those
functions were calculated using Taylor series. The article
suggested
using the trigonometric equation

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

and doing sin(a) and cos(a) with table lookup.

The CORDIC algorithm was invented in 1956.

Two decades later implementing floating point plus CORDIC on a
6800 was
my first paid job as a vacation student. I still have the code :)

CORDIC was also easily implementable in hardware.


With sufficient small steps cos(b)=1 and sin(b) = b/2pi. In
reality,
the trigonometrical functions in those days were calculated
using 3rd
or 4th order polynomials for single precession, so not much
advantage.

On the Sinclair Scientific calculator, maybe.


In an other article an Intel representative was interviewed what
happens, when it is possible to integrate a million transistors
on a
single chip. He could not give a clear answer.

My guess is that the situation today is similar with 1000 or a
million processor cores available.

As always, the hardware is easy, but programming techniques for
multiprocessor systems are still in their infancy.

The *only* processors that have a decent hardware+software story are
the XMOS xCORE processors with xC. They have been around in several
generations for over a decade, and are *very* usable for *hard*
realtime applications.

The xC+xCORE has a good pedigree: the key concepts have been around
since the 70s (xC/CSP) and 80s (xCORE/Treansputer).

Oh, yeah, you are *that* guy.  Yes, we've had this discussion before.
They are good for a small class of applications which they are
optimized
for. Otherwise the advantages of other processors or even FPGAs
make them
more suitable.

Obviously.

But I haven't seen *any* application of the GA144 you mentioned. As
for the
1000 processors chips, I have seen zero implementations, zero
programming
techniques and zero applications for them. The XMOS devices are
therefore
(currently) infinitely superior.

Yes, the GA144 was designed as an experiment rather than with any
application
in mind.  However, I'm told they have customers, one application in
particular was for an advanced hearing aid using something much more
advanced
than a graphic equalizer type of filter.  Seems the original prototype
required the use of a pair of TMS320C6xxx devices which use around a
watt
each if I remember correctly.  This type of signal processing app has
potential for a multiprocessor with little memory, as long as it
doesn't get
too complex or if it needs external memory the bandwidth isn't too
high.


Of those, the programming techniques are the most difficult; get those
right and implementations and applications may *follow*.

I strongly suspect that anything with 1000s of processors will have no
significant advantages over an FPGA.

I don't see that.  The super computers of today use standard
processors, not
FPGAs.  "Tianhee-2 uses over 3 million Intel Xeon E5-2692v2 12C cores",
although that is yesterday's fastest computer surpassed by "The Sunway
TaihuLight uses a total of 40,960 Chinese-designed SW26010 manycore
64-bit
RISC processors based on the Sunway architecture.[6] Each processor
chip
contains 256 processing cores, and an additional four auxiliary
cores for
system management (also RISC cores, just more fully featured) for a
total of
10,649,600 CPU cores across the entire system."  This monster uses
15 MW...
yes, 15 MW!  A professor in college used to kid about throwing the
power
switch on his CPUMOB machine and watching the lights dim in College
Park.
The Sunway would do it for sure!

A new machine is due out any time now running at 130 PFLOPS.  Called
the AI
Bridging Cloud Infrastructure, "The system will consist of 1,088
Primergy
CX2570 M4 servers, mounted in Fujitsu's Primergy CX400 M4 multi-node
servers,
with each server featuring components such as two Intel Xeon Gold
processor
CPUs, four NVIDIA Tesla V100 GPU computing cards, and Intel SSD DC
P4600
storage."  Expected power consumption will be down to only 3 MW..

I was assuming comparing FPGAs with something /vaguely/
similar to the GA144, or that Sony cell processor, or
that intel experiment with 80 cores on a chip.

Comparing something that, as I intended, fitted on a large
PCB and plus into the mains with something that fits on a
tennis court and plugs into a substation isn't really very
enlightening.

There's a fundamental problem with parallel computing
without shared memory[1]: choosing the granularity of
parallelism. Too small and the comms costs dominate,
too large and you have to copy too much context and
wait a long time for other computations to complete.

That has been an unsolved problem since the 1960s,
except for applications that are known as "embarrassingly
parallel". Canonical examples of embarrassingly parallel
computations are monte carlo simulations, telecom/financial
systems, some place/route algorithms, some simulated
annealing algorithms.

Cache coherence traffic grows faster than the number of cores, and
eventually dominates unless you go to a NUMA architecture.

Oh, indeed, that is inherently non-scalable, with
/any/ memory architecture. Sure people have pushed
the limit back a bit, but not cleanly.

The fundamental problem is that low-level softies would
like to think that they are still programming PDP11s in
K&R C. They aren't, and there are an appalling number
of poorly understood band-aids in modern C that attempt
to preserve that illusion. (And may preserve it if you
get all the incantations for your compiler+version
exactly right)

The sooner the flat-memory-space single-instruction-
at-a-time mirage is superseded, the better. Currently
the best option, according to the HPC mob that
traditionally push all computational boundaries, is
message-passing. But that is poorly supported by C.

For an amusing speed-read, have a look at
"C Is Not a Low-level Language. Your computer
is not a fast PDP-11."
https://queue.acm.org/detail.cfm?id=3212479

It has examples of where people in the trenches
don't understand what's going on. For example:
 "A 2015 survey of C programmers, compiler writers,
 and standards committee members raised several
 issues about the comprehensibility of C. For
 example, C permits an implementation to insert
 padding into structures (but not into arrays)
 to ensure that all fields have a useful alignment
 for the target. If you zero a structure and then
 set some of the fields, will the padding bits
 all be zero? According to the results of the
 survey, 36 percent were sure that they would
 be, and 29 percent didn't know. Depending on
 the compiler (and optimization level), it may
 or may not be.

Fun. He's a CS academic, so he probably doesn't write programs and so
doesn't realize how much stuff can't reasonably be done in his favourite
toy language, but it's a good read anyhow.

:)
 

Welcome to EDABoard.com

Sponsor

Back
Top