Voice recognition, directly on mobile phone ? or do it on cloud/server ?...

S

Skybuck Flying

Guest
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.

Bye,
Skybuck.
 
On Wednesday, August 31, 2022 at 10:42:18 PM UTC-7, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.

Dunno about wisdom, but... ask Siri. Or Alexa. They\'d know...
 
On 8/31/22 22:42, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

There are phones for the hearing-impaired that do that.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.

Bye,
Skybuck.

3. There are gadgets for the mentally-impaired that will process your
voice so you can sound like an alien.

https://www.amazon.com/Amscan-Halloween-Voice-Changer-Supplies/dp/B000W4RHAK/ref=sr_1_10?keywords=Voice+Changer+Device&qid=1662053326&sr=8-10
 
On Thursday, September 1, 2022 at 1:42:18 AM UTC-4, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.

I\'m pretty sure they do the former, rather than the latter. If my phone or computer or car do not have network connectivity, voice commands will not work. I\'ve told my android to turn off airplane mode and it tells me, \"I can\'t do that, airplane mode is enabled\". So it at least can recognize that much on the phone without the network. But it gives the same message for any other command. So it\'s only recognizing that I\'ve tried to speak to it, \"Hey, Google\".

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
 
On Thursday, 1 September 2022 at 19:50:53 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 1:42:18 AM UTC-4, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.
I\'m pretty sure they do the former, rather than the latter. If my phone or computer or car do not have network connectivity, voice commands will not work. I\'ve told my android to turn off airplane mode and it tells me, \"I can\'t do that, airplane mode is enabled\". So it at least can recognize that much on the phone without the network. But it gives the same message for any other command. So it\'s only recognizing that I\'ve tried to speak to it, \"Hey, Google\".
There was an automotive hands-free telephone system around 20 years ago that used voice recognition
for dialing telephone numbers. The voice recognition was done in a central server for what at the time
seemed to be good reasons. Unfortunately it was a disaster because the handover gaps between cell sites
which were typically between 10 and 100ms long meant that occasional digits were lost. In some city centres
there could be around ten handovers per minute while driving past successive microcells. I suspect the
system architects in the USA were familiar with CDMA phones that had soft handover between cells
and did not do enough research before committing to a GSM system.

Nowadays this problem could of course be avoided by sending the voice signal as data with error
correction. That aside, I strongly suspect that the phone would be more power efficient
than a central server but probably not as accurate. I think many modern systems use
a hybrid approach where a limited vocabulary is handled locally and more difficult recognition
tasks are offloaded to a central server.

John

 
On a sunny day (Thu, 1 Sep 2022 10:34:45 -0700) it happened corvid
<bl@ckb.ird> wrote in <teqqfm$15ra$1@gioia.aioe.org>:

On 8/31/22 22:42, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the
server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

The problem with that is that the server owner will politically modify
Original:
\"I really do not like this president\"

Returned text;
\"This president is the greatest of them all\"

So who or what (AI that wants to replace humans, \"skynet\") controls \"the server\"?
Already we see twatter and facesbook and guggle !! bring their own agenda into play.

https://www.rt.com/news/562012-google-employees-condemn-nimbus-israel/
don\'t know how true it is but Big Brothel will have its say in such a system 4 sure!

OTOH google produces a reasonable voice, have some scripts that use it
for text to speech.
Not sure if they\'d modify my source text, have not tried anything political yet.
When we (humming beans) all get chipped at birth (or later) by decree
we will be a slave to Skynet AI...
Maybe we alrea.. beep beep beep server error link lost beep!
The works!!

Interesting on Smithsonian channel how a high altitude US nuke test destroyed the Telstar satellite.
I remember working with that sat, very expensive per minute.

So there is still hope! Back to the smoke signals!

Days of future passed

I should not get up so early.. look now.. what I have done...
 
On Thursday, September 1, 2022 at 6:43:53 PM UTC-4, John Walliker wrote:
On Thursday, 1 September 2022 at 19:50:53 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 1:42:18 AM UTC-4, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.
I\'m pretty sure they do the former, rather than the latter. If my phone or computer or car do not have network connectivity, voice commands will not work. I\'ve told my android to turn off airplane mode and it tells me, \"I can\'t do that, airplane mode is enabled\". So it at least can recognize that much on the phone without the network. But it gives the same message for any other command. So it\'s only recognizing that I\'ve tried to speak to it, \"Hey, Google\".

There was an automotive hands-free telephone system around 20 years ago that used voice recognition
for dialing telephone numbers. The voice recognition was done in a central server for what at the time
seemed to be good reasons. Unfortunately it was a disaster because the handover gaps between cell sites
which were typically between 10 and 100ms long meant that occasional digits were lost. In some city centres
there could be around ten handovers per minute while driving past successive microcells. I suspect the
system architects in the USA were familiar with CDMA phones that had soft handover between cells
and did not do enough research before committing to a GSM system.

Nowadays this problem could of course be avoided by sending the voice signal as data with error
correction. That aside, I strongly suspect that the phone would be more power efficient
than a central server but probably not as accurate. I think many modern systems use
a hybrid approach where a limited vocabulary is handled locally and more difficult recognition
tasks are offloaded to a central server.

I\'m not sure what you are talking about \"sending the voice signal as data\". Both CDMA and GSM send everything as data. The last analog cell phone system was AMPS which sent control as data, with analog voice.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
On Saturday, 3 September 2022 at 23:41:31 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 6:43:53 PM UTC-4, John Walliker wrote:
On Thursday, 1 September 2022 at 19:50:53 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 1:42:18 AM UTC-4, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.
I\'m pretty sure they do the former, rather than the latter. If my phone or computer or car do not have network connectivity, voice commands will not work. I\'ve told my android to turn off airplane mode and it tells me, \"I can\'t do that, airplane mode is enabled\". So it at least can recognize that much on the phone without the network. But it gives the same message for any other command. So it\'s only recognizing that I\'ve tried to speak to it, \"Hey, Google\".

There was an automotive hands-free telephone system around 20 years ago that used voice recognition
for dialing telephone numbers. The voice recognition was done in a central server for what at the time
seemed to be good reasons. Unfortunately it was a disaster because the handover gaps between cell sites
which were typically between 10 and 100ms long meant that occasional digits were lost. In some city centres
there could be around ten handovers per minute while driving past successive microcells. I suspect the
system architects in the USA were familiar with CDMA phones that had soft handover between cells
and did not do enough research before committing to a GSM system.

Nowadays this problem could of course be avoided by sending the voice signal as data with error
correction. That aside, I strongly suspect that the phone would be more power efficient
than a central server but probably not as accurate. I think many modern systems use
a hybrid approach where a limited vocabulary is handled locally and more difficult recognition
tasks are offloaded to a central server.
I\'m not sure what you are talking about \"sending the voice signal as data\". Both CDMA and GSM send everything as data. The last analog cell phone system was AMPS which sent control as data, with analog voice.

I wasn\'t suggesting that any of the phones are analogue. With GSM the voice channel
does some forward error correction as the more significant bits are error protected.
The details are all published in the GSM standards which are free to download.
However, if there are short communication breaks then there is some attempt at
concealment by repeating the last received frame. Longer gaps in reception just
result in a gap of silence. This happens on nearly every cell handover, often for tens
of milliseconds, sometimes for hundreds of milliseconds. Handing over from one sector
of a base station to a different sector is very fast. Handing over to a different base station is
slower and moving from one location area code to another is relatively slow..
The alternative is to accumulate a block of digitised voice data and send it using
a protocol that includes full error correction and retransmission such as TCP/IP.
That way, if there are gaps in transmission all the information will still get through
eventually.

Another way of putting it is that phones can send data in more than one mode.
One of these is for live speech where the objective is for good intelligibility to a human
listener. Bandwidth is traded against the number of simultaneous calls, so the
voice quality varies. The advanced multi-rate codec is an example. Humans can cope with
short gaps. Often they don\'t matter, but if they do the listener will ask the speaker
to repeat what they said.
Another mode is bit-exact data transmission where the objective is to have no errors
at the receiving end. The tradeoff here is that the latency is variable because there
may be a need for retransmissions which makes a \"data\" channel less suitable for
live voice but probably much better for a speech recogniser to process.

John
 
On Sunday, September 4, 2022 at 3:40:25 AM UTC-4, John Walliker wrote:
On Saturday, 3 September 2022 at 23:41:31 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 6:43:53 PM UTC-4, John Walliker wrote:
On Thursday, 1 September 2022 at 19:50:53 UTC+1, Ricky wrote:
On Thursday, September 1, 2022 at 1:42:18 AM UTC-4, Skybuck Flying wrote:
Question for you to look into:

1. Would it be power-wise more efficient to collect the voice/speach into waveform and transmit it to a server, then have the server VOICE/AI process it to recgonize the spoken words/commands and then transmit this back to the mobile phone.

or

2. Would it be power-wise more efficient to try and perform VOICE/AI processing/recognition directly on the mobile phone CPU/processing units.
I\'m pretty sure they do the former, rather than the latter. If my phone or computer or car do not have network connectivity, voice commands will not work. I\'ve told my android to turn off airplane mode and it tells me, \"I can\'t do that, airplane mode is enabled\". So it at least can recognize that much on the phone without the network. But it gives the same message for any other command. So it\'s only recognizing that I\'ve tried to speak to it, \"Hey, Google\".

There was an automotive hands-free telephone system around 20 years ago that used voice recognition
for dialing telephone numbers. The voice recognition was done in a central server for what at the time
seemed to be good reasons. Unfortunately it was a disaster because the handover gaps between cell sites
which were typically between 10 and 100ms long meant that occasional digits were lost. In some city centres
there could be around ten handovers per minute while driving past successive microcells. I suspect the
system architects in the USA were familiar with CDMA phones that had soft handover between cells
and did not do enough research before committing to a GSM system.

Nowadays this problem could of course be avoided by sending the voice signal as data with error
correction. That aside, I strongly suspect that the phone would be more power efficient
than a central server but probably not as accurate. I think many modern systems use
a hybrid approach where a limited vocabulary is handled locally and more difficult recognition
tasks are offloaded to a central server.
I\'m not sure what you are talking about \"sending the voice signal as data\". Both CDMA and GSM send everything as data. The last analog cell phone system was AMPS which sent control as data, with analog voice.
I wasn\'t suggesting that any of the phones are analogue. With GSM the voice channel
does some forward error correction as the more significant bits are error protected.
The details are all published in the GSM standards which are free to download.
However, if there are short communication breaks then there is some attempt at
concealment by repeating the last received frame. Longer gaps in reception just
result in a gap of silence. This happens on nearly every cell handover, often for tens
of milliseconds, sometimes for hundreds of milliseconds. Handing over from one sector
of a base station to a different sector is very fast. Handing over to a different base station is
slower and moving from one location area code to another is relatively slow.
The alternative is to accumulate a block of digitised voice data and send it using
a protocol that includes full error correction and retransmission such as TCP/IP.
That way, if there are gaps in transmission all the information will still get through
eventually.

Another way of putting it is that phones can send data in more than one mode.
One of these is for live speech where the objective is for good intelligibility to a human
listener. Bandwidth is traded against the number of simultaneous calls, so the
voice quality varies. The advanced multi-rate codec is an example. Humans can cope with
short gaps. Often they don\'t matter, but if they do the listener will ask the speaker
to repeat what they said.
Another mode is bit-exact data transmission where the objective is to have no errors
at the receiving end. The tradeoff here is that the latency is variable because there
may be a need for retransmissions which makes a \"data\" channel less suitable for
live voice but probably much better for a speech recogniser to process.

What makes you think they aren\'t sending this as data already? The sort of cell voice connection you seem to be talking about is a phone call. Google isn\'t placing phone calls to their servers. The phone app handles all the details of turning the voice into data, with compression as they see fit and getting that data over the network without dropping data in handovers. But then, how would you know? The voice recognition I\'ve seen is not reliable enough to tell when there was data loss.

\"Say 1 to save your data, say 2 to delete all data\".

\"One\"

\"Ok, data deleted\".

I\'ve had crap like that happen.

--

Rick C.

-- Get 1,000 miles of free Supercharging
-- Tesla referral code - https://ts.la/richard11209
 
On Sunday, 4 September 2022 at 19:30:33 UTC+1, Ricky wrote:

What makes you think they aren\'t sending this as data already? The sort of cell voice connection you seem to be talking about is a phone call. Google isn\'t placing phone calls to their servers. The phone app handles all the details of turning the voice into data, with compression as they see fit and getting that data over the network without dropping data in handovers. But then, how would you know? The voice recognition I\'ve seen is not reliable enough to tell when there was data loss.
The OP didn\'t specify very much in his question so any answers are going to be fairly generic.
The automotive system I mentioned earlier that got it badly wrong did exist and it did use
phone calls even though \"data\" transmission would have been an option even then..
I was brought in as a contractor to help debug it. I am sure that a well designed modern
system will avoid repeating the mistakes of the past, but there are plenty of instances where
developers manage to reinvent previously solved problems. Also, the tradeoffs between
latency in a real-time application and reliability over a poor quality network are tricky.

John
 

Welcome to EDABoard.com

Sponsor

Back
Top