Halfbakery: Phonetic Compression

Please log in.

Before you can vote, you need to register. Please log in or create an account.

Computer: Compression
Phonetic Compression (+3, -1) [vote for, against]
Convert Audio (voice) data to Phoenetic Representation

In all cellphones, the sound data is compressed to reduce the number of bytes that need to be sent.

Modern voice data compression takes advantage of the fact that people only need to heat sounds in a particular frequency range, and only with a finite amount of detail, to be understandable, but the compressed data still takes a fairly large number of bytes. If the number of bytes needed could be reduced further (and perhaps some error detction/correction redundancy added), you could carry on a conversation even when you're someplace with very low signal quality.

The most extreme form of compression would be voice to text, followed by text compression, but voice to text is far far far from perfect, and furthermore it would discard all sense of intonation.

Less extreme than voice to text, would be conversion of voice to four streams of data: 1) a phonetic representation, awhich would result in each syllable becoming one byte in the stream. 2) an indication of the loudness of each syllable (each one being compared to the loudness of the previous one), probably measured in decibels. 3) relative pitch (how much higher or lower the frequency of this syllable is than the prior one), probably measured in twelfths of an octave. 4) relative duration (how long a time this sylable was pronounced, relative to the prior syllable *of this type*).

Then apply conventional adaptive compression to each stream.

Before applying the conventional compression, each spoken syllable would become approximately 4 bytes, and far fewer after.

Unlike voice to text, no dictionary is needed to convert syllables to english (or other language) text, so fewer mistakes will be made. Also, intonation is retained, not discarded. And if you "compress" music with this, you'll get something which sounds vaguely music-like, which you wouldn't get with something that produces text :).

Also, as an added feature, it's easy to convert this to actual text at a later time, since all relevant information is retained.

Playing back sound that has been compressed in this manner should also produce much more realistic speech than what you normally get from computer speech synthesis. It might even sound a bit like the person who spoke the words originally :).
-- goldbb, Mar 04 2009

Vocoder http://en.wikipedia.org/wiki/Vocoder
Grind sound and pour hot phonemes [csea, Mar 04 2009]

CELP http://en.wikipedia.org/wiki/CELP
Code Excited Linear Prediction [csea, Mar 04 2009]

Vocal Vowels http://www.explorat...s/vocal_vowels.html
Vocal tract modelling [csea, Mar 04 2009]

History of Speech Synthesis http://www.acoustic...etty_mst/chap2.html
Some history [csea, Mar 04 2009]

it does supply compression http://www.freepate...ne.com/7124082.html
that still does not solve the problem of channel encoding... [4whom, Mar 04 2009]

once again less characters, still conveying meaning greater than pure text. http://esl.about.co...lphontranscript.htm
compression above still less than legacy LPC and its children. [4whom, Mar 04 2009]

LPC (linear predictive coding) http://www.data-com...ion.com/speech.html
the father of our current telephony speech converters. [4whom, Mar 04 2009]

Existing voice compression algos like Speex do pretty damn good already. I think this would sound too robotic.
-- Spacecoyote, Mar 04 2009

Yikes, way over my head, (+) for effort.
psst in the fourth line you've got 'heat sounds' instead of hear sounds, and you have 'awhich' just after 1)
:]

<this message will self destruct>
-- 2 fries shy of a happy meal, Mar 04 2009

wouldn't work for music (except an unaccompanied melody line). I don't think you'd need as much as 4 whole bytes per sound-bite to get a meaning across.[+]
-- FlyingToaster, Mar 04 2009

Syllables and phonemes are not the same thing. For example, the word "thing" has one syllable, but three phonemes.

People hear pitch differences much smaller than the 1/12th of an octave.

Phoneme sets are language-dependent.

I like the idea of storing relative differences to the last use of the same word.
-- jutta, Mar 04 2009

I wonder if you'd be able to assemble something akin to a Midi library of voice sounds. In fact, you might be able to literally do that. In which case, you've already got the whole tried-and-true midi compression standards to use. In addition to which, you can have music in the stream as well...
-- Custardguts, Mar 04 2009

//the midi compression standards//
the what ?
-- FlyingToaster, Mar 04 2009

Apart from [jutta]'s concerns, I still don't see where any compression, that is better than legacy compression, lies. You propose four channels, 4 bytes (your words, I think some of those channels might not need a full byte). That is 32 bit audio. That's 2^32 different bit streams (at whatever sample rate). A bit of overkill for the spoken language, shirley?

//If the number of bytes needed could be reduced further (and perhaps some error detction/correction redundancy added), you could carry on a conversation even when you're someplace with very low signal quality.//

Let's examine why signal strength would detract from audio quality. Weak channels don't necessarily carry less data, they just have more noise. Claude Shannon illuminated the path around this one. Although noisy channels *sometimes* have less carrying capacity, it is not as precipitous a drop as you would think, and not for the reasons you think either. Also, weak signals are more susceptable to interference. This is the kicker. No matter your compression, nor how little data you want to send, when sending over a weak signal interference can fuck it up.

Getting around these issues is called channel encoding and differs completely from the speech encoding you want to do. Point being, it doesn't matter about the data, but rather about the signal, only very seldom is it the data and the signal (usually time modulated protocols).

Speech encoding on the other hand is a form of modulation-demodulation. The basic tennets of which are intelligability and QoS. Foregoing the latter as you have done has been done and never really took off.

I was really hoping this would be a phonetic library (like a midi library), converting audio into phonetic text (which already contains long and short syllables, phonemes and intonations), compressing this losslessly and transmitting across a channel, either coming out as voice (although not the original speaker's voice) or text, or both. I actually think this is what you want to propose, and perhaps I misunderstood your explanation. Unfortunately, even this amount of data will get skewed by interference across a weak channel.
-- 4whom, Mar 04 2009

You could have a program that model's the original speaker's sound mechanisms (mouth parts, nasal passages, vocal chords), transmit those parameters to the receiver then operate that: a virtual mouth.
-- FlyingToaster, Mar 04 2009

See [links] for some relevant reading.
-- csea, Mar 04 2009

@[4whom]: I think [goldbb] is talking about 'is' being coded thusly: 2 bytes saying: 'there was an 'i' and an 's', one byte saying: 'it was middle-loud', one byte saying: 'the 's' was 5 times louder than the 'i' ', and one byte saying: ' 's' was 56 times longer than the last dipthong'. So there would be about 4 bytes per 'syllable' (or whatever smallest denominator). The usual syllable being, say, 100ms long, that is 40B/s, because at that rate you do not need parallel bitstreams.

So the idea is to have a speech recognition-program running on the phone, that adds a little extra-information to the pure text, like duration and loudness of single speech-particles.

I like the idea very much. For desktop-computers via internet the task should already be solvable in subjective real-time, i do not know how fast speech recognition is on mobiles, though.
-- loonquawl, Mar 04 2009

RP English (and presumably General American, Austral or whatever) has around forty-four phonemes, many of which have allophones. /t/ is affricate, fortis and aspirated at the beginning of a syllable but plosive, lenis and unaspirated as the second element in a consonant cluster. We have fourteen vowels. Assuming all syllables are CV, which they aren't because there are closed syllables and consonant clusters as well as diphthongs in English, that makes four hundred and twenty possible syllables. Other languages and accents have different phonologies. You can't have one byte per syllable for that reason. It would also be necessary to lump allophones together, which means they'd have to be recognised.
Vowels are chords. They can't be transposed easily into a different pitch because they aren't pure tones. Many consonants have strong elements of white noise in them and are unaffected by pitch. However, in ordinary speech intonation even in a language without lexical tones such as ours, the pitch of each syllable isn't steady and doesn't even change in a linear way most of the time and it does convey linguistic and other information, for instance apology, enquiry and the end of a statement or imperative. With the likes of Swedish, Chinese or Yoruba, you have even bigger problems.

This is going to need a completely different approach, but i don't know what. I imagine you could do something like ignore sounds which are out of the range of that particular human's speech and record changes from moment to moment, but i think mobiles compress speech specifically because background noise sounds like speech. I would also expect you could take quieter sounds out if they occur immediately before louder ones and represent sound below a certain volume as silence compressed by, um, is it called "run-length encoding"?
-- nineteenthly, Mar 04 2009

//is it called "run-length encoding"?// No it is called linear predictive coding. And its most current commmercial iteration is called CELP (code excited LP) see links. Run length coding, or it derivatives, runs on the channel encoding side. That is, it gets the same data from point A to point B without loss. What data, and how it is already compressed, is inconsequencial.

There are two, no three, data reduction systems here. The first takes an analog signal and converts it to digital. The next would take the digital and refine it further for digital transmission. At this point we are trying to reduce the digital representation of the analog system to the bare minimum. It is at this point where we make all the important decisions around intelligability and QoS. The next would be "lossless" compression for transmission. We can't fuck with the first, we can't fuck with the last. So we fuck with the middle. Just by counting bits (nothing fancy) there are systems that do this middle part better than the proposed mechanism.
-- 4whom, Mar 04 2009

Right, so now i start wondering what run-length coding actually is. I'll Google it, thanks.

Would there be any mileage in the amount of white noise which seems to be present in speech? It seems that various sounds include silence and/or something like white noise, though clearly they're more than that or they'd be the same phoneme. Any voiceless aspirated plosive would include a silence followed by something like white noise. Voiceless fricatives are all close to it but not quite the same. If there's a way of quantifying the difference between truly white noise and the sound of those phonemes, it seems that they could be represented by some code meaning "white noise" plus some description of the difference. Many phonemes must also consist largely of a period during which the sound is similar or changes in a predictable way, as with a diphthong. There's got to be some scope there. Also, the fact that phonemes can be confused due to similar sounds could be exploited if a way to quantify the difference between the two was found. /m/, /n/ and the velar nasal, for example, and voiced and voiceless versions.
Now to read your link.
-- nineteenthly, Mar 04 2009

You could look up Stephen Hawking's doohickey: that's a vocal synthesizer (though I think it simply plays recorded samples) but you can just as easily make a dictionary-style phonetics player and add pitch and duration information... probably about the same data density order of magnitude as 8-bit "texting" for approximately the same information density.
-- FlyingToaster, Mar 04 2009

[19thly] you've used enough multisyllabic words that you sound like you know what you're talking about :) ... but how is a vowel a "chord" ? (in musical terms a chord is a number of separate notes played simultaneously), a "pure tone" is a sinewave. For a given vowel the human voice will produce the same formants regardless of pitch, just that some of the partials will be attenuated.

Re: your "noise complaint" I'm pretty sure it's not "white", but running unvoiced consonants through a spectrum analyser would give you an indication of what category they fit into (if any).
-- FlyingToaster, Mar 19 2009

random, halfbakery