h a l f b a k e r yWith moderate power, comes moderate responsibility.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
First, create a MIDI sound bank, containing all 50 or so different phoenomes used in the English language, and enough additional phoenomes to represent as many additional (popular) languages as possible. Since different languages have many sounds in common, it should be possible to provide enough sounds
for a lot of languages.
Next, take a voice to text program, and discard the portion of it which takes the analysed phonemes, and looks them up in a phonetic dictionary. Enhance the part which converts acoustic data to a sequence of phonemes, and have it additionally identify volume, pitch, and duration.
Then convert the phoneme, volume, pitch, and duration data, into MIDI format data.
Provided the receiver has the appropriate sound bank, he can play the data using a standard midi player.
(Yes, I know that most phonemes don't consist of a single pitch, but rather several; but it should be possible to identify the most prominent pitch of a piece of sound. Alternatively, one could split each phoneme which sounds like a chord into two phonemes, on two channels, and specify the volume, pitch, and duration for each channel.)
Speech to text to Speech (halfbakery)
Speech_20To_20Text_...0Speech_20Converter Same idea without technical details [knowtion, Mar 18 2009]
Voice Translator (halfbakery)
Voice_20Translator Again, same kind of idea with no technical details [knowtion, Mar 18 2009]
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Destination URL.
E.g., https://www.coffee.com/
Description (displayed with the short name and URL.)
|
|
Is this voice to text to voice? |
|
|
Sounds more like a way of speech compression by using MIDI as a backend. This is similar to how the first vocoder, the Voder, which was manually operated, worked. |
|
|
@[bigsleep] Those links are "audio" to midi. They will (probably) translate the pitch and tempo of music and produce something resembling the original song. **I Think** that goldbb's idea is to be able to speak into a microphone and have a the computer say the same thing in it's own voice (not simply modulated, but re-created). But I'm not sure... vote pending. |
|
|
assuming this is what I think it is... (voice to text to voice)... Then it could become the platform for a real time audio translator. Voice - text - translated text - voice |
|
|
Bob's Computer translator: Hola |
|
|
Pedro:Hola. ¿Cómo está usted? |
|
|
Pedro's Computer translator: Hello, how are you? |
|
|
This technology is widely known to Trekkies. It is known as a Universal Translator. |
|
|
The second part is quite doable, in fact current applications which control single-patch "drumsets" and "fx sets" do just that: each sample is assigned its own MIDI note; you'd use the pitch-bend controller to control the pitch as an exponential offset of the sample. You could cobble one together in a few minutes (give or take). |
|
|
However to create the monophonic MIDI sequence in the first place you'd need to write something new. A pitch-to-MIDI converter converts a pitch (sound frequency) to a fixed-table MIDI note and is meant for an "instrument" soundbank which contains all pitches already mapped to individual notes, not a "drumset" sample set type; in your case the table wouldn't be pitches of the same phoneme(instrument), they'd be a set of different phonemes(drumset or fx type). |
|
|
Your last ()'d paragraph is [edit] strife with semantic misapprehensions. |
|
|
This...is the number...8...bus.. to... Victoria...calling at.... Bethnal...Green ...Liverpool...Street ...Station .....Bank.... Holborn.... Tottenham Court....Road.... Oxford Street.... and.... Victoria. |
|
|
Aren't the automated voices you hear on various forms of public transport, and telephone machines do this to some extent already - BT had that thing that would read out your text messages in the voice of Dr Who. |
|
|
Tom Baker, the one with the hat and very long scarf. |
|
|
Spacecoyote, yes, this is a bit like a vocoder using the midi data format. |
|
|
knowtion, your first annotation is correct; you'd speak to the computer, and it would say the same thing back in it's own voice... but with the same tone and speech mannerisms as you spoke those words, so it would sound as much like you as possible. Your second annotation, not so much :), since the program explicitly *doesn't* have a phoenetic dictionary... if you say to your computer "Mumblefuzzlewhat," it will say "mumblefuzzlewhat" right back to you, even though that's not a word in any language. If it were real voice to text to voice, then it would try and look up mumblefuzzlewhat in the dictionary, fail to find it, substitute something "close", and say that back to you. |
|
|
bigsleep, people playing real (non digital) musical instruments can make noises which sound like speech, but it sounds like very strange speech; digitally using musical instruments to produce speech would surely sound equally strange. As for the audio-to-midi links, they only analyse the pitch of what's said/sung; they don't generate something which could be played as speech. |
|
|
FlyingToaster, I'm not suggesting using a "pitch to midi converter" for exactly the reason you've said, since such a program can, at best, digitize the sounds of a single instrument playing a solo ... if I wanted that, I wouldn't even bother posting the idea, since it would be baked. This idea is more of a "symphony to midi converter," where the converter program not only detects pitch (and volume and duration), but also which musical instrument made which sound. |
|
|
As for the part in (), I'll quote nineteenthly's anno to my earlier phonetic compression idea: |
|
|
/Vowels are chords. They can't be transposed easily into a different pitch because they aren't pure tones./ |
|
|
If you think it's bollux, don't complain to me :), since it hadn't occured to me until I saw that anno. |
|
|
[goldbb] sorry for rudeness: I just find myself growing increasingly pedantic with every level of detail :/ ... long story short, I like the idea [+] but I think the default MIDI device on M$ equipped PC's are just very basic sample players (ie: they play .wav's type of thing), however I really don't know that much about the PC implentation. |
|
|
In 1995 Ron Hoory and Dr. Aharon Satt along with
Dr. Hazzan (owner of largest soap company in
Israel, but coming to work every day for the fun of
it. He owned a nicer car than the late Prof. Raviv,
IBM manager of our research facility who was later
on killed in a car crash in Australia) worked on this. |
|
|
We had algorithms for breaking down speech into
phonemes, then adding the prosodies (typical ups
and downs in speech to each person) and some
vocal info, to reconstruct any person. We had a
hilarious demo of Ariel Sharon saying that he
wants peace and doesn't want the "territories".
Turns out life is less hilarious than researchers
would think. |
|
|
I know two people from my team still working on
this stuff, and will point them to it... |
|
|
Its "Almost" speech to text, but without the "need
" for checking out the "text". |
|
|
Yude hevtoo gedalongiffatevver kombz. |
|
|
Your human speech consists of several
components: |
|
|
1. Your typical voice - various frequencies each
with a different 'volume', created by your shape of
mouth and bone consistency. So that should be in
the header. |
|
|
2. Your typical plosives - "typical noise" for the H,
K, S, T or P, and for the G, J, V and B sounds as
well. |
|
|
3. Your typical prosody - how your voice
intonation works for questions ,and regular
speech. |
|
|
4. Speech "feeling" - typical prosody changes for
different modes of talk. |
|
|
Once each of these are set out in advance as
"notes" text could be played out with MIDI or any
other control stream. |
|
| |