Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
With moderate power, comes moderate responsibility.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


                             

Voice to MIDI

Phonetic Compression, Take 2
  (+6)
(+6)
  [vote for,
against]

First, create a MIDI sound bank, containing all 50 or so different phoenomes used in the English language, and enough additional phoenomes to represent as many additional (popular) languages as possible. Since different languages have many sounds in common, it should be possible to provide enough sounds for a lot of languages.

Next, take a voice to text program, and discard the portion of it which takes the analysed phonemes, and looks them up in a phonetic dictionary. Enhance the part which converts acoustic data to a sequence of phonemes, and have it additionally identify volume, pitch, and duration.

Then convert the phoneme, volume, pitch, and duration data, into MIDI format data.

Provided the receiver has the appropriate sound bank, he can play the data using a standard midi player.

(Yes, I know that most phonemes don't consist of a single pitch, but rather several; but it should be possible to identify the most prominent pitch of a piece of sound. Alternatively, one could split each phoneme which sounds like a chord into two phonemes, on two channels, and specify the volume, pitch, and duration for each channel.)

goldbb, Mar 16 2009

Speech to text to Speech (halfbakery) Speech_20To_20Text_...0Speech_20Converter
Same idea without technical details [knowtion, Mar 18 2009]

Voice Translator (halfbakery) Voice_20Translator
Again, same kind of idea with no technical details [knowtion, Mar 18 2009]

Please log in.
If you're not logged in, you can see what this page looks like, but you will not be able to add anything.
Short name, e.g., Bob's Coffee
Destination URL. E.g., https://www.coffee.com/
Description (displayed with the short name and URL.)






       Is this voice to text to voice?
knowtion, Mar 16 2009
  

       Sounds more like a way of speech compression by using MIDI as a backend. This is similar to how the first vocoder, the Voder, which was manually operated, worked.
Spacecoyote, Mar 16 2009
  

       @[bigsleep] Those links are "audio" to midi. They will (probably) translate the pitch and tempo of music and produce something resembling the original song. **I Think** that goldbb's idea is to be able to speak into a microphone and have a the computer say the same thing in it's own voice (not simply modulated, but re-created). But I'm not sure... vote pending.
knowtion, Mar 16 2009
  

       assuming this is what I think it is... (voice to text to voice)... Then it could become the platform for a real time audio translator. Voice - text - translated text - voice   

       This will be basic...   

       Bob: Hello   

       Bob's Computer translator: Hola   

       Pedro:Hola. ¿Cómo está usted?   

       Pedro's Computer translator: Hello, how are you?   

       This technology is widely known to Trekkies. It is known as a Universal Translator.
knowtion, Mar 16 2009
  

       [ ]...   

       The second part is quite doable, in fact current applications which control single-patch "drumsets" and "fx sets" do just that: each sample is assigned its own MIDI note; you'd use the pitch-bend controller to control the pitch as an exponential offset of the sample. You could cobble one together in a few minutes (give or take).   

       However to create the monophonic MIDI sequence in the first place you'd need to write something new. A pitch-to-MIDI converter converts a pitch (sound frequency) to a fixed-table MIDI note and is meant for an "instrument" soundbank which contains all pitches already mapped to individual notes, not a "drumset" sample set type; in your case the table wouldn't be pitches of the same phoneme(instrument), they'd be a set of different phonemes(drumset or fx type).   

       Your last ()'d paragraph is [edit] strife with semantic misapprehensions.
FlyingToaster, Mar 17 2009
  

       This...is the number...8...bus.. to... Victoria...calling at.... Bethnal...Green ...Liverpool...Street ...Station .....Bank.... Holborn.... Tottenham Court....Road.... Oxford Street.... and.... Victoria.   

       Aren't the automated voices you hear on various forms of public transport, and telephone machines do this to some extent already - BT had that thing that would read out your text messages in the voice of Dr Who.
zen_tom, Mar 17 2009
  

       Zen_tom. Which Dr Who?
eight_nine_tortoise, Mar 17 2009
  

       Tom Baker, the one with the hat and very long scarf.
Aristotle, Mar 17 2009
  

       Spacecoyote, yes, this is a bit like a vocoder using the midi data format.   

       knowtion, your first annotation is correct; you'd speak to the computer, and it would say the same thing back in it's own voice... but with the same tone and speech mannerisms as you spoke those words, so it would sound as much like you as possible. Your second annotation, not so much :), since the program explicitly *doesn't* have a phoenetic dictionary... if you say to your computer "Mumblefuzzlewhat," it will say "mumblefuzzlewhat" right back to you, even though that's not a word in any language. If it were real voice to text to voice, then it would try and look up mumblefuzzlewhat in the dictionary, fail to find it, substitute something "close", and say that back to you.   

       bigsleep, people playing real (non digital) musical instruments can make noises which sound like speech, but it sounds like very strange speech; digitally using musical instruments to produce speech would surely sound equally strange. As for the audio-to-midi links, they only analyse the pitch of what's said/sung; they don't generate something which could be played as speech.   

       FlyingToaster, I'm not suggesting using a "pitch to midi converter" for exactly the reason you've said, since such a program can, at best, digitize the sounds of a single instrument playing a solo ... if I wanted that, I wouldn't even bother posting the idea, since it would be baked. This idea is more of a "symphony to midi converter," where the converter program not only detects pitch (and volume and duration), but also which musical instrument made which sound.   

       As for the part in (), I'll quote nineteenthly's anno to my earlier phonetic compression idea:   

       /Vowels are chords. They can't be transposed easily into a different pitch because they aren't pure tones./   

       If you think it's bollux, don't complain to me :), since it hadn't occured to me until I saw that anno.
goldbb, Mar 17 2009
  

       [goldbb] sorry for rudeness: I just find myself growing increasingly pedantic with every level of detail :/ ... long story short, I like the idea [+] but I think the default MIDI device on M$ equipped PC's are just very basic sample players (ie: they play .wav's type of thing), however I really don't know that much about the PC implentation.
FlyingToaster, Mar 18 2009
  

       In 1995 Ron Hoory and Dr. Aharon Satt along with Dr. Hazzan (owner of largest soap company in Israel, but coming to work every day for the fun of it. He owned a nicer car than the late Prof. Raviv, IBM manager of our research facility who was later on killed in a car crash in Australia) worked on this.   

       We had algorithms for breaking down speech into phonemes, then adding the prosodies (typical ups and downs in speech to each person) and some vocal info, to reconstruct any person. We had a hilarious demo of Ariel Sharon saying that he wants peace and doesn't want the "territories". Turns out life is less hilarious than researchers would think.   

       I know two people from my team still working on this stuff, and will point them to it...   

       Its "Almost" speech to text, but without the "need " for checking out the "text".   

       Yude hevtoo gedalongiffatevver kombz.
pashute, Aug 15 2011
  

       Your human speech consists of several components:   

       1. Your typical voice - various frequencies each with a different 'volume', created by your shape of mouth and bone consistency. So that should be in the header.   

       2. Your typical plosives - "typical noise" for the H, K, S, T or P, and for the G, J, V and B sounds as well.   

       3. Your typical prosody - how your voice intonation works for questions ,and regular speech.   

       4. Speech "feeling" - typical prosody changes for different modes of talk.   

       Once each of these are set out in advance as "notes" text could be played out with MIDI or any other control stream.
pashute, Dec 16 2014
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle