h a l f b a k e r yContrary to popular belief
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
There is a problem with speech synthesis, and also with the synthesis of imaginary alien languages mediated via sound: a popular model is crude and inflexible. It assumes at most two sources of sound modified by the filter of the vocal tract, involving tongue, lips, nasal cavity or the absence thereof
and so forth. Not good enough for two reasons. Firstly, it assumes the organs producing the sound are the same for everyone - nobody has a cleft palate or hare lip, there are no functional speech impediments, no hoarse voices, no "cri du chat", everyone has the same teeth, a hard palate of the same height, has the same lungs, a body of the same build, is of the same gender and so on. Secondly, humans are not the only source of sound and the voice is not the only source of human sound. We stamp, clap, speak while running or jumping or lying down, and in echoing rooms, open spaces, inside anechoic chambers and the like. Grasshoppers chirp, dolphins do their thing, insect buzz, birds sing and so on. A speech synthesiser just doesn't cut it, even for the human voice.
Therefore, instead of all that, i suggest this. There are already two tools used for visual purposes: raytracing and physics engines. Raytracing deals with light of different colours and intensities passing through a number of processes which involve scattering, attenuation and a lot of the kinds of things which happen to another kind of wave, namely sound. This would be adequate to produce maybe a hundred milliseconds of audible sound if you're lucky. However, for longer periods, a physics engine would help. In the case of the human voice, we have for instance the sound of a trilled R coming out of the mouth of a male with thick lips and a cleft palate. Model the shapes and vibrations of the speech organs over a period of time, sonically raytrace each stage and you have a much more realistic human voice, of course at the cost of considerably more processing. Moreover, you can then model that voice in whatever space you like, with a bee buzzing round the bloke's head, give him a cold, make him heavier or lighter, have him talking while running, introduce a cricket and a sparrow, fly a plane overhead and have a visiting alien over with five mouths each containing a forked tongue and three larynxes each wearing a scarf over two of them and a series of elephant-like trunks.
Thanks to [Vernon] for the inspiration.
Vernon's idea
Random_20Word_20Generator Random word generator - thanks [nineteenthly, Apr 08 2011]
"Survey of Methods for Modeling Sound in Interactive Virtual Environment Systems"
http://www-sop.inri...blis/presence03.pdf A nice survey paper that covers elements of modelling sound. [Jinbish, Apr 08 2011]
Physical Modelling Synthesis
http://en.wikipedia...modelling_synthesis The same idea applied to musical instruments [iaoth, Apr 09 2011]
Titanian _Smoke_
http://www.peppermi....net/res_andi.html? Of course, after [19thly] mentioned it, I had to go find it. [mouseposture, Apr 09 2011]
real-time_20rendered_20audio
Similar. [spidermother, Apr 16 2011]
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
Thanks [Jinbish], i'll take a butchers'. |
|
|
Ray tracing doesn't model the wave characteristics of light directly. When the wavelength approaches or exceeds the scale of the space in which the wave exists, it becomes necessary to model the wave directly, rather than abstracting it as rays. |
|
|
Your voice echoing in a canyon behaves much like rays, but modelling its production in your vocal tract using ray tracing would be unproductive. |
|
|
For a proper ray-tracing, you'd need a canonical first scene - say, Julie Andrews singing "Do-Re-Mi" over an infinite checkered plane of edelweiss and granite? |
|
|
OK, fine, but that can still be modelled. Maybe not raytracing then, but the methods i'm aware of, and yes i'm ignorant, seem not to be up to much at all. |
|
|
Oh yes, forgot about that, [bigsleep]. Also, there are those guys who altered 'Smoke On The Water' to see what it would sound like on Titan. |
|
|
1) Start with existing standard acoustic modeling software
as used by e.g. theater architects. |
|
|
2) Merge it with existing finite-element modeling
software, so that the simple surfaces (ceiling, balcony,
acoustic tiles, auditorium chairs, etc.) are replaced with
3-D finite element volumes with those surfaces.
(Computational demands would
increase enormously at this point.) Integrating these two
software
packages might be a pretty interesting project, amounting
to, maybe, one doctoral dissertation's worth of work (one
DD, in SI units). |
|
|
3) Build a finite element model of the larynx, pharynx,
toungue, and lips. This is at least another DD, but it may
already have been done (the closest I can find is a finite
element model of the soft palate, but that was in 1999). |
|
|
Considerable work would be required, in practice, to
achieve this, but, in principle, all the pieces are there,
and they just need to be assembled. |
|
|
The only gap I can see is that the mechanical properties of
tounge, pharynx & lips may not be well-studied enough for
modeling. (I'm guessing the larynx is easier and has been
done already.) So: |
|
|
2.5) Quantify mechanical characteristics of pharynx,
tongue, etc, in enough detail to permit modeling. This
includes active elements (i.e. muscles) controlled by the
nervous system, complex geometry, and many, many
degrees of freedom (though you might achieve
dimensional reduction by abandoning the naive, general
approach, and exploiting the existing literature on speech
production). Optimisticly, 2-3 DDs. |
|
|
In short, a suitable long-term project for a large well
funded lab (in ENT or Speech/Communications, say) at a
university with a strong bioengineering program. |
|
|
Once you've finally built the model, the sky's the limit:
you can do Hausa clicks, throat singing, grasshopper
stridulations, finger-snapping ... anything you like or can
conceive of. |
|
|
Sounds feasible but big. I would expect the properties of the tongue to be somewhat similar to that of muscle, and i'd expect that to be modelled somewhere. |
|
|
Oh, and yes, sorry, i should've found that link myself. |
|
|
The properties of tongue are probably identical to muscle,
but modeling muscle isn't straightforward. It's been done
for individual sarcomeres, and I think, for geometrically
simple arrangements of sarcomeres, like a pinnate muscle.
And for assemblages of muscles with well-defined origins &
insertions on rigid bones whose motion is constrained by
joints. That's as of several years ago; maybe by now
someone's done the tongue, but I doubt it: not
impossible, just too difficult to be an attractive
proposition for anyone with the capability. The tongue's
a snarl of muscle fibers going in different directions,
following curved paths, with no rigid elements, origins, or
joints. Also, it can push by generating internal pressure,
which isn't going to be part of a standard muscle model. |
|
|
In
short, a model of the tongue would get really hairy. |
|
|
(I think this is a good idea, or at least a cool and feasible
one, but feel it's irreducibly "big," i.e. can't be done
without lots of time, people, and money.) |
|
|
Doesn't actually sound that difficult. |
|
|
The windpipe/vocal chords already have a physical-model algorithm used in PM synthesis for flutes etc. |
|
|
Then it's mostly a matter of reverberation/resonance formulae, with maybe a bit more PM modelling to handle the tongue and lips' [edit: and epiglottis'] ability to constrict the openings. |
|
|
PM modelling generally sounds as realistic as CGI animation looks, and people are "tuned in" to voices, so it will be easily distinguishable from the real thing. Pretty neat though. |
|
|
// feel it's irreducibly "big," // |
|
|
Well, maybe think of this as the gold standard. After all, raytracing isn't the only way scenes are rendered in CGI and there could be ways of simplifying the tongue. For instance, some kind of in-betweening type approach could be taken there - model a tongue, which is indeed going to be difficult, but rather than doing it from moment to moment, identify key moments in its behaviour and simulate those, then just sort of join the dots. The kind of thing i mean is, suppose you're modelling a trilled R. You might not need to generate _every_ tongue position in the lower-frequency vibrations of the tongue, partly because they repeat and other than the first and last are probably effectively identical, and partly because it won't make an audible difference between different parts of the tongue moving in a straight line and the tongue moving in an arc at all times. |
|
|
[FT], no comment at the moment though it will come and thanks. |
|
|
OK, [FT], looks like it'd work, thanks. It has occurred to me that the output is in a sense one-dimensional, and i wonder if this simplifies the process. |
|
|
Modern synthesis touched on PM in the '90s, but these days the large manufacturers concentrate on modelling the sound rather than the instrument: figure out what sounds different part of the instrument make, eg: a piano's strings, harp, soundboard, hammer-hits and key-returns, damper noises.... quite silly some of them... I knew a (seriously good) pianist which, if you got up real close, you could hear him humming off-key and making "wheee" noises... maybe they should include stuff like that too. |
|
|
I steered you a bit wrong there on the reverb formulae though: state of the art "convolution reverb" starts off with putting speakers and mics into the space-to-be-modelled and seeing what the space does to the sound. So, it too is a "passive" technique rather than actually modelling the space. |
|
|
[FlyingToaster] Glen Gould? |
|
|
[MP]... some orchestra player whose solos were accompanied by the strings' section trying not to giggle... dunno if it was a relaxation technique, or a Tourette's variant, or if he was simply trying to see if he could crack up everybody else by working his way through an impressive serious classical piece, straightfaced, while making zoom-zoom noises. |
|
|
That's a little disappointing, but i suppose a sufficient number of samples could be sort of interpolated and you'd get a similar result. |
|
|
That's what they (large synth mfrs) count on. Have you done any research on current methods of vocal emulation ? |
|
|
A little bit, but this is more a spin-off from [Vernon]'s random word idea. By interpolation, i mean between different arrangements of surfaces at different angles and so on. I don't mean bits of waves. |
|
|
Regarding modelling the tongue, isn't there a rather eerie talking robotic head that rather than simulating speech by calculation and reproduction through a speaker, actually has a complex arrangement of vibrating chords, dynamically alterable resonators modelling the nasal and mouthal cavities, a tongue, teeth and lips. It sounds as though it's had an awful lot to drink. |
|
| |