The problem: The current audio interfaces are either
* an IVR (press one for this press two for that) or
* audio speech recognition wizards that try to guess what you
meant.
*- If they think they succeeded they ask you if that's what
you
meant and proceed to do some preconceived
action.
*- If they don't they'll give you a menu to choose from or go
back to the beginning.
This is quite frustrating as opposed to the interaction we
have with visual interfaces on screens.
If you were not able to read the result you simply read it
again. There's always enough info to look around and choose
what you want to do, and if not, then you ask for a bit more
and get tons more.
The solution should mimic the way we do conversation
naturally:
We (A) ask for something (Q1). The listener (B) hears what
we ask, understands some of it, usually not every exact
single word, but gets the gist of what you are saying, and
then comes up with a question (RQ2) in response.
This response (RQ2 given by B), which is itself another
question, has in it at least three new "sub-questions", three
new bits of extra info (RQ2a, RQ2b, RQ2c...), so that when
RQ2 is answered (by A, with the response R3) , B will be sure
she understood A correctly, and everything A asked at the
beginning (in Q1) is now clear. B can now take action, or ask
another question if there are still any pending issues.
Example: Excuse me - what is the cheapest flight to New
York.
Exasperating computerized aided personnel answering: When
exactly do you want to fly sir? (I don't know, some time
within
the next month as long as it's in my budget!)
Typical smart human answer (after telling her about the
discussion with
the computerized aided personnel, including my
afterthoughts):
[Summing up her thoughts which were: OK, let me see what
kind of packages we have coming up within the next month
which may help you decide. I can then do a detailed
comparison of the days and get back to you. OK, here
goes...]
I can give you
some details about the cheapest flights in the next month,
but
that will take me some time. Do you want me to check it out
for you now, and get back to you when I have an answer?
Or: Typical google answer to search for "Litle Woman" - Did
you mean
"Little Women" - finding results 1-97875 for "Little Women".
Search instead for "Litle Woman"
Human researcher: Are you referring to the book by Louisa
May
Alcott, to something in the news about a small woman, or to
something else?
You: What was that?
HR: Louisa May Alcott the author of the book "Little Women"?
You: You say Loo ee zah, not louse like house!
HR: OK Thank you. I'll remember that: Loo ee za. So is it
*that* book you were asking about?
You: No. What was your other suggestion?
HR: Oh there's a news item going around now about a woman
-
a medical doctor named Thor Dou who married a young man
who's name is Mark Little. She had adopted his family name
and called herself Doc Thor Due Little, but now she has
divorced and there's a whole issue in the news about it. Is
that
what you were looking for?
You: No.
HR: Oh. But you're looking for the term: "Little Women",
right?
(short pause) did you mean women in the plural or woman
like
a single woman? And Little means small or is it a family
name?
Or am I totally off?
--------------------------
What programming technologies will I need to develop this
kind
of a system?
Is there any open source stuff that will help me?
Do you think it's doable with today's existing technologies to
get something at least close to this "intelligent" discussion,
where the computer is narrowing down the possibilities but
without asking me in a mechanical "press one for this, press
two for that" way?
---------------------
The thing is, that I don't want to understand anything about
the phonemes and audio. I want a stream of text(s), like:
"What is the cheapest flight to New York"
Then I get a list of possible semantics and hearing
mistakes:
Watt ease the chipset
fight / light / test jetflight
2 knew rock.
NYC /NYS
And:
Asking about a plane ticket...
Asking about a ticket for a game of boxing
Probably gave location
Did not give time