[Tfug] Smallest vocabulary for conversing with user

Mr Brevity Bexley410 at aim.com
Fri May 16 13:37:49 MST 2014


Hi,

I have a device that uses speech as *one* of its output modalities.
Normally, the device *forwards* "audio" created by a remote speech
synthesizer (i.e., one that has gobs of resources at its disposal
to produce "quality" speech)

When the primary synthesizer is unavailable (offline, comms failure,
etc.), a *local* fallback synthesizer is required to, at the very
least, tell the user "The speech synthesizer is off-line" (etc.)

In this degraded mode, resources are *very* heavily constrained.
E.g., the code must run *in* a BT earpiece (alongside the rest of
the -- perhaps undocumented -- code that implements the earpiece)

I can dramatically reduce the size/capabilities of the synthesizer
by shrinking it to fit the needs of that reduced application domain!
For example, its unlikely that the device would ever need to know
how to speak currency amounts, pronounce surnames, accurately
resolve capitonyms, etc.

And, as the user isn't expected to be listening to the device in
this degraded (i.e., nonfunctioning) form for long periods of time,
the overall quality of the speech -- as well as the user's ability
to tailor it to his listening preferences -- can also suffer.

But, I don't want to push the user *too* hard and further hinder
comprehension, etc.  So, I wouldn't want to take a naive (but effective)
approach of *spelling* everything, declaring punctuation marks 
encountered, etc.  IMO, that just makes a bad situation... impossible!

Given that *most* of the messages routed through this synthesizer will
be under my control (i.e., originate from *within* the local device
that embodies it), I can impose some self-discipline to ensure I don't
call on it to speak things that are likely to be mispronounced or
difficult/tedious to understand.  Ideally, *just* bumping up against
the capabilities of the synthesizer (i.e., no excess/wasted capability).

However, there are also cases where the device may be called upon to
"forward" a message from an external agency (i.e., one that is not
thusly constrained by this "self-discipline") while in this degraded
mode.  But, in those cases, it is still safe to assume that the
synthesizer would NOT have to accommodate completely unconstrained
input (e.g., the "forwarded message" would probably be something along
the lines of "Scheduled server maintenance until 3:00PM" or "Invalid
access credential" and not something like "The Polish cleaning woman
put cheap furniture polish on Dr. Jones's oak credenza in their house
on Phoenix Dr. in Chicago, IL.")

I've built a set of constraints that *should* give me enough flexibility
in "message creation" -- acknowledging that it will never be perfect
(but, infinitely better than a set of coded beeps and bops that require
some sort of never-available cheat sheet to decode!).

However, numbers are giving me all sorts of problems!

The lame approach is just to read out digits.  But, this is tedious
for most listeners -- who will already be having to cope with lower
quality speech, system unavailability, etc.

    "The volume is set at 25%"
    "Unable to connect to server after 13 attempts.  Shall I continue?"

Beyond reading off digits, often context requires embelishing those
values with additional words not explicitly present in the message.

For example, "Battery life remaining:  12:34".  How do you speak this
to the user?
- "Battery life remaining: 12 hours and 34 minutes"
- "Battery life remaining: 12 minutes and 34 seconds"
- "Battery life remaining: twelve thirty-four" (i.e., just after noon)
- "Battery life remaining: one two colon three four"
- "Battery life remaining colon one two colon three four"

[Remember, the synthesizer doesn't *understand* what it is saying!]

Similarly:  "Your IP address is 10.1.2.240"
- "Your ihp address is ten one two two-hundred-and-forty"
- "Your eye pee address is ten point one point two point two four zero"

Or:
"Your MAC address is C0:00:45:23:14:33"
"Today is 5/15" (i.e., 15th of May)
"Warranty expires 5/15" (i.e., May of 2015)
"Battery voltage:  1.225"
"Beacon signal strengths:  12.3, 26.9, 18.7, 30.2"
"Contact Support at (520) 555-1212 between the hours of 9:00 and 5:00"
etc.

Additionally, even simple issues are hard to resolve well.  E.g.,
what do you do with a leading zero in a numeric value?  "Absorb"
it? (silent) Or, assume it is there for a *reason* and, thus,
draw attention to it! (by pronouncing it)

Digits to the right of a decimal seem to want to be read off
individually.  E.g., "twelve point oh one four five".

Values having magnitudes less than one (e.g., having an integer
component of '0') can be spoken with or without acknowledging that
"leading zero" ("zero point five" vs. "point five").

In some cases, "oh" seems preferable to "zero"; in other cases, the
reverse is true.

E.g., I have very different ways of pronouncing "the" based on the
context in which it is encountered ("thuh" vs. "thee").  Likewise,
"oh" vs. "zero".

Please, no pointers to code samples.  I've got *lots* of implementations
showing how to convert numerics to text and/or speech.  What I'm,
instead, looking for, is *opinion* as to what sounds right FOR CASUAL
USERS.  E.g., imagine your grandmother using this device.  How happy
is she going to be listening to a string of digits rattled off in
a very mechanical voice?

Thanx!

[BTW, this addy is probably going away, RSN]



More information about the tfug mailing list