« Home « Kết quả tìm kiếm

Automatic User-Adaptive Speaking Rate Selection


Tóm tắt Xem thử

- Automatic User-Adaptive Speaking Rate Selection.
- Humans giving information over the telephone, however, tend to adapt the speed of their presentation to suit the needs of the listener.
- In a corpus of simulated directory assistance dialogs the operator’s speed in number-giving correlates with the speed of the user’s initial response and with the user’s speaking rate..
- Keywords: rate, speed, pace, adaptation, number-giving.
- Many commercial telephone dialogs include an information delivery phase, in which the system gives the user information such as a time, a price, a password, directions, a confirmation number, etc.
- if the speed is too fast there is again a time loss as the user waits for a repetition..
- Common practice in interface design is to produce an interface that meets the needs of all members of the target user population.
- One solution to this is to allow the user to personalize or customize the system’s behavior to some degree, by explicitly stating his interests and preferences or by explicitly setting system parameters.
- A second solution is for the system to adapt itself, based on experience with the user..
- This can be a very knowledge-intensive process, especially if the system aims to adapt after only a brief initial interaction with the user (Langley 1999).
- The second approach adapts without maintaining an explicit user model by keeping comparable information implicitly in the state of the dialog manager.
- The field of natural language generation includes a body of work which is concerned with the problem of expressing a message using words and syntactic structures that the user can easily understand (Reiter.
- All of this work, however addresses adaptation at a fairly coarse level of granularity, at best that of the word, but more commonly that of the proposition or speech act..
- Ward 1998) shows how simple prosodic properties of the user’s utterances can be used to decide when to repeat, wait, or play the next sentence in a sequence of directions.
- Tsukahara 2003) showed that it is possible to detect the “ephemeral emotional state” of the user from the timing and prosody of his utterances, and that this information can be used to adapt the system’s utterances to make them more pleasing to the user.
- For example, if the user is pleased with himself the system can produce a congratulatory acknowledgement, if the user is proceeding swiftly without problems the system can produce a short businesslike acknowledgement, and if the user is unsure the system can produce a firmly reassuring acknowledgement.
- As such the corpus is narrowly useful for our purpose, correlation finding, and is not likely to be useful for such other purposes as determining which groups of users would in fact benefit from speaking-rate adaptation..
- As we were gathering the corpus in Japan, we chose to mimic the format of the most popular Japanese directory-assistance service, namely NTT’s 104.
- This follows the same pattern seen in Figure 1 except that the number reading, line 6 of the figure, is done not by the operator but mechanically..
- We recorded users’ sex, age decade, language and accent history, occupation, presence of hearing impairments, degree of experience with NTT’s 104 service, and a rating of the acceptability of the automatic number-giving phase of this service versus human number-giving..
- This was intended to give some variety in terms of user’s alertness level and degree of busy-ness or haste..
- However call times were at the user’s choice, when he had free time, and thus there were probably no truly rushed calls.
- The city name, and thus the exchange number of the number given, were always the same for each user.
- Next to each listing was a blank for the user to write down the number given by the operator.
- There were also fields for the user to record, for each dialog, the telephone type used (using the rough classification of PHS, portable, normal landline, and public telephone), the location (home, office, outdoors, taxi, train station, etc), and the time.
- Finally there was a space for the user to mark his impression of the operator’s performance, with the suggested responses being “good”, “normal”, and.
- This was detected after the dialogs were collated, as the user had not realized it at the time.
- Each dialog was recorded onto two DAT tapes: one directly from the telephone line and one from a microphone on the operator’s side.
- The operator’s microphone also picked up some of the user’s voice;.
- This was convenient to set-up, and it allows, at least in principle, use of the correlations between the two channels to automatically synchronize them, and use of the volume differences to automatically identify who was speaking when..
- The operator’s task was to behave like a normal directory assistance operator, with the main difference being that the number for the listing requested was found by scanning a short list, rather than searching in a large database.
- Some operators later reported that, after a few calls, they started to recognize the voices of some of the users;.
- Neither the operators nor the users were told the purpose of the experiment..
- In the corpus there was some noise in some of the dialogs, however not enough to have a noticeable effect on intelligibility.
- The correlation between the signal-noise ratio for the user’s voice and the operator’s number-giving duration over a roughly labeled 289 dialog subset was significant but very low, 0.014 (r .
- Slower number-giving is preferable for users who speak slower, and conversely for faster speakers..
- Slower number-giving is preferable for those who react to the operator’s greeting after a delay, and conversely for users who respond more swiftly..
- The significance of a “good” rating is open to question, as about a third of the users rated all of their dialogs the same, and there was clearly no consistency across users.
- Despite these limitations of the ratings, we chose to use them and analyze only the dialogs rated “good”..
- Filled and non-filled pauses, although clearly significant indicators of the speaker’s state, were excluded from the computation, as they probably do not affect perceived speaking rate in any simple way.
- We defined the “user’s initial response time” to be the delay between the end of the operator’s greeting (utterance 1 in Figure 1) and the start of the user’s first utterance (utterance 2 in Figure 1).
- Measuring operators’ number-giving times was complicated by the fact that there were various patterns..
- The most common was where the user produced an acknowledgement after each group of digits (75 dialogs)..
- There were also 38 dialogs where the user repeated back each group of digits, 20 dialogs where the user listened to the number in silence, and 9 dialogs where the user repeated back some but not all of the.
- Our metric of information-delivery slowness was then simply the overall duration of each number-giving (utterance 6 in Figure 1), including internal pauses and the user’s interleaved acknowledgements.
- Figure 2: Correlation between the user’s speaking rate (measured from the transcription) and the duration of the operator’s number-giving..
- Figure 3: Relation between subjective judgment of the user’s speaking rate and the duration of the operator’s number-giving..
- There was a significant negative correlation between the user’s speaking rate and operator’s number- giving duration, –.25 (r 2 = .06), as seen in Figure 2.
- The correlation between the user’s initial reaction time and the operator’s number-giving duration was positive and somewhat stronger, .32 (r 2 = .10), as seen in Figure 4..
- Figure 4: Relation between the user’s initial reaction time and the duration of the operator’s number- giving..
- R is the user’s speaking rate in [morae/sec], D is his initial reaction time in [msec], and.
- L is the operator’s number-giving duration in [msec], and the parameters, obtained by multiple regression, are.
- For example, if the user’s speaking rate is 8.25 morae/sec and his initial reaction time is 600 ms, then the predicted operator’s number-giving duration is 7.0 seconds..
- 0.01) with the actual operators’ number-giving durations.
- Thus information delivery was indeed slower to the extent that the user spoke slowly and to the extent that he was slow to respond to the initial greeting..
- To find out what other factors are involved, we listened to all cases where the number-giving duration predicted by the formula differed by more than 2 seconds from the actual duration in the corpus.
- First, in some dialogs the operator seemed to actively solicit acknowledgements, prosodically (Ward &.
- Second, in some dialogs the operator paused after every digit.
- Third, in some dialogs the user’s acknowledgements came slowly;.
- sometimes it seemed that he had not intended to produce acknowledgements, but the operator had waited,.
- user’s voice.
- operator’s voice.
- To see whether users would actually prefer speaking rate adaptation, we built a semi-automated directory assistance system.
- In this system, as in most directory assistance systems today, a human operator handles the call up to the point of the final information delivery.
- The novel aspect of our set-up is that the system listens in on the user-operator interaction (Figure 5) to compute the user’s initial response time and his speaking rate, and then uses this to give the user the number at an appropriate rate..
- Fosler- Lussier 1998) to estimate the user’s speaking rate.
- mrate is known to correlate well (.67) with the tran- scribed speaking rate in English.
- Figure 7: Cumulative Duration of Pauses between Digit Groups as a function of Overall Number-Giving Duration.
- For example, if mrate is 5, the infered speaking rate is 8.25 morae/sec..
- To generate a number-giving voice of the duration given by the formula, we needed to determine the duration of each digit group and the duration of the pauses.
- The duration of the pauses, although known to be important (Ishizaki &.
- 1998), however in these dialogs we opted for the simple rule of using 40% of the total duration for the pauses between digit groups, slightly higher than the corpus average (Figure 7).
- The duration of the digit groups was set by selecting the “speed parameter” of the synthesizer according to the formula below, obtained by regression:.
- For example, if the desired number-giving duration was 7 seconds, the total pause length would be 2.8 sec.
- 0.01) between the predicted values and operators’ actual number-giving durations.
- Overall number- giving durations varied from 4.3 to 8.2 seconds (Figure 7)..
- Ultimately speaking rate adaptation should be tested in the context of use, by real users.
- First, the system needs a sanity check so that it backs-off to a standard speaking rate if the computed parameters are implausible.
- thus the actual use of the system will not exactly match the conditions under which the corpus was gathered.
- However this may not be a major problem, given that the major determinants of desired speaking rate are probably the times needed to hear and write down the information, which should not depend much on whether the user speaks or is silent.
- Subjectively, even naive implementation of the equations still gives roughly appropriate speaking rates even in dialogs where the users do not repeat or acknowledge (Ward &.
- Although we have addressed speaking rate adaptation in the context of information delivery, it may also be useful in other contexts, such as prompting and audio browsing.
- Virzi 1992), rate adaptation allows a factor of 2 speed-up with a simple implementation and without requiring the user to do anything special..
- We have only looked at the most obvious factors of the user’s speech.
- Ward 2004) may indicate the user’s degree of understanding or cognitive load.
- The user’s vocabulary, dialect or accent, or inferred age, and also extra-dialog factors, such as time of day and originating exchange, may also be informative..
- Our system uses the information in the user’s speech, but for a pure interactive voice response (IVR) system it may be possible to do similar adaptation by considering the timing and rate of the user’s keypad input.
- Users familiar with the system, for example, often press keys immediately after, or even during, the system prompt.
- such users would probably also welcome a faster speaking rate from the system..
- Speaking rate adaptation is probably not universally a good thing, especially if taken to extremes (Suzuki 2001).
- It would be interesting to explore automatic speaking-rate adaptation for other languages..
- We have shown that simple, easily computable features of the user’s voice can be used to adapt the system’s speaking rate to be more appropriate..
- We expect that speaking rate adaptation will find utility in many automated and semi-automated IVR and spoken dialog systems..
- In International Congress of the Phonetic Sciences, pp.
- In Proceedings of the Seventh International Conferenc on User Modeling, pp.
- Combining Multiple Estimators of Speaking Rate.
- Adaptive Number-giving for Directory Assistance (in Japanese)..
- Journal of the Acoustical Society of America .
- In Proceedings of the 4th Annual Meeting of the (Japanese) Association for Natural Language Processing, pp.
- Automatic User-Adaptive Speaking Rate Selection for Informa- tion Delivery

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt