The first problem faced when investigating emotions contained in speech is to choose a valid database, which is going to be the basis of the subsequent research work. Unfortunately, the scarcity of available emotional databases makes the recording stage an almost ineludible task within the process. A clear difference in the performance of an emotion recognizer can be achieved depending on which kind of speech is used. Generally, three different categories of emotional speech are considered:
All three groups have both advantages and drawbacks, and none of them can be pointed out as generally optimal. The selection of the database is strongly dependent on the application in which the emotional recognizer is going to be employed. Also the categories that are relevant in trying to establish the correspondences between emotions and speech is to a certain extent depending on the task, i.e. different applications may gain from different categorizing. In the framework of this work, the target scenario is the Sony entertainment robot AIBO. To that end, an emotional database has been recorded, simulating different possible situations, which comprises all the desired emotions. From the application point of view it is interesting to have five different emotional states, angry, happy, sad, bored and neutral.
Another problem that has to be accounted when a database is chosen is the cultural dependencies found in the way the emotions are expressed. Many studies try to find out the extent to which emotional expression is psycho-biological or culturally determined. For example, Scherer [Sch00] explored the existence of a universal psychobiological mechanism of emotion in speech across languages and cultures by studying the recognition of 5 emotions in nine languages obtaining 66% of accuracy. In [Abe01] recordings of a Swedish speaker uttering a phrase while expressing different emotions was interpreted by listeners with different languages: Swedish, English, Finnish and Spanish. Results show that the native listeners were the most successful in recognizing them appropriately. Nevertheless, another study [Tic00] carried out comparing the cross-cultural decoding of emotions between Japanese and English subjects, suggests that the vocal effects of possibly quasi-universal psycho-biological response mechanisms may be present.
Following sub-sections resume the benefits and drawbacks of the three mainly kinds of speech.
Spontaneous speech is often argued to contain the most direct and authentic emotions, but the difficulties in collecting this kind of speech are also extensive. In the ideal condition speakers should be recorded without knowing about it, so that they behave completely naturally, but this kind of data collection rises difficulties, since such a routine is ethically problematic (s. [Cam00, Cam01]). Problems with spontaneous speech can also cause legal copyright problems. However, although this kind of data is difficult to collect, there exist corpora of spontaneous speech, mainly consisting of clips from different television programs, but with significant distribution limitations.
Another weakness of this kind of speech is that data in the corpus must be categorized. Emotional categories are quite fuzzy in their definitions, and different researches use different sets. Systematic and careful evaluations of tagsets used for labeling emotions are generally lacking and the labeling process becomes hard and expensive.
Examples of natural databases available are the Belfast database, which contains audiovisual recordings of 100 English speakers exhibiting relatively spontaneous emotion and is used in e.g. [Scö01, Cow00]; the Leeds-Reading Emotion in Speech Corpus, e.g. [Gre95], the JST database, e.g. [Cam01] and the SUSAS corpus [Han99], which consists of air force pilots conversation and is therefore still less common than many everyday situations. Aviation data, i.e. crew conversations in cases where the aircraft is crashing, has also been used (by e.g. [Bre83] or [Wil69]) as well as the radio recordings of the reporting of the Hindenburgh catastrophe (e.g. used by [Wil69]). There are also other researches using spontaneous speech but all of them have suffered ethical critics.
Given the difficulty of inducing or observing naturally occurring vocal expressions of emotion, most researches in this area have used actors as subjects, asking them to vocally portray different emotions, and have analyzed the acoustic features of the recorded portrayals.
Acted speech does not have the same ethical problems that are present by collecting spontaneous speech, however the degree of naturalness is often questioned. Acted speech can be recorded from different sources, sometimes professional actors are employed (s. [Ban96]), in other cases non-professional actors, students of drama or even any other students are asked to utter emotional corpus. Of course the quality of the acting could be suspected to differ between diverse recordings and these differences regarding the quality of acting have to be taken into account as well.
In the first place the quality of acted speech is a function of the quality of the acting performed, which might affect the manifestations of the emotions. But there are further unclear parts of using acted speech; the most important uncertainty is whether acted speech really can be said to reflect authentic emotions. Some reports [Gus01] believe that, due to the exaggerated nature of acted speech, it is not possible to generalize from acted emotional speech to natural speech, even though high recognition rates often can be found in those former experiments. Obviously there is an inverse relation between naturalness and ease of acquisition. Acted speech is an indication of how people believe that emotions should be expressed in speech, not of how emotions are actually expressed [Sti01]. This indicates that acted speech is more stereotypical, and that the expression of emotions is more extreme than in spontaneous speech. For a speech synthesis application this might not be a problem, perhaps it is rather an advance to use stereotypical emotional expressions. Giving the most prototypical and easily interpretable emotive correlates, instead of real, would even be profitable in synthesized speech. These stereotypes could be universally understood in spite of their lack of spontaneity. On the contrary, in speech recognition this mismatch between ideal and reality gives rise to problems. Since there is no unanimous way to express emotions, because it strongly depends on many factors such as social environment or speaker’s personality, automatic recognition systems should be capable of interpreting a wide range of variations in the emotional expression. In other words, in recognizing speech we have to cope with the complexity of reality.
The basis of elicited speech resides in emotion induction. One of the major requirements for the empirical study of the effects of the speaker emotional state on acoustic voice parameters is the ability to induce affective and attitudinal states in a reliable and realistic fashion. Several techniques have been developed in the literature to induce affective states in a controlled way. These range from the reading of positive or negative self-statements through the use of music and the presentation of films to the threat of having to speak in public. For instance, subjects watch a film, which should evoke specific emotions, and then they have to retell the film to the experimenter. Here the idea is that the speech shall be colored by the emotion induced. It is also possible to put a subject into a situation meant to evoke a specific emotion, and then record his speech. However, this method suffers from ethical problems, i.e. it is not fully ethical to scare someone, and then record his speech. In [Gus01] is doubted whether it is even more unethical to do this, than just to record someone who is already scared. As a result of this problem the induced or elicited emotions are often too mild, as if there were an inverse relation between the strength of the induction and the unethical value.
Various techniques using mental imagery have been used effectively to induce affective states in which physiological, vocal and facial reactions congruent with the target states could be elicited for a range of emotions and attitudinal states. Finally, within the fields of speech science and human factors, interactive tasks and games on computers have been used to induce states of high cognitive load and stress, and a number of emotional states. This technique seems particularly relevant to research involving automatic and computer controlled speech interfaces. Wizard of Oz (WoZ) techniques are also widespread employed. There, a real situation is presented to the subject and his emotional reactions are captured. This technique is used during the present research through the scenario “one day with AIBO”.
A wide range of procedures has been attempted to provoke emotions in an artificial way. The induction method has the positive feature that it gives control over the stimulus, on the other hand, different subjects may react differently on the same stimulus. The validity of such elicited, or induced, emotional speech depends to a large extent on how successful the induction process is.
Studies, which have used induced emotional speech are e.g. [Ski35], [Fri62], [Hec68] or [Iid98].
Research made during this thesis is oriented to the AIBO entertainment robot, developed by SONY, which has the capability to communicate with the world around it through the senses of sight, sound and touch. In order to obtain relevant results, it is desired to have a speech database, as close as possible to spontaneous emotional speech in the target scenario.
With that purpose in mind, different stories in the context “one day with AIBO” have been designed, taken into account that approximately 30 commands in five emotions (angry, happy, sad, bored and neutral) should be included. Such stories were recorded by a professional speaker; with the aim to introduce subsequent speakers into the intended emotion.
Recordings of the database are thus focused on the commands to which AIBO usually attends. Before further details of the database are given, one observation must be considered: In order to obtain enough data to deal with the speaker dependent experiments, two subjects, one male and one female, have been selected from the database and larger amount of data has been from them recorded:
Speaker A: One male native German speaker. AIBO stories are recorded twice, corresponding to the speaker ids id0013 and id0014 (see table 4.1). AIBO commands are recorded twice.
Speaker B: One female non-native English speaker. AIBO stories are recorded twice, corresponding to the speaker ids id0029 and id0030 (see table 4.1). AIBO commands are recorded twice.
Collection of the database was completed in the recording studio of the Advanced Technology Centre of Stuttgart (ATCS), property of Sony International Europe GmbH.
The software used in the recording process was implemented by Sony at the same location, i.e. ATCS, and is called Speech Recording System Program V. 22.214.171.124. Recordings were made with two different microphones. A Sony C38B high quality microphone was situated close to the speaker and conforms the left channel. In addition, a Sony WM4108B microphone was distanced 30 cm in front of the speaker as far-field microphone and its signal was set in the right channel. Both channels were recorded with a frequency of 48 KHz. Then channels are converted to a sampling frequency of 16KHz. The present work only considers the closer input, whereas the far-field signal is kept for further research.
As it has been previously introduced in section 4.2, two different kind of recordings are performed:
AIBO commands is a dataset of read speech consisting in the AIBO commands read one after another in each one of the five emotional states1 considered for this work. For the AIBO commands data acquisition, utterances are recorded as read speech and therefore no story is performed; only the commands are prompted. Commands are simply asked to be uttered within certain emotional content. Since these commands were recorded, at a first step, in order to increase the amount of data for the speaker dependant experiments, recordings of this nature only exist for speakers A and B. A database of only neutral commands uttered by 7 male and 6 female speakers is also used for purposes of experiment 126.96.36.199, whose findings question the absence of emotional content in the neutral utterances resulting from the AIBO stories. This fact comes from the intrinsic emotional meaning of the commands, e.g. “Let’s play” has a propensity to be uttered as happy and “Be quiet!” dispose angry intentions.
One day with AIBO database contains emotional samples obtained as elicited (WOZ) speech. People are put in an emotional state by some context action and then asked to read the commands. Subjects are asked to sit in front of a screen and to listen to one recording through the phone heads. This recording, designed to supply the emotional context, was previously recorded by a professional speaker. At the same time that they listen to the story, they can read it on the screen. When they are required to utter a command, it is prompted on the screen. The emotional content in which this command should be uttered is unequivocally given through the story context; however, an icon is presented on the screen next to the sentence for its absolute verification.
Speech files are automatically labelled within the different emotions during the recording session. The story was designed taking into account that at least the 26 AIBO commands uttered in 5 different emotional states should be included. The speaker follows all the situations conducted until the end of the story, which is, to add some non-technical information, a happy end. The labelling of the database is made accordingly to the emotion that is supposed to be uttered in each situation. That means that this work will deal with intended emotional expression without re-labelling through listening tests. This position defends the idea that emotions should be recognised from the natural expression of the speakers, instead of restricting the study to “exaggerated” ways of emotional expression. Nevertheless it would be interesting to contrast results with an appropriate labelled database, which is proposed for further work.
Since the recording sessions have taken place simultaneously with the thesis development, the amount of data has increased successively. The following database matches the data available at the closing stage of this work.
One day with AIBO
Following the procedure formerly described, through which the subjects are put into an emotional context, 30 speakers are recorded. Information about the speakers is given in following table 4.1.
The labels “good emotional performer” and “one of the best emotional performers” result from the criteria of the recording staff, who attended all the sessions. However, it must be noted that this selection is based exclusively on general performance of the speakers at the recording time and not on later listening tests. The use of different sets of speakers to carry out the experimental enquiries is detailed in chapters 8 and 9.
Table 4.1.Database recorded by means of the AIBO scenarios.
These recordings are the result of reading the commands without an emotional context. Speakers A and B are recorded in five different emotions in order to obtain a larger amount of data for speaker dependent classification tasks. On the other hand, remaining speakers in table 4.2 are only recorded in the neutral emotion. All the utterances correspond to commands that AIBO is capable to recognize and “understand”