大会报告

Masato Akagi:Speech communication with affective speech-to-speech translation


      Widespread use of the network has ushered in a new era of communication. For example, these days communication can be carried out instantaneously regardless of the distance between two parties, even if the other party is on the other side of the world. However, although spoken language is the most direct means of communication among human beings, it is not yet possible to communicate with others directly across the network if a common language is not shared. This makes it challenging to construct universal speech communication environments on the network. One approach to this challenge is constructing a speech-to-speech translation (S2ST) system. S2ST is the process by which a spoken utterance in one language (i.e. Japanese) is used to produce a spoken output in another language (i.e. Chinese), using conversion of speech into text by automatic speech recognition, translation of the text by machine translation into the target language, and synthesis with a text-to-speech synthesizer in a target language.
      Speech contains a variety of information including linguistic-, paralinguistic- and nonlinguistic- information. However, conventional S2ST focuses on processing linguistic information only, directly translating the spoken utterance from the source language to the target language and does not consider para-linguistic and non-linguistic information such as speaker individuality, emotions etc. at play in the source language. For natural speech communication, it is crucial to preserve speaker individuality expressed in the source language.
      This talk introduces activities of Acoustic Information Science Laboratory, Human Life Design Area, Japan Advanced Institute of Science and Technology (JAIST) that explore how to deal with para- and non- linguistic information among multiple languages in S2ST applications called “affective S2ST.” In our efforts to construct an effective system, we discuss (1) how to describe para- and non-linguistic information in speech and how to model the perception/production of them and (2) the commonality and differences among multiple languages in the proposed model. We then use these discussions as context for (3) an examination of our “affective S2ST” system in operation.


Bio:




      Masato Akagi received his B.E. from Nagoya Institute of Technology in 1979, and his M.E. and Ph.D. Eng. from the Tokyo Institute of Technology in 1981 and 1984. He joined the Electrical Communication Laboratories of Nippon Telegraph and Telephone Corporation (NTT) in 1984. From 1986 to 1990, he worked at the ATR Auditory and Visual Perception Research Laboratories. Since 1992 he has been on the faculty of the School of Information Science of JAIST and is now a full professor. His research interests include speech perception, the modeling of speech perception mechanisms in human beings, and the signal processing of speech. He is a member of the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan, the Acoustical Society of Japan, the IEEE, the Acoustical Society of America (ASA), and the International Speech Communication Association (ISCA). And he was a president of the Acoustical Society of Japan (ASJ) in 2011 – 2013. Dr. Akagi received the IEICE Excellent Paper Award from the IEICE in 1987, the Best Paper Award from the Research Institute of Signal Processing in 2009, and the Sato Prize for Outstanding Papers from the ASJ in 1998, 2005, 2010 and 2011.