On December 27, the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) announced the adoption of the latest research results of Qifu Technology on speech emotion computing**"MS-SENET: enhancing speech emotion recognition through multi-scale feature fusion with squeeze-and-excitation blocks”。
The team proposed a new network structure called MS-SENET, which efficiently extracts, selects and weights spatial and temporal multi-scale features, and fuses these features with the original information to obtain stronger speech sentiment representation vectors.
The field of affective computing is an interdisciplinary field of research involving computer science, psychology, and linguistics, and its main purpose is to enable computers to recognize and understand human emotional states by analyzing and processing emotional information in speech signals. The mainstream practice in the industry is to classify the senses through multimodal information such as audio and text, but the team believes that the underlying emotional characteristics of human beings are the same, and they can completely span specific languages and text content.
Qifu Technology has developed its own MS-SENET audio affective computing network framework.
Based on this, the team proposed the MS-SENET framework to improve the emotional representation learning of speech signals by reducing the extraction of a large number of irrelevant acoustic features and fusing local frequency and long-term time features. MS-SENET extracts multi-scale spatiotemporal features by using convolutional kernels of different sizes, and introduces the pressure excitation module to effectively capture these multi-scale features. At the same time, the depth of the model is increased by preventing overfitting and merging by jumping connections and space loss layers, which further improves the expressive ability of the affective computing model.
In addition, the team evaluated a multilingual dataset for six different scenarios, including the Speech Emotion Dataset of the Institute of Automation, Chinese Academy of Sciences, the Berlin Emotion Database, the Italian Dataset, the Interactive Emotion Binary Motion Capture Database, the Surrey Audio Visual Expression Emotion Dataset, and the Ryerson Audio Visual Emotion Speech and Song Dataset. Compared to SOTA (State of the Art, which refers to the method or model that currently performs best in a particular task), MS-SENET improves UA and WA by 131% and 161%, while MS-SENET still maintains excellent emotion recognition capabilities with more emotion categories and lower data volumes.
According to reports, the research on speech emotion computing of Qifu Technology is not only a theoretical breakthrough, but also a successful practical application. For example, in the post-loan complaint project, abnormal sentiment monitoring was applied to the real-life business for the first time. Through the one-by-one analysis of the recorded calls of high-risk customers, customers with abnormal emotions were selected in time so that relevant personnel could intervene in time, and the experimental results showed that the complaint rate of the model group was 4 absolute percentage points lower than that of the control group.