Another important task of the H2020 SPRING project is to advance on what we call “Multi-modal Affect and Robot Acceptance Analysis”. In a recently released document we present frameworks for emotion recognition, gaze target detection, and automatic social acceptance with qualitative and quantitative results. (see details in Deliverable D4.5 here)

Facial expressions are essential to nonverbal communication and are major indicators of human emotions. Effective automatic Facial Emotion Recognition (FER) systems can facilitate comprehension of an individual’s intention and prospective behaviors in Human-Computer and Human-Robot Interaction. Facial masks exacerbate the occlusion issue since these cover a significant portion of a person’s face, including the highly informative mouth area from which positive and negative emotions can be differentiated. Conversely, the efficacy of FER is largely contingent upon the supervised learning paradigm, which necessitates costly and laborious data annotation. Our study centers on utilizing the reconstruction capability of a Convolutional Residual Autoencoder to differentiate between positive and negative emotions. The proposed approach employs Unsupervised Feature Learning and inputs facial images of individuals with and without masks as inputs. Our study emphasizes the transferability of the proposed approach to different domains compared to current state-of-the-art fully supervised methods. The comprehensive experimental evaluation demonstrates the superior transferability of the proposed approach, highlighting the effectiveness of unsupervised feature learning. Despite outperforming more complex methods in some scenarios, the proposed approach is characterized by relatively low computational expense.

Furthermore, our framework for emotion recognition also incorporates information from audio-based emotion recognition, where a single-microphone speech emotion recognition algorithm can estimate the emotions of people talking with ARI. Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location; however, they overlooked decoding the relationship between the people and the gazed objects. We propose a Transformer-based architecture that automatically detects objects in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of the gaze target area, gaze pixel point, the class, and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics.

Social acceptance detection involves identifying and comprehending the level of interaction, involvement, or connection between humans and robots within a specific context. This encompasses the analysis of diverse cues, including verbal and non-verbal communication, gestures, facial expressions, and other social signals, to assess the extent to which a person actively engages with or responds to a robot. In tackling this objective, our proposed method focuses on scrutinizing the gaze behavior of human agents. We utilize ARI’s gaze target detection module to extract handcrafted features, which has exhibited promising results in analyzing multi-party conversations.

In the last Deliverable of WP4 (focused on automated human behavior understanding) we describe:

  •  the framework for emotion recognition using image and audio signals;
  •  the proposed module for gaze target detection;
  •  the social acceptance module.