Modules

Recordings, features, annotation, timing of events (face or speech detection) and metadata are provided for the 23 participants from the training and development partitions in different modules. There are 12 modules in total, split over 5 main folders: Annotation (1), Audio (4), Biosignals (2), Metadata (1) and Video (4). The reader is refered to the following publications for all details: Introduction of the database (Face & Gestures 2013), Asynchronous prediction with LSTM-RNN (Pattern Recognition Letters, 2015), and Baseline system for the AudioVisual Emotion recognition Challenge (ACM MM 2016).

Annotation

This module contains the annotations (socio-affective behaviors) performed by the six assistants (three males, three females) using the ANNEMO web-based annotation tool. Data are provided separately for each participant and each assistant, with a framerate of 40ms for the affective behaviors (arousal and valence).

Audio

This folder contains 4 modules: (1) audio recordings, (2) timings (start/stop) of utterances, (3) probability of voice activity detection (VAD), and (4) acoustic features.
(1) Unidirectional headset microphone, external sound card (44.1kHz, 16bits), wav file (PCM).
(2) Timings of spoken utterances (1308 in total) provided as start and stop timecodes in a csv file.
(3) VAD probabilities stored into an ARFF file at a framerate of 40ms (25Hz).
(4) Acoustic features (ComParE/eGeMAPS - openSMILE) provided as ARFF files.

Biosignals

This folder contains 2 modules: (1) physiological recordings, and (2) features.
(1) Electrocardiogram (ECG) and electrodhermal activity (EDA), Biopac MP150 unit, stored into a csv file.
(2) Features given separately for the ECG and the EDA signals.

Metadata

Various data are provided in this module: age, gender and mother tongue of the participants; used communication tool (with or without emotional feedback of teammate); self-reports (mood, and social behaviour); questionnaire given to the participants.

Video

This folder contains 4 modules: (1) video recordings, (2) timing of each video frame, (3) probabilty of face detection and (4) visual features.
(1) Logitech webcam; 1080x720 pixels; YUV; variable 30fps; compression in MPG4: H264, q=25, 'constant' 25fps.
(2) Timecode of each video frame given in a csv file (frame number ; time in seconds).
(3) Probability of face detection provided as ARFF file for each frame.
(4) Probabilty of 15 emotion-related facial action units, movements of the face in X-Y-Z directions, and both mean and standard-deviation of the optical flow in the region of the face provided for each frame into an ARFF file.