The ChallengAV corpus is an audiovisual corpus, designed for multimodal speech filtering, that was developed to provide a wider variety of data than present in many other corpora. Many corpora have the limitation of consistent high quality audio and visual data, making it difficult to test scenarios where inconsistent data is involved, such as the speaker turning their head, or natural conversational data.
This corpus was specifically recorded to test the performance of a fuzzy logic based speech processing system. As this system was intended to work in scenarios where there is noisy audio data present, and visual noise such as that caused by speakers turning their heads, covering their mouths, and moving naturally, as well as camera glitches, it was felt that existing corpora did not cover this requirement adequately. In addition to this, many speech corpora do not consider the Lombard Effect with regard to natural speech, which is very important with regard to audiovisual communication.
To provide a diverse range of audiovisual speech data, and to provide challenging data that the pre-existing corpora used in previous work (GRID and VidTIMIT) fail to supply, volunteers were asked to perform two tasks. Firstly, a reading task, where they read either a short story or a news article. For this task, they were recorded reading for a minute in a quiet environment. The second scenario was a conversational task, where volunteers were encouraged to speak in a more natural manner. Volunteers were recorded in pairs at a table facing each other, with one speaker recorded at a time for one minute. By this, it is meant that while the speakers were facing each other and making conversation, the camera was only pointed at one speaker. This allowed more natural and relaxed speech, and the volunteers were told that they were allowed to move freely and did not have to look directly into the camera at all times. This resulted in more noisy visual data such as head turning, speakers placing their hands over their mouths, and blurring in individual frames due to motion. As this was a conversation rather than continuous speech from a single recorded speaker, there were occasional silences, or speech from the other participant in the conversation. This provided challenging data, which the system has not been trained with.
The data was filmed in a quiet audio environment, with the speakers sitting across the table from each other. However, to take account of other potential uses, two similar tasks were recorded for each speaker, but with a mix of music played intermittently at different volumes. This is to take account of the Lombard Effect.
To record volunteers carrying out the tasks described above, a single camera was used with an integrated microphone. Due to equipment limitations, the visual data was recorded at 15 fps at a resolution of 640 x 480. For each speaker, there were two minutes of initial raw data available. The final corpus contained data from eight speakers, four male, four female. Six of the eight speakers spoke English (five with a Scottish accent and one English), and two were recorded speaking Bulgarian. For each speaker, four minutes of raw data were theoretically available, one minute of conversation and of reading in both noisy and quiet audio environments.
Good quality example frames from all speakers in corpus.
Examples of blurring and head turning, producing variable visual data.
Examples of blurring and head turning, and also the face being covered, producing variable visual data.
The corpus is now available for download (updated 13/06/13). To gain access, please email Dr Andrew Abel at aka (at) cs.stir.ac.uk, using the subject line "ChallengAV Download". The corpus consists of a number of large *.avi files, divide for each reading or conversational task per speaker.
|Speaker 1||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 2||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 3||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 4||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 5||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 6||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 7||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|
|Speaker 8||Reading Quiet||Reading Noisy||Conversation Quiet||Conversation Noisy|