Audiovisual data for Sound-to-Image (S2I) translation experiments.
1-s Mel spectrogram segments and respective central frames extracted from a video track of the class Rail transport
This repository provides the audiovisual data required for training and testing the models available in https://github.com/leofanzeres/s2i.git. The datasets consist of log-Mel spectrograms (or the extracted audio embeddings) computed from 1-s audio segments and the respective frames from the original video tracks. The complete VEGAS dataset was made available by Zhou et al. during their study on crossmodal translation (). It consists of of 10-s maximal duration videos distributed in 10 sound classes, among which we use five: Baby crying, Dog, Rail transport, Fireworks, and Water flowing.
This dataset is made available under a Creative Commons Attribution 4.0 International License