DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation
Abstract
Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a text-audio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.
Tips
Figure 1: The overview of automated matching process of text-audio data pairs. The example here uses the new class dog to demonstrate how an audio clip is matched with its corresponding text data pair.
Description of taxonomy
Here is the dataset information statistics extracted based on our methodology, along with the names of each category.
Comparison between different generation systems
Dog
Baseline system-without subcategory
Text-Guided
Small dog like Chihuahua
Medium dog like Bulldog
Large dog like German Shepherd
Image-Guided



Bird
Baseline system-without subcategory
Text-Guided
Sparrow chirping
Parrot squawking
Eagle screeching
Image-Guided



Cat
Baseline system-without subcategory
Text-Guided
Domestic short hair cat
Siberian cat
Image-Guided


Goose
Baseline system-without subcategory
Text-Guided
Canada goose
Domestic goose
Image-Guided


Page updated on 18 March 2024