DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation



Abstract

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a text-audio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.

Tips

  • We propose DiveSound, a novel taxonomy to properly define sound diversity and sub-categorize within-class variety, accompanied by an automatic pipeline to align high-quality text-audio-image data
  • Both subjective and objective evaluations demonstrate that by incorporating such a taxonomy-based dataset can enhance generated sound quality and diversity, where visual modality is most helpful in guiding audio generation.


  • Figure 1: The overview of automated matching process of text-audio data pairs. The example here uses the new class dog to demonstrate how an audio clip is matched with its corresponding text data pair.



    Description of taxonomy

    Here is the dataset information statistics extracted based on our methodology, along with the names of each category.



    Comparison between different generation systems


    Dog

    Baseline system-without subcategory

    Text-Guided

    Small dog like Chihuahua

    Medium dog like Bulldog

    Large dog like German Shepherd


    Image-Guided
    Image Guided 1
    Image Guided 2
    Image Guided 3

    Bird

    Baseline system-without subcategory

    Text-Guided

    Sparrow chirping

    Parrot squawking

    Eagle screeching


    Image-Guided
    Image Guided 1
    Image Guided 2
    Image Guided 3

    Cat

    Baseline system-without subcategory

    Text-Guided

    Domestic short hair cat

    Siberian cat


    Image-Guided
    Image Guided 1
    Image Guided 2

    Goose

    Baseline system-without subcategory

    Text-Guided

    Canada goose

    Domestic goose


    Image-Guided
    Image Guided 1
    Image Guided 3

    Page updated on 18 March 2024