𝄢 BASS

Benchmarking Audio LMs for Musical Structure and Semantic Reasoning

Min Jang* Orevaghene Ahia* Nazif Tamer Sachin Kumar Yulia Tsvetkov Noah A. Smith

* Equal contribution.

University of Washington, Ohio Statue University, Allen Institute for AI

Main Figure

Overview of the tasks included in BASS, designed to evaluate audio LMs on musical understanding requiring long-term structure or hierarchical reasoning. It contains twelve tasks across four categories: structural segmentation, structural lyrics transcription, musicological analysis, and artist collaboration.

Abstract

Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.

Metrics

The metrics displayed in the leaderboard and table are normalized to be between 0-100%. For the Artist Collaboration and Musicological Analysis tasks, the exact match accuracies (EMA) are random chance normalized, so that the scores shows the improvement over random chance. For the Structural Lyrics Transcription tasks, the Inverted Word Error Rate (IWER) is used, which bounds the WER from 0-100%, and a higher score indicates better performance. The metric for Structural Segmentation, Intersection over Union (IoU), is already between 0-100%, so no further normalization is applied.

Model Performance — Leaderboard

Sorted by overall average (desc)

Medals: 🥇 🥈 🥉 for top 3.

# Model Size Average (%)

Full Results Table

# Model Speech Sound Music Spatial Multi-audio Voice-chat Instruction Average

Data Statistics

Question Distribution

Distribution of 2,658 questions across task categories and individual tasks

Audio Statistics

Category Task # Songs Avg. Min. Max.
Structural Lyrics Transcription Full Structural Lyrics Transcription 665 3.73 1.60 11.72
Section Structural Lyrics Transcription 170 3.97 2.17 8.44
Overall 884 3.78 1.60 11.72
Structural Segmentation Full Structural Segmentation 74 3.06 1.70 10.78
Section Structural Segmentation 224 3.47 1.93 9.31
Overall 298 3.37 1.70 10.78
Artist Collaboration Artist Counting 229 3.92 2.37 6.97
Arist Duration 327 3.83 1.34 6.97
Artist Localization 247 3.83 2.37 6.13
Artist Attribution 64 3.88 1.34 6.97
Overall 327 3.83 1.34 6.97
Musicological Analysis Single-Gene Detection 335 3.15 1.86 9.23
Pairwise-Gene Detection 147 3.77 0.82 6.37
Gene Dominance Ranking 33 3.60 2.58 4.95
Gene Attribution 94 14.14 9.77 19.44
Overall 484 5.57 0.82 19.44