𝄢 BASS
Benchmarking Audio LMs for Musical Structure and Semantic Reasoning
* Equal contribution.
University of Washington, Ohio Statue University, Allen Institute for AI
Overview of the tasks included in BASS, designed to evaluate audio LMs on musical understanding requiring long-term structure or hierarchical reasoning. It contains twelve tasks across four categories: structural segmentation, structural lyrics transcription, musicological analysis, and artist collaboration.
Abstract
Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.
Metrics
The metrics displayed in the leaderboard and table are normalized to be between 0-100%. For the Artist Collaboration and Musicological Analysis tasks, the exact match accuracies (EMA) are random chance normalized, so that the scores shows the improvement over random chance. For the Structural Lyrics Transcription tasks, the Inverted Word Error Rate (IWER) is used, which bounds the WER from 0-100%, and a higher score indicates better performance. The metric for Structural Segmentation, Intersection over Union (IoU), is already between 0-100%, so no further normalization is applied.
Model Performance — Leaderboard
Sorted by overall average (desc)
Medals: 🥇 🥈 🥉 for top 3.
| # | Model | Size | Average (%) |
|---|
Full Results Table
| # | Model | Speech | Sound | Music | Spatial | Multi-audio | Voice-chat | Instruction | Average |
|---|
Data Statistics
Question Distribution
Distribution of 2,658 questions across task categories and individual tasks
Audio Statistics
| Category | Task | # Songs | Avg. | Min. | Max. |
|---|---|---|---|---|---|
| Structural Lyrics Transcription | Full Structural Lyrics Transcription | 665 | 3.73 | 1.60 | 11.72 |
| Section Structural Lyrics Transcription | 170 | 3.97 | 2.17 | 8.44 | |
| Overall | 884 | 3.78 | 1.60 | 11.72 | |
| Structural Segmentation | Full Structural Segmentation | 74 | 3.06 | 1.70 | 10.78 |
| Section Structural Segmentation | 224 | 3.47 | 1.93 | 9.31 | |
| Overall | 298 | 3.37 | 1.70 | 10.78 | |
| Artist Collaboration | Artist Counting | 229 | 3.92 | 2.37 | 6.97 |
| Arist Duration | 327 | 3.83 | 1.34 | 6.97 | |
| Artist Localization | 247 | 3.83 | 2.37 | 6.13 | |
| Artist Attribution | 64 | 3.88 | 1.34 | 6.97 | |
| Overall | 327 | 3.83 | 1.34 | 6.97 | |
| Musicological Analysis | Single-Gene Detection | 335 | 3.15 | 1.86 | 9.23 |
| Pairwise-Gene Detection | 147 | 3.77 | 0.82 | 6.37 | |
| Gene Dominance Ranking | 33 | 3.60 | 2.58 | 4.95 | |
| Gene Attribution | 94 | 14.14 | 9.77 | 19.44 | |
| Overall | 484 | 5.57 | 0.82 | 19.44 |