VNU-HCM Logo
HCMUS Logo

TinyGiantALM

A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Interspeech 2026

Vinh-Thuan Ly


  • University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
    Vietnam National University, Ho Chi Minh City, Vietnam

Abstract

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent.

On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments.

These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

Audio Reasoning with Compact Models

A sample output from TinyGiantALM

The success of Chain-of-Thought (CoT) has shifted audio research from simple perception toward cognitive reasoning. Current state-of-the-art LALMs rely on massive parameter scaling (7B-30B+) and computationally expensive Reinforcement Learning to bridge the semantic gap. However, these brute-force approaches often falter in complex scenes, struggling to disentangle acoustic events without explicit reasoning, and creating significant barriers for deployment on edge devices.

TinyGiantALM correctly infers complex scenarios by synthesizing high-level semantics and successfully applying logic. Instead of passively processing all audio tokens, the model dynamically modulates acoustic features using the textual query, ensuring the Small Language Model (SLM) attends only to task-relevant signals.

Methodology

TinyGiantALM Overall Architecture.

Triple-stream Frontend

To capture the full spectrum of auditory information, we employ a multi-rate strategy integrating three frozen pre-trained encoders: a Fine-grained Temporal Stream (Whisper) for linguistic details, an Event-level Stream (HTS-AT) for short-duration acoustics, and a Global Semantic Stream (CLAP) providing a holistic summary acting as a global prior.

Query-guided Projector

Our projector maps acoustic features to the LLM space via E-Branchformer blocks that model local-global dependencies. We derive a global User Intent vector from the text instruction to drive a Multi-Head Cross-Attention mechanism, ensuring the model aligns the audio representation with the specific user instruction.

Semantic Gating & LLM

To ground features in the global context, we compute a soft gate via a sigmoid projection from the CLAP anchor. The refined embeddings are inserted into the Qwen3 (0.6B) backbone, encapsulated within reasoning steps (Plan, Audio Analysis, Logic, Summary) inside a structured thought block to foster deep logical inference.

Performance vs. Parameters

Evaluating our architecture across multiple paradigms reveals critical advantages. Traditional LALMs (e.g., Qwen2-Audio, SALMONN) exhibit a severe performance collapse in complex scenarios like Mix-Sound-Music tasks. In contrast, TinyGiantALM (1.5B) maintains a robust accuracy, outperforming baselines 5x-8x larger and proving that architectural priors can compensate for reduced scale.

Zero-shot MMAR Accuracy vs. Model Size

Zero-Shot MMAR Benchmark

Table 1: Zero-shot accuracy (%) on MMAR. S: Sound, M: Music, Sp: Speech. TinyGiantALM effectively outperforms traditional LALMs and specialized Audio Reasoning models, remaining highly competitive even against massive Omni & Large Language Models (OLMs).

Model Size Single Modality Mixed Modality Avg All
S M Sp S-M S-Sp M-Sp
Large Audio-Language Models (LALMs)
Flamingo 27B24.920.818.226.621.921.922.4
LTU-AS7B17.519.19.123.28.319.016.0
GAMA8B24.327.924.820.826.528.125.4
Qwen2-Audio8.4B33.337.39.130.530.033.228.9
SALMONN13B24.334.79.131.225.050.029.0
Large Audio Reasoning Models (LARMs)
Audio-CoT8.4B25.234.09.130.737.525.026.9
Audio-Reasoner8.4B35.833.045.530.531.336.835.4
Omni & Large Language Models (OLMs)
Baichuan-Omni-1.511B41.240.536.439.040.747.640.9
Cap+DeepSeek-V3671B33.056.118.248.641.756.742.4
Qwen2.5-Omni7B42.459.954.650.037.558.350.4
TinyGiantALM (Ours) 1.5B 47.3 37.9 46.9 45.5 49.1 58.5 46.4

Ablation Study

We isolate the contributions of the Inference Query (IQ) and CLAP-based Global Gate via ablation. Adding IQ or the CLAP gate individually yields moderate gains, but their combination unlocks a non-linear boost, effectively resolving the "Cocktail Party Problem" in mixed environments and enabling higher-order cognition.

Modality Vanilla w/o IQ w/o CLAP Full Model Δ
Single Modality
Sound (S)33.9438.7938.7947.27+13.33
Music (M)35.9233.5035.4437.86+1.94
Speech (Sp)39.8044.5643.5446.94+7.14
Mixed Modality
Mix S-M9.0918.1836.3645.45+36.36
Mix S-Sp38.9940.3747.7149.08+10.09
Mix M-Sp43.9036.5950.0058.54+14.64
Mix All45.8341.6754.1741.67-4.16
Overall Accuracy 38.00 39.70 42.70 46.40 +8.40