TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning

Abstract

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent.

On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments.

These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

Audio Reasoning with Compact Models

The success of Chain-of-Thought (CoT) has shifted audio research from simple perception toward cognitive reasoning. Current state-of-the-art LALMs rely on massive parameter scaling (7B-30B+) and computationally expensive Reinforcement Learning to bridge the semantic gap. However, these brute-force approaches often falter in complex scenes, struggling to disentangle acoustic events without explicit reasoning, and creating significant barriers for deployment on edge devices.

TinyGiantALM correctly infers complex scenarios by synthesizing high-level semantics and successfully applying logic. Instead of passively processing all audio tokens, the model dynamically modulates acoustic features using the textual query, ensuring the Small Language Model (SLM) attends only to task-relevant signals.

Methodology

Triple-stream Frontend

To capture the full spectrum of auditory information, we employ a multi-rate strategy integrating three frozen pre-trained encoders: a Fine-grained Temporal Stream (Whisper) for linguistic details, an Event-level Stream (HTS-AT) for short-duration acoustics, and a Global Semantic Stream (CLAP) providing a holistic summary acting as a global prior.

Query-guided Projector

Our projector maps acoustic features to the LLM space via E-Branchformer blocks that model local-global dependencies. We derive a global User Intent vector from the text instruction to drive a Multi-Head Cross-Attention mechanism, ensuring the model aligns the audio representation with the specific user instruction.

Semantic Gating & LLM

To ground features in the global context, we compute a soft gate via a sigmoid projection from the CLAP anchor. The refined embeddings are inserted into the Qwen3 (0.6B) backbone, encapsulated within reasoning steps (Plan, Audio Analysis, Logic, Summary) inside a structured thought block to foster deep logical inference.

Performance vs. Parameters

Evaluating our architecture across multiple paradigms reveals critical advantages. Traditional LALMs (e.g., Qwen2-Audio, SALMONN) exhibit a severe performance collapse in complex scenarios like Mix-Sound-Music tasks. In contrast, TinyGiantALM (1.5B) maintains a robust accuracy, outperforming baselines 5x-8x larger and proving that architectural priors can compensate for reduced scale.

Zero-Shot MMAR Benchmark

Table 1: Zero-shot accuracy (%) on MMAR. S: Sound, M: Music, Sp: Speech. TinyGiantALM effectively outperforms traditional LALMs and specialized Audio Reasoning models, remaining highly competitive even against massive Omni & Large Language Models (OLMs).

Model	Size	Single Modality			Mixed Modality			Avg All

Model	Size	S	M	Sp	S-M	S-Sp	M-Sp	Avg All
Large Audio-Language Models (LALMs)
Flamingo 2	7B	24.9	20.8	18.2	26.6	21.9	21.9	22.4
LTU-AS	7B	17.5	19.1	9.1	23.2	8.3	19.0	16.0
GAMA	8B	24.3	27.9	24.8	20.8	26.5	28.1	25.4
Qwen2-Audio	8.4B	33.3	37.3	9.1	30.5	30.0	33.2	28.9
SALMONN	13B	24.3	34.7	9.1	31.2	25.0	50.0	29.0
Large Audio Reasoning Models (LARMs)
Audio-CoT	8.4B	25.2	34.0	9.1	30.7	37.5	25.0	26.9
Audio-Reasoner	8.4B	35.8	33.0	45.5	30.5	31.3	36.8	35.4
Omni & Large Language Models (OLMs)
Baichuan-Omni-1.5	11B	41.2	40.5	36.4	39.0	40.7	47.6	40.9
Cap+DeepSeek-V3	671B	33.0	56.1	18.2	48.6	41.7	56.7	42.4
Qwen2.5-Omni	7B	42.4	59.9	54.6	50.0	37.5	58.3	50.4
TinyGiantALM (Ours)	1.5B	47.3	37.9	46.9	45.5	49.1	58.5	46.4

Ablation Study

We isolate the contributions of the Inference Query (IQ) and CLAP-based Global Gate via ablation. Adding IQ or the CLAP gate individually yields moderate gains, but their combination unlocks a non-linear boost, effectively resolving the "Cocktail Party Problem" in mixed environments and enabling higher-order cognition.

Modality	Vanilla	w/o IQ	w/o CLAP	Full Model	Δ


Single Modality
Sound (S)	33.94	38.79	38.79	47.27	+13.33
Music (M)	35.92	33.50	35.44	37.86	+1.94
Speech (Sp)	39.80	44.56	43.54	46.94	+7.14
Mixed Modality
Mix S-M	9.09	18.18	36.36	45.45	+36.36
Mix S-Sp	38.99	40.37	47.71	49.08	+10.09
Mix M-Sp	43.90	36.59	50.00	58.54	+14.64
Mix All	45.83	41.67	54.17	41.67	-4.16
Overall Accuracy	38.00	39.70	42.70	46.40	+8.40

TinyGiantALM

A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Interspeech 2026