🚀 The Better You Learn, The Smarter You Prune:
Towards Efficient Vision-language-action Models
via Differentiable Token Pruning

Titong Jiang^1,2*, Xuefeng Jiang^1,3*, Yuan Ma^1†, Xin Wen¹, Bailin Li¹,
Kun Zhan¹, Peng Jia¹, Yahui Liu², Sheng Sun³, Xianpeng Lang^1‡

¹LiAuto Inc. ²Tsinghua University ³Chinese Academy of Sciences
^*Equal Contribution ^†Project Lead ^‡Corresponding Author

arXiv 🤗 Models Code

📝 Abstract

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens.

LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: it generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously.

Key Results: LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate on the LIBERO benchmark.

🔧 Approach

Overview of the LightVLA framework

Our method adaptively prunes visual tokens with the goal of optimal performance, thereby improving efficiency and performance simultaneously. The framework learns to identify and retain task-relevant visual information while discarding redundant tokens through differentiable selection.

📊 Experimental Results

LightVLA achieves better performance with fewer visual tokens

Superior efficiency and accuracy compared to existing VLA models and acceleration methods

🎬 Visual Token Pruning Demonstrations

Adaptive token pruning across four LIBERO benchmark categories (videos at 0.2x speed)

Spatial

Pick up the black bowl between the plate and the ramekin and place it on the plate

Pick up the black bowl next to the cookie box and place it on the plate

Pick up the black bowl next to the plate and place it on the plate

Pick up the black bowl on the wooden cabinet and place it on the plate

Object

Pick up the alphabet soup and place it in the basket

Pick up the bbq sauce and place it in the basket

Pick up the chocolate pudding and place it in the basket

Pick up the orange juice and place it in the basket

Goal

Put the wine bottle on top of the cabinet

Open the top drawer and put the bowl inside

Put the bowl on top of the cabinet

Put the wine bottle on the rack

Long

Put both the alphabet soup and the tomato sauce in the basket

Turn on the stove and put the moka pot on it

Put the white mug on the plate and put the chocolate pudding to the right of the plate

Put the yellow and white mug in the microwave and close it

⚠️ Failure Case

Understanding failure modes helps improve future iterations

Spatial

Failed: Pick up the black bowl next to the plate

Object

Failed: Pick up the chocolate pudding

Goal

Failed: Open drawer and put bowl inside

Long

Failed: Put mug in microwave and close

📚 Citation

@misc{jiang2025betterlearnsmarterprune,
      title={The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning}, 
      author={Titong Jiang and Xuefeng Jiang and Yuan Ma and Xin Wen and Bailin Li and Kun Zhan and Peng Jia and Yahui Liu and Sheng Sun and Xianpeng Lang},
      year={2025},
      eprint={2509.12594},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.12594}, 
}

🚀 The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

📝 Abstract

🔧 Approach

Overview of the LightVLA framework

📊 Experimental Results

🎬 Visual Token Pruning Demonstrations

Spatial

Object

Goal

Long

⚠️ Failure Case

Spatial

Object

Goal

Long

📚 Citation

🚀 The Better You Learn, The Smarter You Prune:
Towards Efficient Vision-language-action Models
via Differentiable Token Pruning