We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models.
While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens.
LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: it generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously.
Key Results: LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate on the LIBERO benchmark.
Our method adaptively prunes visual tokens with the goal of optimal performance, thereby improving efficiency and performance simultaneously. The framework learns to identify and retain task-relevant visual information while discarding redundant tokens through differentiable selection.
LightVLA achieves better performance with fewer visual tokens
Superior efficiency and accuracy compared to existing VLA models and acceleration methods
Adaptive token pruning across four LIBERO benchmark categories (videos at 0.2x speed)
Pick up the black bowl between the plate and the ramekin and place it on the plate
Pick up the black bowl next to the cookie box and place it on the plate
Pick up the black bowl next to the plate and place it on the plate
Pick up the black bowl on the wooden cabinet and place it on the plate
Pick up the alphabet soup and place it in the basket
Pick up the bbq sauce and place it in the basket
Pick up the chocolate pudding and place it in the basket
Pick up the orange juice and place it in the basket
Put the wine bottle on top of the cabinet
Open the top drawer and put the bowl inside
Put the bowl on top of the cabinet
Put the wine bottle on the rack
Put both the alphabet soup and the tomato sauce in the basket
Turn on the stove and put the moka pot on it
Put the white mug on the plate and put the chocolate pudding to the right of the plate
Put the yellow and white mug in the microwave and close it
Understanding failure modes helps improve future iterations
Failed: Pick up the black bowl next to the plate
Failed: Pick up the chocolate pudding
Failed: Open drawer and put bowl inside
Failed: Put mug in microwave and close
@misc{jiang2025betterlearnsmarterprune,
title={The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning},
author={Titong Jiang and Xuefeng Jiang and Yuan Ma and Xin Wen and Bailin Li and Kun Zhan and Peng Jia and Yahui Liu and Sheng Sun and Xianpeng Lang},
year={2025},
eprint={2509.12594},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.12594},
}