Robotic Manipulation · VLA · Chain-of-Thought

DualCoT-VLA: Visual-Linguistic Chain of Thought
via Parallel Reasoning for VLA Models

* Equal contribution    † Corresponding author

1 The Hong Kong University of Science and Technology (Guangzhou)
2 Huawei Foundation Model Department

Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a "thinking before acting" capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Method Overview

DualCoT-VLA Pipeline

Overview of DualCoT-VLA. The VLM backbone processes visual observations, language instructions, and two sets of learnable query tokens. Visual CoT aligns query hidden states with spatial features from Depth Anything 3; Linguistic CoT uses query hidden states as conditioning prefixes for an auxiliary LLM. During inference, auxiliary modules are discarded for efficient single-forward-pass action generation via a Flow-Matching DiT Action Expert.

Key Contributions

Visual CoT

Distills dense 3D spatial features from a frozen Vision Foundation Model (Depth Anything 3), enabling low-level spatial understanding without explicit decoding.

Linguistic CoT

Aligns VLM representations with a frozen auxiliary LLM to internalize compressed logical planning within the continuous latent space for high-level task reasoning.

Parallel Reasoning

Two sets of learnable query tokens replace slow autoregressive decoding, enabling single-step forward reasoning with significantly reduced latency and compounding errors.

Qualitative Results

Visualization of Visual CoT and Linguistic CoT reasoning on manipulation tasks.

Qualitative Visualization

Quantitative Results

DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks.

LIBERO Benchmark  ·  Success Rate (%) over 500 episodes

Method Spatial Object Goal Long Average
Diffusion Policy78.587.573.564.876.1
OpenVLA84.788.479.253.776.5
PD-VLA95.596.794.991.794.7
π₀98.096.894.488.494.4
π₀-Fast96.496.888.660.285.5
π₀.₅98.898.298.092.496.9
GR00T-N194.497.693.090.693.9
GR00T-N1.697.798.597.594.497.0
OpenVLA-OFT97.698.497.994.597.1
CoT-VLA87.591.687.669.083.9
ThinkAct88.391.487.170.984.4
DeepThinkVLA96.699.096.496.297.0
Fast-ThinkAct92.097.290.279.489.7
LaRA-VLA96.498.699.896.697.9
DualCoT-VLA (Ours) 99.499.897.898.298.8

RoboCasa GR1 Tabletop Tasks  ·  Success Rate (%) over 50 rollouts / task

Task GR00T-N1.5 GR00T-N1.6 Qwen3GR00T Qwen3PI Qwen3OFT Qwen3FAST DualCoT-VLA (Ours)
Cabinet / Microwave Tasks
BottleToCabinetClose54.051.546.026.030.038.066.0
CanToDrawerClose50.013.080.062.076.044.064.0
CupToDrawerClose38.08.554.042.044.056.046.0
MilkToMicrowaveClose60.014.048.050.044.044.058.0
PotatoToMicrowaveClose32.041.528.042.032.014.030.0
WineToCabinetClose38.016.546.032.036.014.038.0
Cuttingboard Tasks
CuttingboardToBasket38.058.048.040.050.054.044.0
CuttingboardToCardboardbox46.046.540.046.040.042.054.0
CuttingboardToPan58.068.568.060.070.058.080.0
CuttingboardToPot62.065.052.040.054.058.064.0
CuttingboardToTieredbasket28.046.556.044.038.040.046.0
Placemat Tasks
PlacematToBasket30.058.542.044.032.036.048.0
PlacematToBowl60.057.544.052.058.038.058.0
PlacematToPlate56.063.048.050.052.042.074.0
PlacematToTieredshelf36.028.518.028.024.018.026.0
Plate Tasks
PlateToBowl52.057.060.052.060.052.050.0
PlateToCardboardbox48.043.550.040.050.030.056.0
PlateToPan60.051.054.036.066.048.070.0
PlateToPlate52.078.770.048.068.050.076.0
Tray Tasks
TrayToCardboardbox32.051.538.034.044.028.052.0
TrayToPlate58.071.056.064.056.034.064.0
TrayToPot44.064.550.044.062.046.070.0
TrayToTieredbasket60.057.036.050.054.036.060.0
TrayToTieredshelf64.031.516.028.030.016.028.0
Average 48.247.647.843.948.839.055.1

Bold: best result   Underline: second best

Real-World Experiments

DualCoT-VLA seamlessly transfers its robust task planning and 3D spatial perception to real-world environments.

Real World Experiment 1

BibTeX

@misc{zhong2026dualcotvlavisuallinguisticchainthought, title = {DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models}, author = {Zhide Zhong and Junfeng Li and Junjie He and Haodong Yan and Xin Gong and Guanyi Zhao and Yingjie Cai and Jiantao Gao and Xu Yan and Bingbing Liu and Yingcong Chen and Liuqing Yang and Haoang Li}, year = {2026}, eprint = {2603.22280}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.22280} }