PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot

PanoVine system overview — **PanoVine** features (A) a 6 m soft growing vine robot with (B) 19 cameras distributed along its body. (C) Identical action commands lead to drastically different configurations due to unpredictable buckling, hysteresis, and environment interaction. (D) A learned whole-body visuomotor policy enables diverse navigation and manipulation skills.

Abstract

Vine robots, a class of soft, growing robots, are well suited to navigating complex and confined environments thanks to their compliant bodies and self-supporting growth mechanism. However, hysteresis, tether interactions, and deformations make them difficult to predict and model, which limits conventional planning and control. In this work we present a data-driven, vision-based control framework for the first autonomous vine robot system. Our system integrates 19 cameras distributed along the robot's body to provide comprehensive feedback of both the robot state and the surrounding environment. Using this rich whole-body vision feedback, we train an end-to-end visuomotor policy from demonstrations for closed-loop autonomous control. The policy aggregates information from distributed sensing while remaining robust to inaccurate robot states and actuation. Experiments demonstrate robust navigation and manipulation in challenging scenarios—steering through branched structures, climbing slopes, traversing unsupported terrain, reaching objects precisely, and maneuvering through confined spaces and obstacles.

Whole-Body Vision

19 body-mounted RGB cameras are gradually revealed as the robot grows, collectively providing multi-perspective feedback of both the robot body and its environment.

Camera views during course navigation — **Course navigation.** Cameras are progressively revealed during growth; they observe the branch, obstacles, target etc.

Camera views during object reaching — **Object reaching.** The object becomes visible to successively more body cameras as the robot extends and steers toward it.

Whole-Body Visuomotor Policy

We learn an end-to-end visuomotor policy from teleoperated demonstrations. At each step it maps a history of multi-view images and proprioception to an action chunk.

PanoVine policy architecture — Environment and robot states are observed through 19 cameras plus growth/steering sensors. Each image is encoded by a ViT class token feature; vision tokens and proprioception are cross-attended by a diffusion-transformer policy that predicts six steering actions and a growing action.

Multi-view correspondence. Cross-attention in the diffusion transformers policy learns implicit correspondences across multi-view visual features without relying on unreliable extrinsic calibration of a continuously deforming body.
Relative proprioception & actions. Expressing observations and actions relative to the latest frame improves robustness to actuation uncertainty and hysteresis.
Steering/growing rebalancing. Demonstrations are dominated by long stretches of near-pure growth, with steering actions being comparatively sparse. We label each training window by whether its joint angles change over the action horizon, then resample the dataset to balance steering against pure-growth windows—so the policy learns decisive, reactive steering for turning, climbing, and bending instead of overfitting to growing.

Complex Course Navigation

A 6 m, 1.5 m-tall course chaining five skills—branch selection, slope climbing, unsupported-gap traversal, obstacle avoidance, and a sharp final turn. PanoVine reaches 80% success.

Ours · autonomous policy

Autonomous rollout. The policy reactively steers through the branch, climbs the 45° slope, bends across the unsupported gap, avoids obstacles, and makes the final sharp turn to the exit.

Baseline · open-loop trajectory replay

Replay baseline (0% success). Replaying a successful demonstration open-loop collides with obstacles and falls short of the goal, confirming the course is unsolvable without closed-loop visual feedback.

Precise Object Reaching

After 2 m of growth the robot must align its tip with an object to within a small angular tolerance, across seen and unseen objects at five locations. PanoVine reaches 85% success.

Ours · multi-camera policy

Multi-camera reaching. The policy grounds its steering on the object's visual appearance across multiple body cameras, incrementally adjusting its bend as the object comes into view.

Baseline · single-camera policy

Single-camera baseline (0% success). With only the base camera, the object is occluded or leaves the field of view; the policy fails to turn toward it and grows past it.

Acknowledgments

The authors would like to thank the CHARM Lab and REALab members for their helpful discussions and feedback on the manuscript. Xiaomeng Xu is supported by the Stanford Interdisciplinary Graduate Fellowship, and Yimeng Qin is supported by the Stanford Woods Institute for the Environment. This work was supported in part by NSF Awards #2143601, #2037101, and #2132519, an Amazon Research Gift, Stanford System-X, the Stanford Woods Institute for the Environment, and the Stanford University Sustainability Accelerator. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

Citation

@misc{qin2026panovinewholebodyvisuomotorcontrol,
      title={PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot}, 
      author={Yimeng Qin and Xiaomeng Xu and William Heap and Aditi Oak and Shuran Song and Allison Okamura},
      year={2026},
      eprint={2606.22923},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.22923}, 
}