This project is adapted from tatsu-lab/gpt_paper_assistant. The source code of this project is at Variante/gpt_paper_assistant
About me on Bilibili. Help keep the website running:
Paper selection prompt and criteria (jump to the section by clicking the link):
1. Application of diffusion models and vision-language models (VLMs) to robot manipulation.
3. Video segmentation using unsupervised or self-supervised methods.
5. Advances in 3D generation using generative models, including image-to-3D and text-to-3D.
1000. T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation [more]
Authors: Jingkun Feng, Reza Sabzevari
1001. Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models [more]
Authors: Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu
1002. DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models [more]
Authors: Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li
1004. Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators [more]
Authors: Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu
1005. MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping [more]
Authors: Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic
1006. World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis [more]
Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng
1007. FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization [more]
Authors: Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang
1008. PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation [more]
Authors: Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang
1009. MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action [more]
Authors: Boyang Zhang, Lianlei Shan
1010. DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use [more]
Authors: Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen
1011. VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies [more]
Authors: Robert Ramirez Sanchez, Daniel J. Evans, Dylan P. Losey, Siddarth Jain
1012. TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies [more]
Authors: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding
1013. L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation [more]
Authors: Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao
1014. AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding [more]
Authors: Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen
1021. Towards a Data Flywheel for Embodied Intelligence in Logistics [more]
Authors: Anlan Yu, Zaishu Chen, Zhiqing Hong, Daqing Zhang
1022. HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers [more]
Authors: Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames
1023. A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models [more]
Authors: Arash Ghasemzadeh Kakroudi, Roel Pieters
1027. UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching [more]
Authors: Qilin Huang, Quynh Anh Huynh, Long Le, Chen Wang, Chuhao Chen, Ryan Lucas, Eric Eaton, Lingjie Liu
1029. HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes [more]
Authors: Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li
Back to [top]
2003. The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning [more]
Authors: Justinas Zaliaduonis, Patrick Putzky, Till Richter, Sergios Gatidis
2015. ActiveMimic: Egocentric Video Pretraining with Active Perception [more]
Authors: Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang
2024. Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning [more]
Authors: Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli
2026. T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction [more]
Authors: Kerod Woldesenbet, Abem Woldesenbet
2031. RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities [more]
Authors: Vasiliki Rizou, Pascal Frossard, Dorina Thanou
2032. Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning [more]
Authors: Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu
Back to [top]
Back to [top]
4018. Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction [more]
Authors: Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang
4019. Towards One-to-Many Temporal Grounding [more]
Authors: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li
4020. Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models [more]
Authors: Haibo Wang, Lifu Huang
4028. VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning [more]
Authors: Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang
4030. Building The Ph(ysical)AI Layer Of Machine Intelligence [more]
Authors: Ulbert Jose Botero, Liam Smith, Brooks Olney, Pooya Khorrami, Steven Kusiak, Watson Jia, Sage Trudeau, Daniel Capecci
Back to [top]
5017. Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation [more]
Authors: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma
Back to [top]
6016. Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers [more]
Authors: Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle
6025. Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars [more]
Authors: Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan
Back to [top]
Back to [top]
ArXiv: 2606.05975 [page] [pdf]
Authors: Jingkun Feng, Reza Sabzevari
Abstract: arXiv:2606.05975v1 Announce Type: new Abstract: Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.
Comment: Criterion 1: uses a vision-language model to perform task-driven hierarchical open-vocabulary 3D functionality segmentation for robotic perception, evaluated on the SceneFun3D dataset.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05737 [page] [pdf]
Authors: Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu
Abstract: arXiv:2606.05737v1 Announce Type: new Abstract: Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.
Comment: Matches criterion 1: proposes a diffusion-based vision-language-action (VLA) approach for one-step action generation for robotic policies, evaluated on LIBERO / LIBERO-Plus / LIBERO-Pro and real-robot YAM RSS (one-step decoding reaches 95.6% on LIBERO-Long).
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05758 [page] [pdf]
Authors: Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li
Abstract: arXiv:2606.05758v1 Announce Type: new Abstract: Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.
Comment: Matches criterion 1: introduces DRIFT, a residual flow adapter to adapt pretrained VLMs for continuous outputs (e.g., robotic control and temporal grounding) and shows consistent gains on visual grounding and robotic control tasks over regression/generative baselines.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06476 [page] [pdf]
Authors: Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu
Abstract: arXiv:2606.06476v1 Announce Type: new Abstract: While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.
Comment: Criterion 1 — couples a Vision-Language Model policy (Astra-VL) with a Bagel-based world simulator (Astra-WM) to provide imagined novel-view observations for agentic spatial reasoning, yielding concrete gains (e.g., MMSI-Bench 45.1→49.5 with simulator-augmented Gemini-3-Flash and Qwen3-VL improvements 29.8→38.8 / 36.8→42.7).
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05407 [page] [pdf]
Authors: Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic
Abstract: arXiv:2606.05407v1 Announce Type: new Abstract: This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.
Comment: Matches criterion 1: proposes MoDex, a diffusion policy for sequential multi-object dexterous grasping conditioned on point clouds and an opposition space, evaluated in MuJoCo and on a real Franka+Allegro Hand with reported success-rate improvements (2.92–17.92% in sim, 6.67–17.78% real).
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05979 [page] [pdf]
Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng
Abstract: arXiv:2606.05979v1 Announce Type: new Abstract: We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.
Comment: Criterion 1: a vision-language-action/world-modeling approach (WLA autoregressive Transformer) that jointly predicts textual subtasks, subgoal images and robot actions for manipulation, reporting 92.94% success on RoboTwin2.0 Clean and 56.5% on RMBench.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05468 [page] [pdf]
Authors: Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang
Abstract: arXiv:2606.05468v1 Announce Type: new Abstract: Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(\tau^w, \tau^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.
Comment: Criterion 1: direct application to robot policy learning via Vision-Language-Action models — proposes FlowPRO with RPRO (Robotic Flow-matching Proximalized Preference Optimization) for flow-matching VLAs and reports highest success rates on four long-horizon bimanual tasks using teleoperated intervention data.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05773 [page] [pdf]
Authors: Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang
Abstract: arXiv:2606.05773v1 Announce Type: new Abstract: Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.
Comment: Criterion 1: PiL-World is a chunk-wise world model for policy-in-the-loop Vision-Language-Action (VLA) evaluation that alternates VLA inference and world-model prediction to generate multi-view imagined rollouts and notably reduces the error between real and estimated VLA success rates from 63.2% to 12.0.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06245 [page] [pdf]
Authors: Boyang Zhang, Lianlei Shan
Abstract: arXiv:2606.06245v1 Announce Type: new Abstract: Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.
Comment: Matches criterion 1: applies a Vision-Language-Action (VLM) approach (MPCoT reward-guided multi-path latent reasoning) to robot manipulation and shows improved long-horizon performance on robotics benchmarks LIBERO and CALVIN.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05699 [page] [pdf]
Authors: Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen
Abstract: arXiv:2606.05699v1 Announce Type: new Abstract: Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.
Comment: Matches criterion 1: proposes a learned future-state visuomotor world-model (horizon-conditioned transformer high-level Future-State Visuomotor Target Predictor) for bimanual dexterous manipulation and reports achieving 90% of privileged-oracle performance on OakInk2 tool-use tasks, directly using a learned future predictor to guide a target-conditioned policy.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06323 [page] [pdf]
Authors: Robert Ramirez Sanchez, Daniel J. Evans, Dylan P. Losey, Siddarth Jain
Abstract: arXiv:2606.06323v1 Announce Type: new Abstract: Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.
Comment: Matches criterion 1: introduces VOLT, a vision-and-language trajectory segmentation method that reformats demos for imitation learning and is evaluated with diffusion policies to produce faster-than-demonstration manipulation.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06491 [page] [pdf]
Authors: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding
Abstract: arXiv:2606.06491v1 Announce Type: new Abstract: Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
Comment: Matches criterion 1: TempoVLA is a Vision-Language-Action (VLA) policy with a Variable-Speed Trajectory Augmentation (VSTA) and speed conditioning, enabling controllable execution speeds and integration with a large multimodal model for dynamic speed control.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06049 [page] [pdf]
Authors: Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao
Abstract: arXiv:2606.06049v1 Announce Type: new Abstract: Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.
Comment: Criterion 1: applies diffusion policies (Spiking Diffusion Policy, SDP) optimized with RL for intra-vehicular robotic manipulation and reports higher success rates and lower energy consumption on five representative manipulation tasks.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06155 [page] [pdf]
Authors: Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen
Abstract: arXiv:2606.06155v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
Comment: Criterion 1: introduces AffordanceVLA, a Vision-Language-Action model that leverages VLMs and structured affordance modules (Which2Act, Where2Act, How2Act) plus a Mixture-of-Transformer to bridge vision, language, and action for manipulation, with simulation and real-world results.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05960 [page] [pdf]
Authors: Anlan Yu, Zaishu Chen, Zhiqing Hong, Daqing Zhang
Abstract: arXiv:2606.05960v1 Announce Type: new Abstract: Embodied intelligence is moving from laboratory demonstrations toward industrial deployment, with the logistics industry serving as a key application scenario. Learning-based policies offer a promising path beyond traditional perception-planning-control pipelines, but their scalability depends on how embodied data can be collected, organized, and reused. This research studies a data-centric framework for industrial embodied intelligence by constructing a logistics data flywheel. Our framework converts daily operations into reusable data assets, uses World Models to generate reliable supervision for long-tail parcel manipulation, and feeds deployment feedback back into policy improvement. As an initial result, \textit{WM-DAgger} introduces a World-Model-based data aggregation framework that synthesizes out-of-distribution recovery data for robust imitation learning. Building on this result, ongoing work explores how large-scale in-the-wild multimodal data, including labeled human demonstrations, unlabeled operational videos, and system-level robot logs, can be aligned for policy learning and transformed into feedback for continual system improvement.
Comment: Criterion 1: proposes a logistics data flywheel that explicitly uses World Models and introduces WM-DAgger, a world-model-based data aggregation method that synthesizes out-of-distribution recovery data for robust imitation learning of parcel manipulation.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.06493 [page] [pdf]
Authors: Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames
Abstract: arXiv:2606.06493v1 Announce Type: new Abstract: For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
Comment: Matches criterion 1: HANDOFF demonstrates a VLM-driven agentic planner coupled to a distilled whole-body controller for natural-language-driven humanoid task roll-outs, showing VLMs used in the robot control/planning loop.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.06061 [page] [pdf]
Authors: Arash Ghasemzadeh Kakroudi, Roel Pieters
Abstract: arXiv:2606.06061v1 Announce Type: new Abstract: This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while maintaining a responsive control loop. From free-form user commands, the system generates structured action requests for pick, place, and handover. It uses a VLM to return image-space targets, which are converted into metric robot-frame goals using depth and calibration. A web dashboard exposes intermediate intent and grounding overlays (pixel, depth, and robot-frame) and requires explicit operator confirmation before any motion is executed. Experiments on a Franka FR3 platform evaluate end-to-end task reliability and latency under increasing working table scene ambiguity and compare alternative LLM/VLM configurations in the same pipeline. Code and full documentation are available at github.com/cogrob-tuni/franka-llm.
Comment: Criterion 1: integrates a vision-language model into a ROS2 control loop to ground free-form user commands to image-space targets converted to metric robot-frame goals and evaluates end-to-end on a Franka FR3 platform.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.05399 [page] [pdf]
Authors: Qilin Huang, Quynh Anh Huynh, Long Le, Chen Wang, Chuhao Chen, Ryan Lucas, Eric Eaton, Lingjie Liu
Abstract: arXiv:2606.05399v1 Announce Type: new Abstract: Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world's inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object's softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young's Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point estimates and the continuous nature of physical reality. Project page: https://unipixie.github.io/
Comment: Matches criterion 1: a generative flow-matching model (UNIPIXIE) that learns a continuous, controllable distribution of material properties from images and outputs simulation-ready parameters for diverse physics solvers (MPM, LBS, Spring-Mass), reducing Young's Modulus prediction error by >50% — i.e., a generative world-model for physics useful for robot simulation/data augmentation.
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.06390 [page] [pdf]
Authors: Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li
Abstract: arXiv:2606.06390v1 Announce Type: new Abstract: Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/
Comment: Criterion 1 — uses a VLM-based refiner and a 3D generative model to produce simulation-ready whole-home scenes for embodied AI/robot simulation (trained on a curated 300K floorplan dataset and releasing 5K fully furnished scenes).
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.04280 [page] [pdf]
Authors: Justinas Zaliaduonis, Patrick Putzky, Till Richter, Sergios Gatidis
Abstract: arXiv:2606.04280v1 Announce Type: new Abstract: Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.
Comment: Matches criterion 2: provides a theoretical/ methodological advance in contrastive self-supervised learning (a support-corrected InfoNCE variant) with synthetic validation and CIFAR-10 experiments.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.06194 [page] [pdf]
Authors: Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang
Abstract: arXiv:2606.06194v1 Announce Type: new Abstract: Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.
Comment: Criterion 2: proposes ActiveMimic, a self-supervised egocentric video pretraining method that models camera motion as a viewpoint action by recovering synchronized camera and wrist trajectories to improve robot transfer and match SOTA robot-pretrained models.
Relevance: 10 Back to [topic] [top]
ArXiv: 2606.05506 [page] [pdf]
Authors: Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli
Abstract: arXiv:2606.05506v1 Announce Type: new Abstract: We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:
Comment: Matches criterion 2: proposes a new self-supervised contrastive method—sensor-guided adaptive contrastive learning—where privileged LiDAR guides a geometry-aware similarity metric and adaptive temperature scaling to learn visual representations that improve PointGoal navigation transfer.
Relevance: 8 Back to [topic] [top]
ArXiv: 2606.05700 [page] [pdf]
Authors: Kerod Woldesenbet, Abem Woldesenbet
Abstract: arXiv:2606.05700v1 Announce Type: new Abstract: We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p < 0.001, permutation test) confirms structured detections. Code: https://github.com/TerraLatent/t-sar-jepa
Comment: Matches criterion 2: proposes a self-supervised latent-prediction framework (T-SAR-JEPA) using a ViT-Base/16 with local masked reconstruction and gradient feature prediction plus a temporal transformer, evaluated on the DFC 2026 SAR time-series (ROC-AUC 77.0%).
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.05109 [page] [pdf]
Authors: Vasiliki Rizou, Pascal Frossard, Dorina Thanou
Abstract: arXiv:2606.05109v1 Announce Type: new Abstract: To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play' architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.
Comment: Matches criterion 2: RePercENT is a self-supervised methodological contribution for scalable multimodal disentangled representation learning (operating on pre-extracted embeddings with a joint optimization objective and formal theoretical guarantees for pairwise disentanglement beyond two modalities).
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.04492 [page] [pdf]
Authors: Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu
Abstract: arXiv:2606.04492v1 Announce Type: new Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly constructs and selectively leverages historical experiences. EMTC introduces two synergistic components: (1) a Temporally Consistent Semantic Embedder that integrates contrastive learning with time-conditioned state reconstruction, preventing representation collapse and enabling precise memory retrieval; and (2) a Temporal Consistency Gating Mechanism that dynamically modulates episodic incentives based on temporal consistency error. This adaptive gate filters misleading signals from pseudo-successful trajectories, effectively mitigating Q-value overestimation. We provide theoretical guarantees, establishing a strict error bound that directly links the observable temporal consistency error to the underlying trajectory optimality and representation quality. Extensive evaluations on the SMAC and GRF benchmarks demonstrate that EMTC consistently outperforms state-of-the-art baselines. Notably, compared to the strongest episodic baseline, EMTC achieves absolute win-rate improvements of up to 24% in super-hard SMAC scenarios and an average improvement of 28% across GRF tasks.
Comment: Matches criterion 2 (self-supervised learning): proposes a Temporally Consistent Semantic Embedder that integrates contrastive learning with time-conditioned state reconstruction as a new SSL-style method for robust state/representation learning, with theoretical guarantees and strong empirical gains on SMAC and GRF (up to +24% win rate).
Relevance: 6 Back to [topic] [top]
ArXiv: 2606.05769 [page] [pdf]
Authors: Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang
Abstract: arXiv:2606.05769v1 Announce Type: new Abstract: Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.
Comment: Matches criterion 4: presents Future-L1 which interleaves language tokens and continuous latent visual spans for video event prediction, aligning latents to future-frame embeddings and yielding large gains (e.g., Qwen3-VL-8B from 61.0 to 85.4 on FutureBench).
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.06294 [page] [pdf]
Authors: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li
Abstract: arXiv:2606.06294v1 Announce Type: new Abstract: Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.
Comment: Criterion 4 — language-to-video transfer for improved temporal reasoning: introduces the OMTG benchmark and new caption-based Chain-of-Thought rewards, achieving EtF1 43.65% and outperforming Gemini 2.5 Pro and Seed-1.8.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.05833 [page] [pdf]
Authors: Haibo Wang, Lifu Huang
Abstract: arXiv:2606.05833v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.
Comment: Criterion 4 — GeoVR distills geometry from pre-trained 3D foundation models into MLLMs via four geometric targets (inter-frame camera poses, dense depth, metric scale, and multi-scale 3D feature distillation) and reports state-of-the-art gains on spatial reasoning benchmarks for video-level spatial understanding.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.05736 [page] [pdf]
Authors: Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang
Abstract: arXiv:2606.05736v1 Announce Type: new Abstract: Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.
Comment: Matches criterion 4: proposes VTI-CoT (Visual-Textual Interleaved Chain-of-Thought) that integrates textual reasoning steps with corresponding visual frames for improved video reasoning and constructs an automated multimodal CoT annotation pipeline.
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.04106 [page] [pdf]
Authors: Ulbert Jose Botero, Liam Smith, Brooks Olney, Pooya Khorrami, Steven Kusiak, Watson Jia, Sage Trudeau, Daniel Capecci
Abstract: arXiv:2606.04106v1 Announce Type: new Abstract: Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data. We propose principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than learn untethered statistical correlations. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase. Training exclusively on radio-frequency (RF) data with co-designed architecture and losses incorporating these principles, we achieve cross-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine-tuning of the encoder on target domains. Our 1.99M parameter frozen encoder achieves 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing, with systematic variation: 84.5 on physically-grounded tasks (speaker recognition, seismology, RF fingerprinting) versus 70.0% on semantic tasks (music genre, language recognition). This reveals that principle-driven and scale-driven approaches offer complementary paths: physical principles enable efficient cross-modal transfer while naturally establishing the boundary between physical and semantic understanding.
Comment: Criterion 4 — cross-modal transfer: proposes a principle-driven frozen encoder (1.99M params) trained on RF data that transfers without fine-tuning to multiple modalities including video, achieving 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing.
Relevance: 7 Back to [topic] [top]
ArXiv: 2606.06002 [page] [pdf]
Authors: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma
Abstract: arXiv:2606.06002v1 Announce Type: new Abstract: Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation.In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense.To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches.In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree.To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method.The hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object level.The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts.In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters.To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene.As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.
Comment: Criterion 5: proposes text-to-3D indoor scene generation with LVLMs using a hierarchical global-local tree and PRM-guided MCTS, plus diffusion-based texture synthesis and a new 3DTindo-bench dataset.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.05491 [page] [pdf]
Authors: Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle
Abstract: arXiv:2606.05491v1 Announce Type: new Abstract: Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.
Comment: Matches criterion 6: introduces an unpaired RGB–thermal multi-modal 3D reconstruction method using a Visual Geometric Transformer (VGGT) to independently estimate poses then align them (Procrustes + cross-modal matcher) and performs multi-modal 3D Gaussian Splatting for joint novel-view synthesis.
Relevance: 9 Back to [topic] [top]
ArXiv: 2606.05912 [page] [pdf]
Authors: Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan
Abstract: arXiv:2606.05912v1 Announce Type: new Abstract: Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.
Comment: Criterion 6 — SAGE is a self-supervised method for data-efficient 3D Gaussian avatar reconstruction that jointly optimizes 2D Gaussian surfels and an SDF to learn expression-induced Gaussian deformations, achieving animatable high-fidelity results from minimal multiview or monocular inputs.