This project is adapted from tatsu-lab/gpt_paper_assistant. The source code of this project is at Variante/gpt_paper_assistant
About me on Bilibili. Help keep the website running:
Paper selection prompt and criteria (jump to the section by clicking the link):
3. Works on video segmentation that rely on unsupervised and self-supervised learning methods.
1004. Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos [more]
Authors: Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan
1048. SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning [more]
Authors: Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz
1073. ECHOSAT: Estimating Canopy Height Over Space And Time [more]
Authors: Jan Pauls, Karsten Schr"odter, Sven Ligensa, Martin Schwartz, Berkant Turan, Max Zimmer, Sassan Saatchi, Sebastian Pokutta, Philippe Ciais, Fabian Gieseke
Back to [top]
2000. World Guidance: World Modeling in Condition Space for Action Generation [more]
Authors: Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, Xihui Liu
2002. LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [more]
Authors: Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir
2005. Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [more]
Authors: Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen
2006. Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild [more]
Authors: Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu
2009. PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning [more]
Authors: Zekai Lin, Xu Zheng
2010. VLA Knows Its Limits [more]
Authors: Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, Gaowen Liu
2013. Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning [more]
Authors: Tomoya Kawabe (NEC Corporation), Rin Takano (NEC Corporation)
2015. MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving [more]
Authors: Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu
2022. Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots [more]
Authors: Jingchao Wei, Jingkai Qin, Yuxiao Cao, Jingcheng Huang, Xiangrui Zeng, Min Li, Zhouping Yin
2032. Are Foundation Models the Route to Full-Stack Transfer in Robotics? [more]
Authors: Freek Stulp, Samuel Bustamante, Jo~ao Silv'erio, Alin Albu-Sch"affer, Jeannette Bohg, Shuran Song
2041. Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map [more]
Authors: Lei Su, Zhijie Peng, Renyuan Ren, Shengping Mao, Juan Du, Kaifeng Zhang, Xuezhou Zhu
2045. Causal Decoding for Hallucination-Resistant Multimodal Large Language Models [more]
Authors: Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua, Hao Wang
2046. Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models [more]
Authors: Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, Liang He
2047. Primary-Fine Decoupling for Action Generation in Robotic Imitation [more]
Authors: Xiaohan Lei, Min Wang, Wengang Zhou, Xingyu Lu, Houqiang Li
2050. RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking [more]
Authors: Yanqiu Yu, Zhifan Jin, Sijia Chen, Tongfei Chu, En Yu, Liman Liu, Wenbing Tao
2052. Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data [more]
Authors: Emre Can Acikgoz, Cheng Qian, Jonas H"ubotter, Heng Ji, Dilek Hakkani-T"ur, Gokhan Tur
2055. Learning Deformable Object Manipulation Using Task-Level Iterative Learning Control [more]
Authors: Krishna Suresh, Chris Atkeson
2056. See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs [more]
Authors: Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen
2058. Environment-Aware Learning of Smooth GNSS Covariance Dynamics for Autonomous Racing [more]
Authors: Y. Deemo Chen, Arion Zimmermann, Thomas A. Berrueta, Soon-Jo Chung
2059. Force Policy: Learning Hybrid Force-Position Control Policy under Interaction Frame for Contact-Rich Manipulation [more]
Authors: Hongjie Fang, Shirun Tang, Mingyu Mei, Haoxiang Qin, Zihao He, Jingjing Chen, Ying Feng, Chenxi Wang, Wanxi Liu, Zaixing He, Cewu Lu, Shiquan Wang
2063. ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation [more]
Authors: Enyi Wang, Wen Fan, Dandan Zhang
2067. Therapist-Robot-Patient Physical Interaction is Worth a Thousand Words: Enabling Intuitive Therapist Guidance via Remote Haptic Control [more]
Authors: Beatrice Luciani, Alex van den Berg, Matti Lang, Alexandre L. Ratschat, Laura Marchal-Crespo
2068. Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models [more]
Authors: Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li
2075. NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors [more]
Authors: Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
2080. LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations [more]
Authors: Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, Siyuan Huang
2084. Self-Curriculum Model-based Reinforcement Learning for Shape Control of Deformable Linear Objects [more]
Authors: Zhaowei Liang, Song Wang, Zhao Jin, Shirui Wu, Dan Wu
2087. FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation [more]
Authors: Edgar Welte, Yitian Shi, Rosa Wolf, Maximillian Gilles, Rania Rayyes
2090. CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning [more]
Authors: Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang
2096. SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints [more]
Authors: Hyungmin Kim, Hobeom Jeon, Dohyung Kim, Minsu Jang, Jeahong Kim
2099. Behavioral Cloning for Robotic Connector Assembly: An Empirical Study [more]
Authors: Andreas Kernbach, Daniel Bargmann, Werner Kraus, Marco F. Huber
Back to [top]
3003. Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation [more]
Authors: Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani
3034. GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry [more]
Authors: Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, Xiaoqin Zhang
3035. Asymptotically Fast Clebsch-Gordan Tensor Products with Vector Spherical Harmonics [more]
Authors: YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt
3061. UniVBench: Towards Unified Evaluation for Video Foundation Models [more]
Authors: Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu
3070. MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation [more]
Authors: Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
3071. AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting [more]
Authors: Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clap'es
3077. Training Generalizable Collaborative Agents via Strategic Risk Aversion [more]
Authors: Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar
3078. Solaris: Building a Multiplayer Video World Model in Minecraft [more]
Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
3081. Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking [more]
Authors: Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua
3088. Trajectory Generation with Endpoint Regulation and Momentum-Aware Dynamics for Visually Impaired Scenarios [more]
Authors: Yuting Zeng, Manping Fan, You Zhou, Yongbin Yu, Zhiwen Zheng, Jingtao Zhang, Liyong Ren, Zhenglin Yang
3097. Autonomous Sea Turtle Robot for Marine Fieldwork [more]
Authors: Zach J. Patterson, Emily Sologuren, Levi Cai, Daniel Kim, Alaa Maalouf, Pascal Spino, Daniela Rus
3100. UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing [more]
Authors: Mariia Baidachna, James Carty, Aidan Ferguson, Joseph Agrane, Varad Kulkarni, Aubrey Agub, Michael Baxendale, Aaron David, Rachel Horton, Elliott Atkinson
Back to [top]
4024. Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning [more]
Authors: Yushen He
4031. Tokenizing Semantic Segmentation with RLE [more]
Authors: Abhineet Singh, Justin Rozeboom, Nilanjan Ray
4042. Cross domain Persistent Monitoring for Hybrid Aerial Underwater Vehicles [more]
Authors: Ricardo B. Grando, Victor A. Kich, Alisson H. Kolling, Junior C. D. Jesus, Rodrigo S. Guerra, Paulo L. J. Drews-Jr
4049. SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance [more]
Authors: Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song
4054. SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction [more]
Authors: Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, Xiaoshuai Hao
4086. LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration [more]
Authors: Aditya Ranjan Dash, Ramy Battrawy, Ren'e Schuster, Didier Stricker
4089. SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR [more]
Authors: Rajai Alhimdiat, Ramy Battrawy, Ren'e Schuster, Didier Stricker, Wesam Ashour
4098. RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models [more]
Authors: Xiaoyu Xian, Shiao Wang, Xiao Wang, Daxin Tian, Yan Tian
Back to [top]
5011. SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model [more]
Authors: Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu, Weiming Xiong, Yuzhe Jin, Baoxuan Gu, Binjie Mao, Yunjie Yu, Jujie He, Yuhao Feng, Shiwen Tu, Chaojie Wang, Rui Yan, Wei Shen, Jingchen Wu, Peng Zhao, Xuanyue Zhong, Zhuangzhuang Liu, Kaifei Wang, Fuxiang Zhang, Weikai Xu, Wenyan Liu, Binglu Zhang, Yu Shen, Tianhui Xiong, Bin Peng, Liang Zeng, Xuchen Song, Haoxiang Guo, Peiyu Wang, Yahui Zhou
5012. The Design Space of Tri-Modal Masked Diffusion Models [more]
Authors: Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Russ Webb, Jason Ramapuram
5014. Provably Safe Generative Sampling with Constricting Barrier Functions [more]
Authors: Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti
5018. Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling [more]
Authors: Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee
5020. A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers [more]
Authors: Trung X. Pham, Kang Zhang, Ji Woo Hong, Chang D. Yoo
5023. CADC: Content Adaptive Diffusion-Based Generative Image Compression [more]
Authors: Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang
5033. MultiAnimate: Pose-Guided Image Animation Made Extensible [more]
Authors: Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu
5037. CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation [more]
Authors: YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang
5038. FlowFixer: Towards Detail-Preserving Subject-Driven Generation [more]
Authors: Jinyoung Jun, Won-Dong Jang, Wenbin Ouyang, Raghudeep Gadde, Jungbeom Lee
5039. UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling [more]
Authors: Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu
5043. Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling [more]
Authors: Marion Neumeier, Niklas Ro{\ss}berg, Michael Botsch, Wolfgang Utschick
5079. Towards Controllable Video Synthesis of Routine and Rare OR Events [more]
Authors: Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova, Yiqing Shen, Jan Emily Mangulabnan, Hao Ding, Jose L. Porras, Masaru Ishii, Mathias Unberath
5085. When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters [more]
Authors: Liangwei Lyu, Jiaqi Xu, Jianwei Ding, Qiyao Deng
Back to [top]
6027. Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow [more]
Authors: Shimin Hu, Yuanyi Wei, Fei Zha, Yudong Guo, Juyong Zhang
Back to [top]
7001. Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping [more]
Authors: Junmyeong Lee, Hoseung Choi, Minsu Cho
7007. HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles [more]
Authors: Yifan Wang, Francesco Pittaluga, Zaid Tasneem, Chenyu You, Manmohan Chandraker, Ziyu Jiang
7008. DAGS-SLAM: Dynamic-Aware 3DGS SLAM via Spatiotemporal Motion Probability and Uncertainty-Aware Scheduling [more]
Authors: Li Zhang, Yu-An Liu, Xijia Jiang, Conghao Huang, Danyang Li, Yanyong Zhang
7016. Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction [more]
Authors: Changqing Zhou, Yueru Luo, Changhao Chen
7017. Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction [more]
Authors: Beizhen Zhao, Sicheng Yu, Guanzhi Ding, Yu Hu, Hao Wang
7019. Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences [more]
Authors: Julian Kaltheuner, Hannah Dr"oge, Markus Plack, Patrick Stotko, Reinhard Klein
7021. Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle [more]
Authors: Weidong Qiao, Wangmeng Zuo, Hui Li
7026. WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos [more]
Authors: Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu
7028. Global-Aware Edge Prioritization for Pose Graph Initialization [more]
Authors: Tong Wei, Giorgos Tolias, Jiri Matas, Daniel Barath
7029. Scaling View Synthesis Transformers [more]
Authors: Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann
7030. XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression [more]
Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong
7036. Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context [more]
Authors: JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu
Back to [top]
25. Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments [more]
Authors: Xiangqi Meng, Pengxu Hou, Zhenjun Zhao, Javier Civera, Daniel Cremers, Hesheng Wang, Haoang Li
40. Parallel Continuous-Time Relative Localization with Augmented Clamped Non-Uniform B-Splines [more]
Authors: Jiadong Lu, Zhehan Li, Tao Han, Miao Xu, Chao Xu, Yanjun Cao
44. RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations [more]
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
51. Unified Complementarity-Based Contact Modeling and Planning for Soft Robots [more]
Authors: Milad Azizkhani, Yue Chen
53. Geometric Priors for Generalizable World Models via Vector Symbolic Architecture [more]
Authors: William Youngwoo Chung, Calvin Yeung, Hansen Jin Lillemark, Zhuowen Zou, Xiangjian Liu, Mohsen Imani
57. Compact Circulant Layers with Spectral Priors [more]
Authors: Joseph Margaryan, Thomas Hamelryck
60. Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control [more]
Authors: Weisheng Xu, Qiwei Wu, Jiaxi Zhang, Tan Jing, Yangfan Li, Yuetong Fang, Jiaqi Xiong, Kai Wu, Rong Ou, Renjing Xu
62. StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives [more]
Authors: Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang
64. Jumping Control for a Quadrupedal Wheeled-Legged Robot via NMPC and DE Optimization [more]
Authors: Xuanqi Zeng, Lingwei Zhang, Linzhu Yue, Zhitao Song, Hongbo Zhang, Tianlin Zhang, Yun-Hui Liu
65. Automatic Map Density Selection for Locally-Performant Visual Place Recognition [more]
Authors: Somayeh Hussaini, Tobias Fischer, Michael Milford
66. CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics [more]
Authors: Nelson Chen, William R. Johnson III, Rebecca Kramer-Bottiglio, Kostas Bekris, Mridul Aanjaneya
69. WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs [more]
Authors: Yulin Zhang, Cheng Shi, Sibei Yang
72. Learning Agile and Robust Omnidirectional Aerial Motion on Overactuated Tiltable-Quadrotors [more]
Authors: Wentao Zhang, Zhaoqi Ma, Jinjie Li, Huayi Wang, Haokun Liu, Junichiro Sugihara, Chen Chen, Yicheng Chen, Moju Zhao
74. Hierarchical Lead Critic based Multi-Agent Reinforcement Learning [more]
Authors: David Eckel, Henri Mee{\ss}
76. Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration [more]
Authors: Chen Wu, Ling Wang, Zhuoran Zheng, Yuning Cui, Zhixiong Yang, Xiangyu Chen, Yue Zhang, Weidong Jiang, Jingyuan Xia
82. Position-Based Flocking for Persistent Alignment without Velocity Sensing [more]
Authors: Hossein B. Jond, Veli Bak{\i}rc{\i}o\u{g}lu, Logan E. Beaver, Nejat T"ukenmez, Adel Akbarimajd, Martin Saska
83. System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot [more]
Authors: Benjamin Bokser, Daniel Gonzalez, Surya Singh, Aaron Preston, Alex Bahner, Annika Wollschl"ager, Arianna Ilvonen, Asa Eckert-Erdheim, Ashwin Khadke, Bilal Hammoud, Dean Molinaro, Fabian Jenelten, Henry Mayne, Howie Choset, Igor Bogoslavskyi, Itic Tinman, James Tigue, Jan Preisig, Kaiyu Zheng, Kenny Sharma, Kim Ang, Laura Lee, Liana Margolese, Nicole Lin, Oscar Frias, Paul Drews, Ravi Boggavarapu, Rick Burnham, Samuel Zapolsky, Sangbae Kim, Scott Biddlestone, Sean Mayorga, Shamel Fahmi, Tyler McCollum, Velin Dimitrov, William Moyne, Yu-Ming Chen, Farbod Farshidian, Marco Hutter, David Perry, Al Rizzi, Gabe Nelson
91. AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression [more]
Authors: Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu
92. Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps [more]
Authors: Shan Wang, Peixia Li, Chenchen Xu, Ziang Cheng, Jiayu Yang, Hongdong Li, Pulak Purkait
93. StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles [more]
Authors: Daniel Oliveira, David Martins de Matos
94. DexRepNet++: Learning Dexterous Robotic Manipulation with Geometric and Spatial Hand-Object Representations [more]
Authors: Qingtao Liu, Zhengnan Sun, Yu Cui, Haoming Li, Gaofeng Li, Lin Shao, Jiming Chen, Qi Ye
95. SAPNet++: Evolving Point-Prompted Instance Segmentation with Semantic and Spatial Awareness [more]
Authors: Zhaoyang Wei, Xumeng Han, Xuehui Yu, Xue Yang, Guorong Li, Zhenjun Han, Jianbin Jiao
Back to [top]
ArXiv: 2602.22091 [page] [pdf]
Authors: Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan
Abstract: arXiv:2602.22091v1 Announce Type: new Abstract: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.
Comment: Matches criterion 1. Proposes large-scale label-free (teacher-guided pseudo-supervision) pretraining from unposed in-the-wild driving videos to learn unified 4D/video representations (point maps, poses, semantics, motion) — a self-supervised/video-representation methodological advance for driving.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21735 [page] [pdf]
Authors: Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz
Abstract: arXiv:2602.21735v1 Announce Type: new Abstract: Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
Comment: Matches criterion 1: proposes self‑supervised CT-volume representation learning (SigVLP) with Rotary Position Embeddings and chunkwise text–volume alignment to handle variable z-axis sizes.
Relevance: 4 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21421 [page] [pdf]
Authors: Jan Pauls, Karsten Schr"odter, Sven Ligensa, Martin Schwartz, Berkant Turan, Max Zimmer, Sassan Saatchi, Sebastian Pokutta, Philippe Ciais, Fabian Gieseke
Abstract: arXiv:2602.21421v1 Announce Type: new Abstract: Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.
Comment: Partial topical overlap but not a close match. Uses a vision transformer with a self‑supervised 'growth' loss for temporally consistent pixel‑level tree height estimation — touches self‑supervision for pixel‑level regression (criterion 1) but is primarily a domain application (remote sensing/forestry) rather than a general methodological advance.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22010 [page] [pdf]
Authors: Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, Xihui Liu
Abstract: arXiv:2602.22010v1 Announce Type: new Abstract: Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: https://selen-suyue.github.io/WoGNet/
Comment: Matches criterion 2 (VLMs in robotics): proposes WoG (World Guidance) for Vision-Language-Action models by mapping future observations into compact condition space to improve action generation from manipulation videos, with experiments in simulation and real-world robotics.
Relevance: 10 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21531 [page] [pdf]
Authors: Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir
Abstract: arXiv:2602.21531v1 Announce Type: new Abstract: General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.
Comment: Matches criterion 2. Presents LiLo-VLA, a modular vision-language-action approach for long-horizon robot manipulation that uses object-centric VLA policies and modular replanning to improve robustness and zero-shot generalization in robotic tasks.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21633 [page] [pdf]
Authors: Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen
Abstract: arXiv:2602.21633v1 Announce Type: new Abstract: Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
Comment: Matches criterion 2: Vision-Language-Action (VLA) work using imagination and online action refinement for robot manipulation (VLM-style methods applied to robotics).
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21736 [page] [pdf]
Authors: Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu
Abstract: arXiv:2602.21736v1 Announce Type: new Abstract: Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
Comment: Matches criterion 2: uses Vision-Language-Action (VLA/VLM-style) pretraining from large-scale human manipulation video for robot manipulation.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21992 [page] [pdf]
Authors: Zekai Lin, Xu Zheng
Abstract: arXiv:2602.21992v1 Announce Type: new Abstract: 360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
Comment: Matches criterion 2. Implements a reinforcement-learning post-training framework to imbue vision-language models with 3D spatial intelligence in panoramic (ERP) images (Group Relative Policy Optimization + geometry-aware rewards and curriculum).
Relevance: 9 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21445 [page] [pdf]
Authors: Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, Gaowen Liu
Abstract: arXiv:2602.21445v1 Announce Type: new Abstract: Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention weights as a proxy for the model's predictive limit and propose AutoHorizon, the first test-time method that dynamically estimates the execution horizon for each predicted action chunk to adapt to changing perceptual conditions. Across simulated and real-world robotic manipulation tasks, AutoHorizon is performant, incurs negligible computational overhead, and generalizes across diverse tasks and flow-based models.
Comment: Matches criterion 2: applies vision-language-action (VLA) models to robotic manipulation and introduces AutoHorizon, a test-time method that dynamically estimates execution horizon for action chunks in flow-based VLAs.
Relevance: 9 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21670 [page] [pdf]
Authors: Tomoya Kawabe (NEC Corporation), Rin Takano (NEC Corporation)
Abstract: arXiv:2602.21670v1 Announce Type: new Abstract: Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
Comment: Direct match to criterion 2: LLM-based multi-robot task planning. Introduces a hierarchical multi-agent LLM planner with prompt optimization and meta-prompt sharing integrated with classical PDDL solvers for multi-robot task planning (applies LLMs to robotics planning).
Relevance: 8 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21952 [page] [pdf]
Authors: Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu
Abstract: arXiv:2602.21952v1 Announce Type: new Abstract: Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.
Comment: Matches criterion 2: applies vision-language models to robotics/autonomous driving with a progressive multimodal reasoning pipeline and reinforcement fine-tuning for trajectory planning.
Relevance: 9 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21983 [page] [pdf]
Authors: Jingchao Wei, Jingkai Qin, Yuxiao Cao, Jingcheng Huang, Xiangrui Zeng, Min Li, Zhouping Yin
Abstract: arXiv:2602.21983v1 Announce Type: new Abstract: Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.
Comment: Matches criterion 2 (VLMs in robotics): uses a vision-language model for gaze-target reasoning in humanoid robots and a conditional VQ-VAE for generating coordinated eye-head gaze-shift motions for human-like behavior.
Relevance: 8 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22001 [page] [pdf]
Authors: Freek Stulp, Samuel Bustamante, Jo~ao Silv'erio, Alin Albu-Sch"affer, Jeannette Bohg, Shuran Song
Abstract: arXiv:2602.22001v1 Announce Type: new Abstract: In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.
Comment: Partial match to criterion 2. Survey/perspective on foundation models (LLMs, VLMs, VLAs) and their role in transfer for robotics; discusses VLMs in robotic transfer but is a high-level overview rather than a new application or method in RL/imitation learning.
Relevance: 7 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21625 [page] [pdf]
Authors: Lei Su, Zhijie Peng, Renyuan Ren, Shengping Mao, Juan Du, Kaifeng Zhang, Xuezhou Zhu
Abstract: arXiv:2602.21625v1 Announce Type: new Abstract: Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.
Comment: Matches none of the requested criteria exactly. Tacmap is a tactile sim-to-real method for robotics (tactile depth maps and zero-shot sim-to-real), relevant to robotics but not to VLMs (criterion 2) or the specified multimodal/SSL/video segmentation criteria.
Relevance: 4 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21441 [page] [pdf]
Authors: Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua, Hao Wang
Abstract: arXiv:2602.21441v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
Comment: No criteria matched. This paper targets hallucination reduction in multimodal LLM decoding (vision-language faithfulness) but does not address VLMs applied to robotics (criterion 2) or the other listed CV/SSL/video criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21704 [page] [pdf]
Authors: Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, Liang He
Abstract: arXiv:2602.21704v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.
Comment: No matching criterion. Training-free activation-steering to mitigate LVLM hallucination — about vision-language model reliability but not applied to robotics (does not meet criterion 2).
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21684 [page] [pdf]
Authors: Xiaohan Lei, Min Wang, Wengang Zhou, Xingyu Lu, Houqiang Li
Abstract: arXiv:2602.21684v1 Announce Type: new Abstract: Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning. To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution. However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions. To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is learned to generate high-fidelity continuous actions. Theoretically, we prove PF-DAG's two-stage design achieves a strictly lower MSE bound than single-stage generative policies. Empirically, PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks. It further generalizes to real-world tactile dexterous manipulation tasks. Our work demonstrates that explicit mode-level decoupling enables both robust multi-modal modeling and reactive closed-loop control for robotic manipulation.
Comment: No criteria matched: robotics imitation-learning contribution (Primary-Fine Decoupling) but does not involve vision-language models (criterion 2) or SSL/vision transfer; relevant to robotics generally but not the specified criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22033 [page] [pdf]
Authors: Yanqiu Yu, Zhifan Jin, Sijia Chen, Tongfei Chu, En Yu, Liman Liu, Wenbing Tao
Abstract: arXiv:2602.22033v1 Announce Type: new Abstract: Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.
Comment: Partial/near match to criterion 2 — the work uses a multimodal large language model (MLLM) integrating RGB, thermal and text and applies RL-style fine-tuning (GSPO, CAS) for referring multi-object tracking; this leverages VLM/MLLM+RL ideas but is applied to tracking (not explicitly a robotics control or imitation task).
Relevance: 4 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21320 [page] [pdf]
Authors: Emre Can Acikgoz, Cheng Qian, Jonas H"ubotter, Heng Ji, Dilek Hakkani-T"ur, Gokhan Tur
Abstract: arXiv:2602.21320v1 Announce Type: new Abstract: Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.
Comment: No close match to the numbered criteria. Proposes self-evolving LLM agents for tool use via self-play RL under a zero-data assumption — interesting for autonomous agents and tool use but does not use vision-language models in robotics (criterion 2) or the other specified criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21302 [page] [pdf]
Authors: Krishna Suresh, Chris Atkeson
Abstract: arXiv:2602.21302v1 Announce Type: new Abstract: Dynamic manipulation of deformable objects is challenging for humans and robots because they have infinite degrees of freedom and exhibit underactuated dynamics. We introduce a Task-Level Iterative Learning Control method for dynamic manipulation of deformable objects. We demonstrate this method on a non-planar rope manipulation task called the flying knot. Using a single human demonstration and a simplified rope model, the method learns directly on hardware without reliance on large amounts of demonstration data or massive amounts of simulation. At each iteration, the algorithm constructs a local inverse model of the robot and rope by solving a quadratic program to propagate task-space errors into action updates. We evaluate performance across 7 different kinds of ropes, including chain, latex surgical tubing, and braided and twisted ropes, ranging in thicknesses of 7--25mm and densities of 0.013--0.5 kg/m. Learning achieves a 100% success rate within 10 trials on all ropes. Furthermore, the method can successfully transfer between most rope types in approximately 2--5 trials. https://flying-knots.github.io
Comment: Closest to your interests only as a robotics manipulation paper (iterative learning control for deformable object manipulation). It does not use self-supervised learning or language/vision models, so it does not match the specific criteria (1 or 2).
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21497 [page] [pdf]
Authors: Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen
Abstract: arXiv:2602.21497v1 Announce Type: new Abstract: Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.
Comment: No match to the listed criteria — proposes a training-free iterative framework to reduce visual hallucination in LVLMs (multimodal reasoning), relevant to vision-language modeling but not to VLMs applied to robotics (criterion 2) or other listed criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21366 [page] [pdf]
Authors: Y. Deemo Chen, Arion Zimmermann, Thomas A. Berrueta, Soon-Jo Chung
Abstract: arXiv:2602.21366v1 Announce Type: new Abstract: Ensuring accurate and stable state estimation is a challenging task crucial to safety-critical domains such as high-speed autonomous racing, where measurement uncertainty must be both adaptive to the environment and temporally smooth for control. In this work, we develop a learning-based framework, LACE, capable of directly modeling the temporal dynamics of GNSS measurement covariance. We model the covariance evolution as an exponentially stable dynamical system where a deep neural network (DNN) learns to predict the system's process noise from environmental features through an attention mechanism. By using contraction-based stability and systematically imposing spectral constraints, we formally provide guarantees of exponential stability and smoothness for the resulting covariance dynamics. We validate our approach on an AV-24 autonomous racecar, demonstrating improved localization performance and smoother covariance estimates in challenging, GNSS-degraded environments. Our results highlight the promise of dynamically modeling the perceived uncertainty in state estimation problems that are tightly coupled with control sensitivity.
Comment: No listed criterion matches closely. The work learns environment-aware GNSS covariance dynamics for autonomous racing with stability guarantees — interesting for robotics/state-estimation but not about VLMs (criterion 2), SSL, or the other listed topics.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22088 [page] [pdf]
Authors: Hongjie Fang, Shirun Tang, Mingyu Mei, Haoxiang Qin, Zihao He, Jingjing Chen, Ying Feng, Chenxi Wang, Wanxi Liu, Zaixing He, Cewu Lu, Shiquan Wang
Abstract: arXiv:2602.22088v1 Announce Type: new Abstract: Contact-rich manipulation demands human-like integration of perception and force feedback: vision should guide task progress, while high-frequency interaction control must stabilize contact under uncertainty. Existing learning-based policies often entangle these roles in a monolithic network, trading off global generalization against stable local refinement, while control-centric approaches typically assume a known task structure or learn only controller parameters rather than the structure itself. In this paper, we formalize a physically grounded interaction frame, an instantaneous local basis that decouples force regulation from motion execution, and propose a method to recover it from demonstrations. Based on this, we address both issues by proposing Force Policy, a global-local vision-force policy in which a global policy guides free-space actions using vision, and upon contact, a high-frequency local policy with force feedback estimates the interaction frame and executes hybrid force-position control for stable interaction. Real-world experiments across diverse contact-rich tasks show consistent gains over strong baselines, with more robust contact establishment, more accurate force regulation, and reliable generalization to novel objects with varied geometries and physical properties, ultimately improving both contact stability and execution quality. Project page: https://force-policy.github.io/
Comment: Does not match any numbered criterion exactly. It is a robotics paper proposing a global-local vision+force policy for contact-rich manipulation (vision and force control), but it does not use vision-language models (criterion 2) nor introduce SSL/representation, video segmentation, or cross-modal transfer methods.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21622 [page] [pdf]
Authors: Enyi Wang, Wen Fan, Dandan Zhang
Abstract: arXiv:2602.21622v1 Announce Type: new Abstract: Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
Comment: No matching criterion. Multimodal vision–tactile–graph fusion for multi-agent manipulation (policy architecture and adaptive modality attention) — relevant to robotics generally but not to VLMs-in-robotics (criterion 2) or cross-modal video transfer (criterion 4).
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21783 [page] [pdf]
Authors: Beatrice Luciani, Alex van den Berg, Matti Lang, Alexandre L. Ratschat, Laura Marchal-Crespo
Abstract: arXiv:2602.21783v1 Announce Type: new Abstract: Robotic systems can enhance the amount and repeatability of physically guided motor training. Yet their real-world adoption is limited, partly due to non-intuitive trainer/therapist-trainee/patient interactions. To address this gap, we present a haptic teleoperation system for trainers to remotely guide and monitor the movements of a trainee wearing an arm exoskeleton. The trainer can physically interact with the exoskeleton through a commercial handheld haptic device via virtual contact points at the exoskeleton's elbow and wrist, allowing intuitive guidance. Thirty-two participants tested the system in a trainer-trainee paradigm, comparing our haptic demonstration system with conventional visual demonstration in guiding trainees in executing arm poses. Quantitative analyses showed that haptic demonstration significantly reduced movement completion time and improved smoothness, while speech analysis using large language models for automated transcription and categorization of verbal commands revealed fewer verbal instructions. The haptic demonstration did not result in higher reported mental and physical effort by trainers compared to the visual demonstration, while trainers reported greater competence and trainees lower physical demand. These findings support the feasibility of our proposed interface for effective remote human-robot physical interaction. Future work should assess its usability and efficacy for clinical populations in restoring clinicians' sense of agency during robot-assisted therapy.
Comment: No match to criterion 2: it is a robotics/human–robot interaction paper (haptic teleoperation for therapist guidance) but does not use vision-language models for robotics as required by criterion 2.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21779 [page] [pdf]
Authors: Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li
Abstract: arXiv:2602.21779v1 Announce Type: new Abstract: Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.
Comment: Matches none of the requested criteria. Work on VLM temporal reasoning for video deepfake forensics — relevant to Vision-Language Models broadly but not to VLMs-in-robotics (criterion 2) or the other listed research foci.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22144 [page] [pdf]
Authors: Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
Abstract: arXiv:2602.22144v1 Announce Type: new Abstract: Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
Comment: No matching criteria — addresses object hallucinations in large vision-language models (dynamic suppression of language priors). Related to VLM robustness but not to criterion 2 (VLMs applied to robotics) or the other listed criteria.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21723 [page] [pdf]
Authors: Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, Siyuan Huang
Abstract: arXiv:2602.21723v1 Announce Type: new Abstract: Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
Comment: No close match to the numbered criteria. This is a robotics/long-horizon humanoid control paper (distance-field conditioned policy, VAE latents, RL + DAgger distillation) — good for general robotics interest but it does not use vision-language models (criterion 2) or SSL/video/transfer criteria.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21816 [page] [pdf]
Authors: Zhaowei Liang, Song Wang, Zhao Jin, Shirui Wu, Dan Wu
Abstract: arXiv:2602.21816v1 Announce Type: new Abstract: Precise shape control of Deformable Linear Objects (DLOs) is crucial in robotic applications such as industrial and medical fields. However, existing methods face challenges in handling complex large deformation tasks, especially those involving opposite curvatures, and lack efficiency and precision. To address this, we propose a two-stage framework combining Reinforcement Learning (RL) and online visual servoing. In the large-deformation stage, a model-based reinforcement learning approach using an ensemble of dynamics models is introduced to significantly improve sample efficiency. Additionally, we design a self-curriculum goal generation mechanism that dynamically selects intermediate-difficulty goals with high diversity through imagined evaluations, thereby optimizing the policy learning process. In the small-deformation stage, a Jacobian-based visual servo controller is deployed to ensure high-precision convergence. Simulation results show that the proposed method enables efficient policy learning and significantly outperforms mainstream baselines in shape control success rate and precision. Furthermore, the framework effectively transfers the policy trained in simulation to real-world tasks with zero-shot adaptation. It successfully completes all 30 cases with diverse initial and target shapes across DLOs of different sizes and materials. The project website is available at: https://anonymous.4open.science/w/sc-mbrl-dlo-EB48/
Comment: No listed criterion matches closely. This is a model-based RL + self-curriculum approach for deformable linear object shape control — relevant to robotics generally but it does not use vision-language models (criterion 2) nor propose SSL/representation advances (criterion 1).
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22056 [page] [pdf]
Authors: Edgar Welte, Yitian Shi, Rosa Wolf, Maximillian Gilles, Rania Rayyes
Abstract: arXiv:2602.22056v1 Announce Type: new Abstract: Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We present FlowCorrect, a deployment-time correction framework that converts near-miss failures into successes using sparse human nudges, without full policy retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corrections to locally adapt the policy, improving actions without retraining the backbone while preserving the model performance on previously learned scenarios. We evaluate on a real-world robot across three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect improves success on hard cases by 85% while preserving performance on previously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with very few demonstrations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.
Comment: No specific criterion matched. The paper is a robotics human-in-the-loop correction method but does not use vision-language models (criterion 2) nor focus on SSL/representation learning (criterion 1).
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21655 [page] [pdf]
Authors: Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang
Abstract: arXiv:2602.21655v1 Announce Type: new Abstract: Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
Comment: No close match to the requested criteria. Paper is about image captioning using LVLMs + dual-reward RL to optimize completeness and correctness (not VLMs-in-robotics (criterion 2), SSL, video segmentation, or cross-modal transfer).
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21595 [page] [pdf]
Authors: Hyungmin Kim, Hobeom Jeon, Dohyung Kim, Minsu Jang, Jeahong Kim
Abstract: arXiv:2602.21595v1 Announce Type: new Abstract: Embodied Task Planning with large language models faces safety challenges in real-world environments, where partial observability and physical constraints must be respected. Existing benchmarks often overlook these critical factors, limiting their ability to evaluate both feasibility and safety. We introduce SPOC, a benchmark for safety-aware embodied task planning, which integrates strict partial observability, physical constraints, step-by-step planning, and goal-condition-based evaluation. Covering diverse household hazards such as fire, fluid, injury, object damage, and pollution, SPOC enables rigorous assessment through both state and constraint-based online metrics. Experiments with state-of-the-art LLMs reveal that current models struggle to ensure safety-aware planning, particularly under implicit constraints. Code and dataset are available at https://github.com/khm159/SPOC
Comment: No close match to the numbered criteria. Introduces SPOC, a safety-aware embodied planning benchmark emphasizing partial observability and physical constraints — relevant to embodied robotics and planning evaluation but it does not use VLMs (criterion 2) or the other listed research directions.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.22100 [page] [pdf]
Authors: Andreas Kernbach, Daniel Bargmann, Werner Kraus, Marco F. Huber
Abstract: arXiv:2602.22100v1 Announce Type: new Abstract: Automating the assembly of wire harnesses is challenging in automotive, electrical cabinet, and aircraft production, particularly due to deformable cables and a high variance in connector geometries. In addition, connectors must be inserted with limited force to avoid damage, while their poses can vary significantly. While humans can do this task intuitively by combining visual and haptic feedback, programming an industrial robot for such a task in an adaptable manner remains difficult. This work presents an empirical study investigating the suitability of behavioral cloning for learning an action prediction model for connector insertion that fuses force-torque sensing with a fixed position camera. We compare several network architectures and other design choices using a dataset of up to 300 successful human demonstrations collected via teleoperation of a UR5e robot with a SpaceMouse under varying connector poses. The resulting system is then evaluated against five different connector geometries under varying connector poses, achieving an overall insertion success rate of over 90 %.
Comment: Related to robotics and imitation learning (behavioral cloning for connector assembly) but does NOT use vision-language models or new SSL methods—thus it does not match criterion 2 or 1. It's an application-focused empirical study on a specific assembly domain.
Relevance: 3 Novelty: 3 Back to [topic] [top]
ArXiv: 2602.21406 [page] [pdf]
Authors: Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani
Abstract: arXiv:2602.21406v1 Announce Type: new Abstract: Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.
Comment: Matches criterion 3: tackles open‑vocabulary, zero‑shot temporal action segmentation using Vision‑Language Models in a training‑free, segmentation‑by‑classification pipeline (FAES + SMTS) — a VLM‑based approach to video segmentation.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21810 [page] [pdf]
Authors: Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, Xiaoqin Zhang
Abstract: arXiv:2602.21810v1 Announce Type: new Abstract: Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\pi^3$), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:https://github.com/zjutcvg/GeoMotion.
Comment: Related to criterion 3. A learning-based motion segmentation approach that leverages latent 4D geometry and attention to infer moving objects end-to-end (video/motion segmentation). Note: the abstract does not explicitly claim unsupervised or self-supervised training, but it is directly relevant to video segmentation research.
Relevance: 6 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21466 [page] [pdf]
Authors: YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt
Abstract: arXiv:2602.21466v1 Announce Type: new Abstract: $E(3)$-equivariant neural networks have proven to be effective in a wide range of 3D modeling tasks. A fundamental operation of such networks is the tensor product, which allows interaction between different feature types. Because this operation scales poorly, there has been considerable work towards accelerating this interaction. However, recently \citet{xieprice} have pointed out that most speedups come from a reduction in expressivity rather than true algorithmic improvements on computing Clebsch-Gordan tensor products. A modification of Gaunt tensor product \citep{gaunt} can give a true asymptotic speedup but is incomplete and misses many interactions. In this work, we provide the first complete algorithm which truly provides asymptotic benefits Clebsch-Gordan tensor products. For full CGTP, our algorithm brings runtime complexity from the naive $O(L^6)$ to $O(L^4\log^2 L)$, close to the lower bound of $O(L^4)$. We first show how generalizing fast Fourier based convolution naturally leads to the previously proposed Gaunt tensor product \citep{gaunt}. To remedy antisymmetry issues, we generalize from scalar signals to irrep valued signals, giving us tensor spherical harmonics. We prove a generalized Gaunt formula for the tensor harmonics. Finally, we show that we only need up to vector valued signals to recover the missing interactions of Gaunt tensor product.
Comment: No direct match to the listed criteria (this is a theoretical/algorithmic speedup for Clebsch–Gordan tensor products in E(3)-equivariant networks; relevant to 3D modeling methods but not specifically about generative 3D, NeRF/GS, or the other requested topics).
Relevance: 3 Novelty: 8 Back to [topic] [top]
ArXiv: 2602.21835 [page] [pdf]
Authors: Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu
Abstract: arXiv:2602.21835v1 Announce Type: new Abstract: Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.
Comment: No close match to any of the specified criteria (1-7). UniVBench is a benchmark/evaluation suite for video foundation models (understanding/generation/editing/reconstruction) rather than a new method in SSL, VLMs-for-robotics, unsupervised video segmentation, cross-modal transfer, diffusion model advances, or 3D generation.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21397 [page] [pdf]
Authors: Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
Abstract: arXiv:2602.21397v1 Announce Type: new Abstract: Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70% on base-to-novel generalization.
Comment: No close match to the numbered criteria. This is an efficient multi-modal prompt-tuning method for CLIP-like VLMs (low-rank prompting, consistency losses, drift correction) but not about SSL, robotics, video segmentation, cross-modal transfer for dense prediction, diffusion, or 3D reconstruction.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22073 [page] [pdf]
Authors: Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clap'es
Abstract: arXiv:2602.22073v1 Announce Type: new Abstract: Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.
Comment: Partial general interest but does NOT satisfy criterion 3. It's a video event-spotting method with an unsupervised RoI selection for high-resolution processing; useful for video/robotics applications but it does not present unsupervised/self-supervised video segmentation as required by criterion 3.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21515 [page] [pdf]
Authors: Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar
Abstract: arXiv:2602.21515v1 Announce Type: new Abstract: Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.
Comment: Does not match the specified criteria: focuses on multi-agent reinforcement learning and strategic risk aversion for collaborative agents (not on SSL, VLMs in robotics, video segmentation, modality transfer, diffusion or 3D generation).
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22208 [page] [pdf]
Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
Abstract: arXiv:2602.22208v1 Announce Type: new Abstract: Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Comment: No close match to the listed criteria — a multi-agent/multiplayer video world model for Minecraft (multi-view video generation/world models), not focused on SSL, VLMs for robotics, video segmentation, modality transfer, diffusion, or 3D generative reconstruction as requested.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21435 [page] [pdf]
Authors: Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua
Abstract: arXiv:2602.21435v1 Announce Type: new Abstract: Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at https://sqwu.top/AD-Loop.
Comment: No listed criterion matched closely — paper proposes an interleaved analyzing-drafting loop for unified vision-language models (UVLMs) but does not target self-supervised representation learning, VLMs in robotics, video segmentation, cross-modal transfer for video, diffusion/3D generation, or NeRF/Gaussian-splatting style 3D reconstruction.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21691 [page] [pdf]
Authors: Yuting Zeng, Manping Fan, You Zhou, Yongbin Yu, Zhiwen Zheng, Jingtao Zhang, Liyong Ren, Zhenglin Yang
Abstract: arXiv:2602.21691v1 Announce Type: new Abstract: Trajectory generation for visually impaired scenarios requires smooth and temporally consistent state in structured, low-speed dynamic environments. However, traditional jerk-based heuristic trajectory sampling with independent segment generation and conventional smoothness penalties often lead to unstable terminal behavior and state discontinuities under frequent regenerating. This paper proposes a trajectory generation approach that integrates endpoint regulation to stabilize terminal states within each segment and momentum-aware dynamics to regularize the evolution of velocity and acceleration for segment consistency. Endpoint regulation is incorporated into trajectory sampling to stabilize terminal behavior, while a momentum-aware dynamics enforces consistent velocity and acceleration evolution across consecutive trajectory segments. Experimental results demonstrate reduced acceleration peaks and lower jerk levels with decreased dispersion, smoother velocity and acceleration profiles, more stable endpoint distributions, and fewer infeasible trajectory candidates compared with a baseline planner.
Comment: No close match to any of the specified criteria (1-7). The paper proposes trajectory generation improvements for visually impaired scenarios (endpoint regulation and momentum-aware dynamics) — a robotics/planning contribution but not about VLMs in robotics, SSL for vision, unsupervised video segmentation, cross-modal transfer to video, diffusion models, or 3D reconstruction.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21389 [page] [pdf]
Authors: Zach J. Patterson, Emily Sologuren, Levi Cai, Daniel Kim, Alaa Maalouf, Pascal Spino, Daniela Rus
Abstract: arXiv:2602.21389v1 Announce Type: new Abstract: Autonomous robots can transform how we observe marine ecosystems, but close-range operation in reefs and other cluttered habitats remains difficult. Vehicles must maneuver safely near animals and fragile structures while coping with currents, variable illumination and limited sensing. Previous approaches simplify these problems by leveraging soft materials and bioinspired swimming designs, but such platforms remain limited in terms of deployable autonomy. Here we present a sea turtle-inspired autonomous underwater robot that closed the gap between bioinspired locomotion and field-ready autonomy through a tightly integrated, vision-driven control stack. The robot combines robust depth-heading stabilization with obstacle avoidance and target-centric control, enabling it to track and interact with moving objects in complex terrain. We validate the robot in controlled pool experiments and in a live coral reef exhibit at the New England Aquarium, demonstrating stable operation and reliable tracking of fast-moving marine animals and human divers. To the best of our knowledge, this is the first integrated biomimetic robotic system, combining novel hardware, control, and field experiments, deployed to track and monitor real marine animals in their natural environment. During off-tether experiments, we demonstrate safe navigation around obstacles (91% success rate in the aquarium exhibit) and introduce a low-compute onboard tracking mode. Together, these results establish a practical route toward soft-rigid hybrid, bioinspired underwater robots capable of minimally disruptive exploration and close-range monitoring in sensitive ecosystems.
Comment: No specific criterion match — interesting robotics field deployment (bioinspired underwater robot) but not using VLMs, SSL, video segmentation, cross-modal transfer, diffusion or 3D generative methods.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21904 [page] [pdf]
Authors: Mariia Baidachna, James Carty, Aidan Ferguson, Joseph Agrane, Varad Kulkarni, Aubrey Agub, Michael Baxendale, Aaron David, Rachel Horton, Elliott Atkinson
Abstract: arXiv:2602.21904v1 Announce Type: new Abstract: Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.
Comment: Does not match the requested criteria. This is an application-focused UNet keypoint regression method for 3D cone localization in autonomous racing (task-specific perception pipeline), not a new SSL/VLM/transfer/segmentation methodology of the types requested.
Relevance: 3 Novelty: 3 Back to [topic] [top]
ArXiv: 2602.21484 [page] [pdf]
Authors: Yushen He
Abstract: arXiv:2602.21484v1 Announce Type: new Abstract: 3D object detection is essential for autonomous driving and robotic perception, yet its reliance on large-scale manually annotated data limits scalability and adaptability. To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged. However, they face intertwined challenges: low-quality pseudo-labels, unstable feature mining, and a lack of a unified training framework. This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection via Semantic Pseudo-labeling and prototype Learning. SPL first generates high-quality pseudo-labels by integrating image semantics, point cloud geometry, and temporal cues, producing both 3D bounding boxes for dense objects and 3D point labels for sparse ones. These pseudo-labels are not used directly but as probabilistic priors within a novel, multi-stage prototype learning strategy. This strategy stabilizes feature representation learning through memory-based initialization and momentum-based prototype updating, effectively mining features from both labeled and unlabeled data. Extensive experiments on KITTI and nuScenes datasets demonstrate that SPL significantly outperforms state-of-the-art methods in both settings. Our work provides a robust and generalizable solution for learning 3D object detectors with minimal or no manual annotations.
Comment: Matches criterion 4. Proposes a unified unsupervised/sparsely-supervised 3D object detection framework that fuses image semantics, point-cloud geometry, and temporal cues (semantic pseudo-labeling + prototype learning) — an approach to leverage multimodal data for dense 3D detection.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21627 [page] [pdf]
Authors: Abhineet Singh, Justin Rozeboom, Nilanjan Ray
Abstract: arXiv:2602.21627v1 Announce Type: new Abstract: This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.
Comment: Matches criterion 4: cross-modal/tokenization approach for dense prediction. They convert segmentation masks to RLE tokens and train a sequence (language) model to output masks for images and videos — a modality-transfer/tokenization strategy for dense video/image segmentation.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21259 [page] [pdf]
Authors: Ricardo B. Grando, Victor A. Kich, Alisson H. Kolling, Junior C. D. Jesus, Rodrigo S. Guerra, Paulo L. J. Drews-Jr
Abstract: arXiv:2602.21259v1 Announce Type: new Abstract: Hybrid Unmanned Aerial Underwater Vehicles (HUAUVs) have emerged as platforms capable of operating in both aerial and underwater environments, enabling applications such as inspection, mapping, search, and rescue in challenging scenarios. However, the development of novel methodologies poses significant challenges due to the distinct dynamics and constraints of the air and water domains. In this work, we present persistent monitoring tasks for HUAUVs by combining Deep Reinforcement Learning (DRL) and Transfer Learning to enable cross-domain adaptability. Our approach employs a shared DRL architecture trained on Lidar sensor data (on air) and Sonar data (underwater), demonstrating the feasibility of a unified policy for both environments. We further show that the methodology presents promising results, taking into account the uncertainty of the environment and the dynamics of multiple mobile targets. The proposed framework lays the groundwork for scalable autonomous persistent monitoring solutions based on DRL for hybrid aerial-underwater vehicles.
Comment: Matches criterion 4 — applies transfer learning across sensor modalities (Lidar in air to Sonar underwater) with DRL for cross-domain robotics persistent monitoring.
Relevance: 6 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21819 [page] [pdf]
Authors: Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song
Abstract: arXiv:2602.21819v1 Announce Type: new Abstract: Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.
Comment: Not a close match. It uses CLIP‑style (vision–language) embeddings to guide fMRI‑to‑video reconstruction, so there is superficial cross‑modal use of VLM features, but it is a neuroscience‑focused, specialized fMRI→video reconstruction paper rather than a paper about cross‑modal transfer to improve video models (criterion 4) or VLMs applied to robotics (criterion 2).
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21589 [page] [pdf]
Authors: Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, Xiaoshuai Hao
Abstract: arXiv:2602.21589v1 Announce Type: new Abstract: High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
Comment: Partially relevant to criterion 4 in that it studies multimodal (LiDAR+image) fusion for dense BEV/HD-map prediction (subspace experts, uncertainty gating). However, it is an autonomous-driving application (domain-specific) rather than a general cross-modal transfer-for-video paper.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21754 [page] [pdf]
Authors: Aditya Ranjan Dash, Ramy Battrawy, Ren'e Schuster, Didier Stricker
Abstract: arXiv:2602.21754v1 Announce Type: new Abstract: Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.
Comment: No specific criterion matched. This is a tri-modal LiDAR/RGB/event calibration method (multi-sensor calibration) but does not target transfer learning between modalities for video dense prediction (criterion 4).
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21699 [page] [pdf]
Authors: Rajai Alhimdiat, Ramy Battrawy, Ren'e Schuster, Didier Stricker, Wesam Ashour
Abstract: arXiv:2602.21699v1 Announce Type: new Abstract: Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.
Comment: No criteria matched closely. It is a supervised multimodal (image+LiDAR) scene-flow estimator — related to multimodal fusion for dense motion but not framed as transfer-learning between modalities for video dense prediction (criterion 4) nor as self-supervised work.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.22026 [page] [pdf]
Authors: Xiaoyu Xian, Shiao Wang, Xiao Wang, Daxin Tian, Yan Tian
Abstract: arXiv:2602.22026v1 Announce Type: new Abstract: Metro trains often operate in highly complex environments, characterized by illumination variations, high-speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low-light conditions, high-speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS-denied conditions. In this context, we propose a robust baseline method based on a pre-trained RGB OCR foundation model, enhanced through multi-modal adaptation. Furthermore, we construct the first large-scale RGB-Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB-Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on https://github.com/Event-AHU/EvMetro5K_benchmark
Comment: Partial overlap with criterion 4 (multi-modal transfer between RGB and event cameras), but primarily a domain-specific application (kilometer marker recognition) rather than a general cross-modal transfer method.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21818 [page] [pdf]
Authors: Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, Yikun Dou, Zheng Chen, Mingyuan Fan, Tuanhui Li, Mingshan Chang, Hao Zhang, Xiaopeng Sun, Jingtao Xu, Yuqiang Xie, Jiahua Wang, Zhiheng Xu, Weiming Xiong, Yuzhe Jin, Baoxuan Gu, Binjie Mao, Yunjie Yu, Jujie He, Yuhao Feng, Shiwen Tu, Chaojie Wang, Rui Yan, Wei Shen, Jingchen Wu, Peng Zhao, Xuanyue Zhong, Zhuangzhuang Liu, Kaifei Wang, Fuxiang Zhang, Weikai Xu, Wenyan Liu, Binglu Zhang, Yu Shen, Tianhui Xiong, Bin Peng, Liang Zeng, Xuchen Song, Haoxiang Guo, Peiyu Wang, Yahui Zhou
Abstract: arXiv:2602.21818v1 Announce Type: new Abstract: SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.
Comment: Matches criterion 5. SkyReels-V4 is a multi-modal video+audio diffusion-based foundation model that targets high-resolution, long-duration video generation, inpainting and editing and introduces efficiency strategies (low-res full sequence + high-res keyframes) to make cinematic video diffusion feasible.
Relevance: 8 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21472 [page] [pdf]
Authors: Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Russ Webb, Jason Ramapuram
Abstract: arXiv:2602.21472v1 Announce Type: new Abstract: Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
Comment: Matches criterion 5 — presents a tri-modal masked discrete diffusion model pretrained from scratch on text, image-text, and audio-text, with analysis of noise schedules, multimodal scaling, and an SDE-based reparameterization to decouple logical/physical batch sizes; relevant to advances in diffusion generative models.
Relevance: 8 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21429 [page] [pdf]
Authors: Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti
Abstract: arXiv:2602.21429v1 Announce Type: new Abstract: Flow-based generative models, such as diffusion models and flow matching models, have achieved remarkable success in learning complex data distributions. However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints. We address this by proposing a safety filtering framework that acts as an online shield for any pre-trained generative model. Our key insight is to cooperate with the generative process rather than override it. We define a constricting safety tube that is relaxed at the initial noise distribution and progressively tightens to the target safe set at the final data distribution, mirroring the coarse-to-fine structure of the generative process itself. By characterizing this tube via Control Barrier Functions (CBFs), we synthesize a feedback control input through a convex Quadratic Program (QP) at each sampling step. As the tube is loosest when noise is high and intervention is cheapest in terms of control energy, most constraint enforcement occurs when it least disrupts the model's learned structure. We prove that this mechanism guarantees safe sampling while minimizing the distributional shift from the original model at each sampling step, as quantified by the KL divergence. Our framework applies to any pre-trained flow-based generative scheme requiring no retraining or architectural modifications. We validate the approach across constrained image generation, physically-consistent trajectory sampling, and safe robotic manipulation policies, achieving 100% constraint satisfaction while preserving semantic fidelity.
Comment: Matches criterion 5: proposes a sampling-time safety filter for flow-based generative models (diffusion/flow-matching) using Control Barrier Functions and per-step QP interventions to guarantee constraint satisfaction during generation (including robotic trajectories/manipulation).
Relevance: 8 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21760 [page] [pdf]
Authors: Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee
Abstract: arXiv:2602.21760v1 Announce Type: new Abstract: Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.
Comment: Matches criterion 5. Proposes system-level hybrid data-pipeline parallelism and conditional guidance scheduling to accelerate conditional diffusion model inference while preserving image quality, directly addressing reduction of computational cost for diffusion models.
Relevance: 8 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21596 [page] [pdf]
Authors: Trung X. Pham, Kang Zhang, Ji Woo Hong, Chang D. Yoo
Abstract: arXiv:2602.21596v1 Announce Type: new Abstract: Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.
Comment: Matches criterion 5. Systematic analysis of conditional embeddings in Transformer-based diffusion models and shows aggressive pruning of embedding dimensions (up to ~2/3) can retain or improve generation quality—relevant to improving diffusion-model efficiency.
Relevance: 8 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21591 [page] [pdf]
Authors: Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang
Abstract: arXiv:2602.21591v1 Announce Type: new Abstract: Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input -- prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.
Comment: Matches criterion 5: proposes methodological advances in diffusion-based generative image compression (uncertainty-guided adaptive quantization, information-concentration mechanism, bitrate-free adaptive textual conditioning).
Relevance: 8 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21581 [page] [pdf]
Authors: Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu
Abstract: arXiv:2602.21581v1 Announce Type: new Abstract: Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
Comment: Matches criterion 5: improves diffusion-based video generation/animation. The paper extends diffusion transformers for pose-guided image-to-video animation to multi-character scenarios, introducing Identifier Assigner and Identifier Adapter to handle per-person positional cues and inter-person occlusion in diffusion-based generation.
Relevance: 6 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22150 [page] [pdf]
Authors: YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang
Abstract: arXiv:2602.22150v1 Announce Type: new Abstract: Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
Comment: Matches criterion 5: proposes a diffusion-based unified image generation method (CoLoGen) addressing concept vs localization trade-offs via a Progressive Representation Weaving module—an explicit diffusion-model improvement.
Relevance: 6 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21402 [page] [pdf]
Authors: Jinyoung Jun, Won-Dong Jang, Wenbin Ouyang, Raghudeep Gadde, Jungbeom Lee
Abstract: arXiv:2602.21402v1 Announce Type: new Abstract: We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.
Comment: Matches criterion 5 (advances in image generation): proposes a refinement framework (FlowFixer) and a self-supervised one-step denoising scheme to restore fine details in subject-driven image generation and introduces a new keypoint-matching fidelity metric.
Relevance: 6 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21631 [page] [pdf]
Authors: Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu
Abstract: arXiv:2602.21631v1 Announce Type: new Abstract: Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
Comment: Partial match to criterion 5: uses a latent diffusion model for conditional 4D (spatio-temporal) hand motion synthesis, unifying estimation and generation; relevant as a diffusion-based generative approach for motion sequences rather than text→image/video diffusion advances per se.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21319 [page] [pdf]
Authors: Marion Neumeier, Niklas Ro{\ss}berg, Michael Botsch, Wolfgang Utschick
Abstract: arXiv:2602.21319v1 Announce Type: new Abstract: Accurate and uncertainty-aware trajectory prediction remains a core challenge for autonomous driving, driven by complex multi-agent interactions, diverse scene contexts and the inherently stochastic nature of future motion. Diffusion-based generative models have recently shown strong potential for capturing multimodal futures, yet existing approaches such as cVMD suffer from slow sampling, limited exploitation of generative diversity and brittle scenario encodings. This work introduces cVMDx, an enhanced diffusion-based trajectory prediction framework that improves efficiency, robustness and multimodal predictive capability. Through DDIM sampling, cVMDx achieves up to a 100x reduction in inference time, enabling practical multi-sample generation for uncertainty estimation. A fitted Gaussian Mixture Model further provides tractable multimodal predictions from the generated trajectories. In addition, a CVQ-VAE variant is evaluated for scenario encoding. Experiments on the publicly available highD dataset show that cVMDx achieves higher accuracy and significantly improved efficiency over cVMD, enabling fully stochastic, multimodal trajectory prediction.
Comment: Matches criterion 5: a diffusion-based generative model improvement. The paper applies DDIM sampling to diffusion-based trajectory prediction to dramatically speed up sampling and improve multimodal/uncertainty outputs.
Relevance: 6 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21365 [page] [pdf]
Authors: Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova, Yiqing Shen, Jan Emily Mangulabnan, Hao Ding, Jose L. Porras, Masaru Ishii, Mathias Unberath
Abstract: arXiv:2602.21365v1 Announce Type: new Abstract: Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.
Comment: Partially aligns with criterion 5 (video diffusion/modeling improvements) because it proposes a geometric abstraction + conditioning + fine-tuned video diffusion pipeline, but it's primarily an application to operating room (medical) data — a domain-specific application the friend asked to avoid.
Relevance: 4 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21977 [page] [pdf]
Authors: Liangwei Lyu, Jiaqi Xu, Jianwei Ding, Qiyao Deng
Abstract: arXiv:2602.21977v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack surface. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first systematic attack framework that leverages an independent LoRA module as the attack vehicle to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of "trigger word-target image" pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; otherwise, it behaves indistinguishably from the benign model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.
Comment: Does not match the requested criteria closely. Tangential to criterion 5 only in that it targets text-to-image diffusion models, but this work focuses on stealthy backdoor attacks on LoRA adapters rather than methodological advances in generation or efficiency.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21499 [page] [pdf]
Authors: Shimin Hu, Yuanyi Wei, Fei Zha, Yudong Guo, Juyong Zhang
Abstract: arXiv:2602.21499v1 Announce Type: new Abstract: Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.
Comment: Matches criterion 6: advances 3D generative/editing workflows. Easy3E proposes a feed-forward 3D asset editing pipeline using a TRELLIS generative backbone and a Voxel FlowEdit in sparse voxel latent space for globally consistent single-pass 3D deformations plus a normal-guided module to restore high-frequency appearance — relevant to generative 3D editing and efficient 3D asset manipulation.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21668 [page] [pdf]
Authors: Junmyeong Lee, Hoseung Choi, Minsu Cho
Abstract: arXiv:2602.21668v1 Announce Type: new Abstract: Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at https://slime0519.github.io/mogaf
Comment: Direct match to criterion 7. Builds on 4D Gaussian Splatting for space-time forecasting with motion-aware Gaussian grouping and group-wise optimization for dynamic scene reconstruction/forecasting.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21333 [page] [pdf]
Authors: Yifan Wang, Francesco Pittaluga, Zaid Tasneem, Chenyu You, Manmohan Chandraker, Ziyu Jiang
Abstract: arXiv:2602.21333v1 Announce Type: new Abstract: Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: https://horizonforge.github.io/ .
Comment: Matches criterion 7. Uses Gaussian Splats and meshes for editable 3D scene reconstruction plus a noise-aware video diffusion renderer for temporally-consistent driving-scene edits — directly relevant to Gaussian Splatting / NeRF-style 3D reconstruction and editing.
Relevance: 9 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21644 [page] [pdf]
Authors: Li Zhang, Yu-An Liu, Xijia Jiang, Conghao Huang, Danyang Li, Yanyong Zhang
Abstract: arXiv:2602.21644v1 Announce Type: new Abstract: Mobile robots and IoT devices demand real-time localization and dense reconstruction under tight compute and energy budgets. While 3D Gaussian Splatting (3DGS) enables efficient dense SLAM, dynamic objects and occlusions still degrade tracking and mapping. Existing dynamic 3DGS-SLAM often relies on heavy optical flow and per-frame segmentation, which is costly for mobile deployment and brittle under challenging illumination. We present DAGS-SLAM, a dynamic-aware 3DGS-SLAM system that maintains a spatiotemporal motion probability (MP) state per Gaussian and triggers semantics on demand via an uncertainty-aware scheduler. DAGS-SLAM fuses lightweight YOLO instance priors with geometric cues to estimate and temporally update MP, propagates MP to the front-end for dynamic-aware correspondence selection, and suppresses dynamic artifacts in the back-end via MP-guided optimization. Experiments on public dynamic RGB-D benchmarks show improved reconstruction and robust tracking while sustaining real-time throughput on a commodity GPU, demonstrating a practical speed-accuracy tradeoff with reduced semantic invocations toward mobile deployment.
Comment: Matches criterion 7: advances in 3D reconstruction using Gaussian Splatting (DAGS‑SLAM is a dynamic-aware 3DGS-SLAM system addressing dynamic objects and occlusions).
Relevance: 9 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21552 [page] [pdf]
Authors: Changqing Zhou, Yueru Luo, Changhao Chen
Abstract: arXiv:2602.21552v1 Announce Type: new Abstract: Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.
Comment: Matches criterion 7. Uses Gaussian primitives (sparse Gaussians) and visual geometry priors to do volumetric/occupancy prediction and streaming fusion — relevant to 3D reconstruction/generation with Gaussian-like representations and multi-view fusion.
Relevance: 7 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21535 [page] [pdf]
Authors: Beizhen Zhao, Sicheng Yu, Guanzhi Ding, Yu Hu, Hao Wang
Abstract: arXiv:2602.21535v1 Announce Type: new Abstract: 3D scene reconstruction under unposed sparse viewpoints is a highly challenging yet practically important problem, especially in outdoor scenes due to complex lighting and scale variation. With extremely limited input views, directly utilizing diffusion model to synthesize pseudo frames will introduce unreasonable geometry, which will harm the final reconstruction quality. To address these issues, we propose a novel framework for sparse-view outdoor reconstruction that achieves high-quality results through bidirectional pseudo frame restoration and scene perception Gaussian management. Specifically, we introduce a bidirectional pseudo frame restoration method that restores missing content by diffusion-based synthesis guided by adjacent frames with a lightweight pseudo-view deblur model and confidence mask inference algorithm. Then we propose a scene perception Gaussian management strategy that optimize Gaussians based on joint depth-density information. These designs significantly enhance reconstruction completeness, suppress floating artifacts and improve overall geometric consistency under extreme view sparsity. Experiments on outdoor benchmarks demonstrate substantial gains over existing methods in both fidelity and stability.
Comment: Matches criterion 7: tackles unposed sparse-view 3D reconstruction using diffusion-synthesized pseudo-frames and Gaussian management (scene-perception Gaussian optimization) — aligns with work on Gaussian Splatting / pose-free sparse-view reconstruction.
Relevance: 9 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22212 [page] [pdf]
Authors: Julian Kaltheuner, Hannah Dr"oge, Markus Plack, Patrick Stotko, Reinhard Klein
Abstract: arXiv:2602.22212v1 Announce Type: new Abstract: Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.
Comment: Matches criterion 7. Proposes Neu-PiG, a preconditioned latent-grid encoding and Sobolev preconditioning for fast, temporally-consistent dynamic surface reconstruction from unstructured point clouds — relevant to 3D reconstruction/generation and long-sequence dynamic scenes.
Relevance: 7 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.21645 [page] [pdf]
Authors: Weidong Qiao, Wangmeng Zuo, Hui Li
Abstract: arXiv:2602.21645v1 Announce Type: new Abstract: Modeling 4D scenes requires capturing both spatial structure and temporal motion, which is challenging due to the need for physically consistent representations of complex rigid and non-rigid motions. Existing approaches mainly rely on translational displacements, which struggle to represent rotations, articulated transformations, often leading to spatial inconsistency and physically implausible motion. LieFlow, a dynamic radiance representation framework that explicitly models motion within the SE(3) Lie group, enabling coherent learning of translation and rotation in a unified geometric space. The SE(3) transformation field enforces physically inspired constraints to maintain motion continuity and geometric consistency. The evaluation includes a synthetic dataset with rigid-body trajectories and two real-world datasets capturing complex motion under natural lighting and occlusions. Across all datasets, LieFlow consistently improves view-synthesis fidelity, temporal coherence, and physical realism over NeRF-based baselines. These results confirm that SE(3)-based motion modeling offers a robust and physically grounded framework for representing dynamic 4D scenes.
Comment: Matches criterion 7: advances in 3D reconstruction/NeRF-style dynamic scene modeling — proposes an SE(3) Lie-group-based dynamic radiance field (LieFlow) to model coherent translation+rotation and improve NeRF-like view synthesis and temporal coherence.
Relevance: 8 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22209 [page] [pdf]
Authors: Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu
Abstract: arXiv:2602.22209v1 Announce Type: new Abstract: Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www
Comment: Closely matches criterion 7. WHOLE performs joint 3D/world-space reconstruction of hand+object trajectories from egocentric videos (simultaneous hand/object pose and interaction reconstruction), which falls under 3D reconstruction/recovery from video.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21963 [page] [pdf]
Authors: Tong Wei, Giorgos Tolias, Jiri Matas, Daniel Barath
Abstract: arXiv:2602.21963v1 Announce Type: new Abstract: The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at https://github.com/weitong8591/global_edge_prior.
Comment: Matches criterion 7. Improves pose-graph initialization for SfM using a GNN to rank candidate edges and global-aware graph construction, which is directly relevant to multi-view 3D reconstruction and camera-pose estimation pipelines.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21341 [page] [pdf]
Authors: Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann
Abstract: arXiv:2602.21341v1 Announce Type: new Abstract: Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.
Comment: Matches criterion 7: studies novel-view-synthesis (geometry-free view synthesis transformers) and derives scaling/design principles for NVS models (relevant to 3D reconstruction/generation lines of work).
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21780 [page] [pdf]
Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong
Abstract: arXiv:2602.21780v1 Announce Type: new Abstract: Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.
Comment: Matches criterion 7 — focuses on streaming 3D visual-geometry reconstruction with transformer KV-cache compression (pruning + quantization) to enable long-horizon/streaming 3D reconstruction.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21929 [page] [pdf]
Authors: JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu
Abstract: arXiv:2602.21929v1 Announce Type: new Abstract: Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.
Comment: Matches criterion 7. Proposes geometry-as-context for scene-consistent camera-controlled video generation by iteratively estimating geometry and rendering novel views — a 3D-aware video generation approach tied to pose and geometry estimation.
Relevance: 6 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21967 [page] [pdf]
Authors: Xiangqi Meng, Pengxu Hou, Zhenjun Zhao, Javier Civera, Daniel Cremers, Hesheng Wang, Haoang Li
Abstract: arXiv:2602.21967v1 Announce Type: new Abstract: In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.
Comment: Not a direct match to a numbered criterion but highly relevant to your interests in robotics+computer vision: an active SLAM method (Dream-SLAM) that 'dreams' cross-spatio-temporal images and semantic scene structure to improve localization, mapping, and long-horizon planning in dynamic scenes.
Relevance: 7 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.22006 [page] [pdf]
Authors: Jiadong Lu, Zhehan Li, Tao Han, Miao Xu, Chao Xu, Yanjun Cao
Abstract: arXiv:2602.22006v1 Announce Type: new Abstract: Accurate relative localization is critical for multi-robot cooperation. In robot swarms, measurements from different robots arrive asynchronously and with clock time-offsets. Although Continuous-Time (CT) formulations have proved effective for handling asynchronous measurements in single-robot SLAM and calibration, extending CT methods to multi-robot settings faces great challenges to achieve high-accuracy, low-latency, and high-frequency performance. Especially, existing CT methods suffer from the inherent query-time delay of unclamped B-splines and high computational cost. This paper proposes CT-RIO, a novel Continuous-Time Relative-Inertial Odometry framework. We employ Clamped Non-Uniform B-splines (C-NUBS) to represent robot states for the first time, eliminating the query-time delay. We further augment C-NUBS with closed-form extension and shrinkage operations that preserve the spline shape, making it suitable for online estimation and enabling flexible knot management. This flexibility leads to the concept of knot-keyknot strategy, which supports spline extension at high-frequency while retaining sparse keyknots for adaptive relative-motion modeling. We then formulate a sliding-window relative localization problem that operates purely on relative kinematics and inter-robot constraints. To meet the demanding computation required at swarm scale, we decompose the tightly-coupled optimization into robot-wise sub-problems and solve them in parallel using incremental asynchronous block coordinate descent. Extensive experiments show that CT-RIO converges from time-offsets as large as 263 ms to sub-millisecond within 3 s, and achieves RMSEs of 0.046 m and 1.8 {\deg}. It consistently outperforms state-of-the-art methods, with improvements of up to 60% under high-speed motion.
Comment: No close match to the listed criteria (addresses continuous‑time multi‑robot relative localization with clamped non‑uniform B‑splines — strong robotics work but not using VLMs for RL/IL or the other specified criteria).
Relevance: 3 Novelty: 7 Back to [topic] [top]
ArXiv: 2602.22013 [page] [pdf]
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
Abstract: arXiv:2602.22013v1 Announce Type: new Abstract: Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
Comment: Does not match the robotics+VLM criterion (2). The paper is about robustness of Vision-based Retrieval-Augmented Generation using VLMs under visual degradations (useful for VLM applications generally) but it does not apply VLMs to robotics tasks.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21316 [page] [pdf]
Authors: Milad Azizkhani, Yue Chen
Abstract: arXiv:2602.21316v1 Announce Type: new Abstract: Soft robots were introduced in large part to enable safe, adaptive interaction with the environment, and this interaction relies fundamentally on contact. However, modeling and planning contact-rich interactions for soft robots remain challenging: dense contact candidates along the body create redundant constraints and rank-deficient LCPs, while the disparity between high stiffness and low friction introduces severe ill-conditioning. Existing approaches rely on problem-specific approximations or penalty-based treatments. This letter presents a unified complementarity-based framework for soft-robot contact modeling and planning that brings contact modeling, manipulation, and planning into a unified, physically consistent formulation. We develop a robust Linear Complementarity Problem (LCP) model tailored to discretized soft robots and address these challenges with a three-stage conditioning pipeline: inertial rank selection to remove redundant contacts, Ruiz equilibration to correct scale disparity and ill-conditioning, and lightweight Tikhonov regularization on normal blocks. Building on the same formulation, we introduce a kinematically guided warm-start strategy that enables dynamic trajectory optimization through contact using Mathematical Programs with Complementarity Constraints (MPCC) and demonstrate its effectiveness on contact-rich ball manipulation tasks. In conclusion, CUSP provides a new foundation for unifying contact modeling, simulation, and planning in soft robotics.
Comment: No matching criteria — unified complementarity-based contact modeling and planning for soft robots; strong robotics relevance but does not use VLMs, SSL, or cross-modal transfer as requested by the criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21467 [page] [pdf]
Authors: William Youngwoo Chung, Calvin Yeung, Hansen Jin Lillemark, Zhuowen Zou, Xiangjian Liu, Mohsen Imani
Abstract: arXiv:2602.21467v1 Announce Type: new Abstract: A key challenge in artificial intelligence and neuroscience is understanding how neural systems learn representations that capture the underlying dynamics of the world. Most world models represent the transition function with unstructured neural networks, limiting interpretability, sample efficiency, and generalization to unseen states or action compositions. We address these issues with a generalizable world model grounded in Vector Symbolic Architecture (VSA) principles as geometric priors. Our approach utilizes learnable Fourier Holographic Reduced Representation (FHRR) encoders to map states and actions into a high dimensional complex vector space with learned group structure and models transitions with element-wise complex multiplication. We formalize the framework's group theoretic foundation and show how training such structured representations to be approximately invariant enables strong multi-step composition directly in latent space and generalization performances over various experiments. On a discrete grid world environment, our model achieves 87.5% zero shot accuracy to unseen state-action pairs, obtains 53.6% higher accuracy on 20-timestep horizon rollouts, and demonstrates 4x higher robustness to noise relative to an MLP baseline. These results highlight how training to have latent group structure yields generalizable, data-efficient, and interpretable world models, providing a principled pathway toward structured models for real-world planning and reasoning.
Comment: No close match to the numbered criteria. Focuses on structured world models via Vector Symbolic Architecture and learnable FHRR encoders for compositional latent transitions — of interest for representation learning and model-based RL but not self-supervised image/video SSL, VLMs in robotics, or the other listed criteria.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21965 [page] [pdf]
Authors: Joseph Margaryan, Thomas Hamelryck
Abstract: arXiv:2602.21965v1 Announce Type: new Abstract: Critical applications in areas such as medicine, robotics and autonomous systems require compact (i.e., memory efficient), uncertainty-aware neural networks suitable for edge and other resource-constrained deployments. We study compact spectral circulant and block-circulant-with-circulant-blocks (BCCB) layers: FFT-diagonalizable circular convolutions whose weights live directly in the real FFT (RFFT) half (1D) or half-plane (2D). Parameterizing filters in the frequency domain lets us impose simple spectral structure, perform structured variational inference in a low-dimensional weight space, and calculate exact layer spectral norms, enabling inexpensive global Lipschitz bounds and margin-based robustness diagnostics. By placing independent complex Gaussians on the Hermitian support we obtain a discrete instance of the spectral representation of stationary kernels, inducing an exact stationary Gaussian-process prior over filters on the discrete circle/torus. We exploit this to define a practical spectral prior and a Hermitian-aware low-rank-plus-diagonal variational posterior in real coordinates. Empirically, spectral circulant/BCCB layers are effective compact building blocks in both (variational) Bayesian and point estimate regimes: compact Bayesian neural networks on MNIST->Fashion-MNIST, variational heads on frozen CIFAR-10 features, and deterministic ViT projections on CIFAR-10/Tiny ImageNet; spectral layers match strong baselines while using substantially fewer parameters and with tighter Lipschitz certificates.
Comment: No listed criterion matched closely — focuses on compact spectral circulant/BCCB layers and spectral priors for compact/uncertainty-aware networks (useful for resource-constrained robotics) but not on SSL, VLMs in robotics, video segmentation, cross-modal video transfer, generative diffusion/3D advances, or NeRF-style reconstruction.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2602.21599 [page] [pdf]
Authors: Weisheng Xu, Qiwei Wu, Jiaxi Zhang, Tan Jing, Yangfan Li, Yuetong Fang, Jiaqi Xiong, Kai Wu, Rong Ou, Renjing Xu
Abstract: arXiv:2602.21599v1 Announce Type: new Abstract: Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
Comment: No match to any of the requested criteria (robotics/humanoid control and automated motion-data generation but not using VLMs, SSL, or the other specified topics).
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21273 [page] [pdf]
Authors: Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang
Abstract: arXiv:2602.21273v1 Announce Type: new Abstract: Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
Comment: No close match to the requested criteria. A zero-shot pipeline for multi-frame visual narratives / image-sequence generation; not explicitly about text-to-video diffusion improvements, SSL, cross-modal transfer for dense video tasks, or VLMs in robotics.
Relevance: 4 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21612 [page] [pdf]
Authors: Xuanqi Zeng, Lingwei Zhang, Linzhu Yue, Zhitao Song, Hongbo Zhang, Tianlin Zhang, Yun-Hui Liu
Abstract: arXiv:2602.21612v1 Announce Type: new Abstract: Quadrupedal wheeled-legged robots combine the advantages of legged and wheeled locomotion to achieve superior mobility, but executing dynamic jumps remains a significant challenge due to the additional degrees of freedom introduced by wheeled legs. This paper develops a mini-sized wheeled-legged robot for agile motion and presents a novel motion control framework that integrates the Nonlinear Model Predictive Control (NMPC) for locomotion and the Differential Evolution (DE) based trajectory optimization for jumping in quadrupedal wheeled-legged robots. The proposed controller utilizes wheel motion and locomotion to enhance jumping performance, achieving versatile maneuvers such as vertical jumping, forward jumping, and backflips. Extensive simulations and real-world experiments validate the effectiveness of the framework, demonstrating a forward jump over a 0.12 m obstacle and a vertical jump reaching 0.5 m.
Comment: No close match to the specified criteria; this is a control/locomotion paper (NMPC + DE) for a quadrupedal wheeled-legged robot — of interest for robotics but not using VLMs, SSL, or the other listed topics.
Relevance: 4 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21473 [page] [pdf]
Authors: Somayeh Hussaini, Tobias Fischer, Michael Milford
Abstract: arXiv:2602.21473v1 Announce Type: new Abstract: A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.
Comment: No close match to the specified criteria; this is an applied visual place recognition (VPR) paper about automatic map-density selection for deployment rather than SSL, VLMs in robotics, video segmentation, cross-modal transfer, diffusion/3D generation, or NeRF/Gaussian splatting.
Relevance: 4 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21331 [page] [pdf]
Authors: Nelson Chen, William R. Johnson III, Rebecca Kramer-Bottiglio, Kostas Bekris, Mridul Aanjaneya
Abstract: arXiv:2602.21331v1 Announce Type: new Abstract: General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.
Comment: No match to any criterion. This proposes a GNN simulator for cable-driven robots with sim-and-real co-training and control integration — relevant to robotics but not to VLMs, SSL for vision, or the other listed topics.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22142 [page] [pdf]
Authors: Yulin Zhang, Cheng Shi, Sibei Yang
Abstract: arXiv:2602.22142v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
Comment: No matching criteria. This work improves Video-LLMs for streaming by adding order-aware training and a dynamic focus cache; relevant to video-VLMs and streaming but does not target robotics, unsupervised video segmentation, modality transfer for dense video tasks, or generative diffusion/3D advances as defined in the criteria.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21583 [page] [pdf]
Authors: Wentao Zhang, Zhaoqi Ma, Jinjie Li, Huayi Wang, Haokun Liu, Junichiro Sugihara, Chen Chen, Yicheng Chen, Moju Zhao
Abstract: arXiv:2602.21583v1 Announce Type: new Abstract: Tilt-rotor aerial robots enable omnidirectional maneuvering through thrust vectoring, but introduce significant control challenges due to the strong coupling between joint and rotor dynamics. While model-based controllers can achieve high motion accuracy under nominal conditions, their robustness and responsiveness often degrade in the presence of disturbances and modeling uncertainties. This work investigates reinforcement learning for omnidirectional aerial motion control on over-actuated tiltable quadrotors that prioritizes robustness and agility. We present a learning-based control framework that enables efficient acquisition of coordinated rotor-joint behaviors for reaching target poses in the $SE(3)$ space. To achieve reliable sim-to-real transfer while preserving motion accuracy, we integrate system identification with minimal and physically consistent domain randomization. Compared with a state-of-the-art NMPC controller, the proposed method achieves comparable six-degree-of-freedom pose tracking accuracy, while demonstrating superior robustness and generalization across diverse tasks, enabling zero-shot deployment on real hardware.
Comment: No close match to the numbered criteria. This is a learning-based control (RL/system-identification/domain-randomization) study for tiltable quadrotors (omnidirectional aerial motion), not a VLM/SSL/vision-transfer or video-segmentation paper.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21680 [page] [pdf]
Authors: David Eckel, Henri Mee{\ss}
Abstract: arXiv:2602.21680v1 Announce Type: new Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple perspectives on different hierarchy levels. We propose the Hierarchical Lead Critic (HLC) - inspired by natural emerging distributions in team structures, where following high-level objectives combines with low-level execution. HLC demonstrates that introducing multiple hierarchies, leveraging local and global perspectives, can lead to improved performance with high sample efficiency and robust policies. Experimental results conducted on cooperative, non-communicative, and partially observable MARL benchmarks demonstrate that HLC outperforms single hierarchy baselines and scales robustly with increasing amounts of agents and difficulty.
Comment: No matching criterion. Multi-agent RL (Hierarchical Lead Critic) — of general interest for someone into reinforcement learning/robotics, but does not use vision‑language models or meet the specified VLM+robotics criterion (2).
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21917 [page] [pdf]
Authors: Chen Wu, Ling Wang, Zhuoran Zheng, Yuning Cui, Zhixiong Yang, Xiangyu Chen, Yue Zhang, Weidong Jiang, Jingyuan Xia
Abstract: arXiv:2602.21917v1 Announce Type: new Abstract: Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C$^2$SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C$^2$SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.
Comment: No matching criteria — proposes a cluster-centric state-space model for ultra-HD image restoration (image restoration / efficiency), not about SSL/contrastive/MAE, VLMs in robotics, video segmentation, or cross-modal transfer for video.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22154 [page] [pdf]
Authors: Hossein B. Jond, Veli Bak{\i}rc{\i}o\u{g}lu, Logan E. Beaver, Nejat T"ukenmez, Adel Akbarimajd, Martin Saska
Abstract: arXiv:2602.22154v1 Announce Type: new Abstract: Coordinated collective motion in bird flocks and fish schools inspires algorithms for cohesive swarm robotics. This paper presents a position-based flocking model that achieves persistent velocity alignment without velocity sensing. By approximating relative velocity differences from changes between current and initial relative positions and incorporating a time- and density-dependent alignment gain with a non-zero minimum threshold to maintain persistent alignment, the model sustains coherent collective motion over extended periods. Simulations with a collective of 50 agents demonstrate that the position-based flocking model attains faster and more sustained directional alignment and results in more compact formations than a velocity-alignment-based baseline. This position-based flocking model is particularly well-suited for real-world robotic swarms, where velocity measurements are unreliable, noisy, or unavailable. Experimental results using a team of nine real wheeled mobile robots are also presented.
Comment: No specific criterion match — a robotics/swarm coordination method (position-based flocking) but does not involve VLMs, SSL, video segmentation, cross-modal transfer or generative models.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.22118 [page] [pdf]
Authors: Benjamin Bokser, Daniel Gonzalez, Surya Singh, Aaron Preston, Alex Bahner, Annika Wollschl"ager, Arianna Ilvonen, Asa Eckert-Erdheim, Ashwin Khadke, Bilal Hammoud, Dean Molinaro, Fabian Jenelten, Henry Mayne, Howie Choset, Igor Bogoslavskyi, Itic Tinman, James Tigue, Jan Preisig, Kaiyu Zheng, Kenny Sharma, Kim Ang, Laura Lee, Liana Margolese, Nicole Lin, Oscar Frias, Paul Drews, Ravi Boggavarapu, Rick Burnham, Samuel Zapolsky, Sangbae Kim, Scott Biddlestone, Sean Mayorga, Shamel Fahmi, Tyler McCollum, Velin Dimitrov, William Moyne, Yu-Ming Chen, Farbod Farshidian, Marco Hutter, David Perry, Al Rizzi, Gabe Nelson
Abstract: arXiv:2602.22118v1 Announce Type: new Abstract: Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
Comment: Does not match the requested criteria. This is a hardware+control/robotics systems paper using constrained RL to synthesize agile bicycle-like behaviors — substantial engineering/controls but not VLM-in-robotics or the SSL/segmentation/transfer-learning topics listed.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2602.21233 [page] [pdf]
Authors: Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu
Abstract: arXiv:2602.21233v1 Announce Type: new Abstract: This technical report introduces AngelSlim, a comprehensive and versatile toolkit for large model compression developed by the Tencent Hunyuan team. By consolidating cutting-edge algorithms, including quantization, speculative decoding, token pruning, and distillation. AngelSlim provides a unified pipeline that streamlines the transition from model compression to industrial-scale deployment. To facilitate efficient acceleration, we integrate state-of-the-art FP8 and INT8 Post-Training Quantization (PTQ) algorithms alongside pioneering research in ultra-low-bit regimes, featuring HY-1.8B-int2 as the first industrially viable 2-bit large model. Beyond quantization, we propose a training-aligned speculative decoding framework compatible with multimodal architectures and modern inference engines, achieving 1.8x to 2.0x throughput gains without compromising output correctness. Furthermore, we develop a training-free sparse attention framework that reduces Time-to-First-Token (TTFT) in long-context scenarios by decoupling sparse kernels from model architectures through a hybrid of static patterns and dynamic token selection. For multimodal models, AngelSlim incorporates specialized pruning strategies, namely IDPruner for optimizing vision tokens via Maximal Marginal Relevance and Samp for adaptive audio token merging and pruning. By integrating these compression strategies from low-level implementations, AngelSlim enables algorithm-focused research and tool-assisted deployment.
Comment: No criteria matched: an engineering toolkit for large-model compression (quantization, pruning, speculative decoding) for multimodal models — useful but not directly matching SSL method advances, VLMs-in-robotics, video segmentation, cross-modal video transfer, diffusion/video/3D generation breakthroughs.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21820 [page] [pdf]
Authors: Shan Wang, Peixia Li, Chenchen Xu, Ziang Cheng, Jiayu Yang, Hongdong Li, Pulak Purkait
Abstract: arXiv:2602.21820v1 Announce Type: new Abstract: We propose Light-Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting - unlike prior methods that treat them as disjoint tasks - capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.
Comment: No match to any of the requested criteria (not about SSL improvements, VLMs in robotics, unsupervised video segmentation, cross-modal transfer for video, diffusion/3D generative advances, or NeRF/Gaussian Splatting). This is an image relighting/shadow-generation method using monocular depth priors.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21829 [page] [pdf]
Authors: Daniel Oliveira, David Martins de Matos
Abstract: arXiv:2602.21829v1 Announce Type: new Abstract: Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.
Comment: No close match to the numbered criteria. Introduces StoryMovie, a dataset and VLM fine-tuning for visually grounded storytelling and subtitle/script alignment—useful for VLM evaluation but not VLMs applied to robotics, SSL methods, or video segmentation.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21811 [page] [pdf]
Authors: Qingtao Liu, Zhengnan Sun, Yu Cui, Haoming Li, Gaofeng Li, Lin Shao, Jiming Chen, Qi Ye
Abstract: arXiv:2602.21811v1 Announce Type: new Abstract: Robotic dexterous manipulation is a challenging problem due to high degrees of freedom (DoFs) and complex contacts of multi-fingered robotic hands. Many existing deep reinforcement learning (DRL) based methods aim at improving sample efficiency in high-dimensional output action spaces. However, existing works often overlook the role of representations in achieving generalization of a manipulation policy in the complex input space during the hand-object interaction. In this paper, we propose DexRep, a novel hand-object interaction representation to capture object surface features and spatial relations between hands and objects for dexterous manipulation skill learning. Based on DexRep, policies are learned for three dexterous manipulation tasks, i.e. grasping, in-hand reorientation, bimanual handover, and extensive experiments are conducted to verify the effectiveness. In simulation, for grasping, the policy learned with 40 objects achieves a success rate of 87.9% on more than 5000 unseen objects of diverse categories, significantly surpassing existing work trained with thousands of objects; for the in-hand reorientation and handover tasks, the policies also boost the success rates and other metrics of existing hand-object representations by 20% to 40%. The grasp policies with DexRep are deployed to the real world under multi-camera and single-camera setups and demonstrate a small sim-to-real gap.
Comment: No close match to the numbered criteria. This paper focuses on learned hand-object geometric/spatial representations for dexterous manipulation in an RL setting (sim-to-real), not on self-supervised representation methods, VLMs in robotics, or the other listed topics.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2602.21762 [page] [pdf]
Authors: Zhaoyang Wei, Xumeng Han, Xuehui Yu, Xue Yang, Guorong Li, Zhenjun Han, Jianbin Jiao
Abstract: arXiv:2602.21762v1 Announce Type: new Abstract: Single-point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point-prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single-point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise the difficulty distinguishing between different levels of detail (eg. whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point's granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S-MIL. The Multi-level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt's granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.
Comment: No close match to the listed criteria — this is a point-prompted instance segmentation method (image-level PPIS) addressing annotation efficiency and mask refinement, not self-supervised/video segmentation/modality-transfer/VLM-in-robotics work.