Personalized Daily Arxiv Papers 07/15/2025

This project is adapted from tatsu-lab/gpt_paper_assistant. The source code of this project is at Variante/gpt_paper_assistant

About me on Bilibili. Help keep the website running:

Topics

Paper selection prompt and criteria (jump to the section by clicking the link):

1. New methodological improvements to self-supervised learning (SSL) for better image or video representation learning.

2. Shows new applications of vision language models (VLMs) in robotics like reinforcement learning or imitation learning.

3. Works on video segmentation that rely on unsupervised and self-supervised learning methods.

4. Works on transfer learning between modalities such as audio to video, optical flow to video, language to video and image to video; these papers should consider how to use domain-agnostic data to improve performance on videos.

5. Shows a significant advance in the performance of text to image, text to video, image to video diffusion models and so on.

6. Marks significant advancements in 3D generation using generative models, including applications in converting images to 3D models and generating 3D models from textual descriptions.

7. Shows recent progress on 3D reconstruction and generation with Gaussian Splatting, NeRF, and mesh generation.

Go beyond

Topic 1

1002. CLA: Latent Alignment for Online Continual Self-Supervised Learning [more]
Authors: Giacomo Cignoni, Andrea Cossu, Alexandra Gomez-Villa, Joost van de Weijer, Antonio Carta

1017. Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles [more]
Authors: Yangang Ren, Guojian Zhan, Chen Lv, Jun Li, Fenghua Liang, Keqiang Li

1021. Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning [more]
Authors: Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao

Back to [top]

Topic 2

2014. AirScape: An Aerial Generative World Model with Motion Controllability [more]
Authors: Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Chen Gao, Wei Wu, Xin Wang, Xinlei Chen, Yong Li

2016. Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [more]
Authors: Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, Yang Gao

2020. OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation [more]
Authors: Simon Schwaiger, Stefan Thalhammer, Wilfried W"ober, Gerald Steinbauer-Wagner

2022. EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [more]
Authors: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

2027. Demonstrating the Octopi-1.5 Visual-Tactile-Language Model [more]
Authors: Samson Yu, Kelvin Lin, Harold Soh

Back to [top]

Topic 3

Back to [top]

Topic 4

4001. ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models [more]
Authors: Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin

4005. Taming generative video models for zero-shot optical flow extraction [more]
Authors: Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins

4008. Spatial Lifting for Dense Prediction [more]
Authors: Mingzhi Xu, Yizhe Zhang

Back to [top]

Topic 5

5000. Beyond Scores: Proximal Diffusion Models [more]
Authors: Zhenghan Fang, Mateo D'iaz, Sam Buchanan, Jeremias Sulam

5003. Text Embedding Knows How to Quantize Text-Guided Diffusion Models [more]
Authors: Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung

5007. Learning Diffusion Models with Flexible Representation Guidance [more]
Authors: Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola

5009. Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies [more]
Authors: Seokeon Choi, Sunghyun Park, Hyoungwoo Park, Jeongho Kim, Sungrack Yun

5018. Warm Starts Accelerate Generative Modelling [more]
Authors: Jonas Scholz, Richard E. Turner

5019. Can Contrastive Learning Improve Class-Imbalanced Diffusion Model? [more]
Authors: Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang

Back to [top]

Topic 6

6006. Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation [more]
Authors: Yu Lei, Bingde Liu, Qingsong Xie, Haonan Lu, Zhijie Deng

Back to [top]

Topic 7

7004. 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos [more]
Authors: Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma, Guofeng Zhang, Qihao Liu, Alan Yuille, Jieneng Chen

Back to [top]

Go beyond

10. Stable Score Distillation [more]
Authors: Haiming Zhu, Yangyang Xu, Chenshu Xu, Tingrui Shen, Wenxi Liu, Yong Du, Jun Yu, Shengfeng He

11. Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models [more]
Authors: Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuta Takida, Yuki Mitsufuji, Molei Tao

12. Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder [more]
Authors: Vladimir Iashin, Horace Lee, Dan Schofield, Andrew Zisserman

13. Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation [more]
Authors: Ozge Mercanoglu Sincan, Richard Bowden

15. Latent Diffusion Models with Masked AutoEncoders [more]
Authors: Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

23. Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes [more]
Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel

24. Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition [more]
Authors: Kaixuan Cong, Yifan Wang, Rongkun Xue, Yuyang Jiang, Yiming Feng, Jing Yang

25. Prompt Informed Reinforcement Learning for Visual Coverage Path Planning [more]
Authors: Venkat Margapuri

26. Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning [more]
Authors: Tarek Berghout

28. Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios [more]
Authors: Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj

29. MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [more]
Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu

30. Graph World Model [more]
Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

31. MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation [more]
Authors: Juyi Sheng, Ziyi Wang, Peiming Li, Mengyuan Liu

32. Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning [more]
Authors: Linlan Huang, Xusheng Cao, Haori Lu, Yifan Meng, Fei Yang, Xialei Liu

33. Test-Time Canonicalization by Foundation Models for Robust Perception [more]
Authors: Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

34. BlindSight: Harnessing Sparsity for Efficient VLMs [more]
Authors: Tharun Adithya Srikrishnan, Deval Shah, Steven K. Reinhardt

35. GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? [more]
Authors: Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao

36. EgoAnimate: Generating Human Animations from Egocentric top-down Views [more]
Authors: G. Kutay T"urkoglu, Julian Tanke, Iheb Belgacem, Lev Markhasin

37. DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs [more]
Authors: Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao

38. DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation [more]
Authors: Ivan Martinovi'c, Josip \v{S}ari'c, Marin Or\v{s}i'c, Matej Kristan, Sini\v{s}a \v{S}egvi'c

39. Geometric Generative Modeling with Noise-Conditioned Graph Networks [more]
Authors: Peter Pao-Huang, Mitchell Black, Xiaojie Qiu

40. Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction [more]
Authors: Thomas T. Zhang, Daniel Pfrommer, Nikolai Matni, Max Simchowitz

41. Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [more]
Authors: Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, Leonid Sigal

42. Towards Human-level Dexterity via Robot Learning [more]
Authors: Gagan Khandate

43. Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive [more]
Authors: You Huang, Lichao Chen, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

44. MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models [more]
Authors: Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang

45. PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection [more]
Authors: Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad

46. Contrastive Language-Image Pre-Training Model based Semantic Communication Performance Optimization [more]
Authors: Shaoran Yang, Dongyu Wei, Hanzhi Yu, Zhaohui Yang, Yuchen Liu, Mingzhe Chen

47. Cameras as Relative Positional Encoding [more]
Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

48. Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance [more]
Authors: Kyungtae Han, Yitao Chen, Rohit Gupta, Onur Altintas

49. Continual Reinforcement Learning by Planning with Online World Models [more]
Authors: Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin

50. Raci-Net: Ego-vehicle Odometry Estimation in Adverse Weather Conditions [more]
Authors: Mohammadhossein Talebi, Pragyan Dahal, Davide Possenti, Stefano Arrigoni, Francesco Braghin

51. Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration [more]
Authors: Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, Jingyuan Chen

52. Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score [more]
Authors: Eman Ali, Sathira Silva, Chetan Arora, Muhammad Haris Khan

53. QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models [more]
Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu

54. VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding [more]
Authors: Younggun Kim, Ahmed S. Abdelrahman, Mohamed Abdel-Aty

55. Deep Recurrence for Dynamical Segmentation Models [more]
Authors: David Calhas, Arlindo L. Oliveira

56. PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment [more]
Authors: Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

Back to [top]

Full paper list

1002. CLA: Latent Alignment for Online Continual Self-Supervised Learning

ArXiv: 2507.10434 [page] [pdf]

Authors: Giacomo Cignoni, Andrea Cossu, Alexandra Gomez-Villa, Joost van de Weijer, Antonio Carta

Abstract: arXiv:2507.10434v1 Announce Type: new Abstract: Self-supervised learning (SSL) is able to build latent representations that generalize well to unseen data. However, only a few SSL techniques exist for the online CL setting, where data arrives in small minibatches, the model must comply with a fixed computational budget, and task boundaries are absent. We introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL that aligns the representations learned by the current model with past representations to mitigate forgetting. We found that our CLA is able to speed up the convergence of the training process in the online scenario, outperforming state-of-the-art approaches under the same computational budget. Surprisingly, we also discovered that using CLA as a pretraining protocol in the early stages of pretraining leads to a better final performance when compared to a full i.i.d. pretraining.

Comment: Matches criterion 1

Relevance: 5 Novelty: 6 Back to [topic] [top]

1017. Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles

ArXiv: 2507.09537 [page] [pdf]

Authors: Yangang Ren, Guojian Zhan, Chen Lv, Jun Li, Fenghua Liang, Keqiang Li

Abstract: arXiv:2507.09537v1 Announce Type: new Abstract: Predicting the future of surrounding agents and accordingly planning a safe, goal-directed trajectory are crucial for automated vehicles. Current methods typically rely on imitation learning to optimize metrics against the ground truth, often overlooking how scene understanding could enable more holistic trajectories. In this paper, we propose Plan-MAE, a unified pretraining framework for prediction and planning that capitalizes on masked autoencoders. Plan-MAE fuses critical contextual understanding via three dedicated tasks: reconstructing masked road networks to learn spatial correlations, agent trajectories to model social interactions, and navigation routes to capture destination intents. To further align vehicle dynamics and safety constraints, we incorporate a local sub-planning task predicting the ego-vehicle's near-term trajectory segment conditioned on earlier segment. This pretrained model is subsequently fine-tuned on downstream tasks to jointly generate the prediction and planning trajectories. Experiments on large-scale datasets demonstrate that Plan-MAE outperforms current methods on the planning metrics by a large margin and can serve as an important pre-training step for learning-based motion planner.

Comment: Matches criterion 1 as it discusses a new methodological improvement to self-supervised learning using masked autoencoders for prediction and planning in automated vehicles.

Relevance: 5 Novelty: 6 Back to [topic] [top]

1021. Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

ArXiv: 2507.09102 [page] [pdf]

Authors: Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao

Abstract: arXiv:2507.09102v1 Announce Type: new Abstract: Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.

Comment: Matches criterion 1 as it discusses self-supervised learning improvements using diffusion models.

Relevance: 5 Novelty: 5 Back to [topic] [top]

2014. AirScape: An Aerial Generative World Model with Motion Controllability

ArXiv: 2507.08885 [page] [pdf]

Authors: Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Chen Gao, Wei Wu, Xin Wang, Xinlei Chen, Yong Li

Abstract: arXiv:2507.08885v1 Announce Type: new Abstract: How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model -- initially devoid of embodied spatial knowledge -- into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.

Comment: Matches criterion 2 as it discusses a world model for aerial agents, which could be relevant for robotics applications using vision language models.

Relevance: 5 Novelty: 6 Back to [topic] [top]

2016. Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization

ArXiv: 2507.09160 [page] [pdf]

Authors: Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, Yang Gao

Abstract: arXiv:2507.09160v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown remarkable achievements, driven by the rich implicit knowledge of their vision-language components. However, achieving generalist robotic agents demands precise grounding into physical interactions, especially in contact-rich scenarios where fine-grained force control is essential. We advance VLAs' implicit knowledge beyond identifying what to do, towards guiding how to physically interact with real world. This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing. This framework incorporates a hybrid position-force controller to translate the model's intentions into precise physical actions and a reasoning module that allows the robot to adapt its strategy based on tactile feedback. Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects: (1) enabling tactile-aware instruction following, (2) utilizing tactile-relevant commonsense, and (3) facilitating adaptive tactile-involved reasoning. A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks.

Comment: Matches criterion 2 as it discusses a new application of vision language models in robotics, specifically in tactile generalization and physical interaction.

Relevance: 5 Novelty: 6 Back to [topic] [top]

2020. OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

ArXiv: 2507.08851 [page] [pdf]

Authors: Simon Schwaiger, Stefan Thalhammer, Wilfried W"ober, Gerald Steinbauer-Wagner

Abstract: arXiv:2507.08851v1 Announce Type: new Abstract: Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Current vision-language mapping approaches rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct semantic class boundaries. We propose OTAS - an Open-vocabulary Token Alignment method for Outdoor Segmentation. OTAS overcomes the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pretrained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates zero-shot, without scene-specific fine-tuning, and runs at up to ~17 fps. OTAS provides a minor IoU improvement over fine-tuned and open-vocabulary 2D segmentation methods on the Off-Road Freespace Detection dataset. Our model achieves up to a 151% IoU improvement over open-vocabulary mapping methods in 3D segmentation on TartanAir. Real-world reconstructions demonstrate OTAS' applicability to robotic applications. The code and ROS node will be made publicly available upon paper acceptance.

Comment: Matches criterion 2

Relevance: 5 Novelty: 5 Back to [topic] [top]

2022. EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

ArXiv: 2507.10548 [page] [pdf]

Authors: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

Abstract: arXiv:2507.10548v1 Announce Type: new Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

Comment: Matches criterion 2 as it discusses the application of vision-language models in robotics for embodied reasoning and action.

Relevance: 5 Novelty: 5 Back to [topic] [top]

2027. Demonstrating the Octopi-1.5 Visual-Tactile-Language Model

ArXiv: 2507.09985 [page] [pdf]

Authors: Samson Yu, Kelvin Lin, Harold Soh

Abstract: arXiv:2507.09985v1 Announce Type: new Abstract: Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5's RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at https://github.com/clear-nus/octopi-1.5.

Comment: Matches criterion 2 as it demonstrates a vision-tactile-language model for robotics tasks.

Relevance: 5 Novelty: 5 Back to [topic] [top]

4001. ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

ArXiv: 2507.09876 [page] [pdf]

Authors: Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin

Abstract: arXiv:2507.09876v1 Announce Type: new Abstract: Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.

Comment: The paper introduces a novel video reasoning paradigm using video-text interleaved chain-of-thought, which aligns with criterion 4 on transfer learning between modalities.

Relevance: 5 Novelty: 6 Back to [topic] [top]

4005. Taming generative video models for zero-shot optical flow extraction

ArXiv: 2507.09082 [page] [pdf]

Authors: Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins

Abstract: arXiv:2507.09082v1 Announce Type: new Abstract: Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

Comment: Matches criterion 4 as it discusses transfer learning between modalities, specifically using video models for optical flow extraction.

Relevance: 5 Novelty: 6 Back to [topic] [top]

4008. Spatial Lifting for Dense Prediction

ArXiv: 2507.10222 [page] [pdf]

Authors: Mingzhi Xu, Yizhe Zhang

Abstract: arXiv:2507.10222v1 Announce Type: new Abstract: We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.

Comment: Matches criterion 4: Transfer learning between modalities for dense prediction tasks.

Relevance: 5 Novelty: 6 Back to [topic] [top]

5000. Beyond Scores: Proximal Diffusion Models

ArXiv: 2507.08956 [page] [pdf]

Authors: Zhenghan Fang, Mateo D'iaz, Sam Buchanan, Jeremias Sulam

Abstract: arXiv:2507.08956v1 Announce Type: new Abstract: Diffusion models have quickly become some of the most popular and powerful generative models for high-dimensional data. The key insight that enabled their development was the realization that access to the score -- the gradient of the log-density at different noise levels -- allows for sampling from data distributions by solving a reverse-time stochastic differential equation (SDE) via forward discretization, and that popular denoisers allow for unbiased estimators of this score. In this paper, we demonstrate that an alternative, backward discretization of these SDEs, using proximal maps in place of the score, leads to theoretical and practical benefits. We leverage recent results in proximal matching to learn proximal operators of the log-density and, with them, develop Proximal Diffusion Models (ProxDM). Theoretically, we prove that $\widetilde{O}(d/\sqrt{\varepsilon})$ steps suffice for the resulting discretization to generate an $\varepsilon$-accurate distribution w.r.t. the KL divergence. Empirically, we show that two variants of ProxDM achieve significantly faster convergence within just a few sampling steps compared to conventional score-matching methods.

Comment: Matches criterion 5 as it discusses improvements in diffusion models for better generation results and faster convergence.

Relevance: 5 Novelty: 7 Back to [topic] [top]

5003. Text Embedding Knows How to Quantize Text-Guided Diffusion Models

ArXiv: 2507.10340 [page] [pdf]

Authors: Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung

Abstract: arXiv:2507.10340v1 Announce Type: new Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.

Comment: Matches criterion 5

Relevance: 5 Novelty: 6 Back to [topic] [top]

5007. Learning Diffusion Models with Flexible Representation Guidance

ArXiv: 2507.08980 [page] [pdf]

Authors: Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola

Abstract: arXiv:2507.08980v1 Announce Type: new Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

Comment: Matches criterion 5: Shows a significant advance in the performance of diffusion models.

Relevance: 5 Novelty: 6 Back to [topic] [top]

5009. Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies

ArXiv: 2507.10029 [page] [pdf]

Authors: Seokeon Choi, Sunghyun Park, Hyoungwoo Park, Jeongho Kim, Sungrack Yun

Abstract: arXiv:2507.10029v1 Announce Type: new Abstract: Memory-efficient personalization is critical for adapting text-to-image diffusion models while preserving user privacy and operating within the limited computational resources of edge devices. To this end, we propose a selective optimization framework that adaptively chooses between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high), guided by the characteristics of the diffusion process. As observed in our experiments, BP-low efficiently adapts the model to target-specific features, but suffers from structural distortions due to resolution mismatch. Conversely, ZO-high refines high-resolution details with minimal memory overhead but faces slow convergence when applied without prior adaptation. By complementing both methods, our framework leverages BP-low for effective personalization while using ZO-high to maintain structural consistency, achieving memory-efficient and high-quality fine-tuning. To maximize the efficacy of both BP-low and ZO-high, we introduce a timestep-aware probabilistic function that dynamically selects the appropriate optimization strategy based on diffusion timesteps. This function mitigates the overfitting from BP-low at high timesteps, where structural information is critical, while ensuring ZO-high is applied more effectively as training progresses. Experimental results demonstrate that our method achieves competitive performance while significantly reducing memory consumption, enabling scalable, high-quality on-device personalization without increasing inference latency.

Comment: Matches criterion 5 for improving text-to-image diffusion models.

Relevance: 5 Novelty: 6 Back to [topic] [top]

5018. Warm Starts Accelerate Generative Modelling

ArXiv: 2507.09212 [page] [pdf]

Authors: Jonas Scholz, Richard E. Turner

Abstract: arXiv:2507.09212v1 Announce Type: new Abstract: Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This "warm start" substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at https://github.com/jonas-scholz123/warm-start-model.

Comment: Matches criterion 5 as it proposes a method to accelerate generative models like diffusion models.

Relevance: 5 Novelty: 6 Back to [topic] [top]

5019. Can Contrastive Learning Improve Class-Imbalanced Diffusion Model?

ArXiv: 2507.09052 [page] [pdf]

Authors: Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang

Abstract: arXiv:2507.09052v1 Announce Type: new Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.

Comment: Matches criterion 5 as it discusses improving diffusion models for class-imbalanced data using contrastive learning.

Relevance: 5 Novelty: 6 Back to [topic] [top]

6006. Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

ArXiv: 2507.09748 [page] [pdf]

Authors: Yu Lei, Bingde Liu, Qingsong Xie, Haonan Lu, Zhijie Deng

Abstract: arXiv:2507.09748v1 Announce Type: new Abstract: Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L^2$-VSD). $L^2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of $L^2$-VSD, revealing its clear superiority over prior score distillation-based methods. We also show that our method can be seamlessly incorporated into any other VSD-based text-to-3D framework.

Comment: Matches criterion 6 closely as it discusses advancements in text-to-3D generation using generative models.

Relevance: 5 Novelty: 6 Back to [topic] [top]

7004. 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos

ArXiv: 2507.10437 [page] [pdf]

Authors: Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma, Guofeng Zhang, Qihao Liu, Alan Yuille, Jieneng Chen

Abstract: arXiv:2507.10437v1 Announce Type: new Abstract: Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at https://github.com/zhongshsh/4D-Animal.

Comment: Matches criterion 7 closely.

Relevance: 5 Novelty: 6 Back to [topic] [top]

10. Stable Score Distillation

ArXiv: 2507.09168 [page] [pdf]

Authors: Haiming Zhu, Yangyang Xu, Chenshu Xu, Tingrui Shen, Wenxi Liu, Yong Du, Jun Yu, Shengfeng He

Abstract: arXiv:2507.09168v1 Announce Type: new Abstract: Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes Classifier-Free Guidance (CFG) equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content's structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and text-driven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.

Comment: 5

Relevance: 5 Novelty: 6 Back to [topic] [top]

11. Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models

ArXiv: 2507.08965 [page] [pdf]

Authors: Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuta Takida, Yuki Mitsufuji, Molei Tao

Abstract: arXiv:2507.08965v1 Announce Type: new Abstract: Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete diffusion, focusing on the role of guidance schedules. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism empirically applicable to any discrete diffusion. Intuitively, our method smoothens the transport between the data distribution and the initial (masked/uniform) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. The efficacy of our method is empirically demonstrated with experiments on ImageNet (masked discrete diffusion) and QM9 (uniform discrete diffusion).

Comment: 5

Relevance: 5 Novelty: 6 Back to [topic] [top]

12. Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder

ArXiv: 2507.10552 [page] [pdf]

Authors: Vladimir Iashin, Horace Lee, Dan Schofield, Andrew Zisserman

Abstract: arXiv:2507.10552v1 Announce Type: new Abstract: Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approach to learning robust chimpanzee face embeddings from unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision Transformers on automatically mined face crops, eliminating the need for identity labels. Our method demonstrates strong open-set re-identification performance, surpassing supervised baselines on challenging benchmarks such as Bossou, despite utilising no labelled data during training. This work underscores the potential of self-supervised learning in biodiversity monitoring and paves the way for scalable, non-invasive population studies.

Comment: The paper presents a self-supervised learning approach for face embedding in wildlife monitoring, which is relevant to self-supervised learning methods.

Relevance: 5 Novelty: 6 Back to [topic] [top]

13. Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

ArXiv: 2507.10306 [page] [pdf]

Authors: Ozge Mercanoglu Sincan, Richard Bowden

Abstract: arXiv:2507.10306v1 Announce Type: new Abstract: Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Comment: The paper discusses a contrastive pretraining method for sign language translation, which involves self-supervised learning.

Relevance: 5 Novelty: 6 Back to [topic] [top]

15. Latent Diffusion Models with Masked AutoEncoders

ArXiv: 2507.09984 [page] [pdf]

Authors: Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

Abstract: arXiv:2507.09984v1 Announce Type: new Abstract: In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.

Comment: 1

Relevance: 5 Novelty: 6 Back to [topic] [top]

23. Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes

ArXiv: 2507.09362 [page] [pdf]

Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel

Abstract: arXiv:2507.09362v1 Announce Type: new Abstract: An autoencoder (AE) is a neural network that, using self-supervised training, learns a succinct parameterized representation, and a corresponding encoding and decoding process, for all instances in a given class. Here, we introduce the concept of a meta-autoencoder (MAE): an AE for a collection of autoencoders. Given a family of classes that differ from each other by the values of some parameters, and a trained AE for each class, an MAE for the family is a neural net that has learned a compact representation and associated encoder and decoder for the class-specific AEs. One application of this general concept is in research and modeling of natural evolution -- capturing the defining and the distinguishing properties across multiple species that are dynamically evolving from each other and from common ancestors. In this interim report we provide a constructive definition of MAEs, initial examples, and the motivating research directions in machine learning and biology.

Comment: The paper introduces meta-autoencoders, which is a novel concept in self-supervised learning.

Relevance: 3 Novelty: 7 Back to [topic] [top]

24. Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

ArXiv: 2507.09323 [page] [pdf]

Authors: Kaixuan Cong, Yifan Wang, Rongkun Xue, Yuyang Jiang, Yiming Feng, Jing Yang

Abstract: arXiv:2507.09323v1 Announce Type: new Abstract: Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

Comment: The paper introduces a self-supervised pre-training strategy for audio-video fusion, which is relevant to self-supervised learning methods.

Relevance: 5 Novelty: 5 Back to [topic] [top]

25. Prompt Informed Reinforcement Learning for Visual Coverage Path Planning

ArXiv: 2507.10284 [page] [pdf]

Authors: Venkat Margapuri

Abstract: arXiv:2507.10284v1 Announce Type: new Abstract: Visual coverage path planning with unmanned aerial vehicles (UAVs) requires agents to strategically coordinate UAV motion and camera control to maximize coverage, minimize redundancy, and maintain battery efficiency. Traditional reinforcement learning (RL) methods rely on environment-specific reward formulations that lack semantic adaptability. This study proposes Prompt-Informed Reinforcement Learning (PIRL), a novel approach that integrates the zero-shot reasoning ability and in-context learning capability of large language models with curiosity-driven RL. PIRL leverages semantic feedback from an LLM, GPT-3.5, to dynamically shape the reward function of the Proximal Policy Optimization (PPO) RL policy guiding the agent in position and camera adjustments for optimal visual coverage. The PIRL agent is trained using OpenAI Gym and evaluated in various environments. Furthermore, the sim-to-real-like ability and zero-shot generalization of the agent are tested by operating the agent in Webots simulator which introduces realistic physical dynamics. Results show that PIRL outperforms multiple learning-based baselines such as PPO with static rewards, PPO with exploratory weight initialization, imitation learning, and an LLM-only controller. Across different environments, PIRL outperforms the best-performing baseline by achieving up to 14% higher visual coverage in OpenAI Gym and 27% higher in Webots, up to 25% higher battery efficiency, and up to 18% lower redundancy, depending on the environment. The results highlight the effectiveness of LLM-guided reward shaping in complex spatial exploration tasks and suggest a promising direction for integrating natural language priors into RL for robotics.

Comment: 2

Relevance: 5 Novelty: 5 Back to [topic] [top]

26. Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning

ArXiv: 2507.08828 [page] [pdf]

Authors: Tarek Berghout

Abstract: arXiv:2507.08828v1 Announce Type: new Abstract: This paper introduces Recurrent Expansion (RE) as a new learning paradigm that advances beyond conventional Machine Learning (ML) and Deep Learning (DL). While DL focuses on learning from static data representations, RE proposes an additional dimension: learning from the evolving behavior of models themselves. RE emphasizes multiple mappings of data through identical deep architectures and analyzes their internal representations (i.e., feature maps) in conjunction with observed performance signals such as loss. By incorporating these behavioral traces, RE enables iterative self-improvement, allowing each model version to gain insight from its predecessors. The framework is extended through Multiverse RE (MVRE), which aggregates signals from parallel model instances, and further through Heterogeneous MVRE (HMVRE), where models of varying architectures contribute diverse perspectives. A scalable and adaptive variant, Sc-HMVRE, introduces selective mechanisms and scale diversity for real-world deployment. Altogether, RE presents a shift in DL: from purely representational learning to behavior-aware, self-evolving systems. It lays the groundwork for a new class of intelligent models capable of reasoning over their own learning dynamics, offering a path toward scalable, introspective, and adaptive artificial intelligence. A simple code example to support beginners in running their own experiments is provided in Code Availability Section of this paper.

Comment: Does not match any specific criteria but introduces a new learning paradigm.

Relevance: 3 Novelty: 7 Back to [topic] [top]

28. Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios

ArXiv: 2507.09915 [page] [pdf]

Authors: Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj

Abstract: arXiv:2507.09915v1 Announce Type: new Abstract: The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be released after acceptance.

Comment: 5

Relevance: 5 Novelty: 4 Back to [topic] [top]

29. MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

ArXiv: 2507.09574 [page] [pdf]

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu

Abstract: arXiv:2507.09574v1 Announce Type: new Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Comment:

Relevance: 3 Novelty: 5 Back to [topic] [top]

30. Graph World Model

ArXiv: 2507.10539 [page] [pdf]

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

Abstract: arXiv:2507.10539v1 Announce Type: new Abstract: World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab-uiuc/GWM.

Comment: No criteria match closely.

Relevance: 3 Novelty: 5 Back to [topic] [top]

31. MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation

ArXiv: 2507.10543 [page] [pdf]

Authors: Juyi Sheng, Ziyi Wang, Peiming Li, Mengyuan Liu

Abstract: arXiv:2507.10543v1 Announce Type: new Abstract: In robot manipulation, robot learning has become a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the MeanFlow Identity, our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms-19x faster than DP3 and nearly 2x faster than FlowPolicy. Our code is available at https://mp1-2254.github.io/.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

32. Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning

ArXiv: 2507.09118 [page] [pdf]

Authors: Linlan Huang, Xusheng Cao, Haori Lu, Yifan Meng, Fei Yang, Xialei Liu

Abstract: arXiv:2507.09118v1 Announce Type: new Abstract: Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at https://github.com/linlany/MindtheGap.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

33. Test-Time Canonicalization by Foundation Models for Robust Perception

ArXiv: 2507.10375 [page] [pdf]

Authors: Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

Abstract: arXiv:2507.10375v1 Announce Type: new Abstract: Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, "canonical" views, FOCAL enhances robustness without re-training or architectural changes. Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at: https://github.com/sutkarsh/focal.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

34. BlindSight: Harnessing Sparsity for Efficient VLMs

ArXiv: 2507.09071 [page] [pdf]

Authors: Tharun Adithya Srikrishnan, Deval Shah, Steven K. Reinhardt

Abstract: arXiv:2507.09071v1 Announce Type: new Abstract: Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

35. GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

ArXiv: 2507.09491 [page] [pdf]

Authors: Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao

Abstract: arXiv:2507.09491v1 Announce Type: new Abstract: Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

36. EgoAnimate: Generating Human Animations from Egocentric top-down Views

ArXiv: 2507.09230 [page] [pdf]

Authors: G. Kutay T"urkoglu, Julian Tanke, Iheb Belgacem, Lev Markhasin

Abstract: arXiv:2507.09230v1 Announce Type: new Abstract: An ideal digital telepresence experience requires accurate replication of a person's body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems.

Comment: No specific criteria match.

Relevance: 3 Novelty: 5 Back to [topic] [top]

37. DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

ArXiv: 2507.10302 [page] [pdf]

Authors: Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao

Abstract: arXiv:2507.10302v1 Announce Type: new Abstract: In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.

Comment:

Relevance: 3 Novelty: 5 Back to [topic] [top]

38. DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation

ArXiv: 2507.10118 [page] [pdf]

Authors: Ivan Martinovi'c, Josip \v{S}ari'c, Marin Or\v{s}i'c, Matej Kristan, Sini\v{s}a \v{S}egvi'c

Abstract: arXiv:2507.10118v1 Announce Type: new Abstract: Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.

Comment: The paper discusses a semi-supervised approach for panoptic segmentation, which is not directly related to self-supervised learning or video segmentation.

Relevance: 3 Novelty: 5 Back to [topic] [top]

39. Geometric Generative Modeling with Noise-Conditioned Graph Networks

ArXiv: 2507.09391 [page] [pdf]

Authors: Peter Pao-Huang, Mitchell Black, Xiaojie Qiu

Abstract: arXiv:2507.09391v1 Announce Type: new Abstract: Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce \textit{Noise-Conditioned Graph Networks} (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise-independent architectures on a variety of domains including $3$D point clouds, spatiotemporal transcriptomics, and images. Code is available at https://github.com/peterpaohuang/ncgn.

Comment: No criteria match closely.

Relevance: 3 Novelty: 5 Back to [topic] [top]

40. Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction

ArXiv: 2507.09061 [page] [pdf]

Authors: Thomas T. Zhang, Daniel Pfrommer, Nikolai Matni, Max Simchowitz

Abstract: arXiv:2507.09061v1 Announce Type: new Abstract: We study the problem of imitating an expert demonstrator in a continuous state-and-action dynamical system. While imitation learning in discrete settings such as autoregressive language modeling has seen immense success and popularity in recent years, imitation in physical settings such as autonomous driving and robot learning has proven comparably more complex due to the compounding errors problem, often requiring elaborate set-ups to perform stably. Recent work has demonstrated that even in benign settings, exponential compounding errors are unavoidable when learning solely from expert-controlled trajectories, suggesting the need for more advanced policy parameterizations or data augmentation. To this end, we present minimal interventions that provably mitigate compounding errors in continuous state-and-action imitation learning. When the system is open-loop stable, we prescribe "action chunking," i.e., predicting and playing sequences of actions in open-loop; when the system is possibly unstable, we prescribe "noise injection," i.e., adding noise during expert demonstrations. These interventions align with popular choices in modern robot learning, though the benefits we derive are distinct from the effects they were designed to target. Our results draw insights and tools from both control theory and reinforcement learning; however, our analysis reveals novel considerations that do not naturally arise when either literature is considered in isolation.

Comment: Does not match any specific criteria but is related to robotics and imitation learning, which is a general interest area.

Relevance: 3 Novelty: 5 Back to [topic] [top]

41. Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities

ArXiv: 2507.10442 [page] [pdf]

Authors: Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, Leonid Sigal

Abstract: arXiv:2507.10442v1 Announce Type: new Abstract: Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.

Comment: Does not match any specific criteria. It focuses on understanding limitations of vision-language models, which is not directly related to the criteria.

Relevance: 3 Novelty: 5 Back to [topic] [top]

42. Towards Human-level Dexterity via Robot Learning

ArXiv: 2507.09117 [page] [pdf]

Authors: Gagan Khandate

Abstract: arXiv:2507.09117v1 Announce Type: new Abstract: Dexterous intelligence -- the ability to perform complex interactions with multi-fingered hands -- is a pinnacle of human physical intelligence and emergent higher-order cognitive skills. However, contrary to Moravec's paradox, dexterous intelligence in humans appears simple only superficially. Many million years were spent co-evolving the human brain and hands including rich tactile sensing. Achieving human-level dexterity with robotic hands has long been a fundamental goal in robotics and represents a critical milestone toward general embodied intelligence. In this pursuit, computational sensorimotor learning has made significant progress, enabling feats such as arbitrary in-hand object reorientation. However, we observe that achieving higher levels of dexterity requires overcoming very fundamental limitations of computational sensorimotor learning. I develop robot learning methods for highly dexterous multi-fingered manipulation by directly addressing these limitations at their root cause. Chiefly, through key studies, this disseration progressively builds an effective framework for reinforcement learning of dexterous multi-fingered manipulation skills. These methods adopt structured exploration, effectively overcoming the limitations of random exploration in reinforcement learning. The insights gained culminate in a highly effective reinforcement learning that incorporates sampling-based planning for direct exploration. Additionally, this thesis explores a new paradigm of using visuo-tactile human demonstrations for dexterity, introducing corresponding imitation learning techniques.

Comment: The paper discusses reinforcement learning and imitation learning for dexterous manipulation, which is related to robotics but does not specifically mention vision language models.

Relevance: 3 Novelty: 4 Back to [topic] [top]

43. Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

ArXiv: 2507.09612 [page] [pdf]

Authors: You Huang, Lichao Chen, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

Abstract: arXiv:2507.09612v1 Announce Type: new Abstract: Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.

Comment:

Relevance: 3 Novelty: 4 Back to [topic] [top]

44. MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models

ArXiv: 2507.09184 [page] [pdf]

Authors: Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang

Abstract: arXiv:2507.09184v1 Announce Type: new Abstract: Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

Comment: No criteria match closely.

Relevance: 3 Novelty: 4 Back to [topic] [top]

45. PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

ArXiv: 2507.08979 [page] [pdf]

Authors: Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad

Abstract: arXiv:2507.08979v1 Announce Type: new Abstract: We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.

Comment: No criteria match closely.

Relevance: 3 Novelty: 4 Back to [topic] [top]

46. Contrastive Language-Image Pre-Training Model based Semantic Communication Performance Optimization

ArXiv: 2507.08873 [page] [pdf]

Authors: Shaoran Yang, Dongyu Wei, Hanzhi Yu, Zhaohui Yang, Yuchen Liu, Mingzhe Chen

Abstract: arXiv:2507.08873v1 Announce Type: new Abstract: In this paper, a novel contrastive language-image pre-training (CLIP) model based semantic communication framework is designed. Compared to standard neural network (e.g.,convolutional neural network) based semantic encoders and decoders that require joint training over a common dataset, our CLIP model based method does not require any training procedures thus enabling a transmitter to extract data meanings of the original data without neural network model training, and the receiver to train a neural network for follow-up task implementation without the communications with the transmitter. Next, we investigate the deployment of the CLIP model based semantic framework over a noisy wireless network. Since the semantic information generated by the CLIP model is susceptible to wireless noise and the spectrum used for semantic information transmission is limited, it is necessary to jointly optimize CLIP model architecture and spectrum resource block (RB) allocation to maximize semantic communication performance while considering wireless noise, the delay and energy used for semantic communication. To achieve this goal, we use a proximal policy optimization (PPO) based reinforcement learning (RL) algorithm to learn how wireless noise affect the semantic communication performance thus finding optimal CLIP model and RB for each user. Simulation results show that our proposed method improves the convergence rate by up to 40%, and the accumulated reward by 4x compared to soft actor-critic.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

47. Cameras as Relative Positional Encoding

ArXiv: 2507.10496 [page] [pdf]

Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

Abstract: arXiv:2507.10496v1 Announce Type: new Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

48. Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance

ArXiv: 2507.10500 [page] [pdf]

Authors: Kyungtae Han, Yitao Chen, Rohit Gupta, Onur Altintas

Abstract: arXiv:2507.10500v1 Announce Type: new Abstract: While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

49. Continual Reinforcement Learning by Planning with Online World Models

ArXiv: 2507.09177 [page] [pdf]

Authors: Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin

Abstract: arXiv:2507.09177v1 Announce Type: new Abstract: Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

50. Raci-Net: Ego-vehicle Odometry Estimation in Adverse Weather Conditions

ArXiv: 2507.10376 [page] [pdf]

Authors: Mohammadhossein Talebi, Pragyan Dahal, Davide Possenti, Stefano Arrigoni, Francesco Braghin

Abstract: arXiv:2507.10376v1 Announce Type: new Abstract: Autonomous driving systems are highly dependent on sensors like cameras, LiDAR, and inertial measurement units (IMU) to perceive the environment and estimate their motion. Among these sensors, perception-based sensors are not protected from harsh weather and technical failures. Although existing methods show robustness against common technical issues like rotational misalignment and disconnection, they often degrade when faced with dynamic environmental factors like weather conditions. To address these problems, this research introduces a novel deep learning-based motion estimator that integrates visual, inertial, and millimeter-wave radar data, utilizing each sensor strengths to improve odometry estimation accuracy and reliability under adverse environmental conditions such as snow, rain, and varying light. The proposed model uses advanced sensor fusion techniques that dynamically adjust the contributions of each sensor based on the current environmental condition, with radar compensating for visual sensor limitations in poor visibility. This work explores recent advancements in radar-based odometry and highlights that radar robustness in different weather conditions makes it a valuable component for pose estimation systems, specifically when visual sensors are degraded. Experimental results, conducted on the Boreas dataset, showcase the robustness and effectiveness of the model in both clear and degraded environments.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

51. Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration

ArXiv: 2507.10293 [page] [pdf]

Authors: Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, Jingyuan Chen

Abstract: arXiv:2507.10293v1 Announce Type: new Abstract: Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.

Comment: No specific criteria match.

Relevance: 3 Novelty: 4 Back to [topic] [top]

52. Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

ArXiv: 2507.09615 [page] [pdf]

Authors: Eman Ali, Sathira Silva, Chetan Arora, Muhammad Haris Khan

Abstract: arXiv:2507.09615v1 Announce Type: new Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.

Comment: No criteria match closely.

Relevance: 3 Novelty: 4 Back to [topic] [top]

53. QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

ArXiv: 2507.09514 [page] [pdf]

Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu

Abstract: arXiv:2507.09514v1 Announce Type: new Abstract: State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.

Comment: No criteria match closely.

Relevance: 3 Novelty: 4 Back to [topic] [top]

54. VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding

ArXiv: 2507.09815 [page] [pdf]

Authors: Younggun Kim, Ahmed S. Abdelrahman, Mohamed Abdel-Aty

Abstract: arXiv:2507.09815v1 Announce Type: new Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.

Comment: No criteria match closely.

Relevance: 3 Novelty: 4 Back to [topic] [top]

55. Deep Recurrence for Dynamical Segmentation Models

ArXiv: 2507.10143 [page] [pdf]

Authors: David Calhas, Arlindo L. Oliveira

Abstract: arXiv:2507.10143v1 Announce Type: new Abstract: While biological vision systems rely heavily on feedback connections to iteratively refine perception, most artificial neural networks remain purely feedforward, processing input in a single static pass. In this work, we propose a predictive coding inspired feedback mechanism that introduces a recurrent loop from output to input, allowing the model to refine its internal state over time. We implement this mechanism within a standard U-Net architecture and introduce two biologically motivated operations, softmax projection and exponential decay, to ensure stability of the feedback loop. Through controlled experiments on a synthetic segmentation task, we show that the feedback model significantly outperforms its feedforward counterpart in noisy conditions and generalizes more effectively with limited supervision. Notably, feedback achieves above random performance with just two training examples, while the feedforward model requires at least four. Our findings demonstrate that feedback enhances robustness and data efficiency, and offer a path toward more adaptive and biologically inspired neural architectures. Code is available at: github.com/DCalhas/feedback_segmentation.

Comment: Does not match any specific criteria but is related to segmentation, which is a general interest area.

Relevance: 3 Novelty: 4 Back to [topic] [top]

56. PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

ArXiv: 2507.09139 [page] [pdf]

Authors: Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

Abstract: arXiv:2507.09139v1 Announce Type: new Abstract: Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

Comment: Does not match any specific criteria but is related to vision-language models.

Relevance: 3 Novelty: 4 Back to [topic] [top]