This project is adapted from tatsu-lab/gpt_paper_assistant.
About me on Bilibili. Help keep the website running:
Paper selection prompt and criteria (jump to the section by clicking the link):
3. Works on video segmentation that rely on unsupervised and self-supervised learning methods.
1004. Adversarial Masked Autoencoder Purifier with Defense Transferability [more]
Authors: Yuan-Chih Chen, Chun-Shien Lu
1009. Momentum Contrastive Learning with Enhanced Negative Sampling and Hard Negative Filtering [more]
Authors: Duy Hoang, Huy Ngo, Khoi Pham, Tri Nguyen, Gia Bao, Huy Phan
1013. Self-supervised Graph Transformer with Contrastive Learning for Brain Connectivity Analysis towards Improving Autism Detection [more]
Authors: Yicheng Leng, Syed Muhammad Anwar, Islem Rekik, Sen He, Eung-Joo Lee
Back to [top]
2002. Improving Vision-Language-Action Model with Online Reinforcement Learning [more]
Authors: Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
2003. DIRIGENt: End-To-End Robotic Imitation of Human Demonstrations Based on a Diffusion Model [more]
Authors: Josua Spisak, Matthias Kerzel, Stefan Wermter
2007. BiFold: Bimanual Cloth Folding with Language Guidance [more]
Authors: Oriol Barbany, Adri`a Colom'e, Carme Torras
2012. Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [more]
Authors: Muhammad Atta ur Rahman
Back to [top]
Back to [top]
Back to [top]
5000. Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models [more]
Authors: Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang
5001. CascadeV: An Implementation of Wurstchen Architecture for Video Generation [more]
Authors: Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo
5006. ITVTON:Virtual Try-On Diffusion Transformer Model Based on Integrated Image and Text [more]
Authors: Haifeng Ni
5008. CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation [more]
Authors: Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, Federico Tombari
Back to [top]
6005. DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation [more]
Authors: Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
6010. Synthesizing 3D Abstractions by Inverting Procedural Buildings with Transformers [more]
Authors: Max Dax, Jordi Berbel, Jan Stria, Leonidas Guibas, Urs Bergmann
6011. Consistency Diffusion Models for Single-Image 3D Reconstruction with Priors [more]
Authors: Chenru Jiang, Chengrui Zhang, Xi Yang, Jie Sun, Kaizhu Huang
Back to [top]
Back to [top]
14. Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction [more]
Authors: Hy Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
15. Predicting 3D representations for Dynamic Scenes [more]
Authors: Di Qi, Tong Yang, Beining Wang, Xiangyu Zhang, Wenqiang Zhang
16. DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation [more]
Authors: Han Sun, Rui Gong, Ismail Nejjar, Olga Fink
17. Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement [more]
Authors: Kei Katsumata, Motonari Kambara, Daichi Yashima, Ryosuke Korekata, Komei Sugiura
18. TopoNets: High Performing Vision and Language Models with Brain-Like Topography [more]
Authors: Mayukh Deb, Mainak Deb, N. Apurva Ratan Murty
19. PackDiT: Joint Human Motion and Text Generation via Mutual Prompting [more]
Authors: Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang
20. MAUCell: An Adaptive Multi-Attention Framework for Video Frame Prediction [more]
Authors: Shreyam Gupta (Indian Institute of Technology), P. Agrawal (University of Colorado, Boulder, USA), Priyam Gupta (Intelligent Field Robotic Systems)
21. Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation [more]
Authors: Philip Hughes, Larry Burns, Luke Adams
22. Unsupervised Domain Adaptation with Dynamic Clustering and Contrastive Refinement for Gait Recognition [more]
Authors: Xiaolei Liu, Yan Sun, Mark Nixon
23. Objects matter: object-centric world models improve reinforcement learning in visually complex environments [more]
Authors: Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey
24. One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning [more]
Authors: Chunpeng Zhou, Qianqian Shen, Zhi Yu, Jiajun Bu, Haishuai Wang
25. Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? [more]
Authors: Sania Waheed, Bruno Ferrarini, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan
26. Giving Sense to Inputs: Toward an Accessible Control Framework for Shared Autonomy [more]
Authors: Shalutha Rajapakshe, Jean-Marc Odobez, Emmanuel Senft
27. On the Interplay Between Sparsity and Training in Deep Reinforcement Learning [more]
Authors: Fatima Davelouis, John D. Martin, Michael Bowling
28. RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples [more]
Authors: Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Ali Ansari, Sepehr Ghobadi, Masoud Hadi, Arshia Soltani Moakhar, Mohammad Azizmalayeri, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
29. DINOSTAR: Deep Iterative Neural Object Detector Self-Supervised Training for Roadside LiDAR Applications [more]
Authors: Muhammad Shahbaz, Shaurya Agarwal
30. Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment [more]
Authors: Lumen AI, Tengzhou No. 1 Middle School, Shihao Ji, Zihui Song, Fucheng Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Xu Tianhao
31. Extending Information Bottleneck Attribution to Video Sequences [more]
Authors: Veronika Solopova, Lucas Schmidt, Dorothea Kolossa
32. Dream to Drive with Predictive Individual World Model [more]
Authors: Yinfeng Gao, Qichao Zhang, Da-wei Ding, Dongbin Zhao
33. Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection [more]
Authors: Xiangyu Gao, Yu Dai, Benliu Qiu, Hongliang Li
34. RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains [more]
Authors: Shady Nasrat, Myungsu Kim, Seonil Lee, Jiho Lee, Yeoncheol Jang, Seung-joon Yi
35. Blockchain-based Crowdsourced Deep Reinforcement Learning as a Service [more]
Authors: Ahmed Alagha, Hadi Otrok, Shakti Singh, Rabeb Mizouni, Jamal Bentahar
36. Scenario Understanding of Traffic Scenes Through Large Visual Language Models [more]
Authors: Rivera Esteban, L"ubberstedt Jannik, Nico Uhlemann, Markus Lienkamp
Back to [top]
ArXiv: 2501.16904 [page] [pdf]
Authors: Yuan-Chih Chen, Chun-Shien Lu
Abstract: arXiv:2501.16904v1 Announce Type: new Abstract: The study of adversarial defense still struggles to combat with advanced adversarial attacks. In contrast to most prior studies that rely on the diffusion model for test-time defense to remarkably increase the inference time, we propose Masked AutoEncoder Purifier (MAEP), which integrates Masked AutoEncoder (MAE) into an adversarial purifier framework for test-time purification. While MAEP achieves promising adversarial robustness, it particularly features model defense transferability and attack generalization without relying on using additional data that is different from the training dataset. To our knowledge, MAEP is the first study of adversarial purifier based on MAE. Extensive experimental results demonstrate that our method can not only maintain clear accuracy with only a slight drop but also exhibit a close gap between the clean and robust accuracy. Notably, MAEP trained on CIFAR10 achieves state-of-the-art performance even when tested directly on ImageNet, outperforming existing diffusion-based models trained specifically on ImageNet.
Comment: Matches criterion 1: New methodological improvements to self-supervised learning.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16360 [page] [pdf]
Authors: Duy Hoang, Huy Ngo, Khoi Pham, Tri Nguyen, Gia Bao, Huy Phan
Abstract: arXiv:2501.16360v1 Announce Type: new Abstract: Contrastive learning has become pivotal in unsupervised representation learning, with frameworks like Momentum Contrast (MoCo) effectively utilizing large negative sample sets to extract discriminative features. However, traditional approaches often overlook the full potential of key embeddings and are susceptible to performance degradation from noisy negative samples in the memory bank. This study addresses these challenges by proposing an enhanced contrastive learning framework that incorporates two key innovations. First, we introduce a dual-view loss function, which ensures balanced optimization of both query and key embeddings, improving representation quality. Second, we develop a selective negative sampling strategy that emphasizes the most challenging negatives based on cosine similarity, mitigating the impact of noise and enhancing feature discrimination. Extensive experiments demonstrate that our framework achieves superior performance on downstream tasks, delivering robust and well-structured representations. These results highlight the potential of optimized contrastive mechanisms to advance unsupervised learning and extend its applicability across domains such as computer vision and natural language processing
Comment: Matches criteria 1
Relevance: 5 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16346 [page] [pdf]
Authors: Yicheng Leng, Syed Muhammad Anwar, Islem Rekik, Sen He, Eung-Joo Lee
Abstract: arXiv:2501.16346v1 Announce Type: new Abstract: Functional Magnetic Resonance Imaging (fMRI) provides useful insights into the brain function both during task or rest. Representing fMRI data using correlation matrices is found to be a reliable method of analyzing the inherent connectivity of the brain in the resting and active states. Graph Neural Networks (GNNs) have been widely used for brain network analysis due to their inherent explainability capability. In this work, we introduce a novel framework using contrastive self-supervised learning graph transformers, incorporating a brain network transformer encoder with random graph alterations. The proposed network leverages both contrastive learning and graph alterations to effectively train the graph transformer for autism detection. Our approach, tested on Autism Brain Imaging Data Exchange (ABIDE) data, demonstrates superior autism detection, achieving an AUROC of 82.6 and an accuracy of 74%, surpassing current state-of-the-art methods.
Comment: Matches criterion 1 closely as it discusses a novel self-supervised learning method using contrastive learning for brain connectivity analysis.
Relevance: 5 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16664 [page] [pdf]
Authors: Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
Abstract: arXiv:2501.16664v1 Announce Type: new Abstract: Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.
Comment: Matches criterion 2 as it explores the use of vision-language models in reinforcement learning for robotics.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16800 [page] [pdf]
Authors: Josua Spisak, Matthias Kerzel, Stefan Wermter
Abstract: arXiv:2501.16800v1 Announce Type: new Abstract: There has been substantial progress in humanoid robots, with new skills continuously being taught, ranging from navigation to manipulation. While these abilities may seem impressive, the teaching methods often remain inefficient. To enhance the process of teaching robots, we propose leveraging a mechanism effectively used by humans: teaching by demonstrating. In this paper, we introduce DIRIGENt (DIrect Robotic Imitation GENeration model), a novel end-to-end diffusion approach that directly generates joint values from observing human demonstrations, enabling a robot to imitate these actions without any existing mapping between it and humans. We create a dataset in which humans imitate a robot and then use this collected data to train a diffusion model that enables a robot to imitate humans. The following three aspects are the core of our contribution. First is our novel dataset with natural pairs between human and robot poses, allowing our approach to imitate humans accurately despite the gap between their anatomies. Second, the diffusion input to our model alleviates the challenge of redundant joint configurations, limiting the search space. And finally, our end-to-end architecture from perception to action leads to an improved learning capability. Through our experimental analysis, we show that combining these three aspects allows DIRIGENt to outperform existing state-of-the-art approaches in the field of generating joint values from RGB images.
Comment: Matches criterion 2 closely as it discusses a novel application of diffusion models in robotic imitation learning.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16458 [page] [pdf]
Authors: Oriol Barbany, Adri`a Colom'e, Carme Torras
Abstract: arXiv:2501.16458v1 Announce Type: new Abstract: Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. Given the lack of annotated bimanual folding data, we devise a procedure to automatically parse actions of a simulated dataset and tag them with aligned text instructions. BiFold attains the best performance on our dataset and can transfer to new instructions, garments, and environments.
Comment: Matches criterion 2 closely as it discusses the use of vision-language models in robotics for cloth folding tasks.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16769 [page] [pdf]
Authors: Muhammad Atta ur Rahman
Abstract: arXiv:2501.16769v1 Announce Type: new Abstract: Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained. This study investigated simple yet efficient methods for adaping previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposed "Beyond-Labels," a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts. Furthermore, we efficiently captured positional information in images using Fourier embeddings, thus improving the generalization across various image sizes. Extensive ablation tests were performed to investigate the important components of our proposed method; when tested against the common benchmark PASCAL-5i, it demonstrated superior performance despite being trained on frozen image and language characteristics.
Comment: Matches criterion 2 closely as it discusses vision-language models for segmentation tasks.
Relevance: 5 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16714 [page] [pdf]
Authors: Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang
Abstract: arXiv:2501.16714v1 Announce Type: new Abstract: Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.
Comment: Matches criterion 5 as it discusses improvements in text-to-video diffusion models for motion customization.
Relevance: 5 Novelty: 7 Back to [topic] [top]
ArXiv: 2501.16612 [page] [pdf]
Authors: Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo
Abstract: arXiv:2501.16612v1 Announce Type: new Abstract: Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.
Comment: Matches criterion 5 for significant advance in text to video diffusion models.
Relevance: 5 Novelty: 7 Back to [topic] [top]
ArXiv: 2501.16757 [page] [pdf]
Authors: Haifeng Ni
Abstract: arXiv:2501.16757v1 Announce Type: new Abstract: Recent advancements in virtual fitting for characters and clothing have leveraged diffusion models to improve the realism of garment fitting. However, challenges remain in handling complex scenes and poses, which can result in unnatural garment fitting and poorly rendered intricate patterns. In this work, we introduce ITVTON, a novel method that enhances clothing-character interactions by combining clothing and character images along spatial channels as inputs, thereby improving fitting accuracy for the inpainting model. Additionally, we incorporate integrated textual descriptions from multiple images to boost the realism of the generated visual effects. To optimize computational efficiency, we limit training to the attention parameters within a single diffusion transformer (Single-DiT) block. To more rigorously address the complexities of real-world scenarios, we curated training samples from the IGPair dataset, thereby enhancing ITVTON's performance across diverse environments. Extensive experiments demonstrate that ITVTON outperforms baseline methods both qualitatively and quantitatively, setting a new standard for virtual fitting tasks.
Comment: Matches criterion 5 as it discusses improvements in diffusion models for virtual try-on applications.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.17162 [page] [pdf]
Authors: Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, Federico Tombari
Abstract: arXiv:2501.17162v1 Announce Type: new Abstract: We introduce a novel method for generating 360{\deg} panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: https://cubediff.github.io/
Comment: Matches criterion 5 for significant advance in image to panorama diffusion models.
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16764 [page] [pdf]
Authors: Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
Abstract: arXiv:2501.16764v1 Announce Type: new Abstract: Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.
Comment: Matches criteria 6
Relevance: 5 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.17044 [page] [pdf]
Authors: Max Dax, Jordi Berbel, Jan Stria, Leonidas Guibas, Urs Bergmann
Abstract: arXiv:2501.17044v1 Announce Type: new Abstract: We generate abstractions of buildings, reflecting the essential aspects of their geometry and structure, by learning to invert procedural models. We first build a dataset of abstract procedural building models paired with simulated point clouds and then learn the inverse mapping through a transformer. Given a point cloud, the trained transformer then infers the corresponding abstracted building in terms of a programmatic language description. This approach leverages expressive procedural models developed for gaming and animation, and thereby retains desirable properties such as efficient rendering of the inferred abstractions and strong priors for regularity and symmetry. Our approach achieves good reconstruction accuracy in terms of geometry and structure, as well as structurally consistent inpainting.
Comment: Matches criterion 6 closely as it discusses 3D generation using generative models.
Relevance: 5 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16737 [page] [pdf]
Authors: Chenru Jiang, Chengrui Zhang, Xi Yang, Jie Sun, Kaizhu Huang
Abstract: arXiv:2501.16737v1 Announce Type: new Abstract: This paper delves into the study of 3D point cloud reconstruction from a single image. Our objective is to develop the Consistency Diffusion Model, exploring synergistic 2D and 3D priors in the Bayesian framework to ensure superior consistency in the reconstruction process, a challenging yet critical requirement in this field. Specifically, we introduce a pioneering training framework under diffusion models that brings two key innovations. First, we convert 3D structural priors derived from the initial 3D point cloud as a bound term to increase evidence in the variational Bayesian framework, leveraging these robust intrinsic priors to tightly govern the diffusion training process and bolster consistency in reconstruction. Second, we extract and incorporate 2D priors from the single input image, projecting them onto the 3D point cloud to enrich the guidance for diffusion training. Our framework not only sidesteps potential model learning shifts that may arise from directly imposing additional constraints during training but also precisely transposes the 2D priors into the 3D domain. Extensive experimental evaluations reveal that our approach sets new benchmarks in both synthetic and real-world datasets. The code is included with the submission.
Comment: Matches criterion 6 as it discusses advancements in 3D generation using diffusion models for single-image 3D reconstruction.
Relevance: 5 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16753 [page] [pdf]
Authors: Hy Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
Abstract: arXiv:2501.16753v1 Announce Type: new Abstract: Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into $N$ chunks, where $N$ is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings -- this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.
Comment: No specific criteria match.
Relevance: 3 Novelty: 6 Back to [topic] [top]
ArXiv: 2501.16617 [page] [pdf]
Authors: Di Qi, Tong Yang, Beining Wang, Xiangyu Zhang, Wenqiang Zhang
Abstract: arXiv:2501.16617v1 Announce Type: new Abstract: We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
Comment: No specific criteria match.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16410 [page] [pdf]
Authors: Han Sun, Rui Gong, Ismail Nejjar, Olga Fink
Abstract: arXiv:2501.16410v1 Announce Type: new Abstract: Current unsupervised domain adaptation (UDA) methods for semantic segmentation typically assume identical class labels between the source and target domains. This assumption ignores the label-level domain gap, which is common in real-world scenarios, thus limiting their ability to identify finer-grained or novel categories without requiring extensive manual annotation. A promising direction to address this limitation lies in recent advancements in foundation models, which exhibit strong generalization abilities due to their rich prior knowledge. However, these models often struggle with domain-specific nuances and underrepresented fine-grained categories. To address these challenges, we introduce DynAlign, a framework that integrates UDA with foundation models to bridge both the image-level and label-level domain gaps. Our approach leverages prior semantic knowledge to align source categories with target categories that can be novel, more fine-grained, or named differently (e.g., vehicle to {car, truck, bus}). Foundation models are then employed for precise segmentation and category reassignment. To further enhance accuracy, we propose a knowledge fusion approach that dynamically adapts to varying scene contexts. DynAlign generates accurate predictions in a new target label space without requiring any manual annotations, allowing seamless adaptation to new taxonomies through either model retraining or direct inference. Experiments on the street scene semantic segmentation benchmarks GTA to Mapillary Vistas and GTA to IDD validate the effectiveness of our approach, achieving a significant improvement over existing methods. Our code will be publicly available.
Comment: No specific criteria match.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.17022 [page] [pdf]
Authors: Kei Katsumata, Motonari Kambara, Daichi Yashima, Ryosuke Korekata, Komei Sugiura
Abstract: arXiv:2501.17022v1 Announce Type: new Abstract: We consider the problem of generating free-form mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that handles both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Moreover, we introduce a novel training method that effectively incorporates the scores from both learning-based and n-gram based automatic evaluation metrics as rewards. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. Results demonstrate that our proposed method outperforms baseline methods including representative multimodal large language models on standard automatic evaluation metrics. Moreover, physical experiments reveal that using our method to augment data on language instructions improves the performance of an existing multimodal language understanding model for mobile manipulation.
Comment: No specific criteria match.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16396 [page] [pdf]
Authors: Mayukh Deb, Mainak Deb, N. Apurva Ratan Murty
Abstract: arXiv:2501.16396v1 Announce Type: new Abstract: Neurons in the brain are organized such that nearby cells tend to share similar functions. AI models lack this organization, and past efforts to introduce topography have often led to trade-offs between topography and task performance. In this work, we present TopoLoss, a new loss function that promotes spatially organized topographic representations in AI models without significantly sacrificing task performance. TopoLoss is highly adaptable and can be seamlessly integrated into the training of leading model architectures. We validate our method on both vision (ResNet-18, ResNet-50, ViT) and language models (GPT-Neo-125M, NanoGPT), collectively TopoNets. TopoNets are the highest-performing supervised topographic models to date, exhibiting brain-like properties such as localized feature processing, lower dimensionality, and increased efficiency. TopoNets also predict responses in the brain and replicate the key topographic signatures observed in the brain's visual and language cortices. Together, this work establishes a robust and generalizable framework for integrating topography into leading model architectures, advancing the development of high-performing models that more closely emulate the computational strategies of the human brain.
Comment:
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16551 [page] [pdf]
Authors: Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang
Abstract: arXiv:2501.16551v1 Announce Type: new Abstract: Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.
Comment:
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16997 [page] [pdf]
Authors: Shreyam Gupta (Indian Institute of Technology), P. Agrawal (University of Colorado, Boulder, USA), Priyam Gupta (Intelligent Field Robotic Systems)
Abstract: arXiv:2501.16997v1 Announce Type: new Abstract: Temporal sequence modeling stands as the fundamental foundation for video prediction systems and real-time forecasting operations as well as anomaly detection applications. The achievement of accurate predictions through efficient resource consumption remains an ongoing issue in contemporary temporal sequence modeling. We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adversarial Networks (GANs) and spatio-temporal attention mechanisms to improve video frame prediction capabilities. Our approach implements three types of attention models to capture intricate motion sequences. A dynamic combination of these attention outputs allows the model to reach both advanced decision accuracy along with superior quality while remaining computationally efficient. The integration of GAN elements makes generated frames appear more true to life therefore the framework creates output sequences which mimic real-world footage. The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction. Through a comprehensive evaluation methodology which merged the perceptual LPIPS measurement together with classic tests MSE, MAE, SSIM and PSNR exhibited enhancing capabilities than contemporary approaches based on direct benchmark tests of Moving MNIST, KTH Action, and CASIA-B (Preprocessed) datasets. Our examination indicates that MAUCell shows promise for operational time requirements. The research findings demonstrate how GANs work best with attention mechanisms to create better applications for predicting video sequences.
Comment:
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16467 [page] [pdf]
Authors: Philip Hughes, Larry Burns, Luke Adams
Abstract: arXiv:2501.16467v1 Announce Type: new Abstract: Semantic segmentation plays a crucial role in enabling machines to understand and interpret visual scenes at a pixel level. While traditional segmentation methods have achieved remarkable success, their generalization to diverse scenes and unseen object categories remains limited. Recent advancements in large language models (LLMs) offer a promising avenue for bridging visual and textual modalities, providing a deeper understanding of semantic relationships. In this paper, we propose LangSeg, a novel LLM-guided semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors generated by LLMs. Our framework integrates these descriptors with a pre-trained Vision Transformer (ViT) to achieve superior segmentation performance without extensive model retraining. We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms state-of-the-art models, achieving up to a 6.1% improvement in mean Intersection over Union (mIoU). Additionally, we conduct a comprehensive ablation study and human evaluation to validate the effectiveness of our method in real-world scenarios. The results demonstrate that LangSeg not only excels in semantic understanding and contextual alignment but also provides a flexible and efficient framework for language-guided segmentation tasks. This approach opens up new possibilities for interactive and domain-specific segmentation applications.
Comment: No specific criteria match.
Relevance: 3 Novelty: 5 Back to [topic] [top]
ArXiv: 2501.16608 [page] [pdf]
Authors: Xiaolei Liu, Yan Sun, Mark Nixon
Abstract: arXiv:2501.16608v1 Announce Type: new Abstract: Gait recognition is an emerging identification technology that distinguishes individuals at long distances by analyzing individual walking patterns. Traditional techniques rely heavily on large-scale labeled datasets, which incurs high costs and significant labeling challenges. Recently, researchers have explored unsupervised gait recognition with clustering-based unsupervised domain adaptation methods and achieved notable success. However, these methods directly use pseudo-label generated by clustering and neglect pseudolabel noise caused by domain differences, which affects the effect of the model training process. To mitigate these issues, we proposed a novel model called GaitDCCR, which aims to reduce the influence of noisy pseudo labels on clustering and model training. Our approach can be divided into two main stages: clustering and training stage. In the clustering stage, we propose Dynamic Cluster Parameters (DCP) and Dynamic Weight Centroids (DWC) to improve the efficiency of clustering and obtain reliable cluster centroids. In the training stage, we employ the classical teacher-student structure and propose Confidence-based Pseudo-label Refinement (CPR) and Contrastive Teacher Module (CTM) to encourage noisy samples to converge towards clusters containing their true identities. Extensive experiments on public gait datasets have demonstrated that our simple and effective method significantly enhances the performance of unsupervised gait recognition, laying the foundation for its application in the real-world.The code is available at https://github.com/YanSun-github/GaitDCCR
Comment: No specific criteria match.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16443 [page] [pdf]
Authors: Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey
Abstract: arXiv:2501.16443v1 Announce Type: new Abstract: Deep reinforcement learning has achieved remarkable success in learning control policies from pixels across a wide range of tasks, yet its application remains hindered by low sample efficiency, requiring significantly more environment interactions than humans to reach comparable performance. Model-based reinforcement learning (MBRL) offers a solution by leveraging learnt world models to generate simulated experience, thereby improving sample efficiency. However, in visually complex environments, small or dynamic elements can be critical for decision-making. Yet, traditional MBRL methods in pixel-based environments typically rely on auto-encoding with an $L_2$ loss, which is dominated by large areas and often fails to capture decision-relevant details. To address these limitations, we propose an object-centric MBRL pipeline, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements. Our approach consists of four main steps: (1) annotating key objects related to rewards and goals with segmentation masks, (2) extracting object features using a pre-trained, frozen foundation vision model, (3) incorporating these object features with the raw observations to predict environmental dynamics, and (4) training the policy using imagined trajectories generated by this object-centric world model. Building on the efficient MBRL algorithm STORM, we call this pipeline OC-STORM. We demonstrate OC-STORM's practical value in overcoming the limitations of conventional MBRL approaches on both Atari games and the visually complex game Hollow Knight.
Comment: No specific criteria match.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16720 [page] [pdf]
Authors: Chunpeng Zhou, Qianqian Shen, Zhi Yu, Jiajun Bu, Haishuai Wang
Abstract: arXiv:2501.16720v1 Announce Type: new Abstract: Recent advancements in fine-tuning Vision-Language Foundation Models (VLMs) have garnered significant attention for their effectiveness in downstream few-shot learning tasks.While these recent approaches exhibits some performance improvements, they often suffer from excessive training parameters and high computational costs. To address these challenges, we propose a novel Block matrix-based low-rank adaptation framework, called Block-LoRA, for fine-tuning VLMs on downstream few-shot tasks. Inspired by recent work on Low-Rank Adaptation (LoRA), Block-LoRA partitions the original low-rank decomposition matrix of LoRA into a series of sub-matrices while sharing all down-projection sub-matrices. This structure not only reduces the number of training parameters, but also transforms certain complex matrix multiplication operations into simpler matrix addition, significantly lowering the computational cost of fine-tuning. Notably, Block-LoRA enables fine-tuning CLIP on the ImageNet few-shot benchmark using a single 24GB GPU. We also show that Block-LoRA has the more tighter bound of generalization error than vanilla LoRA. Without bells and whistles, extensive experiments demonstrate that Block-LoRA achieves competitive performance compared to state-of-the-art CLIP-based few-shot methods, while maintaining a low training parameters count and reduced computational overhead.
Comment: Does not match any specific criteria but is related to vision-language models and few-shot learning.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16947 [page] [pdf]
Authors: Sania Waheed, Bruno Ferrarini, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan
Abstract: arXiv:2501.16947v1 Announce Type: new Abstract: The advances in Vision-Language models (VLMs) offer exciting opportunities for robotic applications involving image geo-localization, the problem of identifying the geo-coordinates of a place based on visual data only. Recent research works have focused on using a VLM as embeddings extractor for geo-localization, however, the most sophisticated VLMs may only be available as black boxes that are accessible through an API, and come with a number of limitations: there is no access to training data, model features and gradients; retraining is not possible; the number of predictions may be limited by the API; training on model outputs is often prohibited; and queries are open-ended. The utilization of a VLM as a stand-alone, zero-shot geo-localization system using a single text-based prompt is largely unexplored. To bridge this gap, this paper undertakes the first systematic study, to the best of our knowledge, to investigate the potential of some of the state-of-the-art VLMs as stand-alone, zero-shot geo-localization systems in a black-box setting with realistic constraints. We consider three main scenarios for this thorough investigation: a) fixed text-based prompt; b) semantically-equivalent text-based prompts; and c) semantically-equivalent query images. We also take into account the auto-regressive and probabilistic generation process of the VLMs when investigating their utility for geo-localization task by using model consistency as a metric in addition to traditional accuracy. Our work provides new insights in the capabilities of different VLMs for the above-mentioned scenarios.
Comment: Does not match any specific criteria but is related to vision-language models in robotics.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16929 [page] [pdf]
Authors: Shalutha Rajapakshe, Jean-Marc Odobez, Emmanuel Senft
Abstract: arXiv:2501.16929v1 Announce Type: new Abstract: While shared autonomy offers significant potential for assistive robotics, key questions remain about how to effectively map 2D control inputs to 6D robot motions. An intuitive framework should allow users to input commands effortlessly, with the robot responding as expected, without users needing to anticipate the impact of their inputs. In this article, we propose a dynamic input mapping framework that links joystick movements to motions on control frames defined along a trajectory encoded with canal surfaces. We evaluate our method in a user study with 20 participants, demonstrating that our input mapping framework reduces the workload and improves usability compared to a baseline mapping with similar motion encoding. To prepare for deployment in assistive scenarios, we built on the development from the accessible gaming community to select an accessible control interface. We then tested the system in an exploratory study, where three wheelchair users controlled the robot for both daily living activities and a creative painting task, demonstrating its feasibility for users closer to our target population.
Comment: No specific criteria match.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16729 [page] [pdf]
Authors: Fatima Davelouis, John D. Martin, Michael Bowling
Abstract: arXiv:2501.16729v1 Announce Type: new Abstract: We study the benefits of different sparse architectures for deep reinforcement learning. In particular, we focus on image-based domains where spatially-biased and fully-connected architectures are common. Using these and several other architectures of equal capacity, we show that sparse structure has a significant effect on learning performance. We also observe that choosing the best sparse architecture for a given domain depends on whether the hidden layer weights are fixed or learned.
Comment: No specific criteria match.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16971 [page] [pdf]
Authors: Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Ali Ansari, Sepehr Ghobadi, Masoud Hadi, Arshia Soltani Moakhar, Mohammad Azizmalayeri, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Abstract: arXiv:2501.16971v1 Announce Type: new Abstract: In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates diverse'' and
near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.
Comment:
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.17076 [page] [pdf]
Authors: Muhammad Shahbaz, Shaurya Agarwal
Abstract: arXiv:2501.17076v1 Announce Type: new Abstract: Recent advancements in deep-learning methods for object detection in point-cloud data have enabled numerous roadside applications, fostering improvements in transportation safety and management. However, the intricate nature of point-cloud data poses significant challenges for human-supervised labeling, resulting in substantial expenditures of time and capital. This paper addresses the issue by developing an end-to-end, scalable, and self-supervised framework for training deep object detectors tailored for roadside point-cloud data. The proposed framework leverages self-supervised, statistically modeled teachers to train off-the-shelf deep object detectors, thus circumventing the need for human supervision. The teacher models follow fine-tuned set standard practices of background filtering, object clustering, bounding-box fitting, and classification to generate noisy labels. It is presented that by training the student model over the combined noisy annotations from multitude of teachers enhances its capacity to discern background/foreground more effectively and forces it to learn diverse point-cloud-representations for object categories of interest. The evaluations, involving publicly available roadside datasets and state-of-art deep object detectors, demonstrate that the proposed framework achieves comparable performance to deep object detectors trained on human-annotated labels, despite not utilizing such human-annotations in its training process.
Comment:
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16394 [page] [pdf]
Authors: Lumen AI, Tengzhou No. 1 Middle School, Shihao Ji, Zihui Song, Fucheng Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Xu Tianhao
Abstract: arXiv:2501.16394v1 Announce Type: new Abstract: Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer$^{-1}$ architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7% and peak memory usage by 34.1% compared to the standard Transformer, while maintaining comparable accuracy ($\pm$0.3%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.
Comment:
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16889 [page] [pdf]
Authors: Veronika Solopova, Lucas Schmidt, Dorothea Kolossa
Abstract: arXiv:2501.16889v1 Announce Type: new Abstract: We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.
Comment:
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16733 [page] [pdf]
Authors: Yinfeng Gao, Qichao Zhang, Da-wei Ding, Dongbin Zhao
Abstract: arXiv:2501.16733v1 Announce Type: new Abstract: It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene-level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long-term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual-level perspective and captures vehicles' interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM's imagination and effectively navigates in the urban driving scenes leveraging intention-aware latent states. The proposed method is trained and evaluated on simulation environments built upon real-world challenging interactive scenarios. Compared with popular model-free and state-of-the-art model-based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.
Comment: Does not match any specific criteria closely. It discusses model-based reinforcement learning for autonomous driving, but not using vision language models.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16981 [page] [pdf]
Authors: Xiangyu Gao, Yu Dai, Benliu Qiu, Hongliang Li
Abstract: arXiv:2501.16981v1 Announce Type: new Abstract: Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLM to attain generative representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, the frozen backbone doesn't benefit from the labeled data to strengthen the representation. Therefore, we propose a novel two-branch backbone network design, named as ViT-Feature-Modulated Multi-Scale Convolutional network (VMCNet). VMCNet consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a feature modulation module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed feature modulation module could modulate the multi-scale CNN features with the representations from ViT branch. With the proposed mixed structure, detector is more likely to discover novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms the baseline. On OV-COCO, the proposed method achieves 44.3 AP${50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP${50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$.
Comment: Does not match any specific criteria closely. It discusses open-vocabulary object detection using vision language models, but not in the context of robotics or other specified tasks.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16899 [page] [pdf]
Authors: Shady Nasrat, Myungsu Kim, Seonil Lee, Jiho Lee, Yeoncheol Jang, Seung-joon Yi
Abstract: arXiv:2501.16899v1 Announce Type: new Abstract: Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at https://github.com/shadynasrat/RDMM.
Comment: No specific criteria match.
Relevance: 3 Novelty: 4 Back to [topic] [top]
ArXiv: 2501.16369 [page] [pdf]
Authors: Ahmed Alagha, Hadi Otrok, Shakti Singh, Rabeb Mizouni, Jamal Bentahar
Abstract: arXiv:2501.16369v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) has emerged as a powerful paradigm for solving complex problems. However, its full potential remains inaccessible to a broader audience due to its complexity, which requires expertise in training and designing DRL solutions, high computational capabilities, and sometimes access to pre-trained models. This necessitates the need for hassle-free services that increase the availability of DRL solutions to a variety of users. To enhance the accessibility to DRL services, this paper proposes a novel blockchain-based crowdsourced DRL as a Service (DRLaaS) framework. The framework provides DRL-related services to users, covering two types of tasks: DRL training and model sharing. Through crowdsourcing, users could benefit from the expertise and computational capabilities of workers to train DRL solutions. Model sharing could help users gain access to pre-trained models, shared by workers in return for incentives, which can help train new DRL solutions using methods in knowledge transfer. The DRLaaS framework is built on top of a Consortium Blockchain to enable traceable and autonomous execution. Smart Contracts are designed to manage worker and model allocation, which are stored using the InterPlanetary File System (IPFS) to ensure tamper-proof data distribution. The framework is tested on several DRL applications, proving its efficacy.
Comment:
Relevance: 3 Novelty: 3 Back to [topic] [top]
ArXiv: 2501.17131 [page] [pdf]
Authors: Rivera Esteban, L"ubberstedt Jannik, Nico Uhlemann, Markus Lienkamp
Abstract: arXiv:2501.17131v1 Announce Type: new Abstract: Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.
Comment: Does not match any specific criteria closely. It discusses the use of vision language models for scene understanding in autonomous driving, but not in robotics tasks like reinforcement learning or imitation learning.