Personalized Daily Arxiv Papers 04/17/2026

This project is adapted from tatsu-lab/gpt_paper_assistant. The source code of this project is at Variante/gpt_paper_assistant

About me on Bilibili. Help keep the website running:

Topics

Paper selection prompt and criteria (jump to the section by clicking the link):

1. Application of diffusion models and vision-language models (VLMs) to robot manipulation.

2. New methodological improvements to self-supervised learning (SSL) for image or video representation.

3. Video segmentation using unsupervised or self-supervised methods.

4. Transfer learning across modalities (e.g., audio-to-video, optical flow, language-to-video) for improved video understanding.

5. Advances in 3D generation using generative models, including image-to-3D and text-to-3D.

6. Recent progress in 3D reconstruction and generation with Gaussian Splatting, NeRF, or mesh generation.

Go beyond

Topic 1

1006. SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing [more]
Authors: Aodi Wu, Haodong Han, Xubo Luo, Ruisuo Wang, Shan He, Xue Wan

1007. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems [more]
Authors: Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang

1009. R3D: Revisiting 3D Policy Learning [more]
Authors: Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu

1012. DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation [more]
Authors: Ziyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji, Zhenyu Wu, Ziwei Wang

1017. Mean Flow Policy Optimization [more]
Authors: Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng

Back to [top]

Topic 2

2010. Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images [more]
Authors: Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan

2011. Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning [more]
Authors: Michael Leznik

2015. Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers [more]
Authors: Felipe Parodi, Jordan Matelsky, Melanie Segado

2016. Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography [more]
Authors: Simon B"ohi, Irene Cannistraci, Sergio Mu~noz Gonzalez, Moritz Vandenhirtz, Sonia Laguna, Samuel Ruiperez-Campillo, Max Kr"ahenmann, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt

Back to [top]

Topic 3

3000. CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation [more]
Authors: Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

3019. Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization [more]
Authors: Umer Ahmed, Syed Ahmed Mahmood, Fawad Javed Fateh, M. Shaheer Luqman, M. Zeeshan Zia, Quoc-Huy Tran

Back to [top]

Topic 4

4018. FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking [more]
Authors: Jinlin You, Muyu Li, Xudong Zhao

4021. Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports [more]
Authors: Qing Yan, Wenyu Yang, Yufei Wang, Wenhao Ma, Linchong Hu, Yifei Jin, Anton Dahbura

Back to [top]

Topic 5

5001. One-shot Compositional 3D Head Avatars with Deformable Hair [more]
Authors: Yuan Sun, Xuan Wang, WeiLi Zhang, Wenxuan Zhang, Yu Guo, Fei Wang

5003. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes [more]
Authors: Victoria Yue Chen, Emery Pierson, L'eopold Maillard, Maks Ovsjanikov

5008. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens [more]
Authors: Jiawei Ren, Michal Jan Tyszkiewicz, Jiahui Huang, Zan Gojcic

5013. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens [more]
Authors: Roni Itkin, Noam Issachar, Yehonatan Keypur, Yehonatan Keypur, Anpei Chen, Sagie Benaim

5014. Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars [more]
Authors: Yicheng Gong, Jiawei Zhang, Liqiang Liu, Yanwen Wang, Lei Chu, Jiahao Li, Hao Pan, Hao Zhu, Yan Lu

Back to [top]

Topic 6

6002. NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation [more]
Authors: Yi He, Tao Wang, Yi Jin, Congyan Lang, Yidong Li, Haibin Ling

6004. Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting [more]
Authors: Neel Kelkar, Simon Niedermayr, Klaus Engel, R"udiger Westermann

6005. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds [more]
Authors: Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo

6020. StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression [more]
Authors: Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

Back to [top]

Go beyond

Back to [top]

Full paper list

1006. SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

ArXiv: 2604.14399 [page] [pdf]

Authors: Aodi Wu, Haodong Han, Xubo Luo, Ruisuo Wang, Shan He, Xue Wan

Abstract: arXiv:2604.14399v1 Announce Type: new Abstract: Autonomous on-orbit servicing demands embodied agents that perceive through visual sensors, reason about 3D spatial situations, and execute multi-phase tasks over extended horizons. We present SpaceMind, a modular and self-evolving vision-language model (VLM) agent framework that decomposes knowledge, tools, and reasoning into three independently extensible dimensions: skill modules with dynamic routing, Model Context Protocol (MCP) tools with configurable profiles, and injectable reasoning-mode skills. An MCP-Redis interface layer enables the same codebase to operate across simulation and physical hardware without modification, and a Skill Self-Evolution mechanism distills operational experience into persistent skill files without model fine-tuning. We validate SpaceMind through 192 closed-loop runs across five satellites, three task types, and two environments, a UE5 simulation and a physical laboratory, deliberately including degraded conditions to stress-test robustness. Under nominal conditions all modes achieve 90--100% navigation success; under degradation, the Prospective mode uniquely succeeds in search-and-approach tasks where other modes fail. A self-evolution study shows that the agent recovers from failure in four of six groups from a single failed episode, including complete failure to 100% success and inspection scores improving from 12 to 59 out of 100. Real-world validation confirms zero-code-modification transfer to a physical robot with 100% rendezvous success. Code: https://github.com/wuaodi/SpaceMind

Comment: Matches criterion 1: introduces a modular vision-language model (VLM) agent (SpaceMind) for embodied robotic control and planning, validated in 192 closed-loop runs across UE5 simulation and physical hardware with reported 90–100% navigation success and 100% rendezvous success on a physical robot.

Relevance: 10 Back to [topic] [top]

1007. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

ArXiv: 2604.14732 [page] [pdf]

Authors: Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang

Abstract: arXiv:2604.14732v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.

Comment: criterion 1; introduces the World-Value-Action (WAV) model for Vision-Language-Action systems that learns a latent world model and trajectory value function for implicit planning, yielding improved task success rate, generalization, and robustness in long-horizon and real-world experiments.

Relevance: 10 Back to [topic] [top]

1009. R3D: Revisiting 3D Policy Learning

ArXiv: 2604.15281 [page] [pdf]

Authors: Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu

Abstract: arXiv:2604.15281v1 Announce Type: new Abstract: 3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

Comment: Matches criterion 1: a 3D policy learning paper (R3D) that integrates a transformer-based 3D encoder with a diffusion decoder to improve stability and performance on manipulation benchmarks.

Relevance: 10 Back to [topic] [top]

1012. DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

ArXiv: 2604.15023 [page] [pdf]

Authors: Ziyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji, Zhenyu Wu, Ziwei Wang

Abstract: arXiv:2604.15023v1 Announce Type: new Abstract: Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.

Comment: Criterion 1 — uses generative demonstration/data-augmentation for robot visuomotor policy learning: structure-preserving augmentation and point-level spatial editing to synthesize diverse docking viewpoints, improving policy success rates on ManiSkill and real-world platforms.

Relevance: 9 Back to [topic] [top]

1017. Mean Flow Policy Optimization

ArXiv: 2604.14698 [page] [pdf]

Authors: Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng

Abstract: arXiv:2604.14698v1 Announce Type: new Abstract: Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.

Comment: Criterion 1: proposes MeanFlow few-step flow-based generative policies as a more efficient alternative to diffusion-based RL policies and demonstrates comparable or better performance on MuJoCo and DeepMind Control Suite benchmarks (directly about generative-policy RL).

Relevance: 8 Back to [topic] [top]

2010. Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

ArXiv: 2604.14506 [page] [pdf]

Authors: Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan

Abstract: arXiv:2604.14506v1 Announce Type: new Abstract: Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.

Comment: Criterion 2: proposes a new SSL method DAGMaN (co-distilled attention-guided masking with a noisy teacher) for masked image modeling on Swin transformers, evaluated on medical tasks including lung nodule classification and tumor segmentation.

Relevance: 9 Back to [topic] [top]

2011. Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning

ArXiv: 2604.14249 [page] [pdf]

Authors: Michael Leznik

Abstract: arXiv:2604.14249v1 Announce Type: new Abstract: We introduce Metric-Aware Principal Component Analysis (MAPCA), a unified framework for scale-invariant representation learning based on the generalised eigenproblem max Tr(W^T Sigma W) subject to W^T M W = I, where M is a symmetric positive definite metric matrix. The choice of M determines the representation geometry. The canonical beta-family M(beta) = Sigma^beta, beta in [0,1], provides continuous spectral bias control between standard PCA (beta=0) and output whitening (beta=1), with condition number kappa(beta) = (lambda_1/lambda_p)^(1-beta) decreasing monotonically to isotropy. The diagonal metric M = D = diag(Sigma) recovers Invariant PCA (IPCA), a method rooted in Frisch (1928) diagonal regression, as a distinct member of the broader framework. We prove that scale invariance holds if and only if the metric transforms as M_tilde = CMC under rescaling C, a condition satisfied exactly by IPCA but not by the general beta-family at intermediate values. Beyond its classical interpretation, MAPCA provides a geometric language that unifies several self-supervised learning objectives. Barlow Twins and ZCA whitening correspond to beta=1 (output whitening); VICReg's variance term corresponds to the diagonal metric. A key finding is that W-MSE, despite being described as a whitening-based method, corresponds to M = Sigma^{-1} (beta = -1), outside the spectral compression range entirely and in the opposite spectral direction to Barlow Twins. This distinction between input and output whitening is invisible at the level of loss functions and becomes precise only within the MAPCA framework.

Comment: Criterion 2 — proposes MAPCA, a unified framework that analytically links and explains self-supervised objectives (e.g., Barlow Twins, VICReg, ZCA whitening) and introduces a metric family M(beta) that controls spectral bias, providing a clear methodological contribution to SSL representation learning.

Relevance: 9 Back to [topic] [top]

2015. Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

ArXiv: 2604.14433 [page] [pdf]

Authors: Felipe Parodi, Jordan Matelsky, Melanie Segado

Abstract: arXiv:2604.14433v1 Announce Type: new Abstract: Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$,pp classification, $-30.9$,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

Comment: Criterion 2: a rigorous analysis of self-supervised DINO vision transformers showing that commonly-used zero-ablation overstates register content dependence (large drops up to −36.6 pp) while mean/noise/cross-image replacement controls preserve performance within ~1 pp across classification, correspondence, and segmentation.

Relevance: 8 Back to [topic] [top]

2016. Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

ArXiv: 2604.15096 [page] [pdf]

Authors: Simon B"ohi, Irene Cannistraci, Sergio Mu~noz Gonzalez, Moritz Vandenhirtz, Sonia Laguna, Samuel Ruiperez-Campillo, Max Kr"ahenmann, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt

Abstract: arXiv:2604.15096v1 Announce Type: new Abstract: Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.

Comment: Criterion 2 — proposes a new self-supervised masked autoencoder variant (Latent Attention Masked Autoencoder, LAMAE) that augments MAE with a latent attention module for multi-view video representation, pretrained on MIMIC-IV-ECHO and evaluated for ICD-10 prediction and transfer to pediatric cohorts.

Relevance: 8 Back to [topic] [top]

3000. CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

ArXiv: 2604.14630 [page] [pdf]

Authors: Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

Abstract: arXiv:2604.14630v1 Announce Type: new Abstract: Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Comment: Criterion 3: unsupervised video object segmentation — proposes Cross-Modal Token Modulation (dense inter-token connections, relation transformer blocks, token masking) and reports state-of-the-art performance across public benchmarks.

Relevance: 10 Back to [topic] [top]

3019. Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

ArXiv: 2604.15196 [page] [pdf]

Authors: Umer Ahmed, Syed Ahmed Mahmood, Fawad Javed Fateh, M. Shaheer Luqman, M. Zeeshan Zia, Quoc-Huy Tran

Abstract: arXiv:2604.15196v1 Announce Type: new Abstract: We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

Comment: Matches criterion 3: an unsupervised temporal action segmentation method (hierarchical spatiotemporal vector quantization) evaluated on standard skeleton benchmarks HuGaDB, LARa, and BABEL.

Relevance: 7 Back to [topic] [top]

4018. FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking

ArXiv: 2604.14526 [page] [pdf]

Authors: Jinlin You, Muyu Li, Xudong Zhao

Abstract: arXiv:2604.14526v1 Announce Type: new Abstract: Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.

Comment: Criterion 4: cross-modal RGB–event fusion for temporal video understanding via FreqTrack's frequency-domain Spectral Enhancement Transformer and Wavelet Edge Refinement, reporting 76.6% precision on the COESOT tracking benchmark.

Relevance: 7 Back to [topic] [top]

4021. Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports

ArXiv: 2604.14474 [page] [pdf]

Authors: Qing Yan, Wenyu Yang, Yufei Wang, Wenhao Ma, Linchong Hu, Yifei Jin, Anton Dahbura

Abstract: arXiv:2604.14474v1 Announce Type: new Abstract: Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to scouting "by reward," this framework provides a scalable, workflow-aware digital twin system that enables data-driven roster construction and targeted talent discovery across massive candidate pools.

Comment: Matches criterion 4 (transfer learning across modalities): uses Vision-Language Models to generate temporally aligned tactical pseudo-commentary from broadcast video and fuses it with state-action telemetry for a GAIL/IRL objective to learn stylistic reward functions for player evaluation.

Relevance: 6 Back to [topic] [top]

5001. One-shot Compositional 3D Head Avatars with Deformable Hair

ArXiv: 2604.14782 [page] [pdf]

Authors: Yuan Sun, Xuan Wang, WeiLi Zhang, Wenxuan Zhang, Yu Guo, Fei Wang

Abstract: arXiv:2604.14782v1 Announce Type: new Abstract: We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.

Comment: Matches criterion 5 and 6: one-shot image-to-3D generation using 3D Gaussian Splatting (3DGS) with explicit hair/face decoupling, FLAME rigging and PBD-based deformable hair to produce realistic dynamic 3D head avatars.

Relevance: 10 Back to [topic] [top]

5003. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

ArXiv: 2604.14914 [page] [pdf]

Authors: Victoria Yue Chen, Emery Pierson, L'eopold Maillard, Maks Ovsjanikov

Abstract: arXiv:2604.14914v1 Announce Type: new Abstract: Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts

Comment: Matches criterion 5: advances in text-to-3D generative models — analyzes latent "sink traps" in text-driven 3D inversion and proposes an unconditional inversion framework to enable robust semantic manipulation of out-of-distribution 3D shapes.

Relevance: 10 Back to [topic] [top]

5008. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

ArXiv: 2604.15239 [page] [pdf]

Authors: Jiawei Ren, Michal Jan Tyszkiewicz, Jiahui Huang, Zan Gojcic

Abstract: arXiv:2604.15239v1 Announce Type: new Abstract: In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

Comment: Matches criteria 5 and 6: proposes TokenGS for feed-forward 3D Gaussian Splatting (learnable Gaussian tokens, encoder–decoder) achieving state-of-the-art feed-forward reconstruction on static and dynamic scenes.

Relevance: 10 Back to [topic] [top]

5013. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

ArXiv: 2604.15284 [page] [pdf]

Authors: Roni Itkin, Noam Issachar, Yehonatan Keypur, Yehonatan Keypur, Anpei Chen, Sagie Benaim

Abstract: arXiv:2604.15284v1 Announce Type: new Abstract: The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/

Comment: Matches criteria 5 and 6: GlobalSplat introduces global scene tokens for efficient feed-forward 3D Gaussian Splatting, achieving compact reconstructions (as few as 16K Gaussians) and competitive novel-view synthesis on RealEstate10K and ACID.

Relevance: 9 Back to [topic] [top]

5014. Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

ArXiv: 2604.14541 [page] [pdf]

Authors: Yicheng Gong, Jiawei Zhang, Liqiang Liu, Yanwen Wang, Lei Chu, Jiahao Li, Hao Pan, Hao Zhu, Yan Lu

Abstract: arXiv:2604.14541v1 Announce Type: new Abstract: We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.

Comment: Criterion 5 — image-to-3D generation: proposes a feed-forward single-image 3D head avatar reconstruction with explicit emotion control via a dual-path modulation (geometry modulation and appearance modulation) and an emotion-consistent multi-identity dataset.

Relevance: 8 Back to [topic] [top]

6002. NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation

ArXiv: 2604.14706 [page] [pdf]

Authors: Yi He, Tao Wang, Yi Jin, Congyan Lang, Yidong Li, Haibin Ling

Abstract: arXiv:2604.14706v1 Announce Type: new Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU-KD3D/NG-GS.

Comment: Matches criterion 6: proposes NG-GS for high-quality segmentation in 3D Gaussian Splatting and jointly aligns 3DGS with a lightweight NeRF module, reporting state-of-the-art boundary mIoU on NVOS, LERF-OVS, and ScanNet.

Relevance: 10 Back to [topic] [top]

6004. Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting

ArXiv: 2604.14928 [page] [pdf]

Authors: Neel Kelkar, Simon Niedermayr, Klaus Engel, R"udiger Westermann

Abstract: arXiv:2604.14928v1 Announce Type: new Abstract: We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.

Comment: Matches criterion 6: Gaussian-splatting / surfel-splatting 3D reconstruction — introduces a hybrid Gaussian–hash-grid radiance representation with per-Gaussian latents and probabilistic pruning, yielding superior geometry reconstruction and rendering with an order-of-magnitude fewer primitives.

Relevance: 10 Back to [topic] [top]

6005. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

ArXiv: 2604.14268 [page] [pdf]

Authors: Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo

Abstract: arXiv:2604.14268v1 Announce Type: new Abstract: We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

Comment: Criterion 6: HY-World 2.0 is a multi-modal world model that reconstructs and generates navigable 3D scenes using Gaussian Splatting (3DGS) (WorldStereo 2.0, WorldMirror 2.0, WorldLens) and reports SOTA open-source results comparable to Marble with released weights.

Relevance: 10 Back to [topic] [top]

6020. StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

ArXiv: 2604.15237 [page] [pdf]

Authors: Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

Abstract: arXiv:2604.15237v1 Announce Type: new Abstract: Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Comment: Matches criterion 6: a method for reconstructing dense 3D geometry from continuous video streams (StreamCacheVGGT) with novel cache scoring and compression evaluated on 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI.

Relevance: 7 Back to [topic] [top]