VIRD: View-invariant representation through dual-axis transformation for cross-view pose estimation

Published in CVPR-26, 2026

Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environment where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3 DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representation through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Download paper here

Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, Hyun Myung, VIRD: View-invariant representation through dual-axis transformation for cross-view pose estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR-2026), (to appear), 2026.