SAGA: Surface-Aligned Gaussian Avatar

Ronghan Chen1, Yang Cong2, Jiayue Liu2
1State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences
2South China University of Technology
TL;DR: Given a monocular video, SAGA reconstructs a Gaussian-based Avatar that generalizes better under novel view and poses and enables direct mesh recontruction, by aligning Gaussians with a mesh.

Abstract

This paper presents a Surface-Aligned Gaussian representation for creating animatable human avatar from monocular video, aiming at improving the novel view and pose synthesis performance while ensuring fast training and real-time rendering. Recently, 3D Gaussian Splatting (3DGS) has emerged as a more efficient and expressive alternative to neural radiance fields, and has been used for creating dynamic human avatar. However, when applied to the severely ill-posed task of monocular reconstruction, the transient regions such as clothes wrinkles or shadows that change constantly cannot provide consistent supervision for the Gaussians, resulting in noisy geometry and abrupt deformation that typically fail to generalize under novel views and poses. To address these limitations, we present SAGA, i.e., Surface-Aligned Gaussian Avatar, which aligns the Gaussians with a mesh to enforce well-defined geometry and consistent deformation, thereby improving generalization under novel views and poses. Unlike existing strict alignment methods that suffer from limited expressive power and low realism, SAGA employs a two-stage alignment strategy where the Gaussians are first adhered on while then detached from the mesh, thus facilitating both good geometry and high expressivity. In the first stage, we improve the flexibility of Adhered-on-Mesh Gaussians by allowing them to flow on the mesh, in contrast to existing methods that rigidly bind Gaussians to fixed location. In the second stage, we introduce a Gaussian-Mesh Alignment regularization that constrains the geometry and deformation of the detached Gaussians by minimizing their location and orientation offset from the bound triangle. Finally, an efficient Walking-on-Mesh strategy is introduced to dynamically update the bound triangle as Gaussians drift outside, ensuring accurate regularization even as the geometry evolves. Experiments on challenging datasets demonstrate that SAGA outperforms both NeRF and Gaussian-based methods on novel view and pose synthesis tasks, with state-of-the-art training time of 12 minutes, and real-time rendering efficiency at 60+ FPS. Additionally, we showcase that SAGA enables direct high-quality mesh extraction from Gaussians, marking the first attempt at deformable Gaussians learned from monocular human video.