Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes.
Taming Latent Diffusion Model for NeRF Inpainting Table 1: Quantitative comparisons. We present the results on the SPIn-NeRF and LLFF datasets. Note that the LLFF dataset does not have ground-truth views with object being physically removed, therefore, we only measures C-FID and C-KID on these scenes. The best performance is underscored. Methods SPIn-NeRF LLFF LPIPS () M-LPIPS () FID () KID () C-FID () C-KID ()
id: 944902a7b11afae0eaae8f9515e1d27c - page: 9
SPIn-NeRF SPIn-NeRF (LDM) Inpaint3d InpaintNeRF360 Ours 0.5356 0.5568 0.5437 0.4694 0.4345 0.4019 0.4284 0.4374 0.3672 0.3344 219.80 227.87 271.66 222.12 183.25 0.0616 0.0558 0.0964 0.0544 0.0397 231.91 235.67 174.55 171.89 0.0654 0.0642 0.0397 0.0388 model literature that quantify the distributional similarity between two sets of images and are sensitive to the visual artifacts. For each evaluated method, we compute the scores using NeRF-rendered and ground-truth images of all test views across all scenes in the dataset. C-FID/C-KID: For the LLFF dataset, since the dataset does not include test views with the object being physically removed, we alternatively measure the visual quality near the inpainting border. More specifically, we find the four furthest corners of the inpainting mask and crop image patches centered at these corners. Then, finally, compute the FID/KID scores between the real-image patches and the NeRF-rendering after the object is removed and inpainted.
id: 566e7af3cc3fd86b62e455a3cca7b74e - page: 9
Evaluated methods. We compare our method with the following approaches: SPIn-NeRF : We use the official implementation. Note that the authors do not provide the evaluation implementation. Therefore, the LPIPS scores reported in this paper differ from those presented in the SPIn-NeRF paper. Nevertheless, we have contacted the authors to ensure that the SPIn-NeRF results match the quality shown in the original paper. SPIn-NeRF (LDM): We replace the LaMa model to our latent diffusion model in the SPIn-NeRF approach while maintaining all default hyperparameter settings. InpaintNeRF360 : We implement the algorithm as no source code is available. Specifically, we use our latent diffusion model for per-view inpainting and optimize the NeRF with the same network architecture devised in our approach with the objectives proposed in the paper. Inpaint3d : We reach out to the authors for all the rendered images of test views for evaluation on the SPIn-NeRF dataset.
id: c0588e018be3875cfe8feb7d6199a7da - page: 9
4.2 Per-Scene Finetuning In Figure 6, we qualitatively show the effectiveness of our per-scene finetuning on the latent diffusion model. Before the finetuning, our latent diffusion model inpaints arbitrary appearance, and even often time creates arbitrary objects in the inpainting region. Such a high variation is a major issue causing the NeRF 9 10 C. H. Lin et al. (a) Ground Truth(b) NeRF A (Masked LPIPS: 0.3675)(c) NeRF B (Masked LPIPS: 0.3692) Fig. 5: Drawbacks of LPIPS. In some cases, the LPIPS score fails to indicate the visual quality. For example, generating a realistic baseball cap actually lowers the score as there is no object in the inpainting area in the ground truth image. n o i t a z i m o t s u C o / w n o i t a z i m o t s u C / w n o i t a z i m o t s u C o / w n o i t a z i m o t s u C / w Fig. 6: Per-scene customization. Our per-scene customization effectively forges the latent diffusion model to synthesize consistent and in-context contents across views.
id: 50c62458757bdd467800ca475e249e73 - page: 9