LoMOE: Localized Multi-Object Editing via Multi-Diffusion

Abstract

Recent developments in diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects.

We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing many objects in a complex scene in one pass. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the state-of-the-art (SOTA). We also curate and release a dataset dedicated to multi-object editing, named LoMOE-Bench. Our experiments against existing SOTA demonstrate the improved effectiveness of our approach in terms of both image editing quality, and inference speed.

Schematic Diagram

LoMOE comprises of 3 main steps: Inversion produces \(x_{inv}\) and \(c_0\) corresponding to input \(\mathbf{x}_0\). MultiDiffusion process helps restrict the edits to regions \(M_1, M_2\) guided by \(c_1, c_2\). Preservation of Attributes is achieved via \(\mathcal{L}_{xa}\) and \(\mathcal{L}_b\), using reference cross-attention maps and background latents obtained through a reconstruction process.

Comparisions

Single-Object Editing Comparision

Comparison among contemporary methods for Single Object Edits: We observe that SDEdit and InstructP2P tend to modify the whole image. GLIDE often inpaints and removes the subject of the edit in cases where it fails to generate the edit. DiffEdit produces the same output as SDEdit while preserving the unmasked regions of the input image. BLD doesn't preserve the structure of the input and makes unintented attribute edits to the masked subject. Finally, we observe that our proposed LoMOE makes the intented edit, preserves the unmasked region and avoids unintended attribute edits.

Multi-Object Editing Comparision

Comparison with contemporary methods for Multi-Object Edits: While the baselines are either unable to make the edit, accumulate artifacts, edit the unmasked region, or make unintended attribute edits, LoMOE is able to faithfully edit in accordance with the target prompts.

BibTeX

@article{LoMOE: Localized Multi-Object Editing via Multi-Diffusion,
  author    = {Chakrabarty, Goirik and Chandrasekar, Aditya and Hebbalaguppe, Ramya and AP, Prathosh},
  title     = {LoMOE: Localized Multi-Object Editing via Multi-Diffusion},
  journal   = {ACMMM},
  year      = {2024},
}