Mask Factory: Towards High-Quality Synthetic Data Generation for Dichotomous Image Segmentation

1 State Key Lab of CAD&CG, Zhejiang University, 2 VCIP&CS, Nankai University
3 MBZUAI, 4 Linköping University

* Indicates Equal Contribution Indicates Corresponding Author
MY ALT TEXT

This figure Shows the edited masks from the first stage and the corresponding images generated in the second stage. In the examples, we transformed the viewpoint of park benches and tables from a frontal view to a top-down view and edited their shapes, changing park benches from curved to square edges and tables from square to circular shapes.

Abstract

Dichotomous Image Segmentation (DIS) tasks require highly precise annotations, and traditional dataset creation methods are labor intensive, costly, and require extensive domain expertise. Although using synthetic data for DIS is a promising solution to these challenges, current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. To address these issues, we introduce a novel approach, MaskFactory, which provides a scalable solution for generating diverse and precise datasets, markedly reducing preparation time and costs. We first introduce a general mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks. Specially, rigid editing leverages geometric priors from diffusion models to achieve precise viewpoint transformations under zero-shot conditions, while non-rigid editing employs adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, we generate pairs of high-resolution image and accurate segmentation mask using a multi-conditional control generation method. Finally, our experiments on the widely-used DIS5K dataset benchmark demonstrate superior performance in quality and efficiency compared to existing methods. Codes will be publicly available.

Workflow

MY ALT TEXT

We propose a two-step method that synthesizes high-quality and diverse object masks via mask editing and generates corresponding high-resolution images using a multi-conditional control generation method. In the first stage, we generate new masks by applying rigid and non-rigid editing to the existing ground truth masks. In the second stage, we use the generated masks and their corresponding extracted Canny edges as conditions, along with a prompt representing the category, to generate RGB images. This process forms paired data for our generative model.

Experiment

BibTeX

@inproceedings{qianmaskfactory,
        title={MaskFactory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation},
        author={Qian, Haotian and Chen, Yinda and Lou, Shengtao and Khan, Fahad and Jin, Xiaogang and Fan, Deng-Ping},
        booktitle={NeurIPS},
        year={2024}
      }