1. Introduction
Generative AI (GenAI) is revolutionizing complex industrial workflows. In the garment industry, the traditional pipeline—from customer needs to designer, pattern maker, tailor, and final delivery—is being augmented by Large Multimodal Models (LMMs). While current LMMs excel at analyzing customer preferences for item recommendation, a significant gap exists in enabling fine-grained, user-driven customization. Users increasingly wish to act as their own designers, creating and iterating on designs until satisfied. However, pure text-based prompts (e.g., "white blazer") suffer from ambiguity, lacking the professional detail (e.g., specific collar style) a designer would infer. This paper introduces the Better Understanding Generation (BUG) workflow, which leverages LMMs to interpret image-into-prompt inputs alongside text, enabling precise, iterative fashion design edits that bridge the gap between amateur user intent and professional-grade output.
2. Methodology
2.1 The BUG Workflow
The BUG workflow simulates a real-world design consultation. It begins with an initialization phase where a base garment image is generated from a user's text description (e.g., "a cotton blazer with fabric patterns"). Subsequently, the user can request edits through an iterative loop. Each iteration involves a text-as-prompt (e.g., "modify the collar") and, crucially, an image-into-prompt—a reference image illustrating the desired style element (e.g., a picture of a peaked lapel). The LMM processes this multimodal input to produce the edited design, which the user can accept or use as the base for the next refinement.
2.2 Image-into-Prompt Mechanism
This is the core innovation. Instead of relying solely on textual descriptions of visual concepts, the system ingests a reference image. The LMM's vision encoder extracts visual features from this reference, which are then fused with the encoded text prompt. This fusion creates a richer, less ambiguous conditioning signal for the image generation/editing model, directly addressing the "text uncertainty" problem highlighted in the introduction.
2.3 LMM Architecture
The proposed system utilizes a dual-LMM setup, hinted at in Figure 2 as eLMM and mLMM. The eLMM (Editor LMM) is responsible for understanding the multimodal edit request and planning the modification. The mLMM (Modifier LMM) executes the actual image editing, likely built upon a diffusion-based architecture like Stable Diffusion 3, conditioned on the fused text-image representation. This separation allows for specialized reasoning and execution.
3. FashionEdit Dataset
3.1 Dataset Construction
To validate the BUG workflow, the authors introduce the FashionEdit dataset. This dataset is designed to simulate real-world clothing design workflows. It contains triplets: (1) a base garment image, (2) a textual edit instruction (e.g., "change to peaked lapel style"), and (3) a reference style image depicting the target attribute. The dataset covers fine-grained edits like collar style changes (peaked lapel), fastening modifications (4-button double-breasted), and accessory additions (adding a boutonniere).
3.2 Evaluation Metrics
The proposed evaluation is threefold:
- Generation Similarity: Measures how closely the edited output matches the intended attribute from the reference image, using metrics like LPIPS (Learned Perceptual Image Patch Similarity) and CLIP score.
- User Satisfaction: Assessed via human evaluation or surveys to gauge the practical usefulness and alignment with user intent.
- Quality: Evaluates the overall visual fidelity and coherence of the generated image, free of artifacts.
4. Experiments & Results
4.1 Experimental Setup
The BUG framework is benchmarked against baseline text-only editing methods (using models like Stable Diffusion 3 and DALL-E 2 with inpainting) on the FashionEdit dataset. The experiments test the system's ability to perform precise, attribute-specific edits guided by reference images.
4.2 Quantitative Results
The paper reports superior performance of the BUG workflow over text-only baselines across all three evaluation metrics. Key findings include:
- Higher LPIPS/CLIP Scores: The edited images show greater perceptual similarity to the target attributes specified by the reference image.
- Increased User Satisfaction Rates: In human evaluations, outputs from the image-into-prompt method are consistently rated as more accurately fulfilling the edit request.
- Maintained Image Quality: The BUG workflow preserves the overall quality and coherence of the base garment while making the targeted edit.
4.3 Qualitative Analysis & Case Study
Figure 1 and 2 from the PDF provide compelling qualitative evidence. Figure 1 illustrates the real-world scenario: a user provides an image of a person in a white blazer and a reference picture of a specific collar, asking for a modification. The text-only description "white blazer" is insufficient. Figure 2 visually contrasts the iterative BUG process (using both text and image prompts) against a text-only editing pipeline, showing how the former leads to correct designs while the latter often produces wrong or ambiguous results for fine-grained tasks like adding a boutonniere or changing to a 4-button double-breasted style.
5. Technical Analysis & Framework
5.1 Mathematical Formulation
The core generation process can be framed as a conditional diffusion process. Let $I_0$ be the initial base image. An edit request is a pair $(T_{edit}, I_{ref})$, where $T_{edit}$ is the textual instruction and $I_{ref}$ is the reference image. The LMM encodes this into a combined conditioning vector $c = \mathcal{F}(\phi_{text}(T_{edit}), \phi_{vision}(I_{ref}))$, where $\mathcal{F}$ is a fusion network (e.g., cross-attention). The edited image $I_{edit}$ is then sampled from the reverse diffusion process conditioned on $c$: $$p_\theta(I_{edit} | I_0, c) = \prod_{t=1}^{T} p_\theta(I_{t-1} | I_t, c)$$ where $\theta$ are the parameters of the mLMM. The key differentiator from standard text-to-image diffusion is the enriched conditioning $c$ derived from multimodal fusion.
5.2 Analysis Framework Example
Case: Editing a Blazer Lapel
- Input: Base Image ($I_0$): Image of a woman in a notch-lapel blazer. Edit Request: $(T_{edit}="change to peaked lapel style", I_{ref}=[image of a peaked lapel])$.
- LMM Processing: The eLMM parses $T_{edit}$ to identify the target region ("lapel") and the action ("change style"). The vision encoder extracts features from $I_{ref}$ defining "peaked lapel" visually.
- Conditioning Fusion: Features for "lapel" from $I_0$, the textual concept "peaked", and the visual template from $I_{ref}$ are aligned and fused into a unified spatial-aware conditioning map for the mLMM.
- Execution: The mLMM (a diffusion model) performs inpainting/editing on the lapel region of $I_0$, guided by the fused conditioning, transforming the notch lapel into a peaked one while preserving the rest of the blazer and the model's pose.
- Output: $I_{edit}$: The same base image, but with a accurately modified peaked lapel.
6. Future Applications & Directions
The BUG workflow has implications beyond fashion:
- Interior & Product Design: Users could show a reference image of a furniture leg or fabric texture to modify a 3D model or room rendering.
- Game Asset Creation: Rapid prototyping of character armor, weapons, or environments by combining base models with style references.
- Architectural Visualization: Modifying building facades or interior finishes based on example images.
- Future Research: Extending to video editing (changing an actor's costume across frames), 3D shape editing, and improving the compositionality of edits (handling multiple, potentially conflicting reference images). A major direction is enhancing the LMM's reasoning about spatial relationships and physics to ensure edits are not just visually correct but also plausible (e.g., a boutonniere is attached correctly to the lapel).
7. References
- Stable Diffusion 3: Research Paper, Stability AI.
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- OpenAI. (2022). DALL-E 2. https://openai.com/dall-e-2
- Isola, P., et al. (2017). Image-to-Image Translation with Conditional Adversarial Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (CycleGAN is a related unsupervised approach).
- Liu, V., & Chilton, L. B. (2022). Design Guidelines for Prompt Engineering Text-to-Image Generative Models. CHI Conference on Human Factors in Computing Systems.
- Brooks, T., et al. (2023). InstructPix2Pix: Learning to Follow Image Editing Instructions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Li, H., et al. (2025). Fine-Grained Customized Fashion Design with Image-into-Prompt Benchmark and Dataset from LMM. arXiv:2509.09324.
8. Original Analysis & Expert Commentary
Core Insight: This paper isn't just another incremental improvement in image editing; it's a strategic pivot towards multimodal intent disambiguation. The authors correctly identify that the next frontier for generative AI in creative domains isn't raw power, but precision communication. The real bottleneck isn't the model's ability to generate a "blazer," but its ability to understand which specific blazer the user has in mind. By formalizing the "image-as-reference" paradigm into an "image-into-prompt" benchmark (BUG), they are tackling the fundamental ambiguity problem that plagues human-AI co-creation. This moves beyond the well-trodden path of models like CycleGAN (which learn unpaired style transfer) or InstructPix2Pix (which relies solely on text) by explicitly requiring the AI to cross-reference visual exemplars, a cognitive step closer to how human designers work.
Logical Flow: The argument is compelling and well-structured. It starts with a clear industry pain point (the gap between amateur text prompts and professional design output), proposes a cognitively plausible solution (mimicking the designer's use of reference images), and then backs it up with a concrete technical workflow (BUG) and a bespoke evaluation dataset (FashionEdit). The use of a dual-LMM architecture (eLMM/mLMM) logically separates high-level planning from low-level execution, a design pattern gaining traction in agent-based AI systems, as seen in research from institutions like Google DeepMind on tool-use and planning.
Strengths & Flaws: The major strength is the problem framing and benchmark creation. The FashionEdit dataset, if made publicly available, could become a standard for evaluating fine-grained editing, much like MS-COCO for object detection. The integration of user satisfaction as a metric is also praiseworthy, acknowledging that technical scores alone are insufficient. However, the paper, as presented in the excerpt, has notable gaps. The technical details of the LMM fusion mechanism are scant. How exactly are visual features from $I_{ref}$ aligned with the spatial region in $I_0$? Is it through cross-attention, a dedicated spatial alignment module, or something else? Furthermore, the evaluation, while promising, needs more rigorous ablation studies. How much of the improvement comes from the reference image versus simply having a better-tuned base model? Comparisons to strong baselines like InstructPix2Pix or DragGAN-style point-based editing would provide stronger evidence.
Actionable Insights: For industry practitioners, this research signals a clear directive: invest in multimodal interaction layers for your generative AI products. A simple text box is no longer enough. The UI must allow users to drag, drop, or circle reference images. For researchers, the BUG benchmark opens several avenues: 1) Robustness testing—how does the model perform with low-quality or semantically distant reference images? 2) Compositionality—can it handle "make the collar from image A and the sleeves from image B"? 3) Generalization—can the principles be applied to non-fashion domains like graphic design or industrial CAD? The ultimate test will be whether this approach can move from controlled datasets to the messy, open-ended creativity of real users, a challenge that often separates academic prototypes from commercial breakthroughs, as history with earlier GAN-based creative tools has shown.