IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design

1. Introduction & Overview

Fine-Grained Garment Generation (FGG) represents a critical frontier in AI-driven fashion technology, aiming to synthesize high-quality digital garments with precise, multi-conditional control. The paper "IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design" introduces a novel framework designed to overcome the limitations of existing single-condition generation methods. Traditional workflows in fashion design are manual, time-consuming, and prone to inconsistencies, especially when scaling for seasonal collections or multiple product views. IMAGGarment addresses this by enabling unified control over global attributes (silhouette, color) and local details (logo placement, content) through an innovative two-stage architecture, supported by a newly released large-scale dataset, GarmentBench.

2. Methodology & Technical Framework

IMAGGarment employs a two-stage training strategy that decouples the modeling of global appearance and local details, enabling end-to-end inference for controllable generation.

2.1. Global Appearance Modeling

The first stage focuses on capturing the overall garment structure and color scheme. It utilizes a Mixed Attention Module to jointly encode silhouette information (from sketches) and color references. A dedicated Color Adapter ensures high-fidelity color transfer and consistency across the generated garment, preventing the common issue of color bleeding or washout seen in simpler conditional GANs.

2.2. Local Enhancement Modeling

The second stage refines the output by injecting user-defined logos and adhering to spatial constraints. An Adaptive Appearance-Aware Module is key here. It uses the global features from the first stage as context to guide the precise placement, scaling, and visual integration of logos, ensuring they blend realistically with the garment's texture, folds, and lighting.

2.3. Two-Stage Training Strategy

This decoupled approach is the framework's core innovation. By training the global and local models separately, IMAGGarment avoids the "condition entanglement" problem where one control signal (e.g., a strong logo constraint) might degrade the quality of another (e.g., the overall silhouette). During inference, the stages work sequentially to produce a final, coherent image that satisfies all input conditions.

3. The GarmentBench Dataset

To train and evaluate IMAGGarment, the authors introduce GarmentBench, a large-scale, multi-modal dataset. It contains over 180,000 garment samples, each annotated with:

Sketch: Line drawings defining the garment silhouette.
Color Reference: Palette or swatch for color guidance.
Logo Mask & Placement: Binary masks and spatial coordinates for logo insertion.
Textual Prompts: Descriptive captions of the garment style.

This comprehensive dataset is a significant contribution, providing a benchmark for future research in multi-conditional fashion generation.

GarmentBench at a Glance

180,000+ Garment Samples

4 Paired Condition Types (Sketch, Color, Logo, Text)

Publicly available for research

4. Experimental Results & Evaluation

IMAGGarment was rigorously evaluated against several state-of-the-art baselines in conditional image generation.

4.1. Quantitative Metrics

The model was assessed using standard metrics like Fréchet Inception Distance (FID) for overall image quality, Structural Similarity Index (SSIM) for fidelity to the input sketch, and Color Consistency Error for adherence to the color reference. IMAGGarment consistently achieved lower FID scores and higher SSIM values than competitors like Pix2PixHD and SPADE, demonstrating superior performance in both realism and condition adherence.

4.2. Qualitative Analysis

Visual comparisons show IMAGGarment's clear advantages:

Structural Stability: Garment silhouettes are sharp and accurately follow the input sketch, without distortion.
Color Fidelity: Colors are vibrant and match the reference palette closely, avoiding muddiness.
Logo Controllability: Logos are placed precisely as specified and appear naturally integrated into the fabric, respecting wrinkles and perspective.

Figure 1 (conceptual description): A side-by-side comparison shows baseline methods producing blurred logos or incorrect colors, while IMAGGarment generates a crisp T-shirt with a correctly positioned, perspectively accurate logo and perfect color match.

4.3. Ablation Studies

Ablation studies confirmed the necessity of each component. Removing the Color Adapter led to significant color drift. Disabling the Adaptive Appearance-Aware Module resulted in logos that looked "pasted on" and ignored garment geometry. The two-stage strategy itself was proven crucial; a single-stage model trained on all conditions simultaneously showed degraded performance across all metrics due to condition interference.

5. Technical Details & Mathematical Formulation

The core of the Mixed Attention Module can be conceptualized as learning a joint representation. Given a sketch feature map $F_s$ and a color feature map $F_c$, the module computes an attention map $A$ that governs their fusion:

$A = \text{softmax}(\frac{Q_s K_c^T}{\sqrt{d_k}})$

$F_{fusion} = A \cdot V_c + F_s$

where $Q_s$, $K_c$, $V_c$ are query, key, and value projections derived from $F_s$ and $F_c$, and $d_k$ is the dimension of the key vectors. This allows the model to dynamically decide which color information to apply to which part of the sketch. The training objective combines adversarial loss $\mathcal{L}_{GAN}$, reconstruction loss $\mathcal{L}_{recon}$ (e.g., L1), and a dedicated perceptual loss $\mathcal{L}_{perc}$ for style and content:

$\mathcal{L}_{total} = \lambda_{GAN}\mathcal{L}_{GAN} + \lambda_{recon}\mathcal{L}_{recon} + \lambda_{perc}\mathcal{L}_{perc}$

6. Analysis Framework: Core Insight & Critique

Core Insight: IMAGGarment isn't just another image-to-image model; it's a pragmatic engineering solution to a specific industrial pain point—the disentanglement of multi-faceted design control. While models like CycleGAN (Zhu et al., 2017) revolutionized unpaired translation, and StyleGAN (Karras et al., 2019) mastered unconditional fidelity, the fashion industry's need is for precision editing, not just generation. IMAGGarment's two-stage pipeline is a direct, effective answer to the "condition collision" problem that plagues end-to-end multi-modal models.

Logical Flow: The logic is impeccably industrial: 1) Define the shape and base color (the "manufacturing" stage). 2) Apply the branding and fine details (the "customization" stage). This mirrors the actual apparel production pipeline, making the technology intuitively adoptable by designers. The release of GarmentBench is a strategic masterstroke, as it immediately establishes a benchmark and ecosystem around their proposed task definition.

Strengths & Flaws: Its greatest strength is its focused utility and demonstrated superiority in its niche. The separate training stages are a clever hack to ensure stability. However, the flaw lies in its potential rigidity. The pipeline is sequential; an error in the global stage (e.g., a mis-modeled fold) is irrevocably passed to the local stage. It lacks the iterative, holistic refinement capability of more recent diffusion-based architectures (e.g., Stable Diffusion). Furthermore, its control, while multi-conditional, is still based on pre-defined inputs (sketch, color swatch). It doesn't yet tackle the more ambiguous but powerful control offered by natural language prompts at the same granularity.

Actionable Insights: For researchers, the immediate next step is to integrate this two-stage philosophy into a diffusion framework, using the first stage to establish a strong prior and the second for detail-aware, noise-guided refinement. For industry adopters, the priority should be integrating IMAGGarment into existing CAD software (like Browzwear or CLO) as a plugin, focusing on real-time preview generation from rough sketches. The model's current success is on relatively clean, front-view garments; the next challenge is extending it to complex 3D draping, diverse body shapes, and dynamic poses—a necessity for true virtual try-on applications, an area heavily invested in by companies like Google (Search Generative Experience) and Meta.

7. Application Outlook & Future Directions

The applications of IMAGGarment are vast and align with key trends in digital fashion:

E-commerce & Virtual Try-On: Generating photorealistic product images in multiple colors and with custom logos on-demand, reducing photoshoot costs.
Personalized Fashion Design: Allowing consumers to co-design products by uploading sketches, choosing colors, and placing personal logos.
Metaverse & Digital Assets: Rapidly creating unique, high-quality garment assets for avatars in games and virtual worlds.
Designer Tooling: Accelerating the mood board and prototyping phase, enabling rapid iteration of design concepts.

Future Directions:

3D Garment Generation: Extending the framework to generate consistent, textured 3D garment models from 2D conditions, a critical step for AR/VR.
Dynamic Material Synthesis: Incorporating control over fabric type (denim, silk, knit) and physical properties, moving beyond just color and logo.
Interactive Refinement: Developing models that allow for iterative, human-in-the-loop feedback ("make the collar wider," "move the logo left") beyond the initial conditions.
Integration with Large Language/Vision Models: Using LLMs (like GPT-4) or LVMs to interpret high-level, textual design briefs and convert them into the precise condition maps (sketches, color palettes) that IMAGGarment requires.

8. References

Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401-4410).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798-8807). (Pix2PixHD)
Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2337-2346). (SPADE)
Shen, F., Yu, J., Wang, C., Jiang, X., Du, X., & Tang, J. (2021). IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design. Journal of LaTeX Class Files, Vol. 14, No. 8.

Table of Contents