ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

Tel Aviv University

Abstract

In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.

How does it work?

  • 📝 Given a list of prompts and concept tokens, ConsiStyle aims to generate diverse images that remain consistent in subject identity as defined by the concept tokens.
  • 🎨 It first runs a vanilla SDXL pass to extract style-specific Value matrices, capturing the visual attributes (color, texture, etc.) tied to each prompt’s style.
  • 🧠 During generation, attention crossing is enabled so that each image can access contextual signals from the others, improving subject alignment across the batch.
  • 🧼 To prevent style contamination during attention crossing, Adaptive Instance Normalization (AdaIN) is applied to regulate the statistical properties of imported Values.
  • 🔍 In order to obtain better consistency of fine-details, using DIFT, the method computes semantic correspondences and injects aligned Query and Key components.
  • 💾 It also injects the previously captured Values from the vanilla SDXL run, preserving the original style appearance across all outputs.
  • 🖼️ The final result is a set of images that are prompt-aligned, style-consistent, and visually coherent in representing the same subject—without any subject-specific training.

Qualitative Comparison

Previous methods often fail to respect style or prompt fidelity, or they produce subjects that feel out of place due to color or texture leakage. For example, in the film noir dragon row, DB-LoRA and IP-Adapter ignore the black-and-white style entirely, and Consistory features a blue dragon. ConsiStyle preserves both style and features subject consistency.

Additional Examples

BibTeX

@INPROCEEDINGS{mazuz2025consistyle,
  author    = {Yohai mazuz and Janna Bruner and Lior Wolf},
  title     = {ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation},
  year      = {2025}
}