FreeFlux

Abstract

The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question.

We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task.

Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.

Method

Thanks to the explicit injection of positional information into query and key at each layer via RoPE, FLUX demonstrates significantly superior performance over SD3 in both generation quality and high-resolution synthesis, making it a focal point in the text-to-image domain. This also raises an intriguing question: During generation, does the RoPE-based MMDiT rely on positional embedding to retrieve information, or does it depend on the content similarity between query and key?

To address this question, we designed an automated probing strategy to understand the dependency of each self-attention layer on positional information during generation. Later, based on our interesting observations, we began exploring customized editing strategies tailored to different editing tasks based on their specific characteristics.

(1) Probing Layer-wise Positional Dependency: During sampling, we manipulate each self-attention layer in FLUX by preserving the RoPE for queries while either removing or shifting the RoPE for keys, thereby obtaining the corresponding generated results. By measuring the similarity between the original sampled results and the modified outputs, we can infer the functional role of each layer: lower similarity indicates a stronger reliance on positional relationships, whereas higher similarity suggests a greater dependence on the content similarity between queries and keys.

(2) Customized Strategies for Versatile Editing: Versatile image editing shares a classic editing mechanism: it generates both the source image and the edited image in parallel. During generation, editing is achieved by injecting keys and/or values from the source image into the edited image. Based on the nature of different editing tasks, we categorize versatile image editing into three types and design corresponding injection strategies for them: (1) Position-Dependent Editing, (2) Content Similarity-Dependent Editing, and (3) Region-Preserved Editing.

Object Addition Results

Non-Rigid Editing Results

Background Replacement Results

Outpainting Results

Object Movement Results

BibTeX

@article{wei2025freeflux,
    title     = {FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing},
    author    = {Wei, Tianyi and Zhou, yifan and Chen, Dongdong and Pan, Xingang},
    journal   = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
    year      = {2025},
}