Deeply Supervised Flow-Based Models

ByteDance Seed

Uncurated samples generated by DeepFlow.

Abstract

Flow-based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer’s output under-utilizes the rich inter-layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter-layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8x faster on ImageNet-256x256 with equivalent performance.

Advancing Deep Supervision

Importance of Internal Feature Alignment for Flow-Based Models. Our DeepFlow enhances the baseline flow-based model (a) by explicitly aligning intermediate velocity features with final layer features. As shown in (b), simply applying deep supervision reduces the feature distance between intermediate velocity and final, improving FID scores (light blue bars in (d, e)). To further minimize this distance, we introduce the VeRA block, which refines deeply-supervised intermediate features, more closely aligned to final velocity features. This leads to even better image generation quality (dark blue bars in (e)).

DeepFlow

DeepFlow Architecture. We introduce advanced deep supervision by partitioning transformer blocks into equal-sized branches and employing multiple velocity layers (dark blue boxes), enabling each branch to predict velocity at a distinct time-step. Then, VeRA block is inserted between adjacent branches for explicit feature refinement. It consists of three sub-blocks: 1. acceleration generation / 2. time-gap condition / 3. cross-space attention.

DeepFlow as Training Efficient Image Generator

Training Efficiency. (a) On the ImageNet-256 benchmark, DeepFlow consistently outperforms SiT in FID scores across various model sizes. (b) DeepFlow-XL achieves an 8x training efficiency improvement over SiT-XL.

Tradeoff between Efficiency and Performance of DeepFlow. All of the experiments, including the reproduced, better baseline SiT variants, utilize lognormal sampling with 80 epochs of training in ImageNet-256×256 and are evaluated using SDE 250 steps without CFG.

Comparison with Flow-Based Models. (Left) Quantitative Results using Base Model on ImageNet-256. (Right) Quantitative Results using XLarge Model on ImageNet-256. Both without CFG.

System-level Comparison. DeepFlow can achieve competitive image generation performance both in ImageNet-256 and ImageNet-512 benchmarks while maintaining high training efficiency.

DeepFlow can generate higher quality samples even with fewer training needed.

Text-to-Image Generation

Performance of DeepFlow on Text-to-Image Benchmark. Both baseline and DeepFlow are trained on MS-COCO dataset with uniform time sampling.

BibTeX

@article{shin2025deeply,
  author    = {Inkyu Shin and Chenglin Yang and Liang-Chieh Chen},
  title     = {Deeply Supervised Flow-Based Models},
  year      = {2025}
}