Transformer/Swin-Transformer应用场景FLOPs计算实例

张开发
2026/4/4 17:10:43 15 分钟阅读

分享文章

Transformer/Swin-Transformer应用场景FLOPs计算实例
Transformer/Swin-Transformer应用场景FLOPs计算实例文章目录Transformer/Swin-Transformer应用场景FLOPs计算实例4DVarformerV1FLOPs① 全局 Self-Attention / Cross-AttentionO (n²d)② FFN 层③ 1 次 Swin 窗口 AttentionWindowAttention④ 1 个 SwinBlock WSHA / SW-MSA4DVarformerV2FLOPs对比V1V2:SwinV2新增计算量net大家有兴趣可以去看看在气象领域的应用场景这里就不仔细讲述可以参考 DABench4DVarformerV1Pre_CorssAttn_Norm → CrossAttentionPreNorm → FeedForwardPreNorm → Attention标准自注意力PreNorm → FeedForward层1CrossAtt → FFN → SelfAtt → FFN层2CrossAtt → FFN → SelfAtt → FFN层3CrossAtt → FFN → SelfAtt → FFN层4CrossAtt → FFN → SelfAtt → FFNFLOPsF L O P s e m b e d 2 × ( 2 × N × i n _ c h a n n e l s × d i m ) FLOPs_{embed} 2×(2×N×in\_channels×dim)FLOPsembed​2×(2×N×in_channels×dim)F L O P S C A / S A 8 × N × d i m 2 4 × h e a d s × N 2 × d i m h e a d FLOPS_{CA/SA} 8×N×dim^2 4×heads×N^2×dim_{head}FLOPSCA/SA​8×N×dim24×heads×N2×dimhead​F L O P s f f n 2 × N × d i m × ( d i m × m l p r a t i o ) FLOPs_{ffn} 2×N×dim×(dim×mlp_{ratio})FLOPsffn​2×N×dim×(dim×mlpratio​)F L O P s h e a d 2 × N × d i m × ( o u t _ c h a n n e l s × p a t c h _ s i z e [ 0 ] × p a t c h _ s i z e [ 1 ] ) FLOPs_{head} 2×N×dim×(out\_channels×patch\_size[0]×patch\_size[1])FLOPshead​2×N×dim×(out_channels×patch_size[0]×patch_size[1])① 全局 Self-Attention / Cross-AttentionO (n²d)QKV3 * n * dim² 1.14e8Attn 矩阵n² * dim 5.50e9Out projn * dim² 3.80e71 层 SelfAtt 5.65e9 FLOPs② FFN 层2 层线性2 * n * dim * (dim*4)1 层 FFN 3.03e8 FLOPs③ 1 次 Swin 窗口 AttentionWindowAttention只算窗口内复杂度n·w²·dim极小1 次 WindowAtt ≈ 0.15e9 FLOPs④ 1 个 SwinBlock WSHA / SW-MSA1 SwinBlock 1 次 WindowAtt 1 次 FFN4DVarformerV2Patch Embedding 位置编码梯度支路循环 ×6 SwinAttn → FFN → CrossAttn → FFN主分支与梯度分支融合Add主干 SwinLayer ×6 W/SW 交替Linear Head 上采样输出FLOPsF L O P s s w i n 8 × N × d i m 2 4 × h e a d s × N w i n × w 2 × d i m h e a d FLOPs_{swin} 8×N×dim^24×heads×N_{win}×w^2 ×dim_{head}FLOPsswin​8×N×dim24×heads×Nwin​×w2×dimhead​对比V14 层 Cross-Attention 4 层 FFN 4 层 Self-Attention 4 层 FFN4 CrossAtt4×5.6522.6 e94 FFN4×0.3031.212 e94 SelfAtt4×5.6522.6 e94 FFN4×0.3031.212 e9总 FLOPs22.6 1.212 22.6 1.212 47.624 × 10 9 F L O P s 22.61.21222.61.21247.624×10^9 FLOPs22.61.21222.61.21247.624×109FLOPsV2:6层Swin-Attention6 层 Self-Attention 6 层 FFN 6 个 SwinBlock6 Swin Attention6×2×0.15e91.8 e96 SelfAtt6×5.6533.9 e96 FFN6×0.3031.818 e96 SwinBlock6×0.4532.718 e9**总 FLOPs **1.8 33.9 1.818 2.718 40.236 × 10 9 F L O P s 1.833.91.8182.71840.236×10^9 FLOPs1.833.91.8182.71840.236×109FLOPs架构总 FLOPs4DVarformer47.62 ×10⁹4DVarformerV240.24×10⁹SwinV2V1V2SwinAtt →Norm1→ Add → FFN →Norm2→ Add新增计算量余弦相似度 L2 归一化F l o p s n o r m 2 × n × d i m ( 对 q , k 分别做 L 2 归一化 ) 2 × 9270 × 64 1.19 e 6 F L O P s Flops_{norm}2×n×dim(对 q,k 分别做 L2 归一化)2×9270×641.19e6 FLOPsFlopsnorm​2×n×dim(对q,k分别做L2归一化)2×9270×641.19e6FLOPsLog-CPB 偏置生成小型 MLP 输出Δ x , Δ y Δx, ΔyΔx,Δy假设 MLP 结构为2 → h → 2 2 → h → 22→h→2,h32F l o p s l o g − c p b 2 × w 2 × ( 2 × h h × 2 ) 2 × 64 × ( 64 64 ) 16384 ≈ 1.64 e 4 F L O P s Flops_{log-cpb}2×w2×(2×hh×2)2×64×(6464)16384≈1.64e^4 FLOPsFlopslog−cpb​2×w2×(2×hh×2)2×64×(6464)16384≈1.64e4FLOPs单次 WindowAttention 增量Δ p e r _ w i n 1.19 e 6 1.64 e 4 ≈ 1.21 e 6 F L O P s Δ_{per\_win}1.19e61.64e4≈1.21e6 FLOPsΔper_win​1.19e61.64e4≈1.21e6FLOPs占比 0.15 e 91.21 e 6 ≈ 0.81 占比0.15e91.21e6≈0.81%占比0.15e91.21e6≈0.81netnet (Solver, 4.0M) ├── phi_r (FWNet) ├── model_H (Model_H) ├── m_Grad (FDVarFormerV2, 核心 4.0M) │ ├── patch_embed_x (PatchEmbed → Conv2d) │ ├── patch_embed_grad (PatchEmbed → Conv2d) │ ├── dropout │ ├── layers (ModuleList) │ │ ├── layers.0 (Self-Attn FFN Cross-Attn FFN) │ │ └── layers.1 (同上重复一层) │ ├── swin_blocks (SwinTransformer 模块) │ │ └── swin_blocks.0 (SwinLayer ×2) │ ├── norm (LayerNorm) │ ├── head (Linear 输出层) │ └── rearrange └── model_VarCost (Model_Var_Cost) └── m_NormObs (Model_WeightedL2Norm) criterion (WeightedL1Loss)

更多文章