ConvNext学习

参考：
[1] LIU Z, MAO H, WU C Y, et al. A ConvNet for the 2020s[C/OL]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA. 2022. http://dx.doi.org/10.1109/cvpr52688.2022.01167. DOI:10.1109/cvpr52688.2022.01167.
[2] 薛导ConvNext博客
[3] DropPath理解
[4] 官方源码Facebook

1 模型学习
- 1.0 训练策略
- 1.1 Macro design
- 1.2 ResNeXt-ify
- 1.3 Inverted bottleneck
- 1.4 Large kernel size
- 1.5 Micro design
- 1.6 框架图
2 ConvNext代码实现
- 2.1 DropPath
- 2.2 GELU
- 2.3 LayerNorm
- 2.4 Block的实现
- 2.5 ConvNext网络搭建
- - 2.5.1 特征提取部分
  - 2.5.2 分类头部分
  - 2.5.3 初始化

1 模型学习

在这里插入图片描述

1.0 训练策略

作者以训练Vit的策略用来训练ResNet50，发现比原来要好，以此为基准。

1.1 Macro design

改变stage的计算比例

原ResNet50的stage的重复次数为[3,4,6,3]，而Swin-T的重复次数比例为1:1:3:1, Swin-L的重复次数比例为1:1:9:1，可见第三层stage的重复次数更多。
ConvNext将由原来的[3,4,6,4] 魔改为 [3,3,9,3]

改变stem(既开始的部分)

原ResNet的stem为卷积7x7,stride为2,padding为3，再经过stride为2的Maxpooling层，将原图像的高宽缩小4倍(224->56)。但在Swim-T中，卷积之间是没有重叠的，既stride=kernel_size
ConvNext仿照Swin-Transformer，用4x4Conv替代原来的7x7Conv，并将步长设置为4，不使用padding，这样直接将下采样四倍了。

1.2 ResNeXt-ify

替换GroupConv为DW Conv

将组卷积替换为depthwise卷积(DW卷积)，也就是将groups设置为channel，DW卷积最早出现于MobileNet中，也是GroupConv的一种特殊形式（groups = input channels）
然后使用1x1卷积去改变channel数（官方代码中说是使用pointwise 1x1 conv，并且就是一个nn.Linear）

改变conv的深度

原ResNet论文中，卷积的深度为(64,128,256,512)， ConvNext将其改为(96,192,384,768)

1.3 Inverted bottleneck

在这里插入图片描述

原来ResNet结构的Block都是呈现宽-窄-宽结构，在ConvNext中，变成窄-宽-窄结构,如图的(a)->(b)

1.4 Large kernel size

将DW Conv移到第一层

为了确保efficiency，large-kernel conv通常有较少的channels，而1x1 conv反而会去做繁杂的事情（比如升维、降维）。故ConvNext将DW Conv移动到第一层，并保持维度不变，而第二、三层的1x1 Conv负责升维降维。

增大kernel size

ConvNext将DW Conv的3x3 Conv变成7x7 Conv

1.5 Micro design

在这里插入图片描述

将ReLU换成GeLU，GELU可以看作是RELU的smooth版本
减小激活函数的使用，由Swim可知，只在最后一个1x1conv之前(降维之前）使用激活函数
在第二个1x1 Conv之前使用BN层，减小BN层的使用
将BN层换成LN层，BN模块有很多复杂的有害的影响
用2x2 Conv with stride 2替换原来的 1x1 Conv with stride 2进行残差下采样，也就是不使用残差下采样了，而是用 identity残差连接+2x2Conv下采样替代

1.6 框架图

在这里插入图片描述

2 ConvNext代码实现

2.1 DropPath

由 DropPath理解可知，droppath是随机失活样本中的一部分，而dropout是随机失活样本的一些权重。需要注意的是，Droppath只在training phase中使用,并且放在block的残差连接之前。
下图分别为DropPath和Dropout的输出。
在这里插入图片描述

具体实现

def droppath(x,drop_prob:float=0.,training:bool=False):if drop_prob == 0. or not training:return xkeep_prob = 1. - drop_probshape = (x.shape[0],) + (1,)*(x.dim -1)# 举个例子： # 如果x为[10,3,224,224],那么shape为[10,1,1,1]，只保留第0维，扩展后3维random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)# random_tensor介于[0,2)之间random_tensor = random_tensor.floor_() # random_tensor为0或1，表示失活或保留x = x.div(keep_prob)*randoom_tensor# 保持期望值不变，参考网址：https://www.cnblogs.com/dan-baishucaizi/p/14703263.html

官方源码调用库函数

from timm.models.layers import DropPath
...self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

2.2 GELU

参考：https://alaaalatif.github.io/2019-04-11-gelu/
总的来说，GELU要比RELU的训练效果要好，并且GELU具有导数连续，比RELU更加平滑等优点，并且在0以下的值有一定概率不会变为0，避免了梯度消失。由下图可知，GELU安插在1x1conv升维之前。

代码:

self.gelu = nn.GELU()

2.3 LayerNorm

LayerNorm就是对每一个channel独立地进行求均值和方差的操作，然后再去对每一个channel独立地进行标准化。现在分channel维在第二位和第四位的情况，而pytorch官方可直接调用LayerNorm的情况是channel维在第四位的情况。

代码实现

class LayerNorm(nn.Module):def __init__(self,normalized_shape,eps=1e-6,data_format='channels_last')self.eps = epsself.weight = nn.Parameter(torch.ones(normalized_shape)) # learnable gammaself.bias = nn.Parameter(torch.zeros(normalized_shape)) # learnable betaself.data_format = data_formatif self.data_format not in ['channels_first','channels_last']:raise NotImplementedErrorself.normalized_shape = (normalized_shape,)def forward(self,x):if self.data_format == 'channels_first':return F.layer_norm(x,self.normalized_shape,self.weight,self.bias,self.eps)elif self.data_format == 'channels_last':u = x.mean(1,keepdim=True)s = (x-u).pow(2).mean(1,keepdim=True) # variancex = (x-u)/torch.sqrt(s+self.eps)x = self.weight[:,None,None]*x+self.bias[:,None,None]return x

2.4 Block的实现

Block的源码中有一个layer_scale的操作是原论文中没有提及的，源于改论文Going deeper with image transformers. ICCV, 2021, 简单来说就是每一个通道的值进行缩放，缩放的因子gamma是一个可学习参数。

官方源码中开头的注释：
ConvNeXt Block. There are two equivalent implementations:
(1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
(2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
We use (2) as we find it slightly faster in PyTorch
Args:
dim (int): Number of input channels.
drop_path (float): Stochastic depth rate. Default: 0.0
layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.

代码实现：

class Block(nn.Module):def __init__(self,dim,drop_path:float=0.,layer_scale_init_value:float=1e-6):self.dwconv = nn.Conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)self.norm = LayerNorm(dim,eps=1e-6)self.pwconv1 = nn.Linear(dim,dim*4)self.gelu = nn.GELU()self.pwconv2 = nn.Linear(dim*4,dim)self.gamma = nn.Parameter(layer_scale_init_value* torch.ones((dim)), require_grad=True) if layer_scale_init_value > 0 else Noneself.Droppath = DropPath(drop_path) if drop_path > 0. else nn.Identity()def forward(self,x):input_ = x x = self.dwconv(x)x = x.permute(0,2,3,1) # put channel dim into last dimx = self.norm(x)x = self.pwconv1(x)x = self.gelu(x)x = self.pwconv2(x)if self.gamma is not None:x *= self.gammax = x.permute(0,3,1,2)x = input_ + self.drop_path(x)return x

2.5 ConvNext网络搭建

2.5.1 特征提取部分

在这里插入图片描述

stem为 [conv(k=4,s=4) + Layer Norm]
downsample为 [Layer Norm + conv(dim1,dim2,k=2,s=2)]
stage部分为 [ 多个block ]

2.5.2 分类头部分

在这里插入图片描述

全局平均：相当于平均每一维，既[B,C,H,W] - > [B,C]
Layer Norm + Linear不介绍了

2.5.3 初始化

需要初始化的层包括Linear和Conv，分别初始化其weight和bias

分类头Linear： weight和bias都乘以1（保持不变），且这是一个in-place操作，意味着它会直接修改self.head.bias的值，而不是创建一个新的tensor。
其他Linear和conv：关于weight使用 truncated normal distribution（截断正态分布），关于bias使用常数constant为0

网络代码：

class ConvNext(nn.Module):def __init__(self, in_chans=3, num_classes=1000, depths=[3,3,9,3], dims=[96,192,384,768],drop_path_rate=0.,layer_scale_init_value=1e-6,head_init_scale=1.)super().__init__()# ------------------ 下采样部分 --------------------------------self.downsample_layers = nn.ModuleList() # 保存 stem和3个downsample_layerstem = nn.Sequential(nn.Conv2d(3,dims[0],kernel_size=4,stride=4),LayerNorm(dims[0],eps=1e-6,data_format='channels_first'))for i in range(3):downsample_layer = nn.Sequential(LayerNorm(dim[i],eps=1e-6,data_format='channels_first'),nn.Conv2d(dim[i],dim[i+1],kernel_size=2,stride=2))self.downsample_layers.append(downsample_layer)# ------------------- stage 部分 --------------------------------self.stages = nn.ModuleList()dp_rate = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths)]cur = 0 # 表示当前在第几层（深度）for i in range(4):stage = nn.Sequential(*[Block(dims[i], drop_path=dp_rate[cur+j],layer_scale_init_value=layer_scale_init_value) for j in range(depths[i]))self.stages.append(stage)cur += depths[i]# ------------------ 分类头和初始化部分 ----------------------------self.norm = LayerNorm(dim[-1],eps=1e-6)self.head = nn.Linear(dim[-1],num_classes)self.apply(self._init_weights)self.head.weight.data.mul_(head_init_scale)self.head.bias.data.mul_(head_init_scale)def _init_weight(self,m):if isinstance(m,(nn.Linear, nn.Conv2d)):trunc_normal(m.weight, std=.02),nn.init.constant_(m.bias,0)def forward_features(self,x):# features extraction part(stem, stage and downsample)for i in range(4):x = self.downsample_layers[i](x)x = self.stages[i](x)return self.norm(x.mean([-2,-1])) # GAP(global average pooling) [c,b,h,w] -> [c,b]def forward(self,x):x = self.forward_features(x)x = self.head(x)return x