Pytorch 复习总结 3

Pytorch 复习总结，仅供笔者使用，参考教材：

《动手学深度学习》
Stanford University: Practical Machine Learning

本文主要内容为：Pytorch 多层感知机。

本文先介绍了多层感知机的用法，再就训练过程中经常出现的过拟合现象提出解决办法。

Pytorch 语法汇总：

Pytorch 张量的常见运算、线性代数、高等数学、概率论部分见 Pytorch 复习总结1；
Pytorch 线性神经网络部分见 Pytorch 复习总结2；
Pytorch 多层感知机部分见 Pytorch 复习总结3；
Pytorch 深度学习计算部分见 Pytorch 复习总结4；
Pytorch 卷积神经网络部分见 Pytorch 复习总结5；
Pytorch 现代卷积神经网络部分见 Pytorch 复习总结6；

一. 多层感知机

虽然线性模型易于实现和理解、计算成本低、泛化能力强，但是对于一些非线性问题，可能会违反线性模型的单调性。为此，多层感知器引入了隐藏层来克服线性模型的限制，并且加入激活函数以增强网络非线性建模能力。

1. 读取数据集

同 Pytorch 复习总结 2 中 Softmax 回归的数据读取，继续使用 Fashion-MNIST 图像分类数据集：

import torch
import torchvision
from torch.utils import data
from torchvision import transformsdef load_data_fashion_mnist(batch_size, resize=None):"""下载Fashion-MNIST数据集并将其加载到内存中"""trans = [transforms.ToTensor()]if resize:trans.insert(0, transforms.Resize(resize))trans = transforms.Compose(trans)mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True, transform=trans, download=True)mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False, transform=trans, download=True)return (data.DataLoader(mnist_train, batch_size, shuffle=True),data.DataLoader(mnist_test, batch_size, shuffle=False))batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)

2. 神经网络模型

先将输入的图像展平，然后使用 2 个全连接层进行处理，中间的全连接层需要使用激活函数激活，最后一层全连接层作为输出：

from torch import nn
net = nn.Sequential(nn.Flatten(),nn.Linear(784, 256),nn.ReLU(),nn.Linear(256, 10)
)

仍然使用 init_weights() 函数按正态分布初始化所有全连接层的权重：

def init_weights(m):if type(m) == nn.Linear:nn.init.normal_(m.weight, std=0.01)net.apply(init_weights)

3. 激活函数

上一节使用了 ReLU 函数进行激活，在实际应用中，还可以使用 sigmoid、tanh 等函数激活。ReLU、sigmoid、tanh 函数的梯度可视化如下：

import torch
from matplotlib import pyplot as pltx = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
# y = torch.relu(x)
# y = torch.sigmoid(x)
y = torch.tanh(x)
y.backward(torch.ones_like(x), retain_graph=True)
plt.figure(figsize=(5, 2.5))
plt.plot(x.detach(), x.grad)
plt.show()

4. 损失函数

同 Softmax 回归：

loss = nn.CrossEntropyLoss(reduction='none')

5. 优化器

同 Softmax 回归：

trainer = torch.optim.SGD(net.parameters(), lr=0.1)

6. 训练

同 Softmax 回归，可以将训练过程封装成函数：

def accuracy(y_hat, y):"""计算预测正确的数量"""if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:y_hat = y_hat.argmax(axis=1)cmp = y_hat.type(y.dtype) == yreturn float(cmp.type(y.dtype).sum())def train_net(net, train_iter, test_iter, loss, num_epochs, trainer):for epoch in range(num_epochs):     # 迭代训练轮次net.train()                     # 将模型设置为训练模式train_loss_sum = 0.0            # 训练损失总和train_acc_sum = 0.0             # 训练准确度总和sample_num = 0                  # 样本数for X, y in train_iter:y_hat = net(X)l = loss(y_hat, y)trainer.zero_grad()l.mean().backward()trainer.step()train_loss_sum += l.sum()train_acc_sum += accuracy(y_hat, y)sample_num += y.numel()train_loss = train_loss_sum / sample_numtrain_acc = train_acc_sum / sample_numnet.eval()                      # 将模型设置为评估模式test_acc_sum = 0.0test_sample_num = 0for X, y in test_iter:test_acc_sum += accuracy(net(X), y)test_sample_num += y.numel()test_acc = test_acc_sum / test_sample_numprint(f'epoch {epoch + 1}, 'f'train loss {train_loss:.4f}, train acc {train_acc:.4f}, 'f'test acc {test_acc:.4f}')num_epochs = 10
train_net(net, train_iter, test_iter, loss, num_epochs, trainer)

二. 过拟合的缓解

当模型过于复杂、训练数据太少、迭代轮数太多时，就会出现过拟合现象。解决过拟合的方法有很多：

增加数据量：增加训练数据可以帮助模型更好地学习数据的真实规律，减少过拟合的发生；
简化模型：降低模型的复杂度，可以通过减少模型的参数数量、使用正则化等方法来实现；
交叉验证：使用交叉验证来评估模型的泛化能力，选择最优的模型；
提前停止：即 Dropout，在训练过程中监控模型在验证集上的表现，当验证集误差不再下降甚至开始上升时，及时停止训练，防止模型过拟合；
集成学习：使用集成学习方法（如随机森林、梯度提升树等）降低模型的方差，提高泛化能力。

下面介绍几种常用的正则化方法。

1. 权重衰减

权重衰减 (Weight Decay) 通过向损失函数中添加一个惩罚项来减小模型复杂度，以防止过拟合。惩罚项也叫 正则项，通常是权重的平方和（即 L2 范数）或权重的绝对值和（即 L1 范数）乘以一个正则化系数。

以线性回归的损失函数 $L(\mathbf{w}, b)$ 为例，使用优化器训练时，在损失函数 $L(\mathbf{w}, b)$ 上添加 L2 范数如下：
$L(\mathbf{w}, b)+\frac{\lambda}{2}\|\mathbf{w}\|^2\\ =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^{\top} \mathbf{x}^{(i)}+b-y^{(i)}\right)^2+\frac{\lambda}{2}\|\mathbf{w}\|^2\\$

损失函数中没有添加偏置 $b$ 的惩罚项，因为一般情况下，网络输出层的偏置项不需要正则化。代入 $\mathbf{w}$ 的参数更新表达式为：
$\mathbf{w} \leftarrow(1-\eta \lambda) \mathbf{w}-\frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)}\left(\mathbf{w}^{\top} \mathbf{x}^{(i)}+b-y^{(i)}\right)$

要想对模型进行权重衰减，只需要在实例化优化器时通过 weight_decay 指定权重衰减参数。默认情况下，PyTorch 同时衰减权重和偏移：

trainer = torch.optim.SGD(net.parameters(), lr=lr)

如果想要只衰减权重，需要指定参数：

params_to_optimize = [{"params": net[0].weight, 'weight_decay': wd},{"params":net[0].bias}
]
trainer = torch.optim.SGD([{"params":net[0].weight,'weight_decay': wd},{"params":net[0].bias}], lr=lr)

2. Dropout

Dropout 通过在训练过程中随机地将网络内部的一部分神经元的输出设置为零，即以一定的概率 “丢弃” 这些神经元。这样可以防止神经元在训练过程中过于依赖其他神经元，从而降低了网络对特定神经元的依赖性，使得网络更具鲁棒性：
在这里插入图片描述

通常情况下，Dropout 只在训练过程中使用，不在推理阶段使用，因为推理时模型需要产生确定性的输出。

Dropout 需要在网络中添加 Dropout 层，一般位于激活函数后，并且给定 dropout 概率：

dropout1, dropout2 = 0.2, 0.5net = nn.Sequential(nn.Flatten(),nn.Linear(784, 256),nn.ReLU(),nn.Dropout(dropout1),nn.Linear(256, 256),nn.ReLU(),nn.Dropout(dropout2),nn.Linear(256, 10)
)def init_weights(m):if type(m) == nn.Linear:nn.init.normal_(m.weight, std=0.01)net.apply(init_weights)