0%

Python数据科学_34_Pytorch深度学习实战

使用PyTorch搭建线性回归模型

使用自动求导机制和简单计算函数搭建线性回归模型

1
import torch
1
2
3
# 构造数据
x = torch.randn(100)
y = 3 *x + 2
1
2
3
# 初始化参数
k = torch.tensor([0.0], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

单轮迭代模拟

1
2
3
# 迭代一次
y_pre = k*x + b
y_pre
1
2
3
4
5
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0.], grad_fn=<AddBackward0>)
1
2
3
4
lr = 0.015  # 设置学习速率
# 定义损失函数为MSE
def loss_fun(y_true, y_pre):
return torch.sum(torch.square(y_pre-y_true))/len(y_pre)
1
2
3
# 计算损失函数
loss = loss_fun(y, y_pre)
loss
1
tensor(16.0647, grad_fn=<DivBackward0>)
1
2
# 反向传播计算梯度
loss.backward()
1
2
3
# 获取梯度
print(k.grad)
print(b.grad)
tensor([-7.1164])
tensor([-5.3900])
1
2
3
# 更新参数
k.data -= k.grad * lr # k.data = k.data - k.grad*lr
b.data -= b.grad * lr
1
2
print(k)
print(b)
1
2
tensor([0.1067], requires_grad=True)
tensor([0.0809], requires_grad=True)
1
2
3
# 完成一个batch将梯度置零
k.grad.data.zero_()
b.grad.data.zero_()
tensor([0.])

循环迭代

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 迭代多次
for i in range(1000):
y_pre = k*x + b
# print(y_pre)
loss = loss_fun(y, y_pre)
loss.backward()
# 获取梯度
# print(k.grad)
# print(b.grad)
# 更新参数
k.data -= k.grad * lr
b.data -= b.grad * lr
# 每训练10次打印一次记录
if i % 100 == 0:
print("Iter: %d, k: %.4f, b: %.4f, training loss: %.4f" %
(i, k.item(), b.item(), loss.item()))
k.grad.data.zero_()
b.grad.data.zero_()
1
2
3
4
5
6
7
8
9
10
Iter: 0, k: 0.2096, b: 0.1585, training loss: 14.8915
Iter: 100, k: 2.9116, b: 1.9883, training loss: 0.0093
Iter: 200, k: 2.9956, b: 2.0024, training loss: 0.0000
Iter: 300, k: 2.9997, b: 2.0003, training loss: 0.0000
Iter: 400, k: 3.0000, b: 2.0000, training loss: 0.0000
Iter: 500, k: 3.0000, b: 2.0000, training loss: 0.0000
Iter: 600, k: 3.0000, b: 2.0000, training loss: 0.0000
Iter: 700, k: 3.0000, b: 2.0000, training loss: 0.0000
Iter: 800, k: 3.0000, b: 2.0000, training loss: 0.0000
Iter: 900, k: 3.0000, b: 2.0000, training loss: 0.0000

从打印出来的训练过程可以看出,迭代到400轮的时候就已经收敛,k=3.0000,b=2.0000。

上述方法仅仅使用到了pytorch中的自动求导机制进行参数训练,其实对于这些训练过程在PyTorch中还有很多方法可以替代。

例如:

  1. 更新参数部分,可以在提前实例化一个优化器,反向计算完梯度后直接使用optimizer.step()去更新参数。
  2. 参数的梯度置零,可以直接调用优化器的optimizer.zero_grad()方法去将所有参数的梯度全部置零。

调用PyTorch优化器搭建线性回归模型

1
2
3
4
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

定义模型

1
2
3
4
5
6
7
class SimpleLinear(nn.Module):
def __init__(self):
super(SimpleLinear, self).__init__()
self.lr = nn.Linear(1, 1)

def forward(self, x):
return self.lr(x)

定义优化器

1
linear = SimpleLinear()
1
optimizer = torch.optim.SGD(linear.parameters(),  lr=0.015)

定义损失函数

1
loss_fun = nn.MSELoss()

模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
for epoch in range(500):
# 计算模型输出
output = linear(x.unsqueeze(-1))
# 计算损失值
loss = loss_fun(y.unsqueeze(-1), output)
# 反向传播,根据计算图计算每个参数的梯度
loss.backward()
# 参数更新
optimizer.step()
# 参数梯度归零
optimizer.zero_grad()
if epoch % 50 == 0:
print('Epoch:{0}, k:{1:.4f}, b:{2:.4f}, loss:{3:.6f}'.format(epoch, linear.lr.weight.item(), linear.lr.bias.item(), loss))
1
2
3
4
5
6
7
8
9
10
Epoch:0, k:-0.1556, b:-0.5560, loss:16.833160
Epoch:50, k:2.3117, b:1.3544, loss:0.902168
Epoch:100, k:2.8475, b:1.8398, loss:0.049314
Epoch:150, k:2.9658, b:1.9607, loss:0.002728
Epoch:200, k:2.9922, b:1.9905, loss:0.000152
Epoch:250, k:2.9982, b:1.9977, loss:0.000009
Epoch:300, k:2.9996, b:1.9994, loss:0.000000
Epoch:350, k:2.9999, b:1.9999, loss:0.000000
Epoch:400, k:3.0000, b:2.0000, loss:0.000000
Epoch:450, k:3.0000, b:2.0000, loss:0.000000

PyTorch模型搭建与训练过程模拟

定义模型类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class SimpleLinear:
# 初始化参数
def __init__(self):
self.k = torch.tensor([0.0], requires_grad=True)
self.b = torch.tensor([0.0], requires_grad=True)

# 模型前向计算
def forward(self, x):
y = self.k * x + self.b
return y

# 模型参数
def parameters(self):
return [self.k, self.b]

# 调用模型自动开始计算
def __call__(self, x):
return self.forward(x)

定义优化器类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Optimizer:
# 初始化参数
def __init__(self, parameters, lr):
self.parameters = parameters
self.lr = lr

# 更新参数
def step(self):
for para in self.parameters:
para.data -= para.grad * self.lr

# 初始化梯度
def zero_grad(self):
for para in self.parameters:
para.grad.zero_()

定义损失函数类

1
2
3
4
# 定义损失函数为MSE
class LossFun:
def __call__(self, y_true, y_pre):
return torch.sum(torch.square(y_pre-y_true))/len(y_pre)
1
loss_fun = LossFun()

模型训练函数编写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 实例化模型
model = SimpleLinear()
# 实例化优化器
opt = Optimizer(model.parameters(), lr=0.015)
for epoch in range(500):
# 计算模型输出
output = model(x)
# 计算损失值
loss = loss_fun(y, output)
# 反向传播,根据计算图计算每个参数的梯度
loss.backward()
# 参数更新
opt.step()
# 参数梯度归零
opt.zero_grad()
if epoch % 50 == 0:
print('Epoch:{0}, k:{1:.4f}, b:{2:.4f}, loss:{3:.6f}'.format(epoch, model.parameters()[0].item(), model.parameters()[1].item(), loss))
1
2
3
4
5
6
7
8
9
10
Epoch:0, k:0.0915, b:0.0525, loss:12.655186
Epoch:50, k:2.3765, b:1.4947, loss:0.657512
Epoch:100, k:2.8640, b:1.8724, loss:0.035242
Epoch:150, k:2.9699, b:1.9683, loss:0.001927
Epoch:200, k:2.9932, b:1.9922, loss:0.000107
Epoch:250, k:2.9985, b:1.9981, loss:0.000006
Epoch:300, k:2.9996, b:1.9995, loss:0.000000
Epoch:350, k:2.9999, b:1.9999, loss:0.000000
Epoch:400, k:3.0000, b:2.0000, loss:0.000000
Epoch:450, k:3.0000, b:2.0000, loss:0.000000

同样可以达到训练效果

使用PyTorch实现手写数字识别任务

1
2
3
4
5
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary
1
2
3
from torchvision import datasets
from torchvision.transforms import v2 as transforms
from torch.utils.data import DataLoader, Dataset

使用卷积神经网络实现手写数字识别

数据集定义

在线获取

在线获取PyTorch内置数据集 https://pytorch.org/vision/stable/datasets.html

当download参数为True时,会去校验指定的路径中是否有数据集文件,如果没有会去下载相应的数据集

1
2
3
4
5
6
7
mnist_train_dataset = datasets.MNIST('../data/MNIST', 
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
]))
1
2
3
4
5
6
7
mnist_test_dataset = datasets.MNIST('../data/MNIST', 
train=False,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
]))
1
mnist_train_dataset[0][0].shape
torch.Size([1, 28, 28])
1
mnist_train_dataset[0][1]
5

自定义DataSet类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 基本格式
class MyDataset(Dataset):
def __init__(self):
pass

def __len__(self):
# 返回数据长度
pass
return len(img)

def __getitem__(self, idx):
# 根据idx索引返回对应的数据和标签
pass
return image, label
1
import numpy as np
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class MNIST_Dataset(Dataset):
def __init__(self, mnist_npz_filepath, train=True, transforms=None):
super(MNIST_Dataset, self).__init__()
mnist_data = np.load(mnist_npz_filepath)
x_train = mnist_data['x_train']
x_test = mnist_data['x_test']
self.x = np.expand_dims(x_train, -1)
self.y = mnist_data['y_train']

if not train:
self.x = np.expand_dims(x_test, -1)
self.y = mnist_data['y_test']
self.transforms = transforms

def __len__(self):
return len(self.x)

def __getitem__(self, index):
x = self.x[index]
y = self.y[index]
if self.transforms:
# 图像处理预处理函数,要求输入为(channel, w, h)
x = self.transforms(x)
return x, y
1
2
3
4
5
6
train_dataset = MNIST_Dataset('../data/self/mnist.npz', 
train=True,
transforms=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
]))
1
2
3
4
5
6
test_dataset = MNIST_Dataset('../data/self/mnist.npz', 
train=False,
transforms=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,),(0.3081,))
]))
1
train_dataset[0][0].shape
torch.Size([1, 28, 28])
1
train_dataset[0][1]
5

卷积神经网络搭建

image-20231023113804973

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5, 1)
self.pool = nn.MaxPool2d(2)
self.conv2 = nn.Conv2d(6, 16, 5, 1)
self.dropout = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(256, 128)
self.fc2 = nn.Linear(128, 10)

# 重写forward方法
def forward(self, x):
x = self.conv1(x) # 卷积
x = self.pool(x) # 池化
x = F.relu(x) # relu函数激活
x = self.conv2(x) # 卷积
x = self.pool(x) # 池化
x = F.relu(x) # relu函数激活
x = torch.flatten(x, 1) # 展平
x = self.fc1(x) # 全连接
x = F.relu(x) # relu函数激活
x = self.dropout(x) # dropout层
x = self.fc2(x) # relu函数激活
return x
1
2
3
model = Net()
# 查看模型结构
summary(model.to('cuda'), (1, 28,28))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 6, 24, 24] 156
MaxPool2d-2 [-1, 6, 12, 12] 0
Conv2d-3 [-1, 16, 8, 8] 2,416
MaxPool2d-4 [-1, 16, 4, 4] 0
Linear-5 [-1, 128] 32,896
Dropout2d-6 [-1, 128] 0
Linear-7 [-1, 10] 1,290
================================================================
Total params: 36,758
Trainable params: 36,758
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.04
Params size (MB): 0.14
Estimated Total Size (MB): 0.19
----------------------------------------------------------------


D:\ProgramSoftware\Python\Miniconda3\envs\torch2\lib\site-packages\torch\nn\functional.py:1345: UserWarning: dropout2d: Received a 2-D input to dropout2d, which is deprecated and will result in an error in a future release. To retain the behavior and silence this warning, please use dropout instead. Note that dropout2d exists to provide channel-wise dropout on inputs with 2 spatial dimensions, a channel dimension, and an optional batch dimension (i.e. 3D or 4D inputs).
warnings.warn(warn_msg)

定义训练函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# model:模型    device:模型训练场所     optimizer:优化器    epoch:模型训练轮次
def train(model, device, train_loader, criterion, optimizer, epoch):
model.train() # 声明训练函数,参数的梯度要更新
total = 0 # 记录已经训练的数据个数
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()

total += len(data)
progress = math.ceil(batch_idx / len(train_loader) * 50)
print("\rTrain epoch %d: %d/%d, [%-51s] %d%%" %
(epoch, total, len(train_loader.dataset),
'-' * progress + '>', progress * 2), end='')

定义测试函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def test(model, device, test_loader, criterion):
model.eval() # 声明验证函数,禁止所有梯度进行更新
test_loss = 0
correct = 0
# 强制后面的计算不生成计算图,加快测试效率
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += criterion(output, target).item() # 对每个batch的loss进行求和
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)

print('\nTest: average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))

模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
epochs = 10  # 迭代次数
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
# 模型定义并加载至GPU

model = Net().to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(model.parameters(), lr=0.025, momentum=0.9)
criterion = nn.CrossEntropyLoss()

for epoch in range(1, epochs+1):
train(model, device, train_loader, criterion, optimizer, epoch)
test(model, device, test_loader, criterion)
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Train epoch 1: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0003, accuracy: 9756/10000 (98%)
--------------------------
Train epoch 2: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0002, accuracy: 9803/10000 (98%)
--------------------------
Train epoch 3: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0002, accuracy: 9856/10000 (99%)
--------------------------
Train epoch 4: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9875/10000 (99%)
--------------------------
Train epoch 5: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9894/10000 (99%)
--------------------------
Train epoch 6: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9891/10000 (99%)
--------------------------
Train epoch 7: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9881/10000 (99%)
--------------------------
Train epoch 8: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9898/10000 (99%)
--------------------------
Train epoch 9: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9909/10000 (99%)
--------------------------
Train epoch 10: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9905/10000 (99%)
--------------------------

模型参数查看

1
2
for i,v in model.named_parameters():
print(i, v.shape)
1
2
3
4
5
6
7
8
conv1.weight torch.Size([6, 1, 5, 5])
conv1.bias torch.Size([6])
conv2.weight torch.Size([16, 6, 5, 5])
conv2.bias torch.Size([16])
fc1.weight torch.Size([128, 256])
fc1.bias torch.Size([128])
fc2.weight torch.Size([10, 128])
fc2.bias torch.Size([10])

模型检查点存储

随着模型结构的复杂化,数据量的复杂化,模型的训练所消耗的时间也会逐渐增大,这个时候就需要注意,如果我们的模型训练一半,由于一些不可控因素导致训练中断(例如:服务器资源竞争导致程序被杀死、意外断电等等),那么此时如果再从头开始训练会导致前面训练所消耗的时间和资源白白浪费掉了。

为了应对这种情况的发生,模型检查点技术就应运而生,模型检查点是在模型在每轮的训练过程中,存储模型训练中间状态的一种技术。
模型检查点一般会保存以下几个指标:

  1. epoch:当前训练的轮数
  2. step:当前轮数对应的批次数(使用频率低)
  3. model_state_dict:模型参数
  4. optimizer_state_dict:优化器参数
  5. loss:当前模型参数损失值
1
2
3
4
5
6
7
8
save_file = f'checkpoint_{epoch}.pt'
torch.save({
'epoch': epoch, # 存储当前训练轮数,常用
'step': step, # 存储此轮训练的遍历数,该参数就非常具体,具体到某批训练样本,一般用的很少,这个参数如果引用了,那么在训练过程中每个批次都需要去比较该批数据优化后模型是否更优,更优的话然后就保存
'model_state_dict': model.state_dict(), # 模型参数,必须保存
'optimizer_state_dict': optimizer.state_dict(), # 优化器参数,必须保存
'loss': loss, # 此轮的损失值,用的也相对较少,每轮也会计算出新的值
}, save_file)

模型检查点读取

1
2
3
4
5
6
7
8
9
10
resume = f'checkpoint_{epoch}.pt'  # 指定恢复文件
if resume != "":
# 加载之前训过的模型的参数文件
logging.warning(f"loading from {resume}")
checkpoint = torch.load(resume, map_location=torch.device("cuda:0")) #可以是cpu,cuda,cuda:index
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
start_step = checkpoint['step']
loss = checkpoint['loss']

将模型检查点引入训练

检查点的引入一般有以下两种方式:

  1. 固定轮数保存,例如:每5轮训练保存一次检查点。
  2. 指标监测保存,例如:监测loss值,当loss值有下降时存储一次,也即存储最优模型。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def test(model, device, test_loader, criterion):
model.eval() # 声明验证函数,禁止所有梯度进行更新
test_loss = 0
correct = 0
# 强制后面的计算不生成计算图,加快测试效率
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += criterion(output, target).item() # 对每个batch的loss进行求和
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)

print('\nTest: average loss: {:.4f}, accuracy: {}/{} ({:.0f}%)'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
return test_loss # 将测试损失返回
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
model = Net().to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(model.parameters(), lr=0.025, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(model, device, train_loader, criterion, optimizer, epoch)
loss = test(model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'cnn_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Train epoch 1: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0003, accuracy: 9757/10000 (98%)
Loss Reduce inf to 0.00031736000403761864
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 2: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0002, accuracy: 9804/10000 (98%)
Loss Reduce 0.00031736000403761864 to 0.0002453371422481723
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 3: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0002, accuracy: 9859/10000 (99%)
Loss Reduce 0.0002453371422481723 to 0.00018928093751892448
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 4: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9873/10000 (99%)
Loss Reduce 0.00018928093751892448 to 0.00014866217281669377
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 5: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9892/10000 (99%)
Loss Reduce 0.00014866217281669377 to 0.00013459483720362186
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 6: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9881/10000 (99%)
--------------------------
Train epoch 7: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9885/10000 (99%)
--------------------------
Train epoch 8: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9896/10000 (99%)
Loss Reduce 0.00013459483720362186 to 0.0001215838277421426
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 9: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9909/10000 (99%)
Loss Reduce 0.0001215838277421426 to 0.0001122944793663919
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 10: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9893/10000 (99%)
--------------------------

加载检查点继续训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
model = Net().to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(model.parameters(), lr=0.025, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

# 指定检查点
resume = f'cnn_checkpoint_best.pt'
# 加载检查点
if resume:
print(f'loading from {resume}')
checkpoint = torch.load(resume, map_location=torch.device("cuda:0"))
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
min_loss = checkpoint['loss']

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(model, device, train_loader, criterion, optimizer, epoch)
loss = test(model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'cnn_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
loading from cnn_checkpoint_best.pt
Train epoch 11: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9904/10000 (99%)
--------------------------
Train epoch 12: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9893/10000 (99%)
--------------------------
Train epoch 13: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9915/10000 (99%)
--------------------------
Train epoch 14: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9918/10000 (99%)
Loss Reduce 0.0001122944793663919 to 0.00010782728097401559
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 15: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9908/10000 (99%)
--------------------------
Train epoch 16: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9915/10000 (99%)
--------------------------
Train epoch 17: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9905/10000 (99%)
--------------------------
Train epoch 18: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9928/10000 (99%)
Loss Reduce 0.00010782728097401559 to 0.00010413678948243614
Save checkpoint to cnn_checkpoint_best.pt
--------------------------
Train epoch 19: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9922/10000 (99%)
--------------------------
Train epoch 20: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0001, accuracy: 9919/10000 (99%)
--------------------------

使用循环神经网络实现手写数字识别

数据集DataSet类重构

RNN要求数据输入的格式为3维:[batch_size, sequence_length, hidden_size],所以在数据层面不需要升维

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class MNIST_Dataset(Dataset):
def __init__(self, mnist_npz_filepath, train=True, transforms=None):
super(MNIST_Dataset, self).__init__()
mnist_data = np.load(mnist_npz_filepath)
x_train = mnist_data['x_train']
x_test = mnist_data['x_test']
self.x = x_train
self.y = mnist_data['y_train']

if not train:
self.x = x_test
self.y = mnist_data['y_test']
self.transforms = transforms

def __len__(self):
return len(self.x)

def __getitem__(self, index):
x = self.x[index]
y = self.y[index]
return torch.Tensor(x), y
1
train_dataset = MNIST_Dataset('../data/self/mnist.npz', train=True)
1
test_dataset = MNIST_Dataset('../data/self/mnist.npz', train=False)

循环神经网络搭建

1
2
3
4
5
6
7
8
9
10
11
class RNNNet(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(RNNNet, self).__init__()
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='relu')
self.fc = nn.Linear(hidden_dim, output_dim)

# 重写forward方法
def forward(self, x):
out, x = self.rnn(x)
x = self.fc(x)
return x.squeeze()

模型定义

1
2
3
4
5
input_dim = 28
hidden_dim = 64
layer_dim = 1
output_dim = 10
rnn_model = RNNNet(input_dim, hidden_dim, layer_dim, output_dim)
1
summary(rnn_model.to('cuda'), (28, 28))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
RNN-1 [[-1, 28, 64], [-1, 2, 64]] 0
Linear-2 [-1, 2, 10] 650
================================================================
Total params: 650
Trainable params: 650
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 1.75
Params size (MB): 0.00
Estimated Total Size (MB): 1.76
----------------------------------------------------------------

模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
rnn_model = RNNNet(input_dim, hidden_dim, layer_dim, output_dim).to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(rnn_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(rnn_model, device, train_loader, criterion, optimizer, epoch)
loss = test(rnn_model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'rnn_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': rnn_model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Train epoch 1: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0049, accuracy: 5402/10000 (54%)
Loss Reduce inf to 0.00494295916557312
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 2: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0040, accuracy: 6309/10000 (63%)
Loss Reduce 0.00494295916557312 to 0.003991533029079437
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 3: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0028, accuracy: 7532/10000 (75%)
Loss Reduce 0.003991533029079437 to 0.0027526606380939484
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 4: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0020, accuracy: 8298/10000 (83%)
Loss Reduce 0.0027526606380939484 to 0.001968709048628807
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 5: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0014, accuracy: 8915/10000 (89%)
Loss Reduce 0.001968709048628807 to 0.0014009273216128348
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 6: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0012, accuracy: 9072/10000 (91%)
Loss Reduce 0.0014009273216128348 to 0.0012339959993958472
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 7: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0010, accuracy: 9254/10000 (93%)
Loss Reduce 0.0012339959993958472 to 0.0009584869503974914
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 8: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0010, accuracy: 9277/10000 (93%)
--------------------------
Train epoch 9: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0007, accuracy: 9485/10000 (95%)
Loss Reduce 0.0009584869503974914 to 0.0007289471238851547
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 10: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0010, accuracy: 9282/10000 (93%)
--------------------------

模型继续训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
rnn_model = RNNNet(input_dim, hidden_dim, layer_dim, output_dim).to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(rnn_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

# 指定检查点
resume = f'rnn_checkpoint_best.pt'
if resume:
print(f'loading from {resume}')
checkpoint = torch.load(resume, map_location=torch.device("cuda:0"))
rnn_model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
min_loss = checkpoint['loss']

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(rnn_model, device, train_loader, criterion, optimizer, epoch)
loss = test(rnn_model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'rnn_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': rnn_model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
loading from rnn_checkpoint_best.pt
Train epoch 11: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0007, accuracy: 9477/10000 (95%)
--------------------------
Train epoch 12: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0008, accuracy: 9425/10000 (94%)
--------------------------
Train epoch 13: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0006, accuracy: 9520/10000 (95%)
Loss Reduce 0.0007289471238851547 to 0.0006436171770095825
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 14: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0006, accuracy: 9575/10000 (96%)
Loss Reduce 0.0006436171770095825 to 0.0005800850979983806
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 15: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0007, accuracy: 9504/10000 (95%)
--------------------------
Train epoch 16: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0006, accuracy: 9533/10000 (95%)
--------------------------
Train epoch 17: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0006, accuracy: 9593/10000 (96%)
Loss Reduce 0.0005800850979983806 to 0.0005519135936163365
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 18: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0005, accuracy: 9660/10000 (97%)
Loss Reduce 0.0005519135936163365 to 0.00047509705983102323
Save checkpoint to rnn_checkpoint_best.pt
--------------------------
Train epoch 19: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0005, accuracy: 9629/10000 (96%)
--------------------------
Train epoch 20: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0005, accuracy: 9639/10000 (96%)
--------------------------

模型更换为LSTM

模型定义

1
2
3
4
5
6
7
8
9
10
11
class LSTMNet(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(LSTMNet, self).__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)

# 重写forward方法
def forward(self, x):
out, (x, cn) = self.lstm(x)
x = self.fc(x)
return x.squeeze()
1
2
3
4
5
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
lstm_model = LSTMNet(input_dim, hidden_dim, layer_dim, output_dim)
1
lstm_model
1
2
3
4
LSTMNet(
(lstm): LSTM(28, 128, batch_first=True)
(fc): Linear(in_features=128, out_features=10, bias=True)
)

模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
lstm_model = LSTMNet(input_dim, hidden_dim, layer_dim, output_dim).to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(lstm_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(lstm_model, device, train_loader, criterion, optimizer, epoch)
loss = test(lstm_model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'lstm_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': lstm_model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Train epoch 1: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0084, accuracy: 3350/10000 (34%)
Loss Reduce inf to 0.008387945103645324
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 2: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0076, accuracy: 4894/10000 (49%)
Loss Reduce 0.008387945103645324 to 0.007566536009311676
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 3: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0067, accuracy: 5308/10000 (53%)
Loss Reduce 0.007566536009311676 to 0.0066949724435806275
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 4: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0059, accuracy: 5949/10000 (59%)
Loss Reduce 0.0066949724435806275 to 0.0058871085524559025
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 5: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0051, accuracy: 6538/10000 (65%)
Loss Reduce 0.0058871085524559025 to 0.005132971966266632
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 6: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0045, accuracy: 6939/10000 (69%)
Loss Reduce 0.005132971966266632 to 0.004458722734451294
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 7: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0039, accuracy: 7285/10000 (73%)
Loss Reduce 0.004458722734451294 to 0.003857710474729538
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 8: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0034, accuracy: 7544/10000 (75%)
Loss Reduce 0.003857710474729538 to 0.0033694057524204252
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 9: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0030, accuracy: 7753/10000 (78%)
Loss Reduce 0.0033694057524204252 to 0.0030294498264789582
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 10: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0028, accuracy: 7954/10000 (80%)
Loss Reduce 0.0030294498264789582 to 0.0027975483238697053
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
lstm_model = LSTMNet(input_dim, hidden_dim, layer_dim, output_dim).to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(lstm_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

# 指定检查点
resume = f'lstm_checkpoint_best.pt'
if resume:
print(f'loading from {resume}')
checkpoint = torch.load(resume, map_location=torch.device("cuda:0"))
lstm_model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
min_loss = checkpoint['loss']

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(lstm_model, device, train_loader, criterion, optimizer, epoch)
loss = test(lstm_model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'lstm_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': lstm_model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
loading from lstm_checkpoint_best.pt
Train epoch 12: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0026, accuracy: 8073/10000 (81%)
Loss Reduce 0.0027975483238697053 to 0.0025662719130516054
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 13: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0024, accuracy: 8149/10000 (81%)
Loss Reduce 0.0025662719130516054 to 0.002429578536748886
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 14: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0022, accuracy: 8289/10000 (83%)
Loss Reduce 0.002429578536748886 to 0.0022109878718853
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 15: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0021, accuracy: 8319/10000 (83%)
Loss Reduce 0.0022109878718853 to 0.0021090875923633575
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 16: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0019, accuracy: 8462/10000 (85%)
Loss Reduce 0.0021090875923633575 to 0.0019428037196397782
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 17: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0019, accuracy: 8503/10000 (85%)
Loss Reduce 0.0019428037196397782 to 0.001924543511867523
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 18: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0019, accuracy: 8555/10000 (86%)
Loss Reduce 0.001924543511867523 to 0.001856973358988762
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 19: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0018, accuracy: 8592/10000 (86%)
Loss Reduce 0.001856973358988762 to 0.0017717314183712007
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 20: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0017, accuracy: 8631/10000 (86%)
Loss Reduce 0.0017717314183712007 to 0.001695616739988327
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 21: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0017, accuracy: 8645/10000 (86%)
--------------------------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
input_dim = 28
hidden_dim = 128
layer_dim = 1
output_dim = 10
epochs = 10
batch_size = 256
torch.manual_seed(2021)

# 查看GPU是否可用,如果可用就用GPU否则用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 训练集的定义
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# 测试集的定义
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# 模型定义并加载至GPU
lstm_model = LSTMNet(input_dim, hidden_dim, layer_dim, output_dim).to(device)
# 随机梯度下降
optimizer = torch.optim.SGD(lstm_model.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()

min_loss = torch.inf
start_epoch = 0
delta = 1e-4

# 指定检查点
resume = f'lstm_checkpoint_best.pt'
if resume:
print(f'loading from {resume}')
checkpoint = torch.load(resume, map_location=torch.device("cuda:0"))
lstm_model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
min_loss = checkpoint['loss']

for epoch in range(start_epoch + 1, start_epoch + epochs+1):
train(lstm_model, device, train_loader, criterion, optimizer, epoch)
loss = test(lstm_model, device, test_loader, criterion)
if loss < min_loss and not torch.isclose(torch.tensor([min_loss]), torch.tensor([loss]), delta): # 监测loss,当loss下降时保存模型,敏感度设置为1e-4(默认为1e-5)
print(f'Loss Reduce {min_loss} to {loss}')
min_loss = loss
save_file = f'lstm_checkpoint_best.pt'
torch.save({
'epoch': epoch+1,
'model_state_dict': lstm_model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, save_file)
print(f'Save checkpoint to {save_file}')
print('--------------------------')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
loading from lstm_checkpoint_best.pt
Train epoch 22: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0017, accuracy: 8668/10000 (87%)
Loss Reduce 0.001695616739988327 to 0.001675336343050003
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 23: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0017, accuracy: 8676/10000 (87%)
--------------------------
Train epoch 24: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0016, accuracy: 8731/10000 (87%)
Loss Reduce 0.001675336343050003 to 0.0015879747480154038
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 25: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0016, accuracy: 8745/10000 (87%)
Loss Reduce 0.0015879747480154038 to 0.0015599892258644104
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 26: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0015, accuracy: 8749/10000 (87%)
Loss Reduce 0.0015599892258644104 to 0.001519762910157442
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 27: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0015, accuracy: 8782/10000 (88%)
Loss Reduce 0.001519762910157442 to 0.0015107614278793335
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 28: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0015, accuracy: 8813/10000 (88%)
Loss Reduce 0.0015107614278793335 to 0.0014844861656427384
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 29: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0014, accuracy: 8838/10000 (88%)
Loss Reduce 0.0014844861656427384 to 0.0014395290106534958
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 30: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0014, accuracy: 8854/10000 (89%)
Loss Reduce 0.0014395290106534958 to 0.001409248575568199
Save checkpoint to lstm_checkpoint_best.pt
--------------------------
Train epoch 31: 60000/60000, [-------------------------------------------------->] 100%
Test: average loss: 0.0014, accuracy: 8877/10000 (89%)
Loss Reduce 0.001409248575568199 to 0.001386603018641472
Save checkpoint to lstm_checkpoint_best.pt
--------------------------

-------------本文结束感谢您的阅读-------------