Pytorch在训练时冻结某些层使其不参与训练

2015/04/11评论932

我们知道，深度学习网络中的参数是通过计算梯度，在反向传播进行更新的，从而能得到一个优秀的参数，但是有的时候，我们想固定其中的某些层的参数不参与反向传播。比如说，进行微调时，我们想固定已经加载预训练模型的参数部分，只想更新最后一层的分类器，这时应该怎么做呢。

定义网络

# 定义一个简单的网络
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.fc1 = nn.Linear(8, 4)
        self.fc2 = nn.Linear(4, num_class)
    
    def forward(self, x):
        return self.fc2(self.fc1(x))

情况一：当不冻结层时

model = net()

# 情况一：不冻结参数时
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)  # 传入的是所有的参数

# 训练前的模型参数
print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

for epoch in range(10):
    x = torch.randn((3, 8))
    label = torch.randint(0,10,[3]).long()
    output = model(x)
    
    loss = loss_fn(output, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 训练后的模型参数
print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

结果：

(bbn) jyzhang@admin2-X10DAi:~/test$ python -u "/home/jyzhang/test/net.py"
model.fc1.weight Parameter containing:
tensor([[ 0.3362, -0.2676, -0.3497, -0.3009, -0.1013, -0.2316, -0.0189,  0.1430],
        [-0.2486,  0.2900, -0.1818, -0.0942,  0.1445,  0.2410, -0.1407, -0.3176],
        [-0.3198,  0.2039, -0.2249,  0.2819, -0.3136, -0.2794, -0.3011, -0.2270],
        [ 0.3376, -0.0842,  0.2747, -0.0232,  0.0768,  0.3160, -0.1185,  0.2911]],
       requires_grad=True)
model.fc2.weight Parameter containing:
tensor([[ 0.4277,  0.0945,  0.1768,  0.3773],
        [-0.4595, -0.2447,  0.4701,  0.2873],
        [ 0.3281, -0.1861, -0.2202,  0.4413],
        [-0.1053, -0.1238,  0.0275, -0.0072],
        [-0.4448, -0.2787, -0.0280,  0.4629],
        [ 0.4063, -0.2091,  0.0706,  0.3216],
        [-0.2287, -0.1352, -0.0502,  0.3434],
        [-0.2946, -0.4074,  0.4926, -0.0832],
        [-0.2608,  0.0165,  0.0501, -0.1673],
        [ 0.2507,  0.3006,  0.0481,  0.2257]], requires_grad=True)
model.fc1.weight Parameter containing:
tensor([[ 0.3316, -0.2628, -0.3391, -0.2989, -0.0981, -0.2178, -0.0056,  0.1410],
        [-0.2529,  0.2991, -0.1772, -0.0992,  0.1447,  0.2480, -0.1370, -0.3186],
        [-0.3246,  0.2055, -0.2229,  0.2745, -0.3158, -0.2750, -0.2994, -0.2295],
        [ 0.3366, -0.0877,  0.2693, -0.0182,  0.0807,  0.3117, -0.1184,  0.2946]],
       requires_grad=True)
model.fc2.weight Parameter containing:
tensor([[ 0.4189,  0.0985,  0.1723,  0.3804],
        [-0.4593, -0.2356,  0.4772,  0.2784],
        [ 0.3269, -0.1874, -0.2173,  0.4407],
        [-0.1061, -0.1248,  0.0309, -0.0062],
        [-0.4322, -0.2868, -0.0319,  0.4647],
        [ 0.4048, -0.2150,  0.0692,  0.3228],
        [-0.2252, -0.1353, -0.0433,  0.3396],
        [-0.2936, -0.4118,  0.4875, -0.0782],
        [-0.2625,  0.0192,  0.0509, -0.1670],
        [ 0.2474,  0.3056,  0.0418,  0.2265]], requires_grad=True)

情况二：采用方式一冻结fc1层时

方式一

优化器传入所有的参数

optimizer = optim.SGD(model.parameters(), lr=1e-2)  # 传入的是所有的参数

将要冻结层的参数的requires_grad置为False

for name, param in model.named_parameters():
    if "fc1" in name:
        param.requires_grad = False

代码：

# 情况二：采用方式一冻结fc1层时
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)  # 优化器传入的是所有的参数

# 训练前的模型参数
print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

# 冻结fc1层的参数
for name, param in model.named_parameters():
    if "fc1" in name:
        param.requires_grad = False

for epoch in range(10):
    x = torch.randn((3, 8))
    label = torch.randint(0,10,[3]).long()
    output = model(x)
 
    loss = loss_fn(output, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

结果：

(bbn) jyzhang@admin2-X10DAi:~/test$ python -u "/home/jyzhang/test/net.py"
model.fc1.weight Parameter containing:
tensor([[ 0.3163, -0.1592, -0.2360,  0.1436,  0.1158,  0.0406, -0.0627,  0.0566],
        [-0.1688,  0.3519,  0.2464, -0.2693,  0.1284,  0.0544, -0.0188,  0.2404],
        [ 0.0738,  0.2013,  0.0868,  0.1396, -0.2885,  0.3431, -0.1109,  0.2549],
        [ 0.1222, -0.1877,  0.3511,  0.1951,  0.2147, -0.0427, -0.3374, -0.0653]],
       requires_grad=True)
model.fc2.weight Parameter containing:
tensor([[-0.1830, -0.3147, -0.1698,  0.3235],
        [-0.1347,  0.3096,  0.4895,  0.1221],
        [ 0.2735, -0.2238,  0.4713, -0.0683],
        [-0.3150, -0.1905,  0.3645,  0.3766],
        [-0.0340,  0.3212,  0.0650,  0.1380],
        [-0.2500,  0.1128, -0.3338, -0.4151],
        [ 0.0446, -0.4776, -0.3655,  0.0822],
        [-0.1871, -0.0602, -0.4855, -0.3604],
        [-0.3296,  0.0523, -0.3424,  0.2151],
        [-0.2478,  0.1424,  0.4547, -0.1969]], requires_grad=True)
model.fc1.weight Parameter containing:
tensor([[ 0.3163, -0.1592, -0.2360,  0.1436,  0.1158,  0.0406, -0.0627,  0.0566],
        [-0.1688,  0.3519,  0.2464, -0.2693,  0.1284,  0.0544, -0.0188,  0.2404],
        [ 0.0738,  0.2013,  0.0868,  0.1396, -0.2885,  0.3431, -0.1109,  0.2549],
        [ 0.1222, -0.1877,  0.3511,  0.1951,  0.2147, -0.0427, -0.3374, -0.0653]])
model.fc2.weight Parameter containing:
tensor([[-0.1821, -0.3155, -0.1637,  0.3213],
        [-0.1353,  0.3130,  0.4807,  0.1245],
        [ 0.2731, -0.2206,  0.4687, -0.0718],
        [-0.3138, -0.1925,  0.3561,  0.3809],
        [-0.0344,  0.3152,  0.0606,  0.1332],
        [-0.2501,  0.1154, -0.3267, -0.4137],
        [ 0.0400, -0.4723, -0.3586,  0.0808],
        [-0.1823, -0.0667, -0.4854, -0.3543],
        [-0.3285,  0.0547, -0.3388,  0.2166],
        [-0.2497,  0.1410,  0.4551, -0.2008]], requires_grad=True)

方法二

优化器传入不冻结的fc2层的参数

optimizer = optim.SGD(model.fc2.parameters(), lr=1e-2)  # 优化器只传入fc2的参数

注：不需要将要冻结层的参数的requires_grad置为False

代码：

# 情况三：采用方式二冻结fc1层时
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.fc2.parameters(), lr=1e-2)  # 优化器只传入fc2的参数
print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

for epoch in range(10):
    x = torch.randn((3, 8))
    label = torch.randint(0,3,[3]).long()
    output = model(x)
 
    loss = loss_fn(output, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
print("model.fc1.weight", model.fc1.weight)
print("model.fc2.weight", model.fc2.weight)

有两种思路实现这个目标，一个是设置不要更新参数的网络层为false，另一个就是在定义优化器时只传入要更新的参数。

最优做法是，优化器只传入requires_grad=True的参数，这样占用的内存会更小一点，效率也会更高。

最优写法

将不更新的参数的requires_grad设置为False，同时不将该参数传入optimizer

将不更新的参数的requires_grad设置为False

# 冻结fc1层的参数
for name, param in model.named_parameters():
    if "fc1" in name:
        param.requires_grad = False

不将不更新的模型参数传入optimizer

# 定义一个fliter，只传入requires_grad=True的模型参数
optimizer = optim.SGD(filter(lambda p : p.requires_grad, model.parameters()), lr=1e-2)

结论

最优写法能够节省显存和提升速度：

节省显存：不将不更新的参数传入optimizer
提升速度：将不更新的参数的requires_grad设置为False，节省了计算这部分参数梯度的时间

定义网络

情况一：当不冻结层时

情况二：采用方式一冻结fc1层时

方式一

方法二

最优写法

结论

发表评论