pytorch基础(十二)-tips

2024-01-08 16:38:56

断点续训练

保存断点

checkpoint_interval = 5
MAX_EPOCH = 10
xxx
xxx
start_epoch = -1
for epoch in range(start_epoch+1, MAX_EPOCH):
    xxxxx
    xxxx
    xxxx
    if (epoch+1) % checkpoint_interval == 0:

    checkpoint = {"model_state_dict": net.state_dict(),
                   "optimizer_state_dict": optimizer.state_dict(),
                   "epoch": epoch}
    path_checkpoint = "./checkpoint_{}_epoch.pkl".format(epoch)
    torch.save(checkpoint, path_checkpoint)

?加载断点

path_checkpoint = "./checkpoint_4_epoch.pkl"
checkpoint = torch.load(path_checkpoint)

net.load_state_dict(checkpoint['model_state_dict'])

optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

start_epoch = checkpoint['epoch']

scheduler.last_epoch = start_epoch

finetune

步骤

1.加载模型
2.加载模型预训练参数
3.修改输出层

# 1.加载模型
model=xxxxx
# 2.加载模型参数
state_dict_load = torch.load('模型参数地址')
model.load_state_dict(state_dict_load
# 3.修改输出层
outlayer_input_d = model.outlayer.in_features
model.outlayer = nn.Linear(outlayer_input_d , classes)

训练方法

1.冻结预训练层
先冻结训练的层,再替换输出层

for param in model.parameters():
    param.requires_grad = False

2. 根据优化器可分组的特点,给预训练的模型较小的学习率

outlayer_params_id = list(map(id, model.outlayer.parameters()))     # 返回的是parameters的 内存地址
base_params = filter(lambda p: id(p) not in outlayer_params_id, model.parameters())
optimizer = optim.SGD([
        {'params': base_params, 'lr': LR*0.1},   # 0
        {'params': model.outlayer.parameters(), 'lr': LR}], momentum=0.9)

GPU并行

方法

1. torch.cuda.device_count():计算当前可见可用gpu数
2. torch.cuda.get_device_name():获取gpu名称
3. torch.cuda.manual_seed():为当前gpu设置随机种子
4. torch.cuda.manual_seed_all():为所有可见可用gpu设置随机种子
5. os.environ.setdefault("CUDA_VISIBLE_DEVICES", "1, 3") :设置可用的GPU
假设有4个真实存在的GPU(0,1,2,3)
设置可用的GPU为物理GPU(1,3):os.environ.setdefault("CUDA_VISIBLE_DEVICES", "1, 3")
在python中逻辑GPU为GPU0,GPU1

torch.nn.DataParallel?

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
module: 需要包装分发的模型
device_ids : 可分发的gpu,默认分发到所有可见可用gpu
output_device: 结果输出设备

假设batch_size=16,有4个GPU,则每个GPU上的batch_size=4

GPU剩余内存

def get_gpu_memory():
    import os
    os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
    memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
    os.system('rm tmp.txt')
    return memory_gpu

'nvidia-smi:英伟达的显卡信息
-q:查询
-d:查询内容
grep:搜索
-A4:显示当前行及后面4行
grep Free:查询剩余内存
> tmp.txt:输入到txt文件中
rm tmp.txt:删除文件?

gpu_memory = get_gpu_memory()
gpu_list = np.argsort(gpu_memory)[::-1]
gpu_list_str = ','.join(map(str, gpu_list))
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
print("\ngpu free memory: {}".format(gpu_memory))
print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

报错

1.加载cuda训练的模型时显示cuda不可用,只能使用cpu?

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU -only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.?

解决方式:torch.load(path_state_dict, map_location="cpu")

2.加载cuda并行训练的模型时,模型的key不匹配

RuntimeError: Error(s) in loading state_dict for FooNet: Missing key(s) in state_dict: "linears.0.weight", "linears.1.weight", "linears.2.weight". Unexpected key(s) in state_dict: "module.linears.0.weight", "module.linears.1.weight", "module.linears.2.weight".

?一种方式是改模型的key,但是太麻烦了
另一种就是把state_dict的key改一下

from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict_load.items ():
   namekey = k[7:] if k.startswith('module.') else k
   new_state_dict[namekey] = v

细节:tensor.to(device)不是inplace操作,不对当前的tensor进行改变需要赋予新的变量
? ? ? ? ? ?Moudule.to(device)是inplace操作

文章来源:https://blog.csdn.net/weixin_62891098/article/details/135417780
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。