分類問題実行時のPytorchがTensorflowと比較して非常に遅い(42min vs. 11min)

前提・実現したいこと

最近PyTorchを使い始めたものです。
公式チュートリアルを参考にCifar10を対象に分類問題を実行してみたのですが、速度が異常に遅くなってしまいます。（42minかかりました）

Tensorflowで同様のコードを試しに組んでみたのですがこちらは11min程度とはるかに短い時間で完了します

どなたかなぜ違いが出たか、アドバイスをいただけると幸いです。

トライした条件は下記になります

environment: Colab Pro+
dataset: Cifar10
classifier: VGG16
optimizer: Adam
loss: crossentropy
batch size: 32

Pytorch

Python
1import torch, torchvision
2import time, copy
3from torch import nn
4from torchvision import transforms, models
5from tqdm import tqdm
6
7
8trans = transforms.Compose([transforms.Resize((224, 224)),
9                            transforms.ToTensor(),])
10
11data = {phase: torchvision.datasets.CIFAR10('./', train = (phase=='train'),  transform=trans, download=True) for phase in ['train', 'test']}
12dataloaders = {phase: torch.utils.data.DataLoader(data[phase], batch_size=32, shuffle=True) for phase in ['train', 'test']}
13
14def train_model(model, criterion, optimizer, dataloaders, device, num_epochs=5):
15    since = time.time()
16
17    best_model_wts = copy.deepcopy(model.state_dict())
18    best_acc = 0.0
19
20    for epoch in range(num_epochs):
21        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
22        print('-' * 10)
23
24        # Each epoch has a training and validation phase
25        for phase in ['train', 'test']:
26            if phase == 'train':
27                model.train()  # Set model to training mode
28            else:
29                model.eval()   # Set model to evaluate mode
30
31            running_loss = 0.0
32            running_corrects = 0
33
34            # Iterate over data.
35            for inputs, labels in tqdm(iter(dataloaders[phase])):
36                inputs = inputs.to(device)
37                labels = labels.to(device)
38
39                # zero the parameter gradients
40                optimizer.zero_grad()
41
42                # forward
43                # track history if only in train
44                with torch.set_grad_enabled(phase == 'train'):
45                    outputs = model(inputs)
46                    _, preds = torch.max(outputs, 1)
47                    loss = criterion(outputs, labels)
48
49                    # backward + optimize only if in training phase
50                    if phase == 'train':
51                        loss.backward()
52                        optimizer.step()
53
54                # statistics
55                running_loss += loss.item() * inputs.size(0)
56                running_corrects += torch.sum(preds == labels.data)
57
58            epoch_loss = running_loss / len(dataloaders[phase])
59            epoch_acc = running_corrects.double() / len(dataloaders[phase])
60
61            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
62                phase, epoch_loss, epoch_acc))
63
64            # deep copy the model
65            if phase == 'test' and epoch_acc > best_acc:
66                best_acc = epoch_acc
67                best_model_wts = copy.deepcopy(model.state_dict())
68
69        print()
70
71    time_elapsed = time.time() - since
72    print('Training complete in {:.0f}m {:.0f}s'.format(
73        time_elapsed // 60, time_elapsed % 60))
74    print('Best val Acc: {:4f}'.format(best_acc))
75
76    # load best model weights
77    model.load_state_dict(best_model_wts)
78    return model
79
80device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
81
82model = models.vgg16(pretrained=False)
83model = model.to(device)
84
85model = train_model(model=model,
86                    criterion=nn.CrossEntropyLoss(), 
87                    optimizer=torch.optim.Adam(model.parameters(), lr=0.001),
88                    dataloaders=dataloaders,
89                    device=device,
90                    )
91

Epoch 0/4
----------
  0%|          | 0/1563 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
100%|██████████| 1563/1563 [07:50<00:00,  3.32it/s]
train Loss: 75.5199 Acc: 3.2809
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.7274 Acc: 3.1949

Epoch 1/4
----------
100%|██████████| 1563/1563 [07:50<00:00,  3.33it/s]
train Loss: 73.8162 Acc: 3.2514
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]
test Loss: 73.6114 Acc: 3.1949

Epoch 2/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7741 Acc: 3.1369
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.5873 Acc: 3.1949

Epoch 3/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7493 Acc: 3.1331
100%|██████████| 313/313 [00:38<00:00,  8.12it/s]
test Loss: 73.6191 Acc: 3.1949

Epoch 4/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7289 Acc: 3.1939
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]test Loss: 73.5955 Acc: 3.1949

Training complete in 42m 22s
Best val Acc: 3.194888

Tensorflow

Python
1import tensorflow_datasets as tfds
2import tensorflow as tf
3import time
4
5
6ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])
7from tensorflow.keras import applications, models
8
9def resize(ip):
10    image = ip['image']
11    label = ip['label']
12    image = tf.image.resize(image, (224, 224))
13    image = tf.expand_dims(image,0)
14    label = tf.one_hot(label,10)
15    label = tf.expand_dims(label,0)
16    return (image, label)
17
18ds_train_ = ds_train.map(resize)
19ds_test_ = ds_test.map(resize)
20
21
22model = applications.vgg16.VGG16(input_shape = (224, 224, 3), weights=None, classes=10)
23model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics= ['accuracy'])
24
25batch_size = 32
26since = time.time()
27history = model.fit(ds_train_,
28                    batch_size = batch_size,
29                    steps_per_epoch = len(ds_train)//batch_size,
30                    epochs = 5,
31                    validation_steps = len(ds_test),
32                    validation_data = ds_test_,
33                    shuffle = True,)
34time_elapsed = time.time() - since
35print('Training complete in {:.0f}m {:.0f}s'.format( time_elapsed // 60, time_elapsed % 60 ))
36

Epoch 1/5
1562/1562 [==============================] - 125s 69ms/step - loss: 36.9022 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 2/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3031 - accuracy: 0.1005 - val_loss: 2.3033 - val_accuracy: 0.1000
Epoch 3/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3035 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 4/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3038 - accuracy: 0.1024 - val_loss: 2.3030 - val_accuracy: 0.1000
Epoch 5/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3028 - accuracy: 0.1024 - val_loss: 2.3033 - val_accuracy: 0.1000
Training complete in 11m 23s

補足情報（FW/ツールのバージョンなど）

ここにより詳細な情報を記載してください。

lazykyama

2021/09/08 15:35

パッと拝見した限り、以下の点が気になりました。特に前者については状況がコードからだけではわかりませんので追記をお願いします。 * Colabは実行タイミングに応じて、GPUの種類が変わります。`nvidia-smi` コマンドを実行して、PyTorchの速度低下が起こるときに、どのGPUが使われているかご確認ください。(V100 or P100 or other) * PyTorchの `DataLoader` は、デフォルトだと `num_workers=0` となっています。これはデータ読み込みがメインプロセスで実行されることになるのですが、読み込みに時間がかかるとその分全体の時間も増加します。CPUのコア数にもよりますが、1より大きい値を指定して見た場合に変化するかご確認ください。 - https://pytorch.org/docs/stable/data.html