以下のプログラムは途中までのものですが、google colabだと動いたのですが、会社のGPUを積んだマシンだとエラーがでました。エラーが出た原因が知りたいです。
動かしたプログラム
import gym from creversi.gym_reversi.envs import ReversiVecEnv from creversi import * import os import datetime import math import random import numpy as np from collections import namedtuple from itertools import count from tqdm import tqdm_notebook as tqdm import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F BATCH_SIZE = 256 vecenv = ReversiVecEnv(BATCH_SIZE) # if gpu is to be used device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ###################################################################### # Replay Memory Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'next_actions', 'reward')) class ReplayMemory(object): def __init__(self, capacity): self.capacity = capacity self.memory = [] self.position = 0 def push(self, *args): """Saves a transition.""" if len(self.memory) < self.capacity: self.memory.append(None) self.memory[self.position] = Transition(*args) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): return random.sample(self.memory, batch_size) def __len__(self): return len(self.memory) ###################################################################### # DQN k = 192 fcl_units = 256 class DQN(nn.Module): def __init__(self): super(DQN, self).__init__() self.conv1 = nn.Conv2d(2, k, kernel_size=3, padding=1) self.bn1 = nn.BatchNorm2d(k) self.conv2 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn2 = nn.BatchNorm2d(k) self.conv3 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn3 = nn.BatchNorm2d(k) self.conv4 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn4 = nn.BatchNorm2d(k) self.conv5 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn5 = nn.BatchNorm2d(k) self.conv6 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn6 = nn.BatchNorm2d(k) self.conv7 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn7 = nn.BatchNorm2d(k) self.conv8 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn8 = nn.BatchNorm2d(k) self.conv9 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn9 = nn.BatchNorm2d(k) self.conv10 = nn.Conv2d(k, k, kernel_size=3, padding=1) self.bn10 = nn.BatchNorm2d(k) self.fcl1 = nn.Linear(k * 64, fcl_units) self.fcl2 = nn.Linear(fcl_units, 65) def forward(self, x): x = F.relu(self.bn1(self.conv1(x))) x = F.relu(self.bn2(self.conv2(x))) x = F.relu(self.bn3(self.conv3(x))) x = F.relu(self.bn4(self.conv4(x))) x = F.relu(self.bn5(self.conv5(x))) x = F.relu(self.bn6(self.conv6(x))) x = F.relu(self.bn7(self.conv7(x))) x = F.relu(self.bn8(self.conv8(x))) x = F.relu(self.bn9(self.conv9(x))) x = F.relu(self.bn10(self.conv10(x))) x = F.relu(self.fcl1(x.view(-1, k * 64))) x = self.fcl2(x) return x.tanh() def get_states(envs): features_vec = np.zeros((BATCH_SIZE, 2, 8, 8), dtype=np.float32) for i, env in enumerate(envs): env.board.piece_planes(features_vec[i]) return torch.from_numpy(features_vec).to(device)
この最後のreturn torch.from_numpy(features_vec).to(device)のところで、、google colabでは
エラーが出ないのですが、会社のGPUを積んだLinuxだと
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
というエラーがでます。ググったりして調べたのですが、原因が分かりませんでした。
会社のGPUの環境は、下記の通りです。
-bash-4.2$ lspci | grep -i nvidia
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
3b:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
3b:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
3b:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
af:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
af:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
ご教示いただけますと幸いです。
何卒、よろしくお願い申し上げます。
回答1件
あなたの回答
tips
プレビュー