pytorchのデータセットでインデックスを取得する方法

やりたいこと

pytorchでCIFAR10を用いた学習をしているのですが、データセット内で何番目の画像を使ったかのインデックス番号を取得したいです。
CIFAR10クラスを継承して自作してみましたが、dataなんてメンバ変数は知らないとエラーが出てしまいます。
ご存知の方がいらっしゃいましたらご教授頂きたいです。

試したコード

class MyCIFAR10(torchvision.datasets.CIFAR10):
    def __init__(self, root, train=True, transform=None, target_transform=None, download=False):
        super(MyCIFAR10, self).__init__(root, train=train, transform=transform, target_transform=target_transform, download=download)

    def __getitem__(self, index):
        img, target = self.data[index], self.targets[index]
        img = Image.fromarray(img)

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target, index　←　このindexをローダに返したい！

##　試した結果
上記クラスを使うと下のようなエラーが出ます。

Traceback (most recent call last):
  File "main.py", line 186, in <module>
    train_loss, train_acc = train(epoch)
  File "main.py", line 132, in train
    for batch_idx, (inputs, targets, index) in enumerate(trainloader):
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in <listcomp>
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "main.py", line 46, in __getitem__
    img, target = self.data[index], self.targets[index]
AttributeError: 'MyCIFAR10' object has no attribute 'data'

0kcal

2020/02/01 06:54 編集

お疲れ様。答えは、わかってないのですが、まず、上記の構成で、dataというのは、ないような気がします。 ```python C:\_temp_work>python -m pydoc torchvision.datasets.CIFAR10 Help on class CIFAR10 in torchvision.datasets: torchvision.datasets.CIFAR10 = class CIFAR10(torchvision.datasets.vision.VisionDataset) | torchvision.datasets.CIFAR10(root, train=True, transform=None, target_transform=None, download=False) | | `CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset. | | Args: | root (string): Root directory of dataset where directory | ``cifar-10-batches-py`` exists or will be saved to if download is set to True. | train (bool, optional): If True, creates dataset from training set, otherwise | creates from test set. | transform (callable, optional): A function/transform that takes in an PIL image | and returns a transformed version. E.g, ``transforms.RandomCrop`` | target_transform (callable, optional): A function/transform that takes in the | target and transforms it. | download (bool, optional): If true, downloads the dataset from the internet and | puts it in root directory. If dataset is already downloaded, it is not | downloaded again. C:\_temp_work> ```

s-uchi

2020/02/01 07:19

回答有り難うございます。 https://pytorch.org/docs/stable/_modules/torchvision/datasets/cifar.html#CIFAR10 を見て、CIFAR10クラスで定義されてるself.dataを引っ張ってこれたらと考えてます。 super()ってしたら親の変数達も見れるようになると理解してるのですが、、、継承がよくわかりませんw

行動規範の内容に同意します

回答2件

ベストアンサー

Cifar10の場合ですが、例えば以下の要領でデータを取得して、

python
1trainset = torchvision.datasets.CIFAR10(root='../data/raw', train=True,
2                                        download=True, transform=transform)
3
4testset = torchvision.datasets.CIFAR10(root='../data/raw', train=False,
5                                       download=True, transform=transform)
6
7# DataLoaderの作成
8trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
9                                          shuffle=True, num_workers=2)
10testloader = torch.utils.data.DataLoader(testset, batch_size=4,
11                                         shuffle=False, num_workers=2)

次に、以下のようにすることでミニバッチ内のインデックスが取得できます。上記の場合はバッチサイズが4なので、forの中では0,1,2,3のインデックスとそのデータが取得できます。このような回答でよろしかったでしょうか。

python
1for i, data in enumerate(trainloader, 0):
2        inputs, labels = data
3        for index, result in enumerate(inputs):
4          print(index, result) # これでバッチ内のインデックスとそのデータが取得できる。

<質問に対する回答の追記>
例えばですが、以下のようにクラスを定義することで、お望みのインデックスありのデータセットが定義できます。

python
1class Subset(Dataset):
2    """
3    Subset of a dataset at specified indices.
4
5    Arguments:
6        dataset (Dataset): The whole Dataset
7        indices (sequence): Indices in the whole set selected for subset
8    """
9    def __init__(self, data, label, indices):
10        self.data = data
11        self.label = label
12        self.indices = indices
13
14    def __getitem__(self, idx):
15        #out_data = self.transform(self.data)[idx]
16        
17        return self.data, self.label, self.data[self.indices[idx]]
18
19    def __len__(self):
20        return len(self.indices)
21
22train_size = len(trainset) # n_samples is 60000
23indices = list(range(0,train_size)) # [0,1,.....47999]
24train_dataset = Subset(trainset.data, trainset.targets, indices)
25
26# 二つ目のデータの入力とラベルとインデックスを表示
27print(train_dataset.data[2])
28print(train_dataset.label[2])
29print(train_dataset.indices[2])

投稿2020/02/01 07:23

編集2020/02/01 09:30

bamboo-nova

総合スコア1408

s-uchi

2020/02/01 07:34

質問が悪かったです。（本文を編集しました）バッチ内のインデックスではなく、データセット50000枚の何番目かを知りたかったです。同じ画像の特徴量が学習中にどのように変化しているか（データローダでshuffuleが入った状態で）トラッキングしたいのが本当のやりたいことです。

bamboo-nova

2020/02/01 09:30

回答に追記させて頂きました。これでお望みのデータセットが作成できると思います。

行動規範の内容に同意します

お疲れ様。
答えは、わかってないのですが、
まず、上記の構成で、dataというのは、存在しない気がします。

python
1C:\_temp_work>python -m pydoc torchvision.datasets.CIFAR10
2Help on class CIFAR10 in torchvision.datasets:
3
4torchvision.datasets.CIFAR10 = class CIFAR10(torchvision.datasets.vision.VisionDataset)
5| torchvision.datasets.CIFAR10(root, train=True, transform=None, target_transform=None, download=False)
6|
7| `CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
8|
9| Args:
10| root (string): Root directory of dataset where directory
11| ``cifar-10-batches-py`` exists or will be saved to if download is set to True.
12| train (bool, optional): If True, creates dataset from training set, otherwise
13| creates from test set.
14| transform (callable, optional): A function/transform that takes in an PIL image
15| and returns a transformed version. E.g, ``transforms.RandomCrop``
16| target_transform (callable, optional): A function/transform that takes in the
17| target and transforms it.
18| download (bool, optional): If true, downloads the dataset from the internet and
19| puts it in root directory. If dataset is already downloaded, it is not
20| downloaded again.
21
22
23C:\_temp_work>
24