PyTorchのTransformerモジュールのmaskで何を隠せばよいかわからない

前提・実現したいこと

PyTorchのTransformerモジュールを使ってTransformerのモデルを作っています。
ですが、forwardの引数の***_maskで何を隠せばよいかわからず悩んでいます。

Transformer — PyTorch master documentation
上のリファレンスによれば、src_maskのサイズは(S,S)、tgt_maskのサイズは(T,T)とあり、隠したいポジションにTrueまたは1を指定せよ、と書いてあると思います。

考えたこと

たとえばsrc_maskのサイズは(S,S)つまり（シーケンス長✕シーケンス長）ですが、バッチサイズの指定がありません。つまりこれはバッチの各文の<pad>の部分を明示するものではないと考えました。

なので、バッチで取り込まれる各文に適用されるmaskとなる...と考えるとtgtなら未来を隠すmaskだろうと想像つくのですが、srcのほうには必要だっけ...？となり、わからなくなったという状況です。

さらに、この引数が実際に使われているところを見るとMultiHeadAttentionモジュールだとわかったのですが、こちらはmaskのサイズが(N,S)で（バッチサイズ✕シーケンス長）だったので、もっとわからなくなってしまいました。

Transformerの仕組みはとりあえず理解できた程度で勘違いをしているかもしれませんが、どうかご回答よろしくおねがいします。

行動規範の内容に同意します

回答1件

私も同じ課題に悩まされていて、検索してもヒットしなかったのでソースコードを見て答えを導き出しました。少し回答した時期が遅かったかもしれませんが仕組みがわかった気がするのでお伝えします⇓

まず、ここの関数に載っているドキュメントには

text
1- attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
2          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
3          S is the source sequence length. attn_mask ensures that position i is allowed to attend the unmasked
4          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
5          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
6          are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
7          is provided, it will be added to the attention weight.