回答編集履歴

6

fix answer

2023/03/08 16:15

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -3,6 +3,6 @@
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
5
 
6
- とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で,`C = 4,8,16`のときに実行時間が短縮されています.
6
+ とあるようにハードウェア及びそのAPIであるcuDNNが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で,`C = 4,8,16`のときに実行時間が短縮されています.
7
7
  ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
8
8
  2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.

5

fix context

2023/03/08 15:57

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -3,6 +3,6 @@
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
5
 
6
- とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で,`C = 4,8,16`のときに実行時間が短縮されています.
6
+ とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で,`C = 4,8,16`のときに実行時間が短縮されています.
7
7
  ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
8
8
  2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.

4

fix answer

2023/03/08 15:55

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -3,8 +3,6 @@
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
5
 
6
- > The cuDNN library provides some convolution implementations using FFT and Winograd transforms.
7
-
8
- とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です).この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
6
+ とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で,`C = 4,8,16`のときに実行時間が短縮されています.
9
7
  ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
10
8
  2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.

3

fix answer

2023/03/08 15:48

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -2,6 +2,9 @@
2
2
  [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
- とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
5
+
6
+ > The cuDNN library provides some convolution implementations using FFT and Winograd transforms.
7
+
8
+ とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です).この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
6
9
  ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
7
10
  2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.

2

append answer

2023/03/08 15:44

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -2,6 +2,6 @@
2
2
  [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
- とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.
6
-
5
+ とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
6
+ ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
7
7
  2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.

1

fix link

2023/03/08 15:37

投稿

ps_aux_grep
ps_aux_grep

スコア1581

test CHANGED
@@ -1,5 +1,5 @@
1
1
  ハードウェア的理由が大きいと感じます.
2
- [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html)
2
+ [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
3
3
  > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
4
4
  > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
5
5
  とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.