回答編集履歴

fix answer

2023/03/08 16:15

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -3,6 +3,6 @@
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
-とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で，`C = 4,8,16`のときに実行時間が短縮されています．
+とあるようにハードウェア及びそのAPIであるcuDNNが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で，`C = 4,8,16`のときに実行時間が短縮されています．
 ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
 2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね．

fix context

2023/03/08 15:57

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -3,6 +3,6 @@
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
-とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で，`C = 4,8,16`のときに実行時間が短縮されています．
+とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で，`C = 4,8,16`のときに実行時間が短縮されています．
 ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
 2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね．

fix answer

2023/03/08 15:55

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -3,8 +3,6 @@
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
-> The cuDNN library provides some convolution implementations using FFT and Winograd transforms.
-とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です)．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください．
+とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で，`C = 4,8,16`のときに実行時間が短縮されています．
 ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
 2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね．

fix answer

2023/03/08 15:48

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -2,6 +2,9 @@
 [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
-とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください．
+> The cuDNN library provides some convolution implementations using FFT and Winograd transforms.
+とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です)．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください．
 ![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
 2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね．

append answer

2023/03/08 15:44

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -2,6 +2,6 @@
 [NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
-とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．
+とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください．
+![](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)
 2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね．

fix link

2023/03/08 15:37

投稿

ps_aux_grep

スコア1581

answer CHANGED Viewed

@@ -1,5 +1,5 @@
 ハードウェア的理由が大きいと感じます．
-[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html)
+[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
 > * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
 > * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
 とあるようにハードウェア及びそのAPIであるCUDAが最適化されています．