回答編集履歴
6
fix answer
test
CHANGED
@@ -3,6 +3,6 @@
|
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
5
|
|
6
|
-
とあるようにハードウェア及びそのAPIである
|
6
|
+
とあるようにハードウェア及びそのAPIであるcuDNNが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で,`C = 4,8,16`のときに実行時間が短縮されています.
|
7
7
|

|
8
8
|
2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.
|
5
fix context
test
CHANGED
@@ -3,6 +3,6 @@
|
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
5
|
|
6
|
-
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で,`C = 4,8,16`のときに実行時間が短縮されています.
|
6
|
+
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がChannel数(質問で言うところのフィルタ数)である中で,`C = 4,8,16`のときに実行時間が短縮されています.
|
7
7
|

|
8
8
|
2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.
|
4
fix answer
test
CHANGED
@@ -3,8 +3,6 @@
|
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
5
|
|
6
|
-
|
7
|
-
|
8
|
-
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です).この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
|
6
|
+
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数である中で,`C = 4,8,16`のときに実行時間が短縮されています.
|
9
7
|

|
10
8
|
2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.
|
3
fix answer
test
CHANGED
@@ -2,6 +2,9 @@
|
|
2
2
|
[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
|
-
|
5
|
+
|
6
|
+
> The cuDNN library provides some convolution implementations using FFT and Winograd transforms.
|
7
|
+
|
8
|
+
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています(FFTは2の冪乗で最適です).この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
|
6
9
|

|
7
10
|
2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.
|
2
append answer
test
CHANGED
@@ -2,6 +2,6 @@
|
|
2
2
|
[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
|
-
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.
|
6
|
-
|
5
|
+
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.この参考ページで示される次の[画像](https://docscontent.nvidia.com/dita/00000186-1a08-d34f-a596-3f291b140000/deeplearning/performance/dl-performance-convolutional/graphics/specialized-kernels.svg)の横軸`C`がフィルタ数であることに着目してください.
|
6
|
+

|
7
7
|
2の冪乗で乗除するときにはshift演算で済むのもあるかもしれないですね.
|
1
fix link
test
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
ハードウェア的理由が大きいと感じます.
|
2
|
-
[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html)
|
2
|
+
[NVIDIA - Convolutional Layers User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#:~:text=Choose,Effects)
|
3
3
|
> * Choose the number of input and output channels to be divisible by 8 (for FP16) or 4 (for TF32) to run efficiently on Tensor Cores. For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see [Channels In And Out](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels).
|
4
4
|
> * Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead; see [Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#params-perf).
|
5
5
|
とあるようにハードウェア及びそのAPIであるCUDAが最適化されています.
|