WSL2でGPU認識が遅い

前提・実現したいこと

WSL2でGPUを用いた機械学習を行おうとしています。
dockerは使用せず、anacondaで環境構築しています。
以下のようなサンプルコードを走らせると、1epochごとの計算は早いのですが、計算が始まるまでにかなり時間がかかります(4分くらい)。
学習開始までの時間を短くしたいです。

発生している問題・エラーメッセージ

最初のEpochが開始されるまでに4分くらいかかります。
出力ログ：

2021-12-01 20:27:10.719869: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2.4.1
2021-12-01 20:27:11.888173: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-01 20:27:11.896558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-12-01 20:27:12.079597: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.079652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 24.00GiB deviceMemoryBandwidth: 871.81GiB/s
2021-12-01 20:27:12.079689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-12-01 20:27:12.080760: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-12-01 20:27:12.080809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-12-01 20:27:12.081861: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-12-01 20:27:12.082063: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-12-01 20:27:12.083123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-12-01 20:27:12.083701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-12-01 20:27:12.085944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-12-01 20:27:12.086643: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.087221: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.087256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-12-01 20:27:12.087462: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-01 20:27:12.089355: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.089399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 24.00GiB deviceMemoryBandwidth: 871.81GiB/s
2021-12-01 20:27:12.089415: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-12-01 20:27:12.089459: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-12-01 20:27:12.089500: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-12-01 20:27:12.089533: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-12-01 20:27:12.089566: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-12-01 20:27:12.089599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-12-01 20:27:12.089617: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-12-01 20:27:12.089648: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-12-01 20:27:12.090284: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.090791: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:27:12.090826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-12-01 20:27:12.090865: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-12-01 20:30:04.588825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-12-01 20:30:04.588864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-12-01 20:30:04.588873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-12-01 20:30:04.589799: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:30:04.589836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1489] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2021-12-01 20:30:04.590397: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:30:04.590931: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-01 20:30:04.590986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21793 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2021-12-01 20:30:04.591612: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-01 20:30:06.308908: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 188160000 exceeds 10% of free system memory.
2021-12-01 20:30:06.387992: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-12-01 20:30:06.388428: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3600005000 Hz
Epoch 1/5
2021-12-01 20:30:06.629448: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
1875/1875 [==============================] - 95s 2ms/step - loss: 2.3165 - accuracy: 0.1009
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 2.3027 - accuracy: 0.1000
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 2.3027 - accuracy: 0.1014
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 2.3027 - accuracy: 0.1002
Epoch 5/5
1875/1875 [==============================] - 4s 2ms/step - loss: 2.3026 - accuracy: 0.1007
313/313 - 1s - loss: 2.3027 - accuracy: 0.1007

Test accuracy: 0.1006999984383583
time:  284.1962425708771

該当のソースコード

python
1import time
2import tensorflow as tf
3from tensorflow import keras
4import numpy as np
5
6time0 = time.time()
7print(tf.__version__)
8
9#dummy data
10train_images = np.random.rand(60000, 28, 28)
11train_labels = np.random.rand(0,10,60000)
12test_images = np.random.rand(10000, 28, 28)
13test_labels = np.random.randint(0,10,10000)
14
15model = keras.Sequential([
16    keras.layers.Flatten(input_shape=(28, 28)),
17    keras.layers.Dense(128, activation='relu'),
18    keras.layers.Dense(10, activation='softmax')
19    ])
20model.compile(optimizer='adam', 
21              loss='sparse_categorical_crossentropy',
22              metrics=['accuracy'])
23model.fit(train_images, train_labels, epochs=5)
24
25test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
26
27print('\nTest accuracy:', test_acc)
28print("time: ", time.time() - time0)

試したこと

nvidia-smi等でGPUにメモリが割り当てられている様子は確認しています。

また、環境を変えたときの挙動を調べました。
・GPUを使用しない場合
・nvidaの提供しているdockerイメージ環境
・別の純粋なLinuxのPCでの環境
いずれも学習はすぐに開始されました。

純粋なLinux環境と比べ、WSL2環境ではNUMAサポートがないというエラーメッセージが出力ログに追加されているという違いがありました。一方、WSL2+docker環境でもこのメッセージは出力されていました（EではなくIにタグは変わっていました）ので、NUMAサポートがないことが時間がかかる直接的な原因ではないのではないかと思っています。

補足情報

OS：Windows11 Pro, バージョン：21H2
CPU：Intel Core i9-9900K @ 3.60GHz
GPU：NVIDIA GeForce RTX 3090
GPUドライバ：nvidia geforce GameReady 510.06
CUDA: 11.6
Ubuntu 20.04.3 LTS
anaconda 3.5.1
python 3.7.11
tensorflow 2.4.1
numpy 1.20.3