TPUとGPUの結果が同じモデルでも違う

前提・実現したいこと

以下のコードはGPUだとaccは0.9以上、val_accは0.6ぐらいまであがるのですが、
TPUだと低いまま途中から変化しなくなります。
ついでにこれ以上バッチサイズはあげられないのですがGPUよりも遅いです。
これはコードに問題があるのでしょうか。
どなたか参考になりそうな情報をご存知の方はいらっしゃいませんか。
よろしくお願いします。

発生している問題・エラーメッセージ


Epoch 1/50
INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(384,), dtype=tf.int32, name='core_id0'), TensorSpec(shape=(384, 128, 128, 3), dtype=tf.float32, name='input_1_10'), TensorSpec(shape=(384, 5), dtype=tf.float32, name='dense_1_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input_1
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 39.066269874572754 secs
INFO:tensorflow:Setting weights on TPU model.
1/4 [======>.......................] - ETA: 6:59 - loss: 1.5895 - acc: 0.2354INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(354,), dtype=tf.int32, name='core_id0'), TensorSpec(shape=(354, 128, 128, 3), dtype=tf.float32, name='input_1_10'), TensorSpec(shape=(354, 5), dtype=tf.float32, name='dense_1_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input_1
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 27.03714942932129 secs
3/4 [=====================>........] - ETA: 1:00 - loss: 3.1962 - acc: 0.3590INFO:tensorflow:New input shapes; (re-)compiling: mode=eval (# of cores 8), [TensorSpec(shape=(384,), dtype=tf.int32, name='core_id_10'), TensorSpec(shape=(384, 128, 128, 3), dtype=tf.float32, name='input_1_10'), TensorSpec(shape=(384, 5), dtype=tf.float32, name='dense_1_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input_1
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 19.703600883483887 secs
3/4 [=====================>........] - ETA: 1:02 - loss: 4.8907 - acc: 0.4226INFO:tensorflow:New input shapes; (re-)compiling: mode=eval (# of cores 8), [TensorSpec(shape=(354,), dtype=tf.int32, name='core_id_10'), TensorSpec(shape=(354, 128, 128, 3), dtype=tf.float32, name='input_1_10'), TensorSpec(shape=(354, 5), dtype=tf.float32, name='dense_1_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input_1
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 20.408061981201172 secs
4/4 [==============================] - 257s 64s/step - loss: 4.9464 - acc: 0.4242
4/4 [==============================] - 453s 113s/step - loss: 2.7730 - acc: 0.3726 - val_loss: 4.9464 - val_acc: 0.4242
Epoch 2/50
#文字数制限のため中略
Epoch 8/50
4/4 [==============================] - 256s 64s/step - loss: 8.4696 - acc: 0.4241
4/4 [==============================] - 268s 67s/step - loss: 1.4502 - acc: 0.4241 - val_loss: 8.4696 - val_acc: 0.4241
Epoch 9/50
4/4 [==============================] - 250s 63s/step - loss: 11.8974 - acc: 0.1984
4/4 [==============================] - 262s 65s/step - loss: 1.4516 - acc: 0.4240 - val_loss: 11.8974 - val_acc: 0.1984
Epoch 10/50
4/4 [==============================] - 253s 63s/step - loss: 11.6360 - acc: 0.1995
4/4 [==============================] - 265s 66s/step - loss: 1.4501 - acc: 0.4241 - val_loss: 11.6360 - val_acc: 0.1995
Epoch 11/50
4/4 [==============================] - 252s 63s/step - loss: 10.8887 - acc: 0.2013
4/4 [==============================] - 264s 66s/step - loss: 1.4497 - acc: 0.4241 - val_loss: 10.8887 - val_acc: 0.2013
Epoch 12/50
4/4 [==============================] - 255s 64s/step - loss: 9.6763 - acc: 0.2067
4/4 [==============================] - 266s 67s/step - loss: 1.4494 - acc: 0.4241 - val_loss: 9.6763 - val_acc: 0.2067
Epoch 13/50
4/4 [==============================] - 253s 63s/step - loss: 9.1260 - acc: 0.2096
4/4 [==============================] - 265s 66s/step - loss: 1.4492 - acc: 0.4241 - val_loss: 9.1260 - val_acc: 0.2096
Epoch 14/50
4/4 [==============================] - 254s 64s/step - loss: 5.9425 - acc: 0.2304
4/4 [==============================] - 266s 67s/step - loss: 1.4491 - acc: 0.4241 - val_loss: 5.9425 - val_acc: 0.2304
Epoch 15/50
4/4 [==============================] - 255s 64s/step - loss: 5.1473 - acc: 0.2458
4/4 [==============================] - 266s 67s/step - loss: 1.4494 - acc: 0.4241 - val_loss: 5.1473 - val_acc: 0.2458
Epoch 16/50
4/4 [==============================] - 254s 63s/step - loss: 3.9491 - acc: 0.2666
4/4 [==============================] - 265s 66s/step - loss: 1.4492 - acc: 0.4240 - val_loss: 3.9491 - val_acc: 0.2666
Epoch 17/50
4/4 [==============================] - 255s 64s/step - loss: 2.5912 - acc: 0.3689
4/4 [==============================] - 266s 67s/step - loss: 1.4495 - acc: 0.4241 - val_loss: 2.5912 - val_acc: 0.3689
Epoch 18/50
4/4 [==============================] - 252s 63s/step - loss: 2.1936 - acc: 0.3486
4/4 [==============================] - 264s 66s/step - loss: 1.4489 - acc: 0.4241 - val_loss: 2.1936 - val_acc: 0.3486
Epoch 19/50
4/4 [==============================] - 255s 64s/step - loss: 1.7816 - acc: 0.3842
4/4 [==============================] - 267s 67s/step - loss: 1.4492 - acc: 0.4241 - val_loss: 1.7816 - val_acc: 0.3842
Epoch 20/50
4/4 [==============================] - 258s 65s/step - loss: 1.5642 - acc: 0.4237
4/4 [==============================] - 270s 67s/step - loss: 1.4491 - acc: 0.4241 - val_loss: 1.5642 - val_acc: 0.4237
Epoch 21/50
4/4 [==============================] - 258s 64s/step - loss: 1.4927 - acc: 0.4227
4/4 [==============================] - 270s 67s/step - loss: 1.4491 - acc: 0.4241 - val_loss: 1.4927 - val_acc: 0.4227
Epoch 22/50
4/4 [==============================] - 257s 64s/step - loss: 1.4711 - acc: 0.4227
4/4 [==============================] - 269s 67s/step - loss: 1.4492 - acc: 0.4241 - val_loss: 1.4711 - val_acc: 0.4227
Epoch 23/50
4/4 [==============================] - 257s 64s/step - loss: 1.4529 - acc: 0.4243
4/4 [==============================] - 268s 67s/step - loss: 1.4496 - acc: 0.4240 - val_loss: 1.4529 - val_acc: 0.4243
#ここからほぼ変化無し

該当のソースコード

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = {'png', 'retina'}

from tensorflow.keras.applications import Xception
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.preprocessing.image import ImageDataGenerator

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.contrib.tpu.python.tpu import keras_support
from tensorflow.keras.models import Model,load_model
from functools import reduce

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

classes = ["White", "Black", "Asian", "Indian", "Others"]
num_classes = len(classes)
image_size = 128

from tensorflow.keras.applications import Xception

K.clear_session()
# ネットワーク定義
net = Xception(include_top=False, weights="imagenet", input_shape=(image_size,image_size,3))

# 最後の5レイヤーまでをフリーズ
for layer in net.layers[:-5]:
    layer.trainable = False
x = net.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, kernel_regularizer=l2(0.001), activation = 'relu')(x)
predictions = Dense(num_classes, activation = 'softmax')(x)
    
model = Model(inputs = net.inputs, outputs = predictions)

#108層までfreeze
for layer in model.layers[:108]:
    layer.trainable = False

    # Batch Normalizationのfreeze解除
    if layer.name.startswith('batch_normalization'):
        layer.trainable = True
    if layer.name.endswith('bn'):
        layer.trainable = True

#109層以降、学習させる
for layer in model.layers[108:]:
    layer.trainable = True
    
model.compile(
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01),
    loss = 'categorical_crossentropy',
    metrics = ["accuracy"]
)

#tpu
tpu_grpc_url = "grpc://"+os.environ["COLAB_TPU_ADDR"]
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu_grpc_url)
strategy = keras_support.TPUDistributionStrategy(tpu_cluster_resolver)
model = tf.contrib.tpu.keras_to_tpu_model(model, strategy=strategy)

datagen = ImageDataGenerator(
    rescale=1./255,
    featurewise_center = False,
    samplewise_center = False,
    featurewise_std_normalization = False,
    samplewise_std_normalization = False,
    zca_whitening = False,
    rotation_range = 0,
    width_shift_range = 0.1,
    height_shift_range = 0.1,
    horizontal_flip = True,
    vertical_flip = False
)

batch_size=256#TPUの場合2048

train_generator = datagen.flow_from_directory(
        '/content/data/train1',
        target_size=(image_size, image_size),
        batch_size=batch_size, 
        follow_links = True
)

validation_generator = datagen.flow_from_directory(
        '/content/data/validation1',
        target_size=(image_size, image_size),
        batch_size=batch_size, 
        follow_links = True
)

hist = model.fit_generator(
    train_generator,
    epochs = 50,
    validation_data = validation_generator,
    verbose = 1,
    max_queue_size=3,
)