BERTの学習で用いられるattentionの可視化

質問内容

BERTを使い始たばかりなのですが、attention-weightをモデルから計算して、単語間のattentionの値を取り出したいと考えています。日本語wikipediaを事前学習したモデルを利用しています。
Qiita記事を調べてみましたが、tensorflowを用いた記事が見つからなくて、kerasやPytorchを用いたものばかりです。
tensorflowを用いた方法を教えていただけませんでしょうか。

modeling.py

ここでattention_scoresを計算しているみたいなのですが、

 # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)
  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))