自然言語処理の小説データ学習の際に起こるメモリ不足の解消

実現したこと

メモリ不足のエラーの解消

発生している問題

現在、LSTMを用いて文章生成を行おうと思い実装サイトを参考にしてコードを組みました。しかし、

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)

の部分でメモリ不足のエラーを起こしています。恐らく入力データ(data.txt)が長すぎなのが原因だと思われます。

解決方法としてはデータを分割し、学習してモデルを生成するのが良いと思いますが、どのようにコードを書いていいのかわかりません。コードの書き方のご教授いただければと思います。

入力データは複数の小説データを空白、ルビ、改行なしで繋げたものを用いています
小説1小説2小説3…小説N
と続いています

また、モデルを生成した後に、モデルの更新が行えるかどうか、その書き方も教えていただきたいです。
例
小説1学習→モデル生成→小説2学習→モデル更新→小説3学習→モデルの更新…小説N学習→モデルの更新
のような処理はできるのでしょうか。

よろしくお願いいたします。

ソースコード

from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from janome.tokenizer import Tokenizer
import numpy as np
import random
import sys
import io
import time

path = './data.txt'
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

text = Tokenizer().tokenize(text, wakati=True)  # 分かち書きする
chars = text
count = 0
char_indices = {}  # 辞書初期化
indices_char = {}  # 逆引き辞書初期化

for word in chars:
    if not word in char_indices:  # 未登録なら
        char_indices[word] = count  # 登録する
        count += 1
# 逆引き辞書を辞書から作成する
indices_char = dict([(value, key) for (key, value) in char_indices.items()])

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 5
step = 1
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    start_index = 0  # テキストの最初からスタート
    for diversity in [0.2]:  # diversity は 0.2のみ使用
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        # sentence はリストなので文字列へ変換して使用
        generated += "".join(sentence)
        print(sentence)

        # sentence はリストなので文字列へ変換して使用
        print('----- Generating with seed: "' + "".join(sentence) + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:]
            # sentence はリストなので append で結合する
            sentence.append(next_char)

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# fit
global_start_time = time.time()
model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])

print('time : ', time.time() - global_start_time)
# save
model.save('model.h5')

meg_

2020/08/05 22:12

・マシンのスペック、エラーメッセージはどうなっていますか？・> 実装サイトを参考にしてコードを組みました。 ←　サイトのURLを掲載ください。

quickquip

2020/08/06 01:55

https://teratail.com/help/question-tips#questionTips3-4-1 > 実際に起きた結果を示しましょう。例えば、「○○というエラーが表示された」、「レイアウトがこのように崩れてしまった」等です。 > 表示されたエラーメッセージをそのままコピー&ペーストしましょう。自分でタイプしなおしたり、自分で解釈・要約しようとしてはいけません。

8524ba23

2020/08/09 03:08

print( len(sentences), maxlen, len(chars)) を実行した結果も記載ください。

EchoOfDeath

2020/08/09 13:04

記載不足で申し訳ありません。マシンスペック(Surface Pro4) ・OS　Windows 10 Pro ・Ram　8GB ・プロセッサ　Intel Core i5 ・SSD 256GB 参考サイト https://github.com/kuc-arc-f/LSTM_make_colab エラー結果 x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) MemoryError: Unable to allocate 1.92 TiB for an array with shape (650095, 5, 650100) and data type bool print( len(sentences), maxlen, len(chars)) を実行した結果 len(sentences): 650095 maxlen: 5 len(chars): 650100 です。よろしくお願いいたします。