TF-IDF　データの容量？エラーの対処方法

Question

### 実行したいこと TF-IDFを算出したいのですが、出力データが大きすぎるのか、以下のエラーが出ていました。膨大な量の文章でのTF-IDFの算出ですが、値が変動する可能性を考慮して、データを分割することは望ましくないと考えております。このエラーを解消する方法ございましたら是非ご教授よろしくお願いいたします。 Memory error: Unable to allocate 89.4 GiB for an array with shape (19720, 608678) and data type float64 ```python #省略 for i in ['description2']: print (i) tfidf_vec2 = TfidfVectorizer(analyzer='word',ngram_range=(1,2)) y = tfidf_vec2.fit_transform(df[i].values.tolist()) df_tfidf2 = pd.DataFrame(y.toarray(), columns=tfidf_vec2.get_feature_names()) ``` ``` MemoryError Traceback (most recent call last) in 3 tfidf_vec2 = TfidfVectorizer(analyzer='word',ngram_range=(1,2)) 4 y = tfidf_vec2.fit_transform(df[i].values.tolist()) ----> 5 df_tfidf2 = pd.DataFrame(y.toarray(), columns=tfidf_vec2.get_feature_names()) ~\anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out) 1023 if out is None and order is None: 1024 order = self._swap('cf')[0] -> 1025 out = self._process_toarray_args(order, out) 1026 if not (out.flags.c_contiguous or out.flags.f_contiguous): 1027 raise ValueError('Output array must be C or F contiguous') ~\anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out) 1187 return out 1188 else: -> 1189 return np.zeros(self.shape, dtype=self.dtype, order=order) 1190 1191 MemoryError: Unable to allocate 89.4 GiB for an array with shape (19720, 608678) and data type float64 ```

Accepted Answer

お使いのマシンがどういうものかによりますが、たとえばWindows10なら[Windows 10の最低メモリ容量・最大メモリ容量について](https://pssection9.com/archives/windows10-minimum-maximum-memory-capacity.html#:~:text=Windows%2010%20Home%2064bit%E3%81%A7,%E3%81%BE%E3%81%A7%E3%81%97%E3%81%8B%E8%AA%8D%E8%AD%98%E3%81%97%E3%81%BE%E3%81%9B%E3%82%93%E3%80%82)には、「Windows 10 Homeはメモリを最大128GBまで認識する」。と書かれています。ただしマザーボードによってはより少ないことも有り、「ここ数年で発売されたパソコンは、メーカーにもよりますが、ノートパソコンであればだいたい32GBまで認識して、デスクトップパソコンでは64GB～128GBまで認識してくれます」とのことです。

お使いのマシンを調べて128GBまで増設可能なら増設されてはいかがでしょうか。

実行したいこと

関連した質問