編集履歴

質問編集履歴

訂正

2021/02/20 08:24

投稿

fukubaka

スコア11

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -1,5 +1,5 @@
 ### 実行したいこと
-TF-IDFを算出したいのですが、出力データが大きすぎるのか、以下のエラーが出ていました。このエラーを解消する方法ございましたら是非ご教授よろしくお願いいたします。
+TF-IDFを算出したいのですが、出力データが大きすぎるのか、以下のエラーが出ていました。膨大な量の文章でのTF-IDFの算出ですが、値が変動する可能性を考慮して、データを分割することは望ましくないと考えております。このエラーを解消する方法ございましたら是非ご教授よろしくお願いいたします。
 Memory error:

訂正

2021/02/20 08:24

投稿

fukubaka

スコア11

title CHANGED Viewed

	@@ -1,1 +1,1 @@
1	- TF-IDF　~~単語も一緒に出力させたい。~~
1	+ TF-IDF　データの容量？エラーの対処方法

body CHANGED Viewed

@@ -1,34 +1,41 @@
-このコードを変更して、抽出した単語の名前とSVD後の値を一緒に出力したいです。
-ぜひご教授願います。
+### 実行したいこと
+TF-IDFを算出したいのですが、出力データが大きすぎるのか、以下のエラーが出ていました。このエラーを解消する方法ございましたら是非ご教授よろしくお願いいたします。
-```ここに言語を入力
-class MecabTokenizer:
-    def __init__(self):
-        self.wakati = MeCab.Tagger('-Owakati')
-        self.wakati.parse('')
-def tokenize(self, line):
-        txt = self.wakati.parse(line)
-        txt = txt.split()
-        return txt
-def mecab_tokenizer(self, line):
-        node = self.wakati.parseToNode(line)
-        keywords = []
-        while node:
-            if node.feature.split(",")[0] == "名詞" or node.feature.split(",")[0] == "形容詞":
-                keywords.append(node.surface)
-            node = node.next
-        return keywords
+Memory error:
-n_comp = 40 #次元圧縮後の次元数
+Unable to allocate 89.4 GiB for an array with shape (19720, 608678) and data type float64
+```python
+#省略
-for i in ['Title1','Title2','Title3']:
+for i in ['description2']:
     print (i)
-    tfidf_vec = TfidfVectorizer(analyzer='word',ngram_range=(1,2))
+    tfidf_vec2 = TfidfVectorizer(analyzer='word',ngram_range=(1,2))
+    y = tfidf_vec2.fit_transform(df[i].values.tolist())
+    df_tfidf2 = pd.DataFrame(y.toarray(), columns=tfidf_vec2.get_feature_names())
+```
+```
+MemoryError                               Traceback (most recent call last)
+<ipython-input-23-d4e72a34568f> in <module>
+      3     tfidf_vec2 = TfidfVectorizer(analyzer='word',ngram_range=(1,2))
-    text_tfidf = tfidf_vec.fit_transform(df[i].values.tolist() )
+      4     y = tfidf_vec2.fit_transform(df[i].values.tolist())
-    text_svd = TruncatedSVD(n_components=n_comp,algorithm='arpack',random_state=9999)
-    df_svd = pd.DataFrame(text_svd.fit_transform(text_tfidf))
+----> 5     df_tfidf2 = pd.DataFrame(y.toarray(), columns=tfidf_vec2.get_feature_names())
+~\anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
+   1023         if out is None and order is None:
+   1024             order = self._swap('cf')[0]
+-> 1025         out = self._process_toarray_args(order, out)
+   1026         if not (out.flags.c_contiguous or out.flags.f_contiguous):
-    df_svd.columns = ['svd_'+str(i)+str(j+1) for j in range(n_comp)]
+   1027             raise ValueError('Output array must be C or F contiguous')
+~\anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
-    df = pd.concat([df,df_svd],axis=1)
+   1187             return out
+   1188         else:
+-> 1189             return np.zeros(self.shape, dtype=self.dtype, order=order)
+   1190
+   1191
+MemoryError: Unable to allocate 89.4 GiB for an array with shape (19720, 608678) and data type float64
 ```

訂正

2021/02/20 08:20

投稿

fukubaka

スコア11

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -23,11 +23,11 @@
 n_comp = 40 #次元圧縮後の次元数
-for i in ['channelTitle2','description2','title2']:#,'new_title','new_description',
+for i in ['Title1','Title2','Title3']:
     print (i)
     tfidf_vec = TfidfVectorizer(analyzer='word',ngram_range=(1,2))
-text_tfidf = tfidf_vec.fit_transform(df[i].values.tolist() )
+    text_tfidf = tfidf_vec.fit_transform(df[i].values.tolist() )
-    text_svd = TruncatedSVD(n_components=n_comp, algorithm='arpack',random_state=9999)
+    text_svd = TruncatedSVD(n_components=n_comp,algorithm='arpack',random_state=9999)
     df_svd = pd.DataFrame(text_svd.fit_transform(text_tfidf))
     df_svd.columns = ['svd_'+str(i)+str(j+1) for j in range(n_comp)]
     df = pd.concat([df,df_svd],axis=1)