Rのldaパッケージのエラーでmust be non-negative and less than the number of words はなんでしょうか

今回はldaパッケージのlda.collapsed.gibbs.sampler()というメソッドを使っています。
今回自分で作成したパラメータはdocumentsとvocabです。documentsとvocabはサンプルと同じ型で同じように作成したのですが、下のようなエラーが出てきました。どうしてでしょうか。

パッケージLDAの説明サイト

R
1コード
2result <- lda.collapsed.gibbs.sampler(documents, k, vocab,
3                                      25,  # 繰り返し数
4                                      0.1, # ディリクレ過程のハイパーパラメータα
5                                      0.1, # ディリクレ過程のハイパーパラメータη
6                                      compute.log.likelihood=TRUE)
7

エラー内容
 structure(.Call("collapsedGibbsSampler", documents, as.integer(K),  でエラー: 
  Word (251) must be non-negative and less than the number of words (251).

もしわかる方がいらっしゃいましたら教えていただけないでしょうか。

解答をいただいて

解答していただきありがとうございました。
そこで

という簡単なもので試してみました。これですと、vocabもインデックスは9までで、documentsの方もインデックス９までしか使用していないのですが、同じエラーが出力されました。

エラー内容
structure(.Call("collapsedGibbsSampler", documents, as.integer(K),  でエラー: 
  Word (9) must be non-negative and less than the number of words (9).

試したこと
それから先ほど試したこととしてはlda.collapsed.gibbs.sampler()メソッドの中にあるlengthsという変数の中にしっかりと単語の数が入っているかどうか調べるために、その部分を自分のコードの方にコピペして試したところしっかりと単語数が入っていました。

それからtypeofでも調べてみましたが、しっかりと型が揃っていました。

行動規範の内容に同意します

回答2件

ベストアンサー

documentsにでてくる単語のナンバリングがvocabの数未満じゃないとダメみたいです。

R
1vocab <- c("i","have","pen",".","you","are","dog","love","japan")
2k <- 10

R
1#成功例
2d2 <- list(matrix(c(1:4,rep(1,4)),2,4, byrow=T),matrix(c(5:7,rep(1,3)),2,3, byrow=T),matrix(c(1,8,0,rep(1,3)),2,3, byrow=T))
3documents <- lapply(d2, function(x){ matrix(as.integer(x), 2)})
4documents
5
6result <- lda.collapsed.gibbs.sampler(documents, k, vocab,
7                                      25,  # 繰り返し数
8                                      0.1, # ディリクレ過程のハイパーパラメータα
9                                      0.1, # ディリクレ過程のハイパーパラメータη
10                                      compute.log.likelihood=TRUE)
11
12max(unique(unlist(documents)))
13length(unique(unlist(documents)))
14length(vocab)

R
1#失敗例１
2d2 <- list(matrix(c(1:4,rep(1,4)),2,4, byrow=T),matrix(c(5:7,rep(1,3)),2,3, byrow=T),matrix(c(1,8,9,rep(1,3)),2,3, byrow=T))
3documents <- lapply(d2, function(x){ matrix(as.integer(x), 2)})
4documents
5
6result <- lda.collapsed.gibbs.sampler(documents, k, vocab,
7                                      25,  # 繰り返し数
8                                      0.1, # ディリクレ過程のハイパーパラメータα
9                                      0.1, # ディリクレ過程のハイパーパラメータη
10                                      compute.log.likelihood=TRUE)
11max(unique(unlist(documents)))
12length(unique(unlist(documents)))
13length(vocab)

R
1#失敗例２
2d2 <- list(matrix(c(1:4,rep(1,4)),2,4, byrow=T),matrix(c(5:7,rep(1,3)),2,3, byrow=T),matrix(c(9,9,9,rep(1,3)),2,3, byrow=T))
3documents <- lapply(d2, function(x){ matrix(as.integer(x), 2)})
4documents
5
6result <- lda.collapsed.gibbs.sampler(documents, k, vocab,
7                                      25,  # 繰り返し数
8                                      0.1, # ディリクレ過程のハイパーパラメータα
9                                      0.1, # ディリクレ過程のハイパーパラメータη
10                                      compute.log.likelihood=TRUE)
11
12max(unique(unlist(documents)))
13length(unique(unlist(documents)))
14length(vocab)

投稿2020/10/15 03:31

shimiken

総合スコア368

oika77

2020/10/15 03:46

本当に本当にありがとうございました。解決しました。エラー内容を調べてもなかなか理解できるものがなかったので、どうしようかと思っていました。

行動規範の内容に同意します

vocabの数がdocumentsにでてくる単語の種類の数より少ないとダメみたいです。

R
1#サンプルデータの読み込み
2library(lda)
3data(cora.documents)
4data(cora.vocab)
5k <- 10

R
1#成功例
2result <- lda.collapsed.gibbs.sampler(cora.documents, 
3                                      k,
4                                      cora.vocab,
5                                      25,
6                                      0.1,
7                                      0.1,
8                                      compute.log.likelihood=TRUE)
9
10length(unique(unlist(cora.documents)))
11length(cora.vocab)

R
1#vocabの単語数が少ないとダメ
2vocab1　<- cora.vocab[1:2960]
3result <- lda.collapsed.gibbs.sampler(cora.documents, 
4                                      k,
5                                      vocab1,
6                                      25,
7                                      0.1,
8                                      0.1,
9                                      compute.log.likelihood=TRUE)
10
11length(unique(unlist(cora.documents)))
12length(vocab1)

R
1#vocabの単語数が多いぶんにはok
2vocab2 <- c(cora.vocab,"rdgfsfgf")
3result <- lda.collapsed.gibbs.sampler(cora.documents, 
4                                      k,
5                                      vocab2,
6                                      25,
7                                      0.1,
8                                      0.1,
9                                      compute.log.likelihood=TRUE)
10
11length(unique(unlist(cora.documents)))
12length(vocab2)