概要

テキスト・データ

『ザ・フェデラリスト』の著者をめぐる論争

『ザ・フェデラリスト』第1巻のタイトルページ 出典:アメリカ議会図書館

AFTER an unequivocal experience of the inefficiency of the subsisting federal government, you are called upon to deliberate on a new Constitution for the United States of America.
        
This shall accordingly constitute the subject of my next address.
##  必要な2つのパッケージをインストールする(1度だけでよい)
## install.packages("tm")
## install.packages("SnowballC")

## 必要な2つのパッケージを読み込む
library(tm, SnowballC)
## Loading required package: NLP
## 未加工のコーパスを読み込む
corpus.raw <- VCorpus(DirSource(directory = "federalist", pattern = "fp"))
corpus.raw
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 85
## 小文字にする
corpus.prep <- tm_map(corpus.raw, content_transformer(tolower)) 

## スペースを取り除く
corpus.prep <- tm_map(corpus.prep, stripWhitespace) 

## 句読点を取り除く
corpus.prep <- tm_map(corpus.prep, removePunctuation)

## 数字を取り除く
corpus.prep <- tm_map(corpus.prep, removeNumbers) 
head(stopwords("english"))
## [1] "i"      "me"     "my"     "myself" "we"     "our"
## ストップワードを取り除く
corpus <- tm_map(corpus.prep, removeWords, stopwords("english")) 
## 残った単語を語幹化する
corpus <- tm_map(corpus, stemDocument) 
## スペースを節約するため出力を割愛
head(content(corpus[[10]]), 3) # 論文第10篇
## [1] "among numer advantag promis wellconstruct union none"
## [2] "deserv accur develop tendenc break"                  
## [3] "control violenc faction friend popular govern never"
AMONG the numerous advantages promised by a well-constructed Union, none 
        deserves to be more accurately developed than its tendency to break and 
        control the violence of faction. The friend of popular governments never 

文書-用語行列

dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 85, terms: 4849)>>
## Non-/sparse entries: 44917/367248
## Sparsity           : 89%
## Maximal term length: 18
## Weighting          : term frequency (tf)
inspect(dtm[1:5, 1:8])
## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 4/36
## Sparsity           : 90%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##           Terms
## Docs       abandon abat abb abet abhorr abil abject abl
##   fp01.txt       0    0   0    0      0    0      0   1
##   fp02.txt       0    0   0    0      0    1      0   0
##   fp03.txt       0    0   0    0      0    0      0   2
##   fp04.txt       0    0   0    0      0    0      0   1
##   fp05.txt       0    0   0    0      0    0      0   0
dtm.mat <- as.matrix(dtm)
dtm.mat[1:5, 1:8]
##           Terms
## Docs       abandon abat abb abet abhorr abil abject abl
##   fp01.txt       0    0   0    0      0    0      0   1
##   fp02.txt       0    0   0    0      0    1      0   0
##   fp03.txt       0    0   0    0      0    0      0   2
##   fp04.txt       0    0   0    0      0    0      0   1
##   fp05.txt       0    0   0    0      0    0      0   0

参考文献

  1. 社会科学のためのデータ分析入門(上)岩波書店

  2. 社会科学のためのデータ分析入門(下)岩波書店

  1. F. Mosteller and D. L. Wallace (1963) “Inference in an suthoership problem.” Journa of the American Statistical Association, vol. 58, no. 302, pp. 275-309.