森鴎外と夏目漱石の文体比較

まず芥川と太宰について、文の長さに違いがあるかどうかを調べてみたいと思います。

なので、あなた方は芥川と太宰を、解析対象作家として選んではいけません。

芥川の場合

まずファイルダウンロードして、解凍、ルビ取りをします

たとえば、青空文庫から森鴎外を選んだとして、新字新仮名で3000サイズ以上のファイルを、適当選び、そのURLを調べます。

高瀬舟　1916年　サイズ　9290 http://www.aozora.gr.jp/cards/000129/files/45245_ruby_21882.zip

阿部一族　1913年　サイズ30780 http://www.aozora.gr.jp/cards/000129/files/673_ruby_23254.zip

かのように　1917年　サイズ　23856 http://www.aozora.gr.jp/cards/000129/files/678_ruby_22883.zip

鶏　1918年　サイズ　20390 http://www.aozora.gr.jp/cards/000129/files/42375_ruby_18247.zip

二人の友　1915年　サイズ　13018 http://www.aozora.gr.jp/cards/000129/files/675_ruby_23191.zip

食堂　1922年サイズ　8019 http://www.aozora.gr.jp/cards/000129/files/2599_ruby_23032.zip

この中からサイズが比較的近い4つに限定して解析します

#　RStudio上でダウンロード、解凍、ルビ取りを行います。この際、出力ファイル名に作家の頭文字と作品年代を追記します。

source("/var/data/AozoraURL.R")
# Aozora
# ('http://www.aozora.gr.jp/cards/000129/files/45245_ruby_21882.zip',
# 'M_1916Takasebune')
Aozora("http://www.aozora.gr.jp/cards/000129/files/673_ruby_23254.zip", "O_1913Abe")

## [1] "./NORUBY/O_1913Abe2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/678_ruby_22883.zip", "O_1917Kanoyouni")

## [1] "./NORUBY/O_1917Kanoyouni2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/42375_ruby_18247.zip", "O_1918Niwatori")

## [1] "./NORUBY/O_1918Niwatori2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/675_ruby_23191.zip", "O_1915Hutarino")

## [1] "./NORUBY/O_1915Hutarino2.txt"

#
# Aozora('http://www.aozora.gr.jp/cards/000129/files/2599_ruby_23032.zip','M_1922Shoukudo')
# フォルダ内のファイルを確認
dir("NORUBY")

## [1] "O_1913Abe2.txt"       "O_1915Hutarino2.txt"  "O_1917Kanoyouni2.txt"
## [4] "O_1918Niwatori2.txt"  "asobi2.txt"           "karasu2.txt"         
## [7] "niwatori2.txt"

(folder <- getwd())  # フォルダを確認

## [1] "/home/ishida/AkuDa"

まとめて解析する

library(RMeCab)
tmp <- paste(folder, "NORUBY", sep = "/")  # 保存先フォルダは、現在のフォルダ下のNORUBY
setwd(tmp)  #　保存先フォルダに移動
txts <- dir()  # 含まれている全ファイル名を指定
mori <- data.frame()  # 解析結果を保存する入れ物

for (i in txts) {
    x <- sum(nchar(readLines(i)))
    y <- RMeCabFreq(i)
    kuten <- y[y$Info2 == "句点", ]
    z <- sum(y$Freq)  #  / kuten$Freq
    mori <- rbind(mori, data.frame(text = i, chars = x, words = z, kuten = kuten$Freq))
}

## file = O_1913Abe2.txt 
## length = 2862 
## file = O_1915Hutarino2.txt 
## length = 1388 
## file = O_1917Kanoyouni2.txt 
## length = 2249 
## file = O_1918Niwatori2.txt 
## length = 1928 
## file = asobi2.txt 
## length = 1256 
## file = karasu2.txt 
## length = 1041 
## file = niwatori2.txt 
## length = 1928

文の長さの中央値を求めてみます。

median (mori$words / mori$kuten )
median (mori$chars / mori$kuten)

一文あたりの単語数20.1081 は、また文字数は 29.583 とわかります。

結果をプロットしてみます。

plot(mori$words/mori$kuten, main = "一文の単語数", xlab = "作品", type = "l")

plot of chunk unnamed-chunk-3

plot(mori$chars/mori$kuten, main = "一文の文字数", xlab = "作品", type = "l")

plot of chunk unnamed-chunk-3

unlink(tmp, recursive = T)  # 森鴎外の解析結果の入ったフォルダをいったん空にする

夏目漱石の場合

まずファイルの選定、ダウンロードと処理

彼岸過ぎまで　1912年 http://www.aozora.gr.jp/cards/000148/files/765_ruby_2469.zip

硝子戸の中　1910年 http://www.aozora.gr.jp/cards/000148/files/792_ruby_2117.zip

夢十夜　1916年 http://www.aozora.gr.jp/cards/000148/files/799_ruby_6024.zip

永日小品　1909年 http://www.aozora.gr.jp/cards/000148/files/758_ruby_6056.zip

文鳥　1909年　サイズ　12220 http://www.aozora.gr.jp/cards/000148/files/753_ruby_1701.zip

二百十日　1906年　サイズ　30688 http://www.aozora.gr.jp/cards/000148/files/751_ruby_1539.zip

私の個人主義　1916年　サイズ　24336 http://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip

倫敦塔　1916年　サイズ　21433 http://www.aozora.gr.jp/cards/000148/files/1076_ruby_4527.zip

この中からサイズが比較的近い4つに限定して解析します

source("/var/data/AozoraURL.R")

# Aozora ('http://www.aozora.gr.jp/cards/000148/files/765_ruby_2469.zip',
# 'N_1912Higan') Aozora
# ('http://www.aozora.gr.jp/cards/000148/files/792_ruby_2117.zip',
# 'N_1910Grasu') Aozora
# ('http://www.aozora.gr.jp/cards/000148/files/799_ruby_6024.zip',
# 'N_1909Eijitsu') Aozora
# ('http://www.aozora.gr.jp/cards/000148/files/758_ruby_6056.zip',
# 'N_1916Yume')
Aozora("http://www.aozora.gr.jp/cards/000148/files/753_ruby_1701.zip", "S_1909Bun")

## [1] "./NORUBY/S_1909Bun2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/751_ruby_1539.zip", "S_1906Nihyaku")

## [1] "./NORUBY/S_1906Nihyaku2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip", "S_1916Watashi")

## [1] "./NORUBY/S_1916Watashi2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/1076_ruby_4527.zip", "S_1916London")

## [1] "./NORUBY/S_1916London2.txt"




# フォルダ内のファイルを確認
dir("NORUBY")

## [1] "S_1906Nihyaku2.txt" "S_1909Bun2.txt"     "S_1916London2.txt" 
## [4] "S_1916Watashi2.txt"

(folder <- getwd())  # フォルダを確認

## [1] "/home/ishida/AkuDa"

まとめて解析する

library(RMeCab)
tmp <- paste(folder, "NORUBY", sep = "/")  # 保存先フォルダは、現在のフォルダ下のNORUBY
setwd(tmp)  #　保存先フォルダに移動
txts <- dir()  # 含まれている全ファイル名を指定
natu <- data.frame()  # 解析結果を保存する入れ物

for (i in txts) {
    x <- sum(nchar(readLines(i)))
    y <- RMeCabFreq(i)
    kuten <- y[y$Info2 == "句点", ]
    z <- sum(y$Freq)  #  / kuten$Freq
    natu <- rbind(natu, data.frame(text = i, chars = x, words = z, kuten = kuten$Freq))
}

## file = S_1906Nihyaku2.txt 
## length = 2322 
## file = S_1909Bun2.txt 
## length = 1235 
## file = S_1916London2.txt 
## length = 2424 
## file = S_1916Watashi2.txt 
## length = 2094

文の長さの中央値を求めてみます。

median (natu$words / natu$kuten )
median (natu$chars / natu$kuten)

一文あたりの単語数24.257 は、また文字数は 36.7951 とわかります。

結果をプロットしてみます。

plot(natu$words/natu$kuten, main = "一文の単語数", xlab = "作品", type = "l")

plot of chunk unnamed-chunk-7

plot(natu$chars/natu$kuten, main = "一文の文字数", xlab = "作品", type = "l")

plot of chunk unnamed-chunk-7

unlink(tmp, recursive = T)  # 夏目漱石の解析結果の入ったフォルダをいったん空にする

鴎外と漱石に文長に違いかあるか調べます。

boxplot(natu$words/natu$kuten, mori$words/mori$kuten, name = c("夏目", "森"))

plot of chunk unnamed-chunk-9

この結果を見る限り、二人の文長に差があるとはいえません。

二人の比較

視点を変えて、二人の作家の助詞および読点の使い分けに差があるかどうかを調べます。

芥川と太宰、それぞれのファイルを読み込みます

source("/var/data/AozoraURL.R")

Aozora("http://www.aozora.gr.jp/cards/000129/files/673_ruby_23254.zip", "O_1913Abe")

## [1] "./NORUBY/O_1913Abe2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/678_ruby_22883.zip", "O_1917Kanoyouni")

## [1] "./NORUBY/O_1917Kanoyouni2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/42375_ruby_18247.zip", "O_1918Niwatori")

## [1] "./NORUBY/O_1918Niwatori2.txt"

Aozora("http://www.aozora.gr.jp/cards/000129/files/675_ruby_23191.zip", "O_1915Hutarino")

## [1] "./NORUBY/O_1915Hutarino2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/753_ruby_1701.zip", "S_1909Bun")

## [1] "./NORUBY/S_1909Bun2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/751_ruby_1539.zip", "S_1906Nihyaku")

## [1] "./NORUBY/S_1906Nihyaku2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip", "S_1916Watashi")

## [1] "./NORUBY/S_1916Watashi2.txt"

Aozora("http://www.aozora.gr.jp/cards/000148/files/1076_ruby_4527.zip", "S_1916London")

## [1] "./NORUBY/S_1916London2.txt"


# フォルダ内のファイルを確認
dir("NORUBY")

## [1] "O_1913Abe2.txt"       "O_1915Hutarino2.txt"  "O_1917Kanoyouni2.txt"
## [4] "O_1918Niwatori2.txt"  "S_1906Nihyaku2.txt"   "S_1909Bun2.txt"      
## [7] "S_1916London2.txt"    "S_1916Watashi2.txt"

(folder <- getwd())  # フォルダを確認

## [1] "/home/ishida/AkuDa"

まとめて解析する

library(RMeCab)
tmp <- paste(folder, "NORUBY", sep = "/")  # 保存先フォルダは、現在のフォルダ下のNORUBY
setwd(tmp)  #　保存先フォルダに移動

# 文字のNgramを取り出す
x <- docNgram(tmp, type = 0)

## file = /home/ishida/AkuDa/NORUBY/O_1913Abe2.txt Ngram = 2 
## length = 8463 
## 
## file = /home/ishida/AkuDa/NORUBY/O_1915Hutarino2.txt Ngram = 2 
## length = 4324 
## 
## file = /home/ishida/AkuDa/NORUBY/O_1917Kanoyouni2.txt Ngram = 2 
## length = 7120 
## 
## file = /home/ishida/AkuDa/NORUBY/O_1918Niwatori2.txt Ngram = 2 
## length = 6226 
## 
## file = /home/ishida/AkuDa/NORUBY/S_1906Nihyaku2.txt Ngram = 2 
## length = 7649 
## 
## file = /home/ishida/AkuDa/NORUBY/S_1909Bun2.txt Ngram = 2 
## length = 3867 
## 
## file = /home/ishida/AkuDa/NORUBY/S_1916London2.txt Ngram = 2 
## length = 7205 
## 
## file = /home/ishida/AkuDa/NORUBY/S_1916Watashi2.txt Ngram = 2 
## length = 6625

読み込んだ中から、助詞と読点の組み合わせを幾つか抽出する

x <- x[rownames(x) %in% c("[と-、]", "[て-、]", "[は-、]", "[が-、]", 
    "[で-、]", "[に-、]", "[ら-、]", "[も-、]"), ]

####主成分分析を行なってみます。

x <- princomp(t(x))

結果をプロットします。

biplot(x)

plot of chunk unnamed-chunk-14

結論

この図では、上に森鴎外の作品が集まっており、鴎外は「で、」や「は、」、「て、」を頻繁に使うことが伺えます。一方、右下にかけては夏目漱石の作品が集まっており、漱石の場合、「も、」や「と、」や「ら、」を好んで使うことが見て取れます。