第四回(11月06日) Task Check and Weekly Assignment
To Do
□ データを読み込む
□ データの要約と記述統計量の算出
□ 度数分布表の作成
□ ヒストグラムの作成
□ 箱ひげ図の作成
□ 散布図の作成
□ 相関係数の算出
□ クロス集計表の作成
Assignment
・クラス別の国語の点数に関する箱ひげ図をつくりなさい
・理科と社会の点数の相関係数を算出しなさい
・身長を四分割したとき,高い方の群(157cmから172cm)にはいるC組の人は何人いるか,答えなさい。
まずはデータの読み込みと性別の型変換。これは前時の復習。
sample <- read.csv("sample(mac).csv", head = T, na.strings = "*")
sample$sex <- factor(sample$sex, labels = c("male", "female"))
summary(sample)
## ID class sex height weight
## Min. : 1.0 A:34 male :50 Min. :132 Min. :33.2
## 1st Qu.: 25.8 B:33 female:50 1st Qu.:145 1st Qu.:50.6
## Median : 50.5 C:33 Median :150 Median :56.0
## Mean : 50.5 Mean :151 Mean :56.8
## 3rd Qu.: 75.2 3rd Qu.:157 3rd Qu.:63.1
## Max. :100.0 Max. :172 Max. :87.0
##
## kokugo sansuu rika syakai
## Min. :34.0 Min. :58.0 Min. :34.0 Min. :20.0
## 1st Qu.:55.0 1st Qu.:68.0 1st Qu.:46.5 1st Qu.:40.8
## Median :64.0 Median :72.0 Median :51.0 Median :48.0
## Mean :64.5 Mean :71.5 Mean :50.5 Mean :49.4
## 3rd Qu.:74.0 3rd Qu.:75.5 3rd Qu.:54.0 3rd Qu.:57.2
## Max. :94.0 Max. :86.0 Max. :66.0 Max. :86.0
## NA's :1 NA's :1 NA's :1
## eigo
## Min. :25.0
## 1st Qu.:49.0
## Median :61.0
## Mean :59.9
## 3rd Qu.:71.0
## Max. :94.0
平均をだす関数はmean。summary関数でも出ているけど,念のため。
mean(sample$height)
## [1] 151.4
mean(sample$weight)
## [1] 56.84
mean(sample$kokugo)
## [1] NA
欠損値が含まれている場合はNAで返されるので,特別な処理が必要。
na.rmオプションは欠損値を除外する。
mean(sample$kokugo, na.rm = TRUE)
## [1] 64.48
var(sample$sansuu, na.rm = T)
## [1] 35.56
sd(sample$rika, na.rm = T)
## [1] 5.616
median(sample$syakai)
## [1] 48
quantile(sample$rika, probs = seq(0, 1, 0.25), na.rm = T)
## 0% 25% 50% 75% 100%
## 34.0 46.5 51.0 54.0 66.0
男性の身長平均,各クラスの理科得点の分散など,グループごとの特徴が知りたいときは,by関数
by(sample$height, sample$sex, mean)
## sample$sex: male
## [1] 158
## --------------------------------------------------------
## sample$sex: female
## [1] 144.8
by(sample$rika, sample$class, sd, na.rm = T)
## sample$class: A
## [1] 5.328
## --------------------------------------------------------
## sample$class: B
## [1] 5.011
## --------------------------------------------------------
## sample$class: C
## [1] 6.39
度数分布表の作成。table関数。
table(sample$class)
##
## A B C
## 34 33 33
hist関数をつかうと連続値を適当に区分してくれる。図も描いてくれる。
height.hist <- hist(sample$height)
height.hist
## $breaks
## [1] 130 135 140 145 150 155 160 165 170 175
##
## $counts
## [1] 1 7 17 25 17 16 7 6 4
##
## $intensities
## [1] 0.002 0.014 0.034 0.050 0.034 0.032 0.014 0.012 0.008
##
## $density
## [1] 0.002 0.014 0.034 0.050 0.034 0.032 0.014 0.012 0.008
##
## $mids
## [1] 132.5 137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5
##
## $xname
## [1] "sample$height"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
hist(sample$height, col = 1:10, breaks = quantile(sample$height))
データの群ごとの散らばりを見るには箱ひげ図,boxplotがいい。
ちなみに,関数の一般書式をここで学習しておくこと。
boxplot(height ~ sex, data = sample)
ここからは二変数の関係を扱う。まずは散布図。
plot(sample$height, sample$weight)
ピアソンの相関係数を算出。欠損値の扱いが違うことに注意。
cor(sample$height, sample$weight)
## [1] 0.7331
cor(sample$kokugo, sample$sansuu, use = "complete.obs")
## [1] -0.091
クロス集計表の作成。
その前に連続変数をクラスに分けるcut関数。
sample$height.class <- cut(sample$height, breaks = quantile(sample$height))
sample$height.class
## [1] (157,172] (157,172] (157,172] (157,172] (157,172] (150,157] (145,150]
## [8] (145,150] (150,157] (157,172] (150,157] (150,157] (150,157] (157,172]
## [15] (157,172] (145,150] (157,172] (150,157] (150,157] (145,150] (157,172]
## [22] (157,172] (132,145] (150,157] (157,172] (150,157] (150,157] (145,150]
## [29] (145,150] (157,172] (157,172] (157,172] (157,172] (157,172] (145,150]
## [36] (145,150] (157,172] (157,172] (157,172] (157,172] (157,172] (150,157]
## [43] (150,157] (157,172] (150,157] (150,157] (157,172] (157,172] (145,150]
## [50] (150,157] (145,150] (145,150] (132,145] (145,150] (145,150] (132,145]
## [57] (150,157] (150,157] (132,145] (132,145] (132,145] (132,145] <NA>
## [64] (132,145] (132,145] (150,157] (132,145] (132,145] (132,145] (150,157]
## [71] (132,145] (145,150] (150,157] (132,145] (145,150] (132,145] (132,145]
## [78] (150,157] (150,157] (145,150] (132,145] (145,150] (145,150] (145,150]
## [85] (150,157] (150,157] (145,150] (132,145] (145,150] (132,145] (145,150]
## [92] (132,145] (132,145] (145,150] (132,145] (132,145] (145,150] (145,150]
## [99] (150,157] (132,145]
## Levels: (132,145] (145,150] (150,157] (157,172]
クロス集計表。
tab1 <- table(sample$height.class, sample$class)
tab1
##
## A B C
## (132,145] 6 12 6
## (145,150] 10 8 7
## (150,157] 9 4 12
## (157,172] 9 9 7
周辺度数の追加
addmargins(tab1)
##
## A B C Sum
## (132,145] 6 12 6 24
## (145,150] 10 8 7 25
## (150,157] 9 4 12 25
## (157,172] 9 9 7 25
## Sum 34 33 32 99
相対度数の追加
prop.table(tab1)
##
## A B C
## (132,145] 0.06061 0.12121 0.06061
## (145,150] 0.10101 0.08081 0.07071
## (150,157] 0.09091 0.04040 0.12121
## (157,172] 0.09091 0.09091 0.07071
行の相対度数はオプション1,列の相対度数はオプション2
prop.table(tab1, 1)
##
## A B C
## (132,145] 0.25 0.50 0.25
## (145,150] 0.40 0.32 0.28
## (150,157] 0.36 0.16 0.48
## (157,172] 0.36 0.36 0.28
prop.table(tab1, 2)
##
## A B C
## (132,145] 0.1765 0.3636 0.1875
## (145,150] 0.2941 0.2424 0.2188
## (150,157] 0.2647 0.1212 0.3750
## (157,172] 0.2647 0.2727 0.2188
xtabs関数も使える。
tab1 <- xtabs(~height.class + class, data = sample)
tab1
## class
## height.class A B C
## (132,145] 6 12 6
## (145,150] 10 8 7
## (150,157] 9 4 12
## (157,172] 9 9 7
addmargins(tab1)
## class
## height.class A B C Sum
## (132,145] 6 12 6 24
## (145,150] 10 8 7 25
## (150,157] 9 4 12 25
## (157,172] 9 9 7 25
## Sum 34 33 32 99
お・ま・け
パッケージpsychを使うと,便利なdescribe関数が使える。これで記述統計量はばっちり。
library(psych)
describe(sample)
## var n mean sd median trimmed mad min max
## ID 1 100 50.50 29.01 50.50 50.50 37.06 1.00 100.00
## class* 2 100 1.99 0.82 2.00 1.99 1.48 1.00 3.00
## sex* 3 100 1.50 0.50 1.50 1.50 0.74 1.00 2.00
## height 4 100 151.38 9.05 149.99 150.89 8.94 131.91 172.28
## weight 5 100 56.84 9.75 56.04 56.53 9.12 33.25 86.97
## kokugo 6 99 64.48 13.01 64.00 64.31 13.34 34.00 94.00
## sansuu 7 99 71.54 5.96 72.00 71.60 5.93 58.00 86.00
## rika 8 99 50.47 5.62 51.00 50.53 5.93 34.00 66.00
## syakai 9 100 49.44 13.04 48.00 49.04 11.86 20.00 86.00
## eigo 10 100 59.88 14.95 61.00 59.76 16.31 25.00 94.00
## height.class* 11 99 2.52 1.12 3.00 2.52 1.48 1.00 4.00
## range skew kurtosis se
## ID 99.00 0.00 -1.24 2.90
## class* 2.00 0.02 -1.54 0.08
## sex* 1.00 0.00 -2.02 0.05
## height 40.37 0.42 -0.40 0.91
## weight 53.72 0.35 0.13 0.98
## kokugo 60.00 0.11 -0.55 1.31
## sansuu 28.00 -0.12 -0.47 0.60
## rika 32.00 -0.17 -0.01 0.56
## syakai 66.00 0.26 -0.01 1.30
## eigo 69.00 0.05 -0.42 1.49
## height.class* 3.00 -0.02 -1.38 0.11
describe(sample$height)
## var n mean sd median trimmed mad min max range skew kurtosis
## 1 1 100 151.4 9.05 150 150.9 8.94 131.9 172.3 40.37 0.42 -0.4
## se
## 1 0.91
describeBy関数を使うと,グループごとの記述統計量も簡単に出ます。
describeBy(sample$height, group = sample$sex)
## group: male
## var n mean sd median trimmed mad min max range skew kurtosis
## 1 1 50 158 7.37 157.3 157.7 6.35 141.2 172.3 31.04 0.18 -0.63
## se
## 1 1.04
## --------------------------------------------------------
## group: female
## var n mean sd median trimmed mad min max range skew kurtosis
## 1 1 50 144.8 4.8 145.8 145.1 4.71 131.9 152.7 20.81 -0.43 -0.32
## se
## 1 0.68
gmodelsパッケージをつかうとクロス集計表が奇麗。
library(gmodels)
## Error: package/namespace load failed for 'gmodels'
CrossTable(sample$height.class, sample$class)
## Error: 関数 "CrossTable" を見つけることができませんでした