第四回（11月06日）　Task Check and Weekly Assignment

データの集計と可視化

To Do
□　データを読み込む
□　データの要約と記述統計量の算出
□　度数分布表の作成
□　ヒストグラムの作成
□　箱ひげ図の作成
□　散布図の作成
□　相関係数の算出
□　クロス集計表の作成

Assignment
・クラス別の国語の点数に関する箱ひげ図をつくりなさい
・理科と社会の点数の相関係数を算出しなさい
・身長を四分割したとき，高い方の群（157cmから172cm）にはいるC組の人は何人いるか，答えなさい。

まずはデータの読み込みと性別の型変換。これは前時の復習。

sample <- read.csv("sample(mac).csv", head = T, na.strings = "*")
sample$sex <- factor(sample$sex, labels = c("male", "female"))
summary(sample)

##        ID        class      sex         height        weight    
##  Min.   :  1.0   A:34   male  :50   Min.   :132   Min.   :33.2  
##  1st Qu.: 25.8   B:33   female:50   1st Qu.:145   1st Qu.:50.6  
##  Median : 50.5   C:33               Median :150   Median :56.0  
##  Mean   : 50.5                      Mean   :151   Mean   :56.8  
##  3rd Qu.: 75.2                      3rd Qu.:157   3rd Qu.:63.1  
##  Max.   :100.0                      Max.   :172   Max.   :87.0  
##                                                                 
##      kokugo         sansuu          rika          syakai    
##  Min.   :34.0   Min.   :58.0   Min.   :34.0   Min.   :20.0  
##  1st Qu.:55.0   1st Qu.:68.0   1st Qu.:46.5   1st Qu.:40.8  
##  Median :64.0   Median :72.0   Median :51.0   Median :48.0  
##  Mean   :64.5   Mean   :71.5   Mean   :50.5   Mean   :49.4  
##  3rd Qu.:74.0   3rd Qu.:75.5   3rd Qu.:54.0   3rd Qu.:57.2  
##  Max.   :94.0   Max.   :86.0   Max.   :66.0   Max.   :86.0  
##  NA's   :1      NA's   :1      NA's   :1                    
##       eigo     
##  Min.   :25.0  
##  1st Qu.:49.0  
##  Median :61.0  
##  Mean   :59.9  
##  3rd Qu.:71.0  
##  Max.   :94.0

平均をだす関数はmean。summary関数でも出ているけど，念のため。

mean(sample$height)

## [1] 151.4

mean(sample$weight)

## [1] 56.84

mean(sample$kokugo)

## [1] NA

欠損値が含まれている場合はNAで返されるので，特別な処理が必要。
na.rmオプションは欠損値を除外する。

mean(sample$kokugo, na.rm = TRUE)

## [1] 64.48

var(sample$sansuu, na.rm = T)

## [1] 35.56

sd(sample$rika, na.rm = T)

## [1] 5.616

median(sample$syakai)

## [1] 48

quantile(sample$rika, probs = seq(0, 1, 0.25), na.rm = T)

##   0%  25%  50%  75% 100% 
## 34.0 46.5 51.0 54.0 66.0

男性の身長平均，各クラスの理科得点の分散など，グループごとの特徴が知りたいときは，by関数

by(sample$height, sample$sex, mean)

## sample$sex: male
## [1] 158
## -------------------------------------------------------- 
## sample$sex: female
## [1] 144.8

by(sample$rika, sample$class, sd, na.rm = T)

## sample$class: A
## [1] 5.328
## -------------------------------------------------------- 
## sample$class: B
## [1] 5.011
## -------------------------------------------------------- 
## sample$class: C
## [1] 6.39

度数分布表の作成。table関数。

table(sample$class)

## 
##  A  B  C 
## 34 33 33

hist関数をつかうと連続値を適当に区分してくれる。図も描いてくれる。

height.hist <- hist(sample$height)

plot of chunk unnamed-chunk-6

height.hist

## $breaks
##  [1] 130 135 140 145 150 155 160 165 170 175
## 
## $counts
## [1]  1  7 17 25 17 16  7  6  4
## 
## $intensities
## [1] 0.002 0.014 0.034 0.050 0.034 0.032 0.014 0.012 0.008
## 
## $density
## [1] 0.002 0.014 0.034 0.050 0.034 0.032 0.014 0.012 0.008
## 
## $mids
## [1] 132.5 137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5
## 
## $xname
## [1] "sample$height"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

hist(sample$height, col = 1:10, breaks = quantile(sample$height))

plot of chunk unnamed-chunk-6

データの群ごとの散らばりを見るには箱ひげ図，boxplotがいい。
ちなみに，関数の一般書式をここで学習しておくこと。

boxplot(height ~ sex, data = sample)

plot of chunk unnamed-chunk-7

ここからは二変数の関係を扱う。まずは散布図。

plot(sample$height, sample$weight)

plot of chunk unnamed-chunk-8

ピアソンの相関係数を算出。欠損値の扱いが違うことに注意。

cor(sample$height, sample$weight)

## [1] 0.7331

cor(sample$kokugo, sample$sansuu, use = "complete.obs")

## [1] -0.091

クロス集計表の作成。
その前に連続変数をクラスに分けるcut関数。

sample$height.class <- cut(sample$height, breaks = quantile(sample$height))
sample$height.class

##   [1] (157,172] (157,172] (157,172] (157,172] (157,172] (150,157] (145,150]
##   [8] (145,150] (150,157] (157,172] (150,157] (150,157] (150,157] (157,172]
##  [15] (157,172] (145,150] (157,172] (150,157] (150,157] (145,150] (157,172]
##  [22] (157,172] (132,145] (150,157] (157,172] (150,157] (150,157] (145,150]
##  [29] (145,150] (157,172] (157,172] (157,172] (157,172] (157,172] (145,150]
##  [36] (145,150] (157,172] (157,172] (157,172] (157,172] (157,172] (150,157]
##  [43] (150,157] (157,172] (150,157] (150,157] (157,172] (157,172] (145,150]
##  [50] (150,157] (145,150] (145,150] (132,145] (145,150] (145,150] (132,145]
##  [57] (150,157] (150,157] (132,145] (132,145] (132,145] (132,145] <NA>     
##  [64] (132,145] (132,145] (150,157] (132,145] (132,145] (132,145] (150,157]
##  [71] (132,145] (145,150] (150,157] (132,145] (145,150] (132,145] (132,145]
##  [78] (150,157] (150,157] (145,150] (132,145] (145,150] (145,150] (145,150]
##  [85] (150,157] (150,157] (145,150] (132,145] (145,150] (132,145] (145,150]
##  [92] (132,145] (132,145] (145,150] (132,145] (132,145] (145,150] (145,150]
##  [99] (150,157] (132,145]
## Levels: (132,145] (145,150] (150,157] (157,172]

クロス集計表。

tab1 <- table(sample$height.class, sample$class)
tab1

##            
##              A  B  C
##   (132,145]  6 12  6
##   (145,150] 10  8  7
##   (150,157]  9  4 12
##   (157,172]  9  9  7

周辺度数の追加

addmargins(tab1)

##            
##              A  B  C Sum
##   (132,145]  6 12  6  24
##   (145,150] 10  8  7  25
##   (150,157]  9  4 12  25
##   (157,172]  9  9  7  25
##   Sum       34 33 32  99

相対度数の追加

prop.table(tab1)

##            
##                   A       B       C
##   (132,145] 0.06061 0.12121 0.06061
##   (145,150] 0.10101 0.08081 0.07071
##   (150,157] 0.09091 0.04040 0.12121
##   (157,172] 0.09091 0.09091 0.07071

行の相対度数はオプション1，列の相対度数はオプション2

prop.table(tab1, 1)

##            
##                A    B    C
##   (132,145] 0.25 0.50 0.25
##   (145,150] 0.40 0.32 0.28
##   (150,157] 0.36 0.16 0.48
##   (157,172] 0.36 0.36 0.28

prop.table(tab1, 2)

##            
##                  A      B      C
##   (132,145] 0.1765 0.3636 0.1875
##   (145,150] 0.2941 0.2424 0.2188
##   (150,157] 0.2647 0.1212 0.3750
##   (157,172] 0.2647 0.2727 0.2188

xtabs関数も使える。

tab1 <- xtabs(~height.class + class, data = sample)
tab1

##             class
## height.class  A  B  C
##    (132,145]  6 12  6
##    (145,150] 10  8  7
##    (150,157]  9  4 12
##    (157,172]  9  9  7

addmargins(tab1)

##             class
## height.class  A  B  C Sum
##    (132,145]  6 12  6  24
##    (145,150] 10  8  7  25
##    (150,157]  9  4 12  25
##    (157,172]  9  9  7  25
##    Sum       34 33 32  99

お・ま・け

パッケージpsychを使うと，便利なdescribe関数が使える。これで記述統計量はばっちり。

library(psych)
describe(sample)

##               var   n   mean    sd median trimmed   mad    min    max
## ID              1 100  50.50 29.01  50.50   50.50 37.06   1.00 100.00
## class*          2 100   1.99  0.82   2.00    1.99  1.48   1.00   3.00
## sex*            3 100   1.50  0.50   1.50    1.50  0.74   1.00   2.00
## height          4 100 151.38  9.05 149.99  150.89  8.94 131.91 172.28
## weight          5 100  56.84  9.75  56.04   56.53  9.12  33.25  86.97
## kokugo          6  99  64.48 13.01  64.00   64.31 13.34  34.00  94.00
## sansuu          7  99  71.54  5.96  72.00   71.60  5.93  58.00  86.00
## rika            8  99  50.47  5.62  51.00   50.53  5.93  34.00  66.00
## syakai          9 100  49.44 13.04  48.00   49.04 11.86  20.00  86.00
## eigo           10 100  59.88 14.95  61.00   59.76 16.31  25.00  94.00
## height.class*  11  99   2.52  1.12   3.00    2.52  1.48   1.00   4.00
##               range  skew kurtosis   se
## ID            99.00  0.00    -1.24 2.90
## class*         2.00  0.02    -1.54 0.08
## sex*           1.00  0.00    -2.02 0.05
## height        40.37  0.42    -0.40 0.91
## weight        53.72  0.35     0.13 0.98
## kokugo        60.00  0.11    -0.55 1.31
## sansuu        28.00 -0.12    -0.47 0.60
## rika          32.00 -0.17    -0.01 0.56
## syakai        66.00  0.26    -0.01 1.30
## eigo          69.00  0.05    -0.42 1.49
## height.class*  3.00 -0.02    -1.38 0.11

describe(sample$height)

##   var   n  mean   sd median trimmed  mad   min   max range skew kurtosis
## 1   1 100 151.4 9.05    150   150.9 8.94 131.9 172.3 40.37 0.42     -0.4
##     se
## 1 0.91

describeBy関数を使うと，グループごとの記述統計量も簡単に出ます。

describeBy(sample$height, group = sample$sex)

## group: male
##   var  n mean   sd median trimmed  mad   min   max range skew kurtosis
## 1   1 50  158 7.37  157.3   157.7 6.35 141.2 172.3 31.04 0.18    -0.63
##     se
## 1 1.04
## -------------------------------------------------------- 
## group: female
##   var  n  mean  sd median trimmed  mad   min   max range  skew kurtosis
## 1   1 50 144.8 4.8  145.8   145.1 4.71 131.9 152.7 20.81 -0.43    -0.32
##     se
## 1 0.68

gmodelsパッケージをつかうとクロス集計表が奇麗。

library(gmodels)

## Error: package/namespace load failed for 'gmodels'

CrossTable(sample$height.class, sample$class)

## Error: 関数 "CrossTable" を見つけることができませんでした