HW1

Question

Concerning the anorexia{MASS}, explain the difference between > anorexia[,2] > anorexia[“Prewt”]

and > lm(Postwt ~ Treat, data=anorexia) > lm(Postwt ~ as.numeric(Treat), data=anorexia)

Does the second output make sense in the context of the ‘anorexia’ data analysis?

data(anorexia, package = "MASS")
str(anorexia)
## 'data.frame':    72 obs. of  3 variables:
##  $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Prewt : num  80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
##  $ Postwt: num  80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
head(anorexia)
##   Treat Prewt Postwt
## 1  Cont  80.7   80.2
## 2  Cont  89.4   80.1
## 3  Cont  91.8   86.4
## 4  Cont  74.0   86.3
## 5  Cont  78.1   76.1
## 6  Cont  88.3   78.1
class(anorexia)
## [1] "data.frame"

what’s different between anorexia[,2] and anorexia[“Prewt”]

#
anorexia[,2]
##  [1] 80.7 89.4 91.8 74.0 78.1 88.3 87.3 75.1 80.6 78.4 77.6 88.7 81.3 78.1 70.5
## [16] 77.3 85.2 86.0 84.1 79.7 85.5 84.4 79.6 77.5 72.3 89.0 80.5 84.9 81.5 82.6
## [31] 79.9 88.7 94.9 76.3 81.0 80.5 85.0 89.2 81.3 76.5 70.0 80.4 83.3 83.0 87.7
## [46] 84.2 86.4 76.5 80.2 87.8 83.3 79.7 84.5 80.8 87.4 83.8 83.3 86.0 82.5 86.7
## [61] 79.6 76.9 94.2 73.4 80.5 81.6 82.1 77.6 83.5 89.9 86.0 87.3
class(anorexia[,2])
## [1] "numeric"
typeof(anorexia[,2])
## [1] "double"
#
anorexia["Prewt"]
##    Prewt
## 1   80.7
## 2   89.4
## 3   91.8
## 4   74.0
## 5   78.1
## 6   88.3
## 7   87.3
## 8   75.1
## 9   80.6
## 10  78.4
## 11  77.6
## 12  88.7
## 13  81.3
## 14  78.1
## 15  70.5
## 16  77.3
## 17  85.2
## 18  86.0
## 19  84.1
## 20  79.7
## 21  85.5
## 22  84.4
## 23  79.6
## 24  77.5
## 25  72.3
## 26  89.0
## 27  80.5
## 28  84.9
## 29  81.5
## 30  82.6
## 31  79.9
## 32  88.7
## 33  94.9
## 34  76.3
## 35  81.0
## 36  80.5
## 37  85.0
## 38  89.2
## 39  81.3
## 40  76.5
## 41  70.0
## 42  80.4
## 43  83.3
## 44  83.0
## 45  87.7
## 46  84.2
## 47  86.4
## 48  76.5
## 49  80.2
## 50  87.8
## 51  83.3
## 52  79.7
## 53  84.5
## 54  80.8
## 55  87.4
## 56  83.8
## 57  83.3
## 58  86.0
## 59  82.5
## 60  86.7
## 61  79.6
## 62  76.9
## 63  94.2
## 64  73.4
## 65  80.5
## 66  81.6
## 67  82.1
## 68  77.6
## 69  83.5
## 70  89.9
## 71  86.0
## 72  87.3
class(anorexia["Prewt"])
## [1] "data.frame"
typeof(anorexia["Prewt"])
## [1] "list"

根據結果,兩者皆是取”Prewt”這個欄位的資料,但仍有一些差異:

anorexia[,2]的結構是一個numeric,其型態為double

vector,而anorexia[“Prewt”]的結構仍是一個data.frame,其型態為list

what’s different after as.numeric(Treat) in linear regression

lm(Postwt ~ Treat, data=anorexia)
## 
## Call:
## lm(formula = Postwt ~ Treat, data = anorexia)
## 
## Coefficients:
## (Intercept)    TreatCont      TreatFT  
##      85.697       -4.589        4.798

在anorexia中,Treat是一個3 level的factor(“CBT”,“Cont”,“FT”),因此線性回歸結果是以”CBT”為參考組,計算”Cont”與”FT”兩組的beta

lm(Postwt ~ as.numeric(Treat), data=anorexia)
## 
## Call:
## lm(formula = Postwt ~ as.numeric(Treat), data = anorexia)
## 
## Coefficients:
##       (Intercept)  as.numeric(Treat)  
##            82.036              1.711

透過as.numeric()的語法,將Treat從factor(3個level)轉為數值,因此結果只會出現一個beata

結論:由於Treat是個類別,所以做數值轉換後的結果並不合理

HW2

Question

Find all possible sums from rolling three dice using R. If possible, construct a histogram for the sum of three dice. Is this a probability histogram or an empirical histogram?

# All possible dice rolls:
dice.sums <- outer(outer(1:6, 1:6, FUN = '+'), 1:6, FUN = '+')
dice.sums
## , , 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    3    4    5    6    7    8
## [2,]    4    5    6    7    8    9
## [3,]    5    6    7    8    9   10
## [4,]    6    7    8    9   10   11
## [5,]    7    8    9   10   11   12
## [6,]    8    9   10   11   12   13
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    4    5    6    7    8    9
## [2,]    5    6    7    8    9   10
## [3,]    6    7    8    9   10   11
## [4,]    7    8    9   10   11   12
## [5,]    8    9   10   11   12   13
## [6,]    9   10   11   12   13   14
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    5    6    7    8    9   10
## [2,]    6    7    8    9   10   11
## [3,]    7    8    9   10   11   12
## [4,]    8    9   10   11   12   13
## [5,]    9   10   11   12   13   14
## [6,]   10   11   12   13   14   15
## 
## , , 4
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    6    7    8    9   10   11
## [2,]    7    8    9   10   11   12
## [3,]    8    9   10   11   12   13
## [4,]    9   10   11   12   13   14
## [5,]   10   11   12   13   14   15
## [6,]   11   12   13   14   15   16
## 
## , , 5
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    7    8    9   10   11   12
## [2,]    8    9   10   11   12   13
## [3,]    9   10   11   12   13   14
## [4,]   10   11   12   13   14   15
## [5,]   11   12   13   14   15   16
## [6,]   12   13   14   15   16   17
## 
## , , 6
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    8    9   10   11   12   13
## [2,]    9   10   11   12   13   14
## [3,]   10   11   12   13   14   15
## [4,]   11   12   13   14   15   16
## [5,]   12   13   14   15   16   17
## [6,]   13   14   15   16   17   18

3個骰子的所有可能有216種,擲出3的機率是1/216,擲出4的機率是1+2/216,擲出5的機率是1+2+3/216….依此類推。 擲出3到11的組合從1,1+2,1+2+3,1+2+3+4,…1+2+..+8

# Probability of dice.sums= 4: 0.0138
prob <- mean(dice.sums== 4)
prob
## [1] 0.01388889
#將216種結果變為vector
all<- c(as.matrix(dice.sums)) |>hist(prob = T, main="", xlab ="sums from rolling three dice", breaks=2:19)

class(c(as.matrix(dice.sums)))
## [1] "integer"

solution & Question

outer(as.numeric(outer(1:6, 1:6, '+')), 1:6, '+') |> hist(prob = T, breaks= 2:19) #hist好像不能改xlim?
axis(side=1, at=c(2:19)) ##為什麼是2:19?

HW3

Question

The IQ and behavior problem page has a dataset and a script of R code chunks. Generate a markdown file from the script to push the output in HTML for posting to course Moodle site. Explain what the code chunks do with your comments in the markdown file.

#Row names in the first column
dta <- read.table("IQ_Beh.txt", header = T, row.names = 1)
# structure of data
str(dta)
## 'data.frame':    94 obs. of  3 variables:
##  $ Dep: chr  "N" "N" "N" "N" ...
##  $ IQ : int  103 124 124 104 96 92 124 99 92 116 ...
##  $ BP : int  4 12 9 3 3 3 6 4 3 9 ...
# show first 6 rows
head(dta)
##   Dep  IQ BP
## 1   N 103  4
## 2   N 124 12
## 3   N 124  9
## 4   N 104  3
## 5   D  96  3
## 6   N  92  3
#data structure
class(dta)
## [1] "data.frame"
# dimention of data
dim(dta)
## [1] 94  3
# variable names
names(dta)
## [1] "Dep" "IQ"  "BP"
# 驗證dta$BP是否為vector
is.vector(dta$BP)
## [1] TRUE
# select first row
dta[1, ]
##   Dep  IQ BP
## 1   N 103  4
# select 1-3 row的 IQ
dta[1:3, "IQ"]
## [1] 103 124 124
# 先針對BP row由小到大排序,並列出最後六筆(row)的各欄位(column)資料
tail(dta[order(dta$BP), ])
##    Dep  IQ BP
## 16   N  89 11
## 58   N 117 11
## 66   N 126 11
## 2    N 124 12
## 73   D  99 13
## 12   D  22 17
# 先針對BP row由大到小排序,並列出最後四筆(row)的各欄位(column)資料
tail(dta[order(-dta$BP), ], 4)
##    Dep  IQ BP
## 77   N 124  1
## 80   N 121  1
## 24   N 106  0
## 75   N 122  0
# with的功能與sas中的"data="相似
with(dta, hist(IQ, xlab = "IQ", main = ""))

# boxplot
boxplot(BP ~ Dep, data = dta, 
        xlab = "Depression", 
        ylab = "Behavior problem score")

# 原本的Dep 是character,需先變成factor
dta$Dep<-with(dta, as.factor(Dep))
str(dta)
## 'data.frame':    94 obs. of  3 variables:
##  $ Dep: Factor w/ 2 levels "D","N": 2 2 2 2 1 2 2 2 2 2 ...
##  $ IQ : int  103 124 124 104 96 92 124 99 92 116 ...
##  $ BP : int  4 12 9 3 3 3 6 4 3 9 ...
#原本col=dta$Dep,好像沒有"dta$"也抓地到資料?
plot(IQ ~ BP, data = dta, pch = 20, col = Dep, 
     xlab = "Behavior problem score", ylab = "IQ")
grid()

# type = "n" 代表不要任何點或線,只呈現外框,無值
plot(BP ~ IQ, data = dta, type = "n",
     ylab = "Behavior problem score", xlab = "IQ")
# 把點換成用文字呈現,cex是字體大小
text(dta$IQ, dta$BP, labels = dta$Dep, cex = 0.5)
# 增加一條 Depression 組的回歸線
abline(lm(BP ~ IQ, data = dta, subset = Dep == "D"))
# 增加一條 non-Depression 組的回歸線,line type 使用虛線
abline(lm(BP ~ IQ, data = dta, subset = Dep == "N"), lty = 2)

HW4

Question

The usBirths2015.txt is a dataset of monthly births in the US in 2015. Summarize the number of births by season.

tidyverse approach

dta4 <- read.table("usBirths2015.txt", header = T)

#
library(dplyr)
dta4 <- dta4|> mutate(
  season = ifelse(month %in% c("December", "January", "February"), "Winter",
           ifelse(month %in% c("March","April","May"), "Spring",
           ifelse(month %in% c("June","July","August"),"Summer","Fall"))))

#
with(dta4, aggregate(birth~as.factor(season), FUN = sum))
##   as.factor(season)   birth
## 1              Fall 1005343
## 2            Spring  977672
## 3            Summer 1035747
## 4            Winter  959735

simplifies solution

#調整一下row的順序按照season順序排
dta4v <- dta4[c(3:12,1:2),]
#repeated each spring, summer...3 times
dta4v$season <- rep(c("Spring", "Summer", "Fall", "Winter"), rep(3,4))

HW5

Question

Ten subjects read a paragraph consisting of seven sentences. The reading time (in seconds) for each sentence was the outcome measure. The predictors are the serial position of the sentence (Sp), the number of words in the sentences (Wrds), and the number of new arguments in the sentence (New).

  1. Rank subjects by their reading speeed

  2. Estimate, on average, how long does it take to read a word.

dta5 <- read.table("readingtimes.txt", header = T)
str(dta5)
## 'data.frame':    7 obs. of  14 variables:
##  $ Snt : int  1 2 3 4 5 6 7
##  $ Sp  : int  1 2 3 4 5 6 7
##  $ Wrds: int  13 16 9 9 10 18 6
##  $ New : int  1 3 2 2 3 4 1
##  $ S01 : num  3.43 6.48 1.71 3.68 4 ...
##  $ S02 : num  2.79 5.41 2.34 3.71 2.9 ...
##  $ S03 : num  4.16 4.49 3.02 2.87 2.99 ...
##  $ S04 : num  3.07 5.06 2.46 2.73 2.67 ...
##  $ S05 : num  3.62 9.29 6.04 4.21 3.88 ...
##  $ S06 : num  3.16 5.64 2.46 6.24 3.22 ...
##  $ S07 : num  3.23 8.36 4.92 3.72 3.14 ...
##  $ S08 : num  7.16 4.31 3.37 6.33 6.14 ...
##  $ S09 : num  1.54 2.95 1.38 1.15 2.76 ...
##  $ S10 : num  4.06 6.65 2.18 3.66 3.33 ...
head(dta5)
##   Snt Sp Wrds New   S01   S02   S03   S04   S05    S06    S07   S08   S09   S10
## 1   1  1   13   1 3.429 2.795 4.161 3.071 3.625  3.161  3.232 7.161 1.536 4.063
## 2   2  2   16   3 6.482 5.411 4.491 5.063 9.295  5.643  8.357 4.313 2.946 6.652
## 3   3  3    9   2 1.714 2.339 3.018 2.464 6.045  2.455  4.920 3.366 1.375 2.179
## 4   4  4    9   2 3.679 3.714 2.866 2.732 4.205  6.241  3.723 6.330 1.152 3.661
## 5   5  5   10   3 4.000 2.902 2.991 2.670 3.884  3.223  3.143 6.143 2.759 3.330
## 6   6  6   18   4 6.973 8.018 6.625 7.571 8.795 13.188 11.170 6.071 7.964 7.866

Rank

AA<- apply(dta5[5:14],2,mean) #2 means column
AA
##      S01      S02      S03      S04      S05      S06      S07      S08 
## 4.130143 3.847000 3.774286 3.779286 5.620000 5.371286 5.228429 5.011429 
##      S09      S10 
## 2.741000 4.493714
# what's different in rank, order, sort
order(AA) #關心的是index, 數值由小到大排列,回傳其值的位置(index)
##  [1]  9  3  4  2  1 10  8  7  6  5
rank(AA) #關心的是subject,回傳每個subject的排名
## S01 S02 S03 S04 S05 S06 S07 S08 S09 S10 
##   5   4   2   3  10   9   8   7   1   6
sort(AA) #關心的是數值排序,回傳數值由小到大排列相數對應之值與subject
##      S09      S03      S04      S02      S01      S10      S08      S07 
## 2.741000 3.774286 3.779286 3.847000 4.130143 4.493714 5.011429 5.228429 
##      S06      S05 
## 5.371286 5.620000

Reading speed

# 7個句子,每句在10位平均閱讀時間
snt_sum<- apply(dta5[5:14],1,mean)
#每句平均閱讀時間除每句單字數量
snt_wrds<- snt_sum/dta5$Wrds
#總平均單字數之閱讀時間
mean(snt_wrds)
## [1] 0.3783395

從每句平均閱讀時間除以每句單字數量結果可以發現,7個句子在10個受試者的平均閱讀時間都不相同,代表每個句子間,還有其他因素影響閱讀的速度,可能是句子的位置、也可能是句子的新單詞數等。