Concerning the anorexia{MASS}, explain the difference between > anorexia[,2] > anorexia[“Prewt”]
and > lm(Postwt ~ Treat, data=anorexia) > lm(Postwt ~ as.numeric(Treat), data=anorexia)
Does the second output make sense in the context of the ‘anorexia’ data analysis?
data(anorexia, package = "MASS")
str(anorexia)
## 'data.frame': 72 obs. of 3 variables:
## $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
## $ Prewt : num 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
## $ Postwt: num 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
head(anorexia)
## Treat Prewt Postwt
## 1 Cont 80.7 80.2
## 2 Cont 89.4 80.1
## 3 Cont 91.8 86.4
## 4 Cont 74.0 86.3
## 5 Cont 78.1 76.1
## 6 Cont 88.3 78.1
class(anorexia)
## [1] "data.frame"
#
2] anorexia[,
## [1] 80.7 89.4 91.8 74.0 78.1 88.3 87.3 75.1 80.6 78.4 77.6 88.7 81.3 78.1 70.5
## [16] 77.3 85.2 86.0 84.1 79.7 85.5 84.4 79.6 77.5 72.3 89.0 80.5 84.9 81.5 82.6
## [31] 79.9 88.7 94.9 76.3 81.0 80.5 85.0 89.2 81.3 76.5 70.0 80.4 83.3 83.0 87.7
## [46] 84.2 86.4 76.5 80.2 87.8 83.3 79.7 84.5 80.8 87.4 83.8 83.3 86.0 82.5 86.7
## [61] 79.6 76.9 94.2 73.4 80.5 81.6 82.1 77.6 83.5 89.9 86.0 87.3
class(anorexia[,2])
## [1] "numeric"
typeof(anorexia[,2])
## [1] "double"
#
"Prewt"] anorexia[
## Prewt
## 1 80.7
## 2 89.4
## 3 91.8
## 4 74.0
## 5 78.1
## 6 88.3
## 7 87.3
## 8 75.1
## 9 80.6
## 10 78.4
## 11 77.6
## 12 88.7
## 13 81.3
## 14 78.1
## 15 70.5
## 16 77.3
## 17 85.2
## 18 86.0
## 19 84.1
## 20 79.7
## 21 85.5
## 22 84.4
## 23 79.6
## 24 77.5
## 25 72.3
## 26 89.0
## 27 80.5
## 28 84.9
## 29 81.5
## 30 82.6
## 31 79.9
## 32 88.7
## 33 94.9
## 34 76.3
## 35 81.0
## 36 80.5
## 37 85.0
## 38 89.2
## 39 81.3
## 40 76.5
## 41 70.0
## 42 80.4
## 43 83.3
## 44 83.0
## 45 87.7
## 46 84.2
## 47 86.4
## 48 76.5
## 49 80.2
## 50 87.8
## 51 83.3
## 52 79.7
## 53 84.5
## 54 80.8
## 55 87.4
## 56 83.8
## 57 83.3
## 58 86.0
## 59 82.5
## 60 86.7
## 61 79.6
## 62 76.9
## 63 94.2
## 64 73.4
## 65 80.5
## 66 81.6
## 67 82.1
## 68 77.6
## 69 83.5
## 70 89.9
## 71 86.0
## 72 87.3
class(anorexia["Prewt"])
## [1] "data.frame"
typeof(anorexia["Prewt"])
## [1] "list"
根據結果,兩者皆是取”Prewt”這個欄位的資料,但仍有一些差異:
anorexia[,2]的結構是一個numeric,其型態為double
vector,而anorexia[“Prewt”]的結構仍是一個data.frame,其型態為list
lm(Postwt ~ Treat, data=anorexia)
##
## Call:
## lm(formula = Postwt ~ Treat, data = anorexia)
##
## Coefficients:
## (Intercept) TreatCont TreatFT
## 85.697 -4.589 4.798
在anorexia中,Treat是一個3 level的factor(“CBT”,“Cont”,“FT”),因此線性回歸結果是以”CBT”為參考組,計算”Cont”與”FT”兩組的beta
lm(Postwt ~ as.numeric(Treat), data=anorexia)
##
## Call:
## lm(formula = Postwt ~ as.numeric(Treat), data = anorexia)
##
## Coefficients:
## (Intercept) as.numeric(Treat)
## 82.036 1.711
透過as.numeric()的語法,將Treat從factor(3個level)轉為數值,因此結果只會出現一個beata
結論:由於Treat是個類別,所以做數值轉換後的結果並不合理
Find all possible sums from rolling three dice using R. If possible, construct a histogram for the sum of three dice. Is this a probability histogram or an empirical histogram?
# All possible dice rolls:
<- outer(outer(1:6, 1:6, FUN = '+'), 1:6, FUN = '+')
dice.sums dice.sums
## , , 1
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 3 4 5 6 7 8
## [2,] 4 5 6 7 8 9
## [3,] 5 6 7 8 9 10
## [4,] 6 7 8 9 10 11
## [5,] 7 8 9 10 11 12
## [6,] 8 9 10 11 12 13
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 4 5 6 7 8 9
## [2,] 5 6 7 8 9 10
## [3,] 6 7 8 9 10 11
## [4,] 7 8 9 10 11 12
## [5,] 8 9 10 11 12 13
## [6,] 9 10 11 12 13 14
##
## , , 3
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 5 6 7 8 9 10
## [2,] 6 7 8 9 10 11
## [3,] 7 8 9 10 11 12
## [4,] 8 9 10 11 12 13
## [5,] 9 10 11 12 13 14
## [6,] 10 11 12 13 14 15
##
## , , 4
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 6 7 8 9 10 11
## [2,] 7 8 9 10 11 12
## [3,] 8 9 10 11 12 13
## [4,] 9 10 11 12 13 14
## [5,] 10 11 12 13 14 15
## [6,] 11 12 13 14 15 16
##
## , , 5
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 7 8 9 10 11 12
## [2,] 8 9 10 11 12 13
## [3,] 9 10 11 12 13 14
## [4,] 10 11 12 13 14 15
## [5,] 11 12 13 14 15 16
## [6,] 12 13 14 15 16 17
##
## , , 6
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 8 9 10 11 12 13
## [2,] 9 10 11 12 13 14
## [3,] 10 11 12 13 14 15
## [4,] 11 12 13 14 15 16
## [5,] 12 13 14 15 16 17
## [6,] 13 14 15 16 17 18
3個骰子的所有可能有216種,擲出3的機率是1/216,擲出4的機率是1+2/216,擲出5的機率是1+2+3/216….依此類推。 擲出3到11的組合從1,1+2,1+2+3,1+2+3+4,…1+2+..+8
# Probability of dice.sums= 4: 0.0138
<- mean(dice.sums== 4)
prob prob
## [1] 0.01388889
#將216種結果變為vector
<- c(as.matrix(dice.sums)) |>hist(prob = T, main="", xlab ="sums from rolling three dice", breaks=2:19) all
class(c(as.matrix(dice.sums)))
## [1] "integer"
outer(as.numeric(outer(1:6, 1:6, '+')), 1:6, '+') |> hist(prob = T, breaks= 2:19) #hist好像不能改xlim?
axis(side=1, at=c(2:19)) ##為什麼是2:19?
The IQ and behavior problem page has a dataset and a script of R code chunks. Generate a markdown file from the script to push the output in HTML for posting to course Moodle site. Explain what the code chunks do with your comments in the markdown file.
#Row names in the first column
<- read.table("IQ_Beh.txt", header = T, row.names = 1) dta
# structure of data
str(dta)
## 'data.frame': 94 obs. of 3 variables:
## $ Dep: chr "N" "N" "N" "N" ...
## $ IQ : int 103 124 124 104 96 92 124 99 92 116 ...
## $ BP : int 4 12 9 3 3 3 6 4 3 9 ...
# show first 6 rows
head(dta)
## Dep IQ BP
## 1 N 103 4
## 2 N 124 12
## 3 N 124 9
## 4 N 104 3
## 5 D 96 3
## 6 N 92 3
#data structure
class(dta)
## [1] "data.frame"
# dimention of data
dim(dta)
## [1] 94 3
# variable names
names(dta)
## [1] "Dep" "IQ" "BP"
# 驗證dta$BP是否為vector
is.vector(dta$BP)
## [1] TRUE
# select first row
1, ] dta[
## Dep IQ BP
## 1 N 103 4
# select 1-3 row的 IQ
1:3, "IQ"] dta[
## [1] 103 124 124
# 先針對BP row由小到大排序,並列出最後六筆(row)的各欄位(column)資料
tail(dta[order(dta$BP), ])
## Dep IQ BP
## 16 N 89 11
## 58 N 117 11
## 66 N 126 11
## 2 N 124 12
## 73 D 99 13
## 12 D 22 17
# 先針對BP row由大到小排序,並列出最後四筆(row)的各欄位(column)資料
tail(dta[order(-dta$BP), ], 4)
## Dep IQ BP
## 77 N 124 1
## 80 N 121 1
## 24 N 106 0
## 75 N 122 0
# with的功能與sas中的"data="相似
with(dta, hist(IQ, xlab = "IQ", main = ""))
# boxplot
boxplot(BP ~ Dep, data = dta,
xlab = "Depression",
ylab = "Behavior problem score")
# 原本的Dep 是character,需先變成factor
$Dep<-with(dta, as.factor(Dep))
dtastr(dta)
## 'data.frame': 94 obs. of 3 variables:
## $ Dep: Factor w/ 2 levels "D","N": 2 2 2 2 1 2 2 2 2 2 ...
## $ IQ : int 103 124 124 104 96 92 124 99 92 116 ...
## $ BP : int 4 12 9 3 3 3 6 4 3 9 ...
#原本col=dta$Dep,好像沒有"dta$"也抓地到資料?
plot(IQ ~ BP, data = dta, pch = 20, col = Dep,
xlab = "Behavior problem score", ylab = "IQ")
grid()
# type = "n" 代表不要任何點或線,只呈現外框,無值
plot(BP ~ IQ, data = dta, type = "n",
ylab = "Behavior problem score", xlab = "IQ")
# 把點換成用文字呈現,cex是字體大小
text(dta$IQ, dta$BP, labels = dta$Dep, cex = 0.5)
# 增加一條 Depression 組的回歸線
abline(lm(BP ~ IQ, data = dta, subset = Dep == "D"))
# 增加一條 non-Depression 組的回歸線,line type 使用虛線
abline(lm(BP ~ IQ, data = dta, subset = Dep == "N"), lty = 2)
The usBirths2015.txt is a dataset of monthly births in the US in 2015. Summarize the number of births by season.
<- read.table("usBirths2015.txt", header = T)
dta4
#
library(dplyr)
<- dta4|> mutate(
dta4 season = ifelse(month %in% c("December", "January", "February"), "Winter",
ifelse(month %in% c("March","April","May"), "Spring",
ifelse(month %in% c("June","July","August"),"Summer","Fall"))))
#
with(dta4, aggregate(birth~as.factor(season), FUN = sum))
## as.factor(season) birth
## 1 Fall 1005343
## 2 Spring 977672
## 3 Summer 1035747
## 4 Winter 959735
#調整一下row的順序按照season順序排
<- dta4[c(3:12,1:2),]
dta4v #repeated each spring, summer...3 times
$season <- rep(c("Spring", "Summer", "Fall", "Winter"), rep(3,4)) dta4v
Ten subjects read a paragraph consisting of seven sentences. The reading time (in seconds) for each sentence was the outcome measure. The predictors are the serial position of the sentence (Sp), the number of words in the sentences (Wrds), and the number of new arguments in the sentence (New).
Rank subjects by their reading speeed
Estimate, on average, how long does it take to read a word.
<- read.table("readingtimes.txt", header = T)
dta5 str(dta5)
## 'data.frame': 7 obs. of 14 variables:
## $ Snt : int 1 2 3 4 5 6 7
## $ Sp : int 1 2 3 4 5 6 7
## $ Wrds: int 13 16 9 9 10 18 6
## $ New : int 1 3 2 2 3 4 1
## $ S01 : num 3.43 6.48 1.71 3.68 4 ...
## $ S02 : num 2.79 5.41 2.34 3.71 2.9 ...
## $ S03 : num 4.16 4.49 3.02 2.87 2.99 ...
## $ S04 : num 3.07 5.06 2.46 2.73 2.67 ...
## $ S05 : num 3.62 9.29 6.04 4.21 3.88 ...
## $ S06 : num 3.16 5.64 2.46 6.24 3.22 ...
## $ S07 : num 3.23 8.36 4.92 3.72 3.14 ...
## $ S08 : num 7.16 4.31 3.37 6.33 6.14 ...
## $ S09 : num 1.54 2.95 1.38 1.15 2.76 ...
## $ S10 : num 4.06 6.65 2.18 3.66 3.33 ...
head(dta5)
## Snt Sp Wrds New S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
## 1 1 1 13 1 3.429 2.795 4.161 3.071 3.625 3.161 3.232 7.161 1.536 4.063
## 2 2 2 16 3 6.482 5.411 4.491 5.063 9.295 5.643 8.357 4.313 2.946 6.652
## 3 3 3 9 2 1.714 2.339 3.018 2.464 6.045 2.455 4.920 3.366 1.375 2.179
## 4 4 4 9 2 3.679 3.714 2.866 2.732 4.205 6.241 3.723 6.330 1.152 3.661
## 5 5 5 10 3 4.000 2.902 2.991 2.670 3.884 3.223 3.143 6.143 2.759 3.330
## 6 6 6 18 4 6.973 8.018 6.625 7.571 8.795 13.188 11.170 6.071 7.964 7.866
<- apply(dta5[5:14],2,mean) #2 means column
AA AA
## S01 S02 S03 S04 S05 S06 S07 S08
## 4.130143 3.847000 3.774286 3.779286 5.620000 5.371286 5.228429 5.011429
## S09 S10
## 2.741000 4.493714
# what's different in rank, order, sort
order(AA) #關心的是index, 數值由小到大排列,回傳其值的位置(index)
## [1] 9 3 4 2 1 10 8 7 6 5
rank(AA) #關心的是subject,回傳每個subject的排名
## S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
## 5 4 2 3 10 9 8 7 1 6
sort(AA) #關心的是數值排序,回傳數值由小到大排列相數對應之值與subject
## S09 S03 S04 S02 S01 S10 S08 S07
## 2.741000 3.774286 3.779286 3.847000 4.130143 4.493714 5.011429 5.228429
## S06 S05
## 5.371286 5.620000
# 7個句子,每句在10位平均閱讀時間
<- apply(dta5[5:14],1,mean)
snt_sum#每句平均閱讀時間除每句單字數量
<- snt_sum/dta5$Wrds
snt_wrds#總平均單字數之閱讀時間
mean(snt_wrds)
## [1] 0.3783395
從每句平均閱讀時間除以每句單字數量結果可以發現,7個句子在10個受試者的平均閱讀時間都不相同,代表每個句子間,還有其他因素影響閱讀的速度,可能是句子的位置、也可能是句子的新單詞數等。