Q1

Concerning the anorexia{MASS}, explain the difference between > anorexia[,2] > anorexia[“Prewt”]

and > lm(Postwt ~ Treat, data=anorexia) > lm(Postwt ~ as.numeric(Treat), data=anorexia)

Does the second output make sense in the context of the ‘anorexia’ data analysis?

#load data from package
#install.packages("MASS")
data(anorexia,package ="MASS")

#exam data structure
str(anorexia)

## 'data.frame':    72 obs. of  3 variables:
##  $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Prewt : num  80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
##  $ Postwt: num  80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...

anorexia的資料結構是data.frame，3個變數：Treat是Factor(3 levels：“CBT”,“Cont”,“FT”)；Prewt、Postwt前後測體重是num vector。

# show first 6 data
head(anorexia,8)

##   Treat Prewt Postwt
## 1  Cont  80.7   80.2
## 2  Cont  89.4   80.1
## 3  Cont  91.8   86.4
## 4  Cont  74.0   86.3
## 5  Cont  78.1   76.1
## 6  Cont  88.3   78.1
## 7  Cont  87.3   75.1
## 8  Cont  75.1   86.7

tail(anorexia,8)

##    Treat Prewt Postwt
## 65    FT  80.5   75.2
## 66    FT  81.6   77.8
## 67    FT  82.1   95.5
## 68    FT  77.6   90.7
## 69    FT  83.5   92.5
## 70    FT  89.9   93.8
## 71    FT  86.0   91.7
## 72    FT  87.3   98.0

anorexia[,2] #[列,行=2]，anorexia[,2]代表取出第二個變數Prewt所有觀察值

##  [1] 80.7 89.4 91.8 74.0 78.1 88.3 87.3 75.1 80.6 78.4 77.6 88.7 81.3 78.1 70.5
## [16] 77.3 85.2 86.0 84.1 79.7 85.5 84.4 79.6 77.5 72.3 89.0 80.5 84.9 81.5 82.6
## [31] 79.9 88.7 94.9 76.3 81.0 80.5 85.0 89.2 81.3 76.5 70.0 80.4 83.3 83.0 87.7
## [46] 84.2 86.4 76.5 80.2 87.8 83.3 79.7 84.5 80.8 87.4 83.8 83.3 86.0 82.5 86.7
## [61] 79.6 76.9 94.2 73.4 80.5 81.6 82.1 77.6 83.5 89.9 86.0 87.3

anorexia$Prewt #結果跟anorexia[,2]一樣

##  [1] 80.7 89.4 91.8 74.0 78.1 88.3 87.3 75.1 80.6 78.4 77.6 88.7 81.3 78.1 70.5
## [16] 77.3 85.2 86.0 84.1 79.7 85.5 84.4 79.6 77.5 72.3 89.0 80.5 84.9 81.5 82.6
## [31] 79.9 88.7 94.9 76.3 81.0 80.5 85.0 89.2 81.3 76.5 70.0 80.4 83.3 83.0 87.7
## [46] 84.2 86.4 76.5 80.2 87.8 83.3 79.7 84.5 80.8 87.4 83.8 83.3 86.0 82.5 86.7
## [61] 79.6 76.9 94.2 73.4 80.5 81.6 82.1 77.6 83.5 89.9 86.0 87.3

anorexia["Prewt"] #data.frame中只留下Prewt

##    Prewt
## 1   80.7
## 2   89.4
## 3   91.8
## 4   74.0
## 5   78.1
## 6   88.3
## 7   87.3
## 8   75.1
## 9   80.6
## 10  78.4
## 11  77.6
## 12  88.7
## 13  81.3
## 14  78.1
## 15  70.5
## 16  77.3
## 17  85.2
## 18  86.0
## 19  84.1
## 20  79.7
## 21  85.5
## 22  84.4
## 23  79.6
## 24  77.5
## 25  72.3
## 26  89.0
## 27  80.5
## 28  84.9
## 29  81.5
## 30  82.6
## 31  79.9
## 32  88.7
## 33  94.9
## 34  76.3
## 35  81.0
## 36  80.5
## 37  85.0
## 38  89.2
## 39  81.3
## 40  76.5
## 41  70.0
## 42  80.4
## 43  83.3
## 44  83.0
## 45  87.7
## 46  84.2
## 47  86.4
## 48  76.5
## 49  80.2
## 50  87.8
## 51  83.3
## 52  79.7
## 53  84.5
## 54  80.8
## 55  87.4
## 56  83.8
## 57  83.3
## 58  86.0
## 59  82.5
## 60  86.7
## 61  79.6
## 62  76.9
## 63  94.2
## 64  73.4
## 65  80.5
## 66  81.6
## 67  82.1
## 68  77.6
## 69  83.5
## 70  89.9
## 71  86.0
## 72  87.3

#比較資料結構
str(anorexia[,2]) #vector

##  num [1:72] 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...

str(anorexia$Prewt) #vector

##  num [1:72] 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...

str(anorexia["Prewt"]) #data.frame

## 'data.frame':    72 obs. of  1 variable:
##  $ Prewt: num  80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...

lm(Postwt ~ Treat, data=anorexia)

## 
## Call:
## lm(formula = Postwt ~ Treat, data = anorexia)
## 
## Coefficients:
## (Intercept)    TreatCont      TreatFT  
##      85.697       -4.589        4.798

lm(Postwt ~ as.numeric(Treat), data=anorexia)

## 
## Call:
## lm(formula = Postwt ~ as.numeric(Treat), data = anorexia)
## 
## Coefficients:
##       (Intercept)  as.numeric(Treat)  
##            82.036              1.711

Treat是有3 levels：“CBT”,“Cont”,“FT”。這三種治療方式如果是劑量強度不同，那麼Treat就是ordinal variable，這時候類別變數轉成數值as.numeric(Treat)才有意義。

Q2

Find all possible sums from rolling three dice using R. If possible, construct a histogram for the sum of three dice. Is this a probability histogram or an empirical histogram?

#骰子可能出現的數值
dice<-1:6
dice

## [1] 1 2 3 4 5 6

#sample(x, size, replace = FALSE, prob = NULL)
#x為任意vector或整數
#size=抽取的樣本數
#replace = FALSE 邏輯指令, 設定是否可重複抽取.
#prob 設定每一個個體被抽取之相對應機率或比率之向量。若無設定值NULL, 則每一個個體被抽取之相對應機率為相等


#擲一個骰子，重複七次
x<- sample(1:6, 7, rep=T, prob = NULL) #rep=T取出放回
x

## [1] 2 5 6 6 4 3 2

#三顆骰子出現的數值加總範圍
dice<-3*1:6
dice

## [1]  3  6  9 12 15 18

#擲三個骰子，重複10次
x<- sample(3:18, 10, rep=T) 
x

##  [1]  7 15 15  4  7 17 10  8 14 14

#construct a histogram for the sum of three dice
sample(3:18, 10, rep=T) |> hist(prob=T, main="") #丟10次

#construct a histogram for the sum of three dice
sample(3:18,1000, rep=T) |> hist(prob=T, main="") #1000次

這是empirical histogram。理論上，骰子每一面出現的機率會一樣，服從discrete uniform distribution，但是實際模擬出現的結果卻不一定如此。

擲骰子作法2

n <- 6000 #將丟擲次數設定為一參數
dice <- sample(1:6, size = n, replace = TRUE) #重複丟擲一公正骰子6000次，並將結果存入dice變數
head(dice, 100) #顯示前100筆結果

##   [1] 3 5 4 4 2 1 5 4 2 3 3 5 5 2 1 1 1 6 1 2 1 4 2 1 1 5 5 5 2 5 3 3 2 1 3 2 1
##  [38] 3 6 1 1 3 6 5 1 2 3 6 6 5 4 6 2 2 5 4 2 1 3 1 3 4 3 5 1 2 6 3 1 4 2 2 3 5
##  [75] 4 3 2 6 1 5 2 6 1 3 6 3 2 5 5 1 6 4 1 1 6 3 4 2 1 1

dice_count <- table(dice) #利用table函數幫助我們建立次數表
dice_count #顯示次數表

## dice
##    1    2    3    4    5    6 
##  954 1018 1036  969 1016 1007

#將次數表結果畫成條形圖
library(ggplot2)
ggplot(data.frame(dice_count), aes(x = dice, y = Freq)) +
  geom_bar(stat = "identity") + 
  geom_hline(aes(yintercept = n / 6)) #畫出條形圖，並給出期望次數水平線，注意我們需先將次數表由table型態轉為dataframe型態，這樣ggplot才能解讀，其中第一欄為dice，第二欄為Freq

#將y軸由次數轉成比例
df_dice_count <- data.frame(dice_count) #先將次數表由table型態轉為dataframe型態方便下面的計算
for (i in 1:6) df_dice_count$Freq[i] <- df_dice_count$Freq[i] / n #將每一筆次數都除以總數即得比例，再放回dataframe中
ggplot(df_dice_count, aes(x = dice, y = Freq)) +
  geom_bar(stat = "identity") + 
  geom_hline(aes(yintercept = 1 / 6)) #畫出條形圖

#進一步寫出像是一次丟3個骰子，連丟10000次，將每一次的點數和記錄下來並製成次數表畫成條形圖，看看圖形跟前面的有何不同

dice <- function(n_dice, times){
  dice_sum <- function(n_dice){
    dice <- sample(1:6, size = n_dice, replace = TRUE)
    return(sum(dice))
  }
replicate(times, dice_sum(n_dice))
}
dice <- dice(3,10000)
ggplot(data.frame(table(dice)), aes(x = dice, y = Freq)) +
  geom_bar(stat="identity")

Q3

The IQ and behavior problem page has a dataset and a script of R code chunks. Generate a markdown file from the script to push the output in HTML for posting to course Moodle site. Explain what the code chunks do with your comments in the markdown file.

# input data
dta <- read.table("C:/Users/Ching-Fang Wu/Documents/dataM/IQ_Beh.txt", header = T,row.names = 1)

row.names = TRUE，是設定輸出 row names。不清楚row.names = 1是甚麼意思，但刪除後會多出一個變數Mom

# exam data structure
str(dta)

## 'data.frame':    94 obs. of  3 variables:
##  $ Dep: chr  "N" "N" "N" "N" ...
##  $ IQ : int  103 124 124 104 96 92 124 99 92 116 ...
##  $ BP : int  4 12 9 3 3 3 6 4 3 9 ...

資料結構是data.frame，有三個變數，Dep是character，IQ和BP是整數

# show first 6 lines
head(dta)

##   Dep  IQ BP
## 1   N 103  4
## 2   N 124 12
## 3   N 124  9
## 4   N 104  3
## 5   D  96  3
## 6   N  92  3

# 資料結構類型
class(dta)

## [1] "data.frame"

# 資料維度
dim(dta)

## [1] 94  3

# 變數名稱
names(dta)

## [1] "Dep" "IQ"  "BP"

# 檢查資料型別是否為vector
is.vector(dta$BP)

## [1] TRUE

# 取出第一列row
dta[1, ]

##   Dep  IQ BP
## 1   N 103  4

# 前三筆IQ
dta[1:3, "IQ"]

## [1] 103 124 124

#dta$BP 取出BP的數值
#order(dta$BP) order()函數可回傳由小到大之元素位置
#dta[order(dta$BP), ] dataframe按照BP由小排到大
tail(dta[order(dta$BP), ]) #tail選出最後六筆，也就是BP最高的六位受試者的資料

##    Dep  IQ BP
## 16   N  89 11
## 58   N 117 11
## 66   N 126 11
## 2    N 124 12
## 73   D  99 13
## 12   D  22 17

#order(-dta$BP)是將數值由大排到小，另一種方式：order(dta$BP,decreasing = T)


#BP由高到低排序，再選出最後4位，也就是BP值最低的四位受試者
tail(dta[order(-dta$BP), ], 4)

##    Dep  IQ BP
## 77   N 124  1
## 80   N 121  1
## 24   N 106  0
## 75   N 122  0

#with()Evaluate an Expression in a Data Environmen
with(dta, hist(IQ, xlab = "IQ", main = ""))

with(dta, hist(BP, xlab = "BP", main = ""))

#用hist()在dta上

#用boxplot()畫盒形圖
boxplot(BP ~ Dep, data = dta, 
      xlab = "Depression", 
      ylab = "Behavior problem score")

#也可以用with()畫盒形圖
with(dta,boxplot(BP ~ Dep,xlab = "Depression", 
        ylab = "Behavior problem score"))

# 用plot()畫散布圖scatter plots
plot(IQ ~ BP, data = dta, pch = 20, col=2, 
     xlab = "Behavior problem score", ylab = "IQ")
grid()

#
plot(BP ~ IQ, data = dta, type = "n",
     ylab = "Behavior problem score", xlab = "IQ")
text(dta$IQ, dta$BP, labels = dta$Dep, cex = 0.5) #資料點加上標籤
abline(lm(BP ~ IQ, data = dta, subset = Dep == "D")) #加"D"趨勢線
abline(lm(BP ~ IQ, data = dta, subset = Dep == "N"), lty = 2) #加"N"趨勢線

Q&A

Q：有時候會看到plot()、text()、abline()函數之間用”+“連起來，但上例卻不用，使用時機為何呢?

我還試過pipe |>，但不能run，WHY?? 畫圖不是一道一道程序加上去的嗎?

A： The output of R base plot is sent to the standard device, i.e, your screen. (You can also save it as a graphical object to the computer drive.) It is not saved as an object in an R session. That is why you can’t use operator such as ‘pipe’ or ‘+’ as in ggplot2 with it. We will explain the differences among base R graphics, lattice and ggplot2 later.

Try the followings in an R session:

plot(women) lines(women) text(women)

g = lattice::xyplot(height ~ weight, women)

show(g)

update(g, type=c(“g”, “p”, “r”))

There is no need to involve + or pipe operators as the graphic device is kept open in the first case and the object is saved in the second.

plot(women) #畫點
lines(women) #畫線
text(women) #標註點值

#g = lattice::xyplot(height ~ weight, women)
#show(g)
#update(g, type=c(“g”, “p”, “r”))

##end

Q4

The usBirths2015.txt is a dataset of monthly births in the US in 2015. Summarize the number of births by season.

dta_usBirths2015 <- read.table("C:/Users/Ching-Fang Wu/Documents/dataM/usBirths2015.txt", header = T, row.names = 1 )

str(dta_usBirths2015)

## 'data.frame':    12 obs. of  1 variable:
##  $ month: chr  "January" "February" "March" "April" ...

summary(dta_usBirths2015)

##     month          
##  Length:12         
##  Class :character  
##  Mode  :character

12個月當中出生人數最少(Min.)時為298058人，最多Max.為353415人。平均每個月有Mean:331541。

#Summarize the number of births by season.
#計算四季出生人數
spring<-sum(dta_usBirths2015$birth[1:3])
summer<-sum(dta_usBirths2015$birth[4:6])
autumn<-sum(dta_usBirths2015$birth[7:9])
winter<-sum(dta_usBirths2015$birth[10:12])

data.frame(spring,summer,autumn,winter)

##   spring summer autumn winter
## 1      0      0      0      0

Q5

Ten subjects read a paragraph consisting of seven sentences. The reading time (in seconds) for each sentence was the outcome measure. The predictors are the serial position of the sentence (Sp), the number of words in the sentences (Wrds), and the number of new arguments in the sentence (New). (a) Rank subjects by their reading speeed (b) Estimate, on average, how long does it take to read a word.

Source: Lorch, R.F., & Myers, J.L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition. 16, 149-157.

#input data
dta_readingtimes <- read.table("C:/Users/Ching-Fang Wu/Documents/dataM/readingtimes.txt", header = T)

#
str(dta_readingtimes)

## 'data.frame':    7 obs. of  14 variables:
##  $ Snt : int  1 2 3 4 5 6 7
##  $ Sp  : int  1 2 3 4 5 6 7
##  $ Wrds: int  13 16 9 9 10 18 6
##  $ New : int  1 3 2 2 3 4 1
##  $ S01 : num  3.43 6.48 1.71 3.68 4 ...
##  $ S02 : num  2.79 5.41 2.34 3.71 2.9 ...
##  $ S03 : num  4.16 4.49 3.02 2.87 2.99 ...
##  $ S04 : num  3.07 5.06 2.46 2.73 2.67 ...
##  $ S05 : num  3.62 9.29 6.04 4.21 3.88 ...
##  $ S06 : num  3.16 5.64 2.46 6.24 3.22 ...
##  $ S07 : num  3.23 8.36 4.92 3.72 3.14 ...
##  $ S08 : num  7.16 4.31 3.37 6.33 6.14 ...
##  $ S09 : num  1.54 2.95 1.38 1.15 2.76 ...
##  $ S10 : num  4.06 6.65 2.18 3.66 3.33 ...

#
dta_readingtimes

##   Snt Sp Wrds New   S01   S02   S03   S04   S05    S06    S07   S08   S09   S10
## 1   1  1   13   1 3.429 2.795 4.161 3.071 3.625  3.161  3.232 7.161 1.536 4.063
## 2   2  2   16   3 6.482 5.411 4.491 5.063 9.295  5.643  8.357 4.313 2.946 6.652
## 3   3  3    9   2 1.714 2.339 3.018 2.464 6.045  2.455  4.920 3.366 1.375 2.179
## 4   4  4    9   2 3.679 3.714 2.866 2.732 4.205  6.241  3.723 6.330 1.152 3.661
## 5   5  5   10   3 4.000 2.902 2.991 2.670 3.884  3.223  3.143 6.143 2.759 3.330
## 6   6  6   18   4 6.973 8.018 6.625 7.571 8.795 13.188 11.170 6.071 7.964 7.866
## 7   7  7    6   1 2.634 1.750 2.268 2.884 3.491  3.688  2.054 1.696 1.455 3.705

#先計算每位受試者平均閱讀時間
colMeans(dta_readingtimes[,5:14])

##      S01      S02      S03      S04      S05      S06      S07      S08 
## 4.130143 3.847000 3.774286 3.779286 5.620000 5.371286 5.228429 5.011429 
##      S09      S10 
## 2.741000 4.493714

#排序
order(colMeans(dta_readingtimes[,5:14]))

##  [1]  9  3  4  2  1 10  8  7  6  5

第九位(S09)的閱讀速度最快，第五位(S05)最慢。

#Estimate, on average, how long does it take to read a word.

colMeans(dta_readingtimes[,5:14])/dta_readingtimes[,3]

## Warning in colMeans(dta_readingtimes[, 5:14])/dta_readingtimes[, 3]: 較長的物件
## 長度並非較短物件長度的倍數

##       S01       S02       S03       S04       S05       S06       S07       S08 
## 0.3177033 0.2404375 0.4193651 0.4199206 0.5620000 0.2984048 0.8714048 0.3854945 
##       S09       S10 
## 0.1713125 0.4993016

##The end

DataM L3 Exercise 1-5

Ching-Fang Wu

2021-10-24

Q1

Q2

擲骰子作法2

Q3

Q&A

Q4

Q5