Edit the R script with your own comments to explain what each code chunk does and then compile a report of the script in html for publication to moodle.
example of regression with sampling weights
set some options
options(digits=3, show.signif.stars=FALSE)
packagage management
# install.packages(pacman)
load packages
pacman::p_load(alr4, tidyverse)
load data
data(UN11, package="alr4")
seed the random number generator to get the same sample
set.seed(6102)
pick 81 countries from three regions
arrange the rows by alphabetical order
dta1 <- UN11 %>%
filter(region %in% c("Africa", "Asia", "Europe")) %>%
sample_n(81) %>%
arrange(region)
從資料庫 UN11 中非洲、亞洲和歐洲三個區域中隨機選初 81 的國家,再依區域排序
first 6 lines of data frame
head(dta1)
region group fertility ppgdp lifeExpF pctUrban
1 Africa africa 2.64 2654 75.5 44
2 Africa africa 4.74 737 63.2 28
3 Africa africa 2.38 7255 54.1 62
4 Africa africa 4.36 1131 61.0 42
5 Africa africa 4.62 802 59.2 23
6 Africa africa 4.98 16852 52.9 40
檢視資料的前 6 筆。
data dimensions - rows and columns
dim(dta1)
[1] 81 6
資料的列與行數。
how many countries in each of the three regions
R3 <- table(dta1$region)
R3
Africa Asia Caribbean Europe Latin Amer
28 34 0 19 0
North America NorthAtlantic Oceania
0 0 0
依 3 個區域分別計數。
percentage of countries from each of the three regions selected
w <- R3/table(UN11$region)
w
Africa Asia Caribbean Europe Latin Amer
0.528 0.680 0.000 0.487 0.000
North America NorthAtlantic Oceania
0.000 0.000 0.000
計算隨機選出的國家在原始資料庫中各區域的比例
add the sampling weights variable to data
skip over countries in regions not selected
dta1$wt <- rep(1/w[w != 0], R3[R3 != 0])
計算各資料的在整體資料中依各區域的權重。
simple regression 一般線性迴歸分析
summary(m0 <- lm(fertility ~ log(ppgdp), data=dta1))
Call:
lm(formula = fertility ~ log(ppgdp), data = dta1)
Residuals:
Min 1Q Median 3Q Max
-1.936 -0.809 -0.162 0.611 3.029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4222 0.6448 11.51 < 2e-16
log(ppgdp) -0.5622 0.0772 -7.29 2.1e-10
Residual standard error: 1.1 on 79 degrees of freedom
Multiple R-squared: 0.402, Adjusted R-squared: 0.394
F-statistic: 53.1 on 1 and 79 DF, p-value: 2.11e-10
weighted regression 加權迴歸分析
summary(m1 <- update(m0, weights=wt))
Call:
lm(formula = fertility ~ log(ppgdp), data = dta1, weights = wt)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.648 -1.028 -0.192 0.793 4.170
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5651 0.6382 11.85 < 2e-16
log(ppgdp) -0.5771 0.0761 -7.59 5.6e-11
Residual standard error: 1.45 on 79 degrees of freedom
Multiple R-squared: 0.421, Adjusted R-squared: 0.414
F-statistic: 57.5 on 1 and 79 DF, p-value: 5.58e-11
plot
ggplot(dta1,
aes(log(ppgdp), fertility, label=region)) +
stat_smooth(method="lm", formula=y ~ x, se=F, col="peru", lwd=rel(.5)) +
stat_smooth(aes(weight=wt), method="lm", formula=y ~ x, se=F, lwd=rel(.5), col="gray")+
geom_text(check_overlap=TRUE, size=rel(2.3), aes(color=region))+
labs(x="GDP (US$ in log unit)",
y="Number of children per woman") +
theme_minimal() +
theme(legend.position="NONE")
橘紅色的線為一般迴歸分析,與之些微錯開的灰線為加權迴歸分析。
The end
Replicate the analysis in the lexical decision latencies data example using the data:heid{languageR}. Perform a generalized least-squares regression and compare the results against an ordinary regression of RT onto BaseFrequency.
You will need to augument the data with a variable for trial:
heid$Trial <- as.numeric(unlist(apply(as.matrix(table(heid$Subject)), 1, seq)))
trial by trial reaction time data
pacman::p_load(languageR, lattice, tidyverse, nlme, lmtest)
data(heid, package="languageR")
head(heid)
Subject Word RT BaseFrequency
1 pp1 basaalheid 6.69 3.56
2 pp1 markantheid 6.81 5.16
3 pp1 ontroerdheid 6.51 5.55
4 pp1 contentheid 6.58 4.50
5 pp1 riantheid 6.86 4.53
6 pp1 tembaarheid 6.35 0.00
str(heid)
'data.frame': 832 obs. of 4 variables:
$ Subject : Factor w/ 26 levels "pp1","pp10","pp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Word : Factor w/ 40 levels "aftandsheid",..: 4 24 28 10 33 38 25 5 16 30 ...
$ RT : num 6.69 6.81 6.51 6.58 6.86 6.35 6.69 7.17 6.58 6.98 ...
$ BaseFrequency: num 3.56 5.16 5.55 4.5 4.53 0 1.61 3.61 2.56 5.86 ...
heid$Trial <- as.numeric(unlist(apply(as.matrix(table(heid$Subject)), 1, seq)))
str(heid)
'data.frame': 832 obs. of 5 variables:
$ Subject : Factor w/ 26 levels "pp1","pp10","pp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Word : Factor w/ 40 levels "aftandsheid",..: 4 24 28 10 33 38 25 5 16 30 ...
$ RT : num 6.69 6.81 6.51 6.58 6.86 6.35 6.69 7.17 6.58 6.98 ...
$ BaseFrequency: num 3.56 5.16 5.55 4.5 4.53 0 1.61 3.61 2.56 5.86 ...
$ Trial : num 1 2 3 4 5 6 7 8 9 10 ...
dta2 <- heid %>% select(Subject, Trial, RT, BaseFrequency)
ot <- theme_set(theme_bw())
word frequency effect showed in ordinary regression
ggplot(data = dta2, aes(x = BaseFrequency, y = RT)) +
geom_point() +
stat_smooth(method="lm") +
labs(x = "Word Frequency", y = "Reaction Time (transformed)")
trial effect by subject showed in ordinary regression
ggplot(data = dta2, aes(x = Trial, y = RT)) +
geom_point(pch =1, size = rel(.5)) +
geom_line() +
facet_wrap(~ Subject) +
labs(x = "Trial ID", y = "Reaction Time (transformed)")
autocorrelation by individual
acf.fnc(dta2, group = "Subject", time = "Trial", x = "RT", plot = TRUE)
simple linear model ignoring trial dependence - least-squares
summary(m0 <- lm(RT ~ BaseFrequency, data = dta2))
Call:
lm(formula = RT ~ BaseFrequency, data = dta2)
Residuals:
Min 1Q Median 3Q Max
-0.5166 -0.2158 -0.0525 0.1599 0.8903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6417 0.0207 320.39 <2e-16
BaseFrequency -0.0155 0.0047 -3.29 0.001
Residual standard error: 0.282 on 830 degrees of freedom
Multiple R-squared: 0.0129, Adjusted R-squared: 0.0117
F-statistic: 10.8 on 1 and 830 DF, p-value: 0.00104
MLE of the model above
summary(m0a <- gls(RT ~ BaseFrequency, data = dta2, method = "ML"))
Generalized least squares fit by maximum likelihood
Model: RT ~ BaseFrequency
Data: dta2
AIC BIC logLik
258 273 -126
Coefficients:
Value Std.Error t-value p-value
(Intercept) 6.64 0.0207 320 0.000
BaseFrequency -0.02 0.0047 -3 0.001
Correlation:
(Intr)
BaseFrequency -0.882
Standardized residuals:
Min Q1 Med Q3 Max
-1.835 -0.766 -0.186 0.568 3.161
Residual standard error: 0.282
Degrees of freedom: 832 total; 830 residual
simple linear model with autocorrelation error using gls
summary(m1 <- update(m0a, correlation = corAR1(form = ~ Trial | Subject)))
Generalized least squares fit by maximum likelihood
Model: RT ~ BaseFrequency
Data: dta2
AIC BIC logLik
64.3 83.2 -28.1
Correlation Structure: AR(1)
Formula: ~Trial | Subject
Parameter estimate(s):
Phi
0.462
Coefficients:
Value Std.Error t-value p-value
(Intercept) 6.64 0.02193 302.5 0e+00
BaseFrequency -0.01 0.00387 -3.6 4e-04
Correlation:
(Intr)
BaseFrequency -0.699
Standardized residuals:
Min Q1 Med Q3 Max
-1.835 -0.765 -0.184 0.586 3.159
Residual standard error: 0.281
Degrees of freedom: 832 total; 830 residual
anova(m0a, m1)
Model df AIC BIC logLik Test L.Ratio p-value
m0a 1 3 258.5 272.6 -126.2
m1 2 4 64.3 83.2 -28.1 1 vs 2 196 <.0001
collect residuals from the two models above
dta2 <- dta2 %>% mutate(res0 = rstandard(m0),
res1 = resid(m1, type = "normalized"))
Durbin-Watson test of serial correlation
dwtest(res0 ~ BaseFrequency, data = dta2)
Durbin-Watson test
data: res0 ~ BaseFrequency
DW = 1, p-value <2e-16
alternative hypothesis: true autocorrelation is greater than 0
dwtest(res1 ~ BaseFrequency, data = dta2)
Durbin-Watson test
data: res1 ~ BaseFrequency
DW = 2, p-value = 1
alternative hypothesis: true autocorrelation is greater than 0
check acf for residuals
acf.fnc(dta2, group = "Subject", time = "Trial", x = "res1", plot = TRUE)
End
Replicate the analyses shown in the heights of boys example for the data in the heights of girls . You can ignore the grouping by mothers in the analysis.
regression with correlations over time and in clusters
# set significant decimal points
# don't print silly stars
options(digits=4, show.signif.stars=FALSE)
# handle package loading automatically
# install.package(pacman)
# load packages
pacman::p_load(tidyverse, nlme, broom, magrittr, MuMIn)
heightg <- read.csv("~/data/girl_height.csv", header = T)
str(heightg)
'data.frame': 100 obs. of 4 variables:
$ Height: num 111 116 122 126 130 ...
$ Girl : Factor w/ 20 levels "G1","G10","G11",..: 1 1 1 1 1 12 12 12 12 12 ...
$ Age : int 6 7 8 9 10 6 7 8 9 10 ...
$ Mother: Factor w/ 3 levels "M","S","T": 2 2 2 2 2 2 2 2 2 2 ...
dta3 <- heightg %>%
select(Height, Age)
dta3 %>% cor
Height Age
Height 1.0000 0.8551
Age 0.8551 1.0000
dta3 %>% var %>% diag
Height Age
90.28 2.02
ot <- theme_set(theme_bw())
ggplot(heightg, aes(x = Age, y = Height, color = Mother)) +
geom_point() +
stat_smooth(aes(group = 1), method = "lm") +
labs(x = "Age (scaled)", y = "Height (cm)") +
theme(legend.position = "NONE")
m0 <- gls(Height ~ Age, data = heightg)
m1 <- update(m0, corr = corAR1(form = ~ 1 | Girl), method = "ML")
m2 <- update(m0, weights = varIdent(form = ~ 1 | Mother))
m3 <- update(m1, weights = varIdent(form = ~ 1 | Mother))
model.sel(m0, m1, m2, m3)
Model selection table
(Intrc) Age correlation weights REML df logLik AICc delta weight
m3 82.47 5.558 crAR1(Girl) vrI(Mth) F 6 -162.8 338.5 0.00 1
m1 82.47 5.684 crAR1(Girl) F 4 -172.7 353.8 15.26 0
m2 82.46 5.549 vrI(Mth) 5 -292.0 594.6 256.14 0
m0 82.52 5.716 3 -300.8 607.8 269.27 0
Abbreviations:
correlation: crAR1(Girl) = 'corAR1(~1|Girl)'
weights: vrI(Mth) = 'varIdent(~1|Mother)'
REML: F = 'FALSE'
Models ranked by AICc(x)
lapply(list(m0, m1, m2, m3), summary)
[[1]]
Generalized least squares fit by REML
Model: Height ~ Age
Data: heightg
AIC BIC logLik
607.5 615.3 -300.8
Coefficients:
Value Std.Error t-value p-value
(Intercept) 82.52 2.8439 29.02 0
Age 5.72 0.3501 16.33 0
Correlation:
(Intr)
Age -0.985
Standardized residuals:
Min Q1 Med Q3 Max
-1.8561 -0.7402 -0.1696 0.7974 2.6348
Residual standard error: 4.951
Degrees of freedom: 100 total; 98 residual
[[2]]
Generalized least squares fit by maximum likelihood
Model: Height ~ Age
Data: heightg
AIC BIC logLik
353.3 363.8 -172.7
Correlation Structure: AR(1)
Formula: ~1 | Girl
Parameter estimate(s):
Phi
0.9782
Coefficients:
Value Std.Error t-value p-value
(Intercept) 82.47 1.380 59.75 0
Age 5.68 0.111 51.21 0
Correlation:
(Intr)
Age -0.643
Standardized residuals:
Min Q1 Med Q3 Max
-1.8432 -0.7005 -0.1174 0.8831 2.7934
Residual standard error: 4.781
Degrees of freedom: 100 total; 98 residual
[[3]]
Generalized least squares fit by REML
Model: Height ~ Age
Data: heightg
AIC BIC logLik
594 606.9 -292
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Mother
Parameter estimates:
S M T
1.0000 0.8014 1.8171
Coefficients:
Value Std.Error t-value p-value
(Intercept) 82.46 2.3331 35.34 0
Age 5.55 0.2872 19.32 0
Correlation:
(Intr)
Age -0.985
Standardized residuals:
Min Q1 Med Q3 Max
-1.88260 -0.64836 0.07568 0.75427 2.40785
Residual standard error: 3.961
Degrees of freedom: 100 total; 98 residual
[[4]]
Generalized least squares fit by maximum likelihood
Model: Height ~ Age
Data: heightg
AIC BIC logLik
337.6 353.2 -162.8
Correlation Structure: AR(1)
Formula: ~1 | Girl
Parameter estimate(s):
Phi
0.9787
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Mother
Parameter estimates:
S M T
1.0000 0.6923 1.5494
Coefficients:
Value Std.Error t-value p-value
(Intercept) 82.47 1.1305 72.95 0
Age 5.56 0.0903 61.55 0
Correlation:
(Intr)
Age -0.639
Standardized residuals:
Min Q1 Med Q3 Max
-1.77189 -0.71727 0.06484 0.80091 2.56095
Residual standard error: 4.265
Degrees of freedom: 100 total; 98 residual
####autocorrelations
acf(resid(m0, type = "normalized"), main = "m0")
acf(resid(m1, type = "normalized"), main = "m1")
acf(resid(m2, type = "normalized"), main = "m2")
acf(resid(m3, type = "normalized"), main = "m3")
The end
The data set concerns student evaluation of instructor’s beauty and teaching quality for several courses at the University of Texas. The teaching evaluatons were done at the end of the semester, and the beauty judgments were made later, by six students who had not attended the classes and were not aware of the course evaluations. Carry out the analysis presented in the R script by commenting on what each code chunk does. You can spin the script to a markdown file for html output.
Source: Hamermesh, D.S., & Parker, A.M. (2005). Beauty in the classroom: instructor’s pulchritude and putative pedagogical productivity.a Economics and Education Review, 24, 369-376. Reported in Gelman, A., & Hill, J. (2006). Data analysis using regression and hierarchical/multilevel models. p. 51.
Column 1: Course evaluation score Column 2: Beauty score Column 3: Gender of professor, 1 = Female, 0 = Male Column 4: Pofessor age in years Column 5: Minority status of professor, 1 = Minority, 0 = Others Column 6: Tenure status of professor, 1 = Tenured, 0 = No Column 7: Course ID
dta4 <- read.table("~/data/course_eval.txt", h=T)
str(dta4)
'data.frame': 463 obs. of 7 variables:
$ Score : num 4.3 4.5 3.7 4.3 4.4 4.2 4 3.4 4.5 3.9 ...
$ Beauty : num 0.202 -0.826 -0.66 -0.766 1.421 ...
$ Gender : int 1 0 0 1 1 0 1 1 1 0 ...
$ Age : int 36 59 51 40 31 62 33 51 33 47 ...
$ Minority: int 1 0 0 0 0 0 0 0 0 0 ...
$ Tenure : int 0 1 1 1 0 1 0 1 0 0 ...
$ Course : int 3 0 4 2 0 0 4 0 0 4 ...
dta4$Course <- as.factor(dta4$Course)
library(lattice)
xyplot(Score ~ Beauty | Course, data=dta4,
ylab="Average course evaluation score",
xlab="Beauty judgment score",
type=c("p", "g", "r"))
dtag <-
groupedData(Score ~ Beauty | Course,
data=as.data.frame(dta4),
FUN=mean,
labels=list(x="Beauty score",
y="Couse evaluation score" ))
plot(dtag, asp=1)
m0 <- lm(Score ~ Beauty, data=dta4)
summary(m0)
Call:
lm(formula = Score ~ Beauty, data = dta4)
Residuals:
Min 1Q Median 3Q Max
-1.8002 -0.3630 0.0725 0.4021 1.1037
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0100 0.0255 157.21 < 2e-16
Beauty 0.1330 0.0322 4.13 4.2e-05
Residual standard error: 0.545 on 461 degrees of freedom
Multiple R-squared: 0.0357, Adjusted R-squared: 0.0336
F-statistic: 17.1 on 1 and 461 DF, p-value: 4.25e-05
with(dta4, plot(Beauty, Score, bty="n",
xlab="Beauty score",
ylab="Average course evaluation"))
grid()
abline(m0)
t.test(coef(lmList(Score ~ Beauty | Course, data = dta4))["Beauty"])
One Sample t-test
data: coef(lmList(Score ~ Beauty | Course, data = dta4))["Beauty"]
t = 0.54, df = 30, p-value = 0.6
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.188 3.769
sample estimates:
mean of x
0.7905
m1 <- lm(Score ~ Course/Beauty - 1, data = dta4)
summary(m1)
Call:
lm(formula = Score ~ Course/Beauty - 1, data = dta4)
Residuals:
Min 1Q Median 3Q Max
-1.8108 -0.2739 0.0135 0.3314 1.0935
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Course0 4.0316 0.0304 132.83 < 2e-16
Course1 3.7049 1.2199 3.04 0.00254
Course2 4.4852 0.4880 9.19 < 2e-16
Course3 3.8806 0.2535 15.31 < 2e-16
Course4 3.8359 0.1207 31.79 < 2e-16
Course5 4.0863 0.3073 13.30 < 2e-16
Course6 4.2195 0.3775 11.18 < 2e-16
Course7 3.4491 0.4324 7.98 1.6e-14
Course8 5.0785 1.0201 4.98 9.5e-07
Course9 3.9284 0.2150 18.27 < 2e-16
Course10 4.8393 1.8830 2.57 0.01053
Course11 4.1322 0.4060 10.18 < 2e-16
Course12 3.3081 0.6667 4.96 1.0e-06
Course13 4.1574 0.6147 6.76 4.8e-11
Course14 3.6171 0.4442 8.14 4.9e-15
Course15 2.4200 0.4187 5.78 1.5e-08
Course16 -16.9734 22.0939 -0.77 0.44280
Course17 4.4669 0.2982 14.98 < 2e-16
Course18 4.2042 0.3147 13.36 < 2e-16
Course19 3.5362 0.2520 14.03 < 2e-16
Course20 4.4540 0.4957 8.99 < 2e-16
Course21 3.6620 0.1425 25.70 < 2e-16
Course22 3.7286 0.2064 18.06 < 2e-16
Course23 3.5310 0.5096 6.93 1.7e-11
Course24 3.7810 0.3552 10.65 < 2e-16
Course25 20.7529 13.6132 1.52 0.12818
Course26 4.2462 0.3138 13.53 < 2e-16
Course27 4.5522 0.6720 6.77 4.5e-11
Course28 3.6051 1.3974 2.58 0.01024
Course29 3.8039 0.3909 9.73 < 2e-16
Course30 4.3076 0.3272 13.17 < 2e-16
Course0:Beauty 0.1462 0.0378 3.87 0.00013
Course1:Beauty 0.9109 1.3378 0.68 0.49632
Course2:Beauty 0.2416 0.8978 0.27 0.78795
Course3:Beauty -0.0458 1.4114 -0.03 0.97412
Course4:Beauty 0.5401 0.2909 1.86 0.06405
Course5:Beauty 0.2275 0.4206 0.54 0.58888
Course6:Beauty 0.7948 0.6393 1.24 0.21453
Course7:Beauty -0.1954 0.4447 -0.44 0.66051
Course8:Beauty 1.8980 1.4103 1.35 0.17913
Course9:Beauty -1.9625 0.7006 -2.80 0.00534
Course10:Beauty 0.5348 1.9240 0.28 0.78117
Course11:Beauty -0.4049 0.5014 -0.81 0.41987
Course12:Beauty -1.9471 1.6706 -1.17 0.24452
Course13:Beauty 0.5055 0.9294 0.54 0.58682
Course14:Beauty -0.1386 0.8917 -0.16 0.87659
Course15:Beauty -0.3753 0.5578 -0.67 0.50140
Course16:Beauty -18.2149 19.1411 -0.95 0.34187
Course17:Beauty 0.3354 0.4471 0.75 0.45349
Course18:Beauty -0.1303 0.4932 -0.26 0.79172
Course19:Beauty -0.3202 0.3246 -0.99 0.32451
Course20:Beauty 0.0692 0.8893 0.08 0.93801
Course21:Beauty -0.0380 0.1891 -0.20 0.84096
Course22:Beauty 0.1641 0.2424 0.68 0.49881
Course23:Beauty -1.3347 0.7649 -1.74 0.08178
Course24:Beauty 0.0756 0.9732 0.08 0.93811
Course25:Beauty 40.5972 32.6559 1.24 0.21453
Course26:Beauty 0.2253 0.3363 0.67 0.50332
Course27:Beauty 0.9815 1.2156 0.81 0.41987
Course28:Beauty 0.7384 0.9699 0.76 0.44693
Course29:Beauty 0.4722 0.2924 1.61 0.10711
Course30:Beauty 0.1544 0.2311 0.67 0.50451
Residual standard error: 0.525 on 401 degrees of freedom
Multiple R-squared: 0.985, Adjusted R-squared: 0.983
F-statistic: 434 on 62 and 401 DF, p-value: <2e-16
course 和 beauty 沒有 interaction
summary(m1)$coef[which(summary(m1)$coef[-c(1:31),4] <0.05),]
Estimate Std. Error t value Pr(>|t|)
Course0 4.032 0.03035 132.83 0.000e+00
Course9 3.928 0.21504 18.27 1.102e-54
The end