DataM: Homework Exercise 0427 - Functions
DataM: Homework Exercise 0427 - Functions
- HW exercise 1.
- Question (a): test if any pairs of the five variables: read, write, math, science, and socst, are different in means.
- Question (b): test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read, write, math, science, and socst.
- Question (c): Perform all pairwise simple regressions for these variables:
read,write,math,science, andsocst.
- HW exercise 2.
- HW exercise 3.
- HW exercise 4 (c-stat example).
- HW exercise 5.
- HW exercise 6.
HW exercise 1.
Use the data in the high schools example to solve the following problems:
- Question (a): test if any pairs of the five variables: read, write, math, science, and socst, are different in means.
- Question (b): test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read, write, math, science, and socst.
- Question (c): Perform all pairwise simple regressions for these variables: read, write, math, science, and socst.
- Column 1: Student ID
- Column 2: Gender, M = Male, F = Female
- Column 3: Race
- Column 4: Socioeconomic status
- Column 5: School type
- Column 6: Program type
- Column 7: Reading score
- Column 8: Writing score
- Column 9: Math score
- Column 10: Science score
- Column 11: Social science studies score
Load in the data set and check its structure.
'data.frame': 200 obs. of 11 variables:
$ id : int 70 121 86 141 172 113 50 11 84 48 ...
$ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
$ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
$ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
$ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
$ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
$ read : int 57 68 44 63 47 44 50 34 63 57 ...
$ write : int 52 59 33 44 52 52 59 46 57 55 ...
$ math : int 41 53 54 47 57 51 42 45 54 52 ...
$ science: int 47 63 58 53 53 63 53 39 58 NA ...
$ socst : int 57 61 31 56 61 61 61 36 51 51 ...
Trun the data set into a long form and do some variable transformation.
'data.frame': 1000 obs. of 8 variables:
$ id : int 70 121 86 141 172 113 50 11 84 48 ...
$ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
$ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
$ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
$ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
$ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
$ variable: Factor w/ 5 levels "read","write",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : int 57 68 44 63 47 44 50 34 63 57 ...
Question (a): test if any pairs of the five variables: read, write, math, science, and socst, are different in means.
Data visualization (point plot with error bar)
library(ggplot2)
dta1_long %>% group_by(variable) %>%
summarise(group_MEAN = mean(value, na.rm = TRUE),
group_SE = sd(value, na.rm = TRUE) / sqrt(sum(!is.na(value)))) %>%
ggplot(mapping = aes(x = variable, y = group_MEAN)) +
geom_point(aes(color = variable), size = 4) +
geom_errorbar(aes(ymax = group_MEAN + 1.96*group_SE,
ymin = group_MEAN - 1.96*group_SE), width=.3, size=.4) +
xlab('Variable') + ylab('Mean score') + theme_bw() +
theme(legend.position = 'none')It seems that there is no obvious difference among these five measurements.
One-way ANOVA with repeated measure and Tukey HSD
Error: id
Df Sum Sq Mean Sq
variable 1 4264 4264
Error: id:variable
Df Sum Sq Mean Sq
variable 4 51.03 12.76
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
variable 4 322 80.59 0.851 0.493
Residuals 985 93271 94.69
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ variable, data = dta1_long)
$variable
diff lwr upr p adj
write-read 0.5450000 -2.171434 3.261434 0.9821514
math-read 0.4150000 -2.301434 3.131434 0.9936400
science-read -0.3120513 -3.045842 2.421740 0.9979459
socst-read 0.1750000 -2.541434 2.891434 0.9997850
math-write -0.1300000 -2.846434 2.586434 0.9999341
science-write -0.8570513 -3.590842 1.876740 0.9124627
socst-write -0.3700000 -3.086434 2.346434 0.9959143
science-math -0.7270513 -3.460842 2.006740 0.9503030
socst-math -0.2400000 -2.956434 2.476434 0.9992492
socst-science 0.4870513 -2.246740 3.220842 0.9885731
ANOVA showed that there is no significant difference among these five measurements. Tukey HSD test also showed that there is no pair of the five variables different in mean.
Question (b): test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read, write, math, science, and socst.
Data visualization (faceted point plot with error bar)
dta1_long %>% group_by(race, variable) %>%
summarise(group_MEAN = mean(value, na.rm = TRUE),
group_SE = sd(value, na.rm = TRUE) / sqrt(sum(!is.na(value)))) %>%
ggplot(mapping = aes(x = race, y = group_MEAN)) +
facet_wrap(. ~ variable, nrow = 1) +
geom_point(aes(color = race), size = 2) +
geom_errorbar(aes(ymax = group_MEAN + 1.96*group_SE,
ymin = group_MEAN - 1.96*group_SE), width=.2, size=.35) +
xlab('Variable') + ylab('Mean score') + theme_bw() +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 80, margin = margin(t = 10)))It seems that there is an obvious ethnic difference in read, write, math, and science. There seems to be no ethnic difference in socst.
One-way ANOVA and Tukey HSD in specified measurement
f <- function(df) {
model <- aov(value ~ race, data = df)
return(list(summary(model), TukeyHSD(model)))
}
results <- lapply(split(dta1_long, dta1_long$variable), f)(1) read
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
race 3 1750 583.3 5.964 0.000654 ***
Residuals 196 19170 97.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[[2]]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ race, data = df)
$race
diff lwr upr p adj
asian-african-amer 5.1090909 -4.510360 14.728542 0.5157500
hispanic-african-amer -0.1333333 -7.891989 7.625323 0.9999682
white-african-amer 7.1241379 1.011569 13.236706 0.0150717
hispanic-asian -5.2424242 -14.573094 4.088246 0.4662523
white-asian 2.0150470 -5.999200 10.029294 0.9149192
white-hispanic 7.2574713 1.610254 12.904689 0.0056791
There is a significant difference among four races in the measurement of read (\(p < .001\)). Also, Tukey HSD showed that:
- White participants have higher
readscores than African American participants do (\(p < .05\)). - White participants have higher
readscores than Hispanic participants do (\(p < .01\)).
(2) write
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
race 3 1914 638.1 7.833 5.78e-05 ***
Residuals 196 15965 81.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[[2]]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ race, data = df)
$race
diff lwr upr p adj
asian-african-amer 9.800000 1.0214200 18.578580 0.0219249
hispanic-african-amer -1.741667 -8.8221106 5.338777 0.9197990
white-african-amer 5.855172 0.2769254 11.433419 0.0355640
hispanic-asian -11.541667 -20.0567093 -3.026624 0.0030738
white-asian -3.944828 -11.2585205 3.368865 0.5023337
white-hispanic 7.596839 2.4432653 12.750413 0.0010226
There is a significant difference among four races in the measurement of write (\(p < .001\)). Also, Tukey HSD showed that:
- Asian participants have higher
writescores than African American participants do (\(p < .05\)). - White participants have higher
writescores than African American participants do (\(p < .05\)). - Hispanic participants have lower
writescores than Asian participants do (\(p < .01\)). - White participants have higher
writescores than Hispanic participants do (\(p < .01\)).
(3) math
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
race 3 1842 614.0 7.703 6.84e-05 ***
Residuals 196 15624 79.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[[2]]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ race, data = df)
$race
diff lwr upr p adj
asian-african-amer 10.5227273 1.838424 19.207030 0.0104479
hispanic-african-amer 0.6666667 -6.337737 7.671071 0.9947149
white-african-amer 7.2224138 1.704074 12.740754 0.0046342
hispanic-asian -9.8560606 -18.279657 -1.432465 0.0145446
white-asian -3.3003135 -10.535462 3.934835 0.6389480
white-hispanic 6.5557471 1.457520 11.653975 0.0056430
There is a significant difference among four races in the measurement of math (\(p < .001\)). Also, Tukey HSD showed that:
- Asian participants have higher
mathscores than African American participants do (\(p < .05\)). - White participants have higher
mathscores than African American participants do (\(p < .01\)). - Hispanic participants have lower
mathscores than Asian participants do (\(p < .05\)). - White participants have higher
mathscores than Hispanic participants do (\(p < .01\)).
(4) science
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
race 3 3170 1056.5 13.06 8.5e-08 ***
Residuals 191 15447 80.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5 observations deleted due to missingness
[[2]]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ race, data = df)
$race
diff lwr upr p adj
asian-african-amer 9.033493 0.202789 17.864197 0.0428018
hispanic-african-amer 3.796339 -3.429558 11.022236 0.5249642
white-african-amer 11.726835 6.033066 17.420603 0.0000016
hispanic-asian -5.237154 -13.781662 3.307353 0.3875623
white-asian 2.693342 -4.601452 9.988136 0.7739897
white-hispanic 7.930496 2.691577 13.169415 0.0006985
There is a significant difference among four races in the measurement of science (\(p < .001\)). Also, Tukey HSD showed that:
- Asian participants have higher
sciencescores than African American participants do (\(p < .05\)). - White participants have higher
sciencescores than African American participants do (\(p < .001\)). - White participants have higher
sciencescores than Hispanic participants do (\(p < .001\)).
(5) socst
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
race 3 944 314.6 2.804 0.041 *
Residuals 196 21992 112.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[[2]]
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ race, data = df)
$race
diff lwr upr p adj
asian-african-amer 1.550000 -8.7533662 11.853366 0.9798366
hispanic-african-amer -1.658333 -9.9686075 6.651941 0.9549249
white-african-amer 4.232759 -2.3143961 10.779913 0.3395595
hispanic-asian -3.208333 -13.2023873 6.785721 0.8392806
white-asian 2.682759 -5.9012785 11.266796 0.8497705
white-hispanic 5.891092 -0.1576264 11.939810 0.0593904
There is a significant difference among four races in the measurement of socst but the significance is not high (\(p < .05\)). Tukey HSD showed that no comparison pair is significant.
Question (c): Perform all pairwise simple regressions for these variables: read, write, math, science, and socst.
Four simple regression models with read as y and (1) write, (2) math, (3) science, and (4) socst as x
dta1 %>% dplyr::select(write, math, science, socst) %>%
lapply(., function(x) lm(dta1$read ~ x)) %>%
lapply(., broom::tidy)$write
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 18.2 3.31 5.49 1.21e- 7
2 x 0.646 0.0617 10.5 1.11e-20
$math
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.1 3.12 4.52 1.08e- 5
2 x 0.725 0.0583 12.4 1.28e-26
$science
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 18.2 3.10 5.87 1.91e- 8
2 x 0.654 0.0586 11.2 1.22e-22
$socst
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 21.1 2.84 7.43 3.22e-12
2 x 0.594 0.0532 11.2 9.29e-23
The slope of four models are all significant (\(p < .001\)).
Three simple regression models with write as y and (1) math, (2) science, and (3) socst as x
dta1 %>% dplyr::select(math, science, socst) %>%
lapply(., function(x) lm(dta1$write ~ x)) %>%
lapply(., broom::tidy)$math
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19.9 3.02 6.58 4.20e-10
2 x 0.625 0.0566 11.0 2.09e-22
$science
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.4 3.03 8.03 9.27e-14
2 x 0.546 0.0574 9.50 8.15e-18
$socst
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 24.8 2.67 9.28 3.11e-17
2 x 0.534 0.0500 10.7 2.45e-21
The slope of three models are all significant (\(p < .001\)).
Two simple regression models with math as y and (1) science and (2) socst as x
dta1 %>% dplyr::select(science, socst) %>%
lapply(., function(x) lm(dta1$math ~ x)) %>%
lapply(., broom::tidy)$science
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 21.4 2.84 7.54 1.76e-12
2 x 0.600 0.0537 11.2 1.06e-22
$socst
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 27.7 2.78 9.97 3.09e-19
2 x 0.475 0.0520 9.13 7.83e-17
The slope of two models are both significant (\(p < .001\)).
HW exercise 2.
The formula \(P = L (r/(1-(1+r)^{-M})\) describes the payment you have to make per month for M number of months if you take out a loan of L amount today at a monthly interest rate of r. Compute how much you will have to pay per month for 10, 15, 20, 25, or 30 years if you borrow NT$5,000,000, 10,000,000, or 15,000,000 from a bank that charges you 2%, 5%, or 7% for the monthly interest rate.
HW exercise 3.
The following R script is an attempt to demonstrate the correspondence between parameter estimations by the least square method and the maximum likelihood method for the case of simple linear regression with a constant normal error term.
- Construct a function from the script so that any deviance value for pairs of parameter estimates can be found.
- Generalize the function further so that it will work with any data sets that can be modeled by a simple linear regression with a constant normal error term.
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
##
param <- c(coef(m0)[1], coef(m0)[2])
##
a <- param[1]
##
b <- param[2]
##
yhat <- a + b*women$height
##
e <- summary(m0)$sigma
##
lkhd <- dnorm(women$weight, mean=yhat, sd=e)
##
dvnc <- -2 * sum(log(lkhd))
##
ci_a <- coef(m0)[1] + unlist(summary(m0))$coefficients3*c(-2,2)
##
ci_b <- coef(m0)[2] + unlist(summary(m0))$coefficients4*c(-2,2)
##
bb <- expand.grid(a=seq(ci_a[1], ci_a[2], len=50),
b=seq(ci_b[1], ci_b[2], len=50))1
2
Create the function
lm_with_CI <- function(m0) {
print(summary(m0))
param <- c(coef(m0)[1], coef(m0)[2])
a <- param[1]
b <- param[2]
ci_a <- coef(m0)[1] + unlist(summary(m0))$coefficients3*c(-1.96, 1.96)
ci_b <- coef(m0)[2] + unlist(summary(m0))$coefficients4*c(-1.96, 1.96)
print(paste0('95% CI of the intercept: [', min(ci_a), ', ', max(ci_a), ']'))
print(paste0('95% CI of the slope: [', min(ci_b), ', ', max(ci_b), ']'))
}Use the function
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
[1] "95% CI of the intercept: [-99.1530770025164, -75.8802563308165]"
[1] "95% CI of the slope: [3.27137246888624, 3.62862753111375]"
Call:
lm(formula = weight ~ Time, data = ChickWeight)
Residuals:
Min 1Q Median 3Q Max
-138.331 -14.536 0.926 13.533 160.669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.4674 3.0365 9.046 <2e-16 ***
Time 8.8030 0.2397 36.725 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 38.91 on 576 degrees of freedom
Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
[1] "95% CI of the intercept: [21.5159560282947, 33.4188942714666]"
[1] "95% CI of the slope: [8.33322725058112, 9.27285128480826]"
HW exercise 4 (c-stat example).
Modify the following R script to create a function to compute the c-statistic illustrated with the data set in the article: Tryon, W.W. (1984). A simplified time-series analysis for evaluating treatment interventions. Journal of Applied Behavioral Analysis, 34(4), 230-233.
Load and check the dataset
'data.frame': 42 obs. of 1 variable:
$ nc: int 28 46 39 45 24 20 35 37 36 40 ...
Plot figure 1
- Modify axis setting to make the plot be more similar with the original figure.
- Use
lapplyand user-defined function to draw three segments at the same time.
plot(1:42, dta[,1], type = 'n', xaxt='n', yaxt ='n', xaxs='i', yaxs='i', # 1
xlab="Observations", ylab="Number of Children")
lines(1:42, dta[,1])
abline(v=10, lty=2)
abline(v=32, lty=2)
# 1
axis(1, at = seq(0, 40, by=5), labels = c(NA, seq(5, 40, by=5)))
axis(2, at = seq(10, 45, by=5), labels = c(NA, seq(15, 45, by=5)), las = 1)
# 2
#segments(1, mean(dta[1:10,1]),10, mean(dta[1:10,1]),col="red")
#segments(11, mean(dta[11:32,1]),32, mean(dta[11:32,1]),col="red")
#segments(33, mean(dta[33:42,1]),42, mean(dta[33:42,1]),col="red")
lapply(as.data.frame(matrix(c(1, 10, 11, 32, 33, 42), 2, 3)),
function(x) segments(x[1], mean(dta[x[1]:x[2], 1]),
x[2], mean(dta[x[1]:x[2], 1]), col = 'red'))Calculate c-stat for first baseline phase
The original script
cden <- 1-(sum(diff(dta[1:10,1])^2)/(2*(10-1)*var(dta[1:10,1])))
sc <- sqrt((10-2)/((10-1)*(10+1)))
pval <- 1-pnorm(cden/sc)
pval[1] 0.2866238
The modified script
f_cstat <- function(x) {
n <- length(x)
cden <- 1 - (sum(diff(x)^2)/(2*(n-1)*var(x)))
return(cden)
}
f_sc <- function(x) {
n <- length(x)
sc <- sqrt((n-2) / ((n-1) * (n+1)))
return(sc)
}
f_pval <- function(x) {
cden <- f_cstat(x)
sc <- f_sc(x)
return(1 - pnorm(cden / sc))
}
f_cstat(dta[1:10, 1])[1] 0.1601208
[1] 0.2842676
[1] 0.2866238
Calculate c-stat for first baseline plus group tokens
The original script
n <- 32
cden <- 1-(sum(diff(dta[1:n,1])^2)/(2*(n-1)*var(dta[1:n,1])))
sc <- sqrt((n-2)/((n-1)*(n+1)))
pval <- 1-pnorm(cden/sc)
list(z=cden/sc,pvalue=pval)$z
[1] 3.879054
$pvalue
[1] 5.243167e-05
The modified script
F_cstat <- function(x) {
n <- length(x)
cden <- 1 - (sum(diff(x)^2)/(2*(n-1)*var(x)))
sc <- sqrt((n-2)/((n-1)*(n+1)))
pval <- 1-pnorm(cden/sc)
return(list(c.stat = cden, sc = sc, z = cden/sc, p.value = pval))
}
F_cstat(dta[1:32, 1])$c.stat
[1] 0.6642762
$sc
[1] 0.1712469
$z
[1] 3.879054
$p.value
[1] 5.243167e-05
HW exercise 5.
Plot the likelihood function to estimate the probability of graduate admission by gender, respectively, for Dept A in UCBAdmissions{datasets}. Construct approximate 95%-CI for each gender. Do they overlap?
\(LL = \sum_{y_i=1}log(\theta) + \sum_{y_i \neq 1} log(1-\theta)\)
set.seed(1234)
theta <- seq(0.01, 0.99, by = .01)
n_m <- sum(UCBAdmissions[,1,1])
n_f <- sum(UCBAdmissions[,2,1])
p_m <- UCBAdmissions[1,1,1] / n_m
p_f <- UCBAdmissions[1,2,1] / n_f
y_m <- rbinom(n, 1, p_m)
y_f <- rbinom(n, 1, p_f)
llkhd_m <- sum(y_m) * log(theta) + (n - sum(y_m)) * log(1 - theta)
llkhd_f <- sum(y_f) * log(theta) + (n - sum(y_f)) * log(1 - theta)
plot(theta, llkhd_f, xlab = 'Probability', ylab = 'Likelihood', main = 'Grid search', type='n')
grid()
lines(theta, llkhd_m, col = 'hotpink1')
lines(theta, llkhd_f, col = 'dodgerblue1')
phat <- mean(y_m)
abline(v = phat, lty = 3, col = 'hotpink1')
arrows(phat - 2*sqrt(phat*(1-phat))/sqrt(n_m), -60,
phat + 2*sqrt(phat*(1-phat))/sqrt(n_m), -60,
code = 3, length = 0.1, angle = 90, col = 'hotpink1', lwd = 2)
phat <- mean(y_f)
abline(v = phat, lty = 3, col = 'dodgerblue1')
arrows(phat - 2*sqrt(phat*(1-phat))/sqrt(n_f), -80,
phat + 2*sqrt(phat*(1-phat))/sqrt(n_f), -80,
code = 3, length = 0.1, angle = 90, col = 'dodgerblue1', lwd = 2)
legend(x = .2, y = -2200, legend = c('Male', 'Female'), cex = 1.2, lty = 1,
col = c('hotpink1', 'dodgerblue1'), bty = 'n')They do not overlap, which means that the graduate admission probability of different genders are different.
HW exercise 6.
(Bonus) The data set contains inter-response times (in milliseconds) in the resting activity of a single neuron recorded from the spinal cord of a cat. Write a function to fit an exponential distribution to the data. More specifically, estimate the rate parameter of the exponential distribution using the maximum likelihood method.
Source: McGill, W.J. (1963). Luce, Bush, & Galanter, eds. Handbook of Mathematical Psychology.
\(L = L(y_1, y_2, ..., y_n; \theta) = \prod_{i=1}^n L(y_i; \theta) = \prod_{i=1}^n \theta e^{-\theta y}\)
\(LL = \sum_{i=1}^n log(y_i; \theta) = \theta ^n exp[-\theta \sum_{i=1}^n y_i]\)
\(\bar {y} = \frac{1}{n}\sum_{i=1}^n y_i\) is the maximum liklihood estimate of the rate parameter \(\theta\).
'data.frame': 404 obs. of 1 variable:
$ RT: int 4 2 4 4 5 5 5 5 5 5 ...
RT
Min. : 2.00
1st Qu.: 13.75
Median : 30.50
Mean : 39.57
3rd Qu.: 58.50
Max. :125.00
[1] 39.57426