DataM: Homework Exercise 0427 - Functions

library(dplyr)
library(ggplot2)

HW exercise 1.

Use the data in the high schools example to solve the following problems:

  • Question (a): test if any pairs of the five variables: read, write, math, science, and socst, are different in means.
  • Question (b): test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read, write, math, science, and socst.
  • Question (c): Perform all pairwise simple regressions for these variables: read, write, math, science, and socst.

  • Column 1: Student ID
  • Column 2: Gender, M = Male, F = Female
  • Column 3: Race
  • Column 4: Socioeconomic status
  • Column 5: School type
  • Column 6: Program type
  • Column 7: Reading score
  • Column 8: Writing score
  • Column 9: Math score
  • Column 10: Science score
  • Column 11: Social science studies score


Load in the data set and check its structure.

dta1 <- read.table('../data/high_schools.txt', header = TRUE)
head(dta1)
str(dta1)
'data.frame':   200 obs. of  11 variables:
 $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
 $ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
 $ race   : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
 $ ses    : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
 $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
 $ prog   : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
 $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
 $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
 $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
 $ science: int  47 63 58 53 53 63 53 39 58 NA ...
 $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

Trun the data set into a long form and do some variable transformation.

dta1_long <- dta1 %>% reshape2::melt(., measure.vars = colnames(.)[7:11])
str(dta1_long)
'data.frame':   1000 obs. of  8 variables:
 $ id      : int  70 121 86 141 172 113 50 11 84 48 ...
 $ female  : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
 $ race    : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
 $ ses     : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
 $ schtyp  : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
 $ prog    : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
 $ variable: Factor w/ 5 levels "read","write",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : int  57 68 44 63 47 44 50 34 63 57 ...


Question (a): test if any pairs of the five variables: read, write, math, science, and socst, are different in means.

Data visualization (point plot with error bar)

library(ggplot2)
dta1_long %>% group_by(variable) %>%
  summarise(group_MEAN = mean(value, na.rm = TRUE),
            group_SE = sd(value, na.rm = TRUE) / sqrt(sum(!is.na(value)))) %>%
  ggplot(mapping = aes(x = variable, y = group_MEAN)) +
  geom_point(aes(color = variable), size = 4) +
  geom_errorbar(aes(ymax = group_MEAN + 1.96*group_SE,              
                    ymin = group_MEAN - 1.96*group_SE), width=.3, size=.4) +
  xlab('Variable') + ylab('Mean score') + theme_bw() +
  theme(legend.position = 'none')

It seems that there is no obvious difference among these five measurements.

One-way ANOVA with repeated measure and Tukey HSD

model <- aov(value ~ variable + Error(id / variable), data = dta1_long)
summary(model)

Error: id
         Df Sum Sq Mean Sq
variable  1   4264    4264

Error: id:variable
         Df Sum Sq Mean Sq
variable  4  51.03   12.76

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)
variable    4    322   80.59   0.851  0.493
Residuals 985  93271   94.69               
TukeyHSD(aov(value ~ variable, data = dta1_long))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ variable, data = dta1_long)

$variable
                    diff       lwr      upr     p adj
write-read     0.5450000 -2.171434 3.261434 0.9821514
math-read      0.4150000 -2.301434 3.131434 0.9936400
science-read  -0.3120513 -3.045842 2.421740 0.9979459
socst-read     0.1750000 -2.541434 2.891434 0.9997850
math-write    -0.1300000 -2.846434 2.586434 0.9999341
science-write -0.8570513 -3.590842 1.876740 0.9124627
socst-write   -0.3700000 -3.086434 2.346434 0.9959143
science-math  -0.7270513 -3.460842 2.006740 0.9503030
socst-math    -0.2400000 -2.956434 2.476434 0.9992492
socst-science  0.4870513 -2.246740 3.220842 0.9885731

ANOVA showed that there is no significant difference among these five measurements. Tukey HSD test also showed that there is no pair of the five variables different in mean.


Question (b): test if the 4 different ethnic groups have the same mean scores for each of the 5 variables (individually): read, write, math, science, and socst.

Data visualization (faceted point plot with error bar)

dta1_long %>% group_by(race, variable) %>%
  summarise(group_MEAN = mean(value, na.rm = TRUE),
            group_SE = sd(value, na.rm = TRUE) / sqrt(sum(!is.na(value)))) %>%
  ggplot(mapping = aes(x = race, y = group_MEAN)) +
  facet_wrap(. ~ variable, nrow = 1) + 
  geom_point(aes(color = race), size = 2) +
  geom_errorbar(aes(ymax = group_MEAN + 1.96*group_SE,              
                    ymin = group_MEAN - 1.96*group_SE), width=.2, size=.35) +
  xlab('Variable') + ylab('Mean score') + theme_bw() +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 80, margin = margin(t = 10)))

It seems that there is an obvious ethnic difference in read, write, math, and science. There seems to be no ethnic difference in socst.

One-way ANOVA and Tukey HSD in specified measurement

f <- function(df) {
  model <- aov(value ~ race, data = df)
  return(list(summary(model), TukeyHSD(model)))
}

results <- lapply(split(dta1_long, dta1_long$variable), f)

(1) read

results$read
[[1]]
             Df Sum Sq Mean Sq F value   Pr(>F)    
race          3   1750   583.3   5.964 0.000654 ***
Residuals   196  19170    97.8                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[[2]]
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ race, data = df)

$race
                            diff        lwr       upr     p adj
asian-african-amer     5.1090909  -4.510360 14.728542 0.5157500
hispanic-african-amer -0.1333333  -7.891989  7.625323 0.9999682
white-african-amer     7.1241379   1.011569 13.236706 0.0150717
hispanic-asian        -5.2424242 -14.573094  4.088246 0.4662523
white-asian            2.0150470  -5.999200 10.029294 0.9149192
white-hispanic         7.2574713   1.610254 12.904689 0.0056791

There is a significant difference among four races in the measurement of read (\(p < .001\)). Also, Tukey HSD showed that:

  • White participants have higher read scores than African American participants do (\(p < .05\)).
  • White participants have higher read scores than Hispanic participants do (\(p < .01\)).


(2) write

results$write
[[1]]
             Df Sum Sq Mean Sq F value   Pr(>F)    
race          3   1914   638.1   7.833 5.78e-05 ***
Residuals   196  15965    81.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[[2]]
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ race, data = df)

$race
                            diff         lwr       upr     p adj
asian-african-amer      9.800000   1.0214200 18.578580 0.0219249
hispanic-african-amer  -1.741667  -8.8221106  5.338777 0.9197990
white-african-amer      5.855172   0.2769254 11.433419 0.0355640
hispanic-asian        -11.541667 -20.0567093 -3.026624 0.0030738
white-asian            -3.944828 -11.2585205  3.368865 0.5023337
white-hispanic          7.596839   2.4432653 12.750413 0.0010226

There is a significant difference among four races in the measurement of write (\(p < .001\)). Also, Tukey HSD showed that:

  • Asian participants have higher write scores than African American participants do (\(p < .05\)).
  • White participants have higher write scores than African American participants do (\(p < .05\)).
  • Hispanic participants have lower write scores than Asian participants do (\(p < .01\)).
  • White participants have higher write scores than Hispanic participants do (\(p < .01\)).


(3) math

results$math
[[1]]
             Df Sum Sq Mean Sq F value   Pr(>F)    
race          3   1842   614.0   7.703 6.84e-05 ***
Residuals   196  15624    79.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[[2]]
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ race, data = df)

$race
                            diff        lwr       upr     p adj
asian-african-amer    10.5227273   1.838424 19.207030 0.0104479
hispanic-african-amer  0.6666667  -6.337737  7.671071 0.9947149
white-african-amer     7.2224138   1.704074 12.740754 0.0046342
hispanic-asian        -9.8560606 -18.279657 -1.432465 0.0145446
white-asian           -3.3003135 -10.535462  3.934835 0.6389480
white-hispanic         6.5557471   1.457520 11.653975 0.0056430

There is a significant difference among four races in the measurement of math (\(p < .001\)). Also, Tukey HSD showed that:

  • Asian participants have higher math scores than African American participants do (\(p < .05\)).
  • White participants have higher math scores than African American participants do (\(p < .01\)).
  • Hispanic participants have lower math scores than Asian participants do (\(p < .05\)).
  • White participants have higher math scores than Hispanic participants do (\(p < .01\)).


(4) science

results$science
[[1]]
             Df Sum Sq Mean Sq F value  Pr(>F)    
race          3   3170  1056.5   13.06 8.5e-08 ***
Residuals   191  15447    80.9                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5 observations deleted due to missingness

[[2]]
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ race, data = df)

$race
                           diff        lwr       upr     p adj
asian-african-amer     9.033493   0.202789 17.864197 0.0428018
hispanic-african-amer  3.796339  -3.429558 11.022236 0.5249642
white-african-amer    11.726835   6.033066 17.420603 0.0000016
hispanic-asian        -5.237154 -13.781662  3.307353 0.3875623
white-asian            2.693342  -4.601452  9.988136 0.7739897
white-hispanic         7.930496   2.691577 13.169415 0.0006985

There is a significant difference among four races in the measurement of science (\(p < .001\)). Also, Tukey HSD showed that:

  • Asian participants have higher science scores than African American participants do (\(p < .05\)).
  • White participants have higher science scores than African American participants do (\(p < .001\)).
  • White participants have higher science scores than Hispanic participants do (\(p < .001\)).


(5) socst

results$socst
[[1]]
             Df Sum Sq Mean Sq F value Pr(>F)  
race          3    944   314.6   2.804  0.041 *
Residuals   196  21992   112.2                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[[2]]
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ race, data = df)

$race
                           diff         lwr       upr     p adj
asian-african-amer     1.550000  -8.7533662 11.853366 0.9798366
hispanic-african-amer -1.658333  -9.9686075  6.651941 0.9549249
white-african-amer     4.232759  -2.3143961 10.779913 0.3395595
hispanic-asian        -3.208333 -13.2023873  6.785721 0.8392806
white-asian            2.682759  -5.9012785 11.266796 0.8497705
white-hispanic         5.891092  -0.1576264 11.939810 0.0593904

There is a significant difference among four races in the measurement of socst but the significance is not high (\(p < .05\)). Tukey HSD showed that no comparison pair is significant.


Question (c): Perform all pairwise simple regressions for these variables: read, write, math, science, and socst.

Four simple regression models with read as y and (1) write, (2) math, (3) science, and (4) socst as x

dta1 %>% dplyr::select(write, math, science, socst) %>%
  lapply(., function(x) lm(dta1$read ~ x)) %>%
  lapply(., broom::tidy)
$write
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   18.2      3.31        5.49 1.21e- 7
2 x              0.646    0.0617     10.5  1.11e-20

$math
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   14.1      3.12        4.52 1.08e- 5
2 x              0.725    0.0583     12.4  1.28e-26

$science
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   18.2      3.10        5.87 1.91e- 8
2 x              0.654    0.0586     11.2  1.22e-22

$socst
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   21.1      2.84        7.43 3.22e-12
2 x              0.594    0.0532     11.2  9.29e-23

The slope of four models are all significant (\(p < .001\)).

Three simple regression models with write as y and (1) math, (2) science, and (3) socst as x

dta1 %>% dplyr::select(math, science, socst) %>%
  lapply(., function(x) lm(dta1$write ~ x)) %>%
  lapply(., broom::tidy)
$math
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   19.9      3.02        6.58 4.20e-10
2 x              0.625    0.0566     11.0  2.09e-22

$science
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   24.4      3.03        8.03 9.27e-14
2 x              0.546    0.0574      9.50 8.15e-18

$socst
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   24.8      2.67        9.28 3.11e-17
2 x              0.534    0.0500     10.7  2.45e-21

The slope of three models are all significant (\(p < .001\)).

Two simple regression models with math as y and (1) science and (2) socst as x

dta1 %>% dplyr::select(science, socst) %>%
  lapply(., function(x) lm(dta1$math ~ x)) %>%
  lapply(., broom::tidy)
$science
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   21.4      2.84        7.54 1.76e-12
2 x              0.600    0.0537     11.2  1.06e-22

$socst
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   27.7      2.78        9.97 3.09e-19
2 x              0.475    0.0520      9.13 7.83e-17

The slope of two models are both significant (\(p < .001\)).

Simple regression model with science as y and socst as x

with(dta1, lm(science ~ socst)) %>% broom::tidy()

The slope of the model is significant (\(p < .001\)).


HW exercise 2.

The formula \(P = L (r/(1-(1+r)^{-M})\) describes the payment you have to make per month for M number of months if you take out a loan of L amount today at a monthly interest rate of r. Compute how much you will have to pay per month for 10, 15, 20, 25, or 30 years if you borrow NT$5,000,000, 10,000,000, or 15,000,000 from a bank that charges you 2%, 5%, or 7% for the monthly interest rate.

Define the function

interest <- function(L, r, M) {
  P <- L * (r / (1 - (1 + r)^(-M)))
  return(P)
}

Compute the interest

data.frame(L = rep(c(5000000, 10000000, 15000000), each = 15),
           r = rep(c(2, 5, 7) / 100, each = 5, 3),
           M = rep(c(10, 15, 20, 25, 30) * 12, 9)) %>%
  mutate(P = round(mapply(interest, L, r, M)))


HW exercise 3.

The following R script is an attempt to demonstrate the correspondence between parameter estimations by the least square method and the maximum likelihood method for the case of simple linear regression with a constant normal error term.

  1. Construct a function from the script so that any deviance value for pairs of parameter estimates can be found.
  2. Generalize the function further so that it will work with any data sets that can be modeled by a simple linear regression with a constant normal error term.
##
m0 <- lm(weight ~ height, data=women)
##
summary(m0)

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
##
param <- c(coef(m0)[1], coef(m0)[2])
##
a <- param[1]
##
b <- param[2]
##
yhat <- a + b*women$height
##
e <- summary(m0)$sigma
##
lkhd <- dnorm(women$weight, mean=yhat, sd=e)
##
dvnc <- -2 * sum(log(lkhd))
##
ci_a <- coef(m0)[1] + unlist(summary(m0))$coefficients3*c(-2,2)
##
ci_b <- coef(m0)[2] + unlist(summary(m0))$coefficients4*c(-2,2)
##
bb <- expand.grid(a=seq(ci_a[1], ci_a[2], len=50),
                  b=seq(ci_b[1], ci_b[2], len=50))

1

dvnc_fun <- function(x) {
  m0 <- lm(weight ~ height, data = women)
  return(c(x[1] - coef(m0)[1], x[2] - coef(m0)[2]))
}

2

Create the function

lm_with_CI <- function(m0) {
  print(summary(m0))
  param <- c(coef(m0)[1], coef(m0)[2])
  a <- param[1]
  b <- param[2]
  ci_a <- coef(m0)[1] + unlist(summary(m0))$coefficients3*c(-1.96, 1.96)
  ci_b <- coef(m0)[2] + unlist(summary(m0))$coefficients4*c(-1.96, 1.96)
  print(paste0('95% CI of the intercept: [',  min(ci_a), ', ', max(ci_a), ']'))
  print(paste0('95% CI of the slope: [',  min(ci_b), ', ', max(ci_b), ']'))
}

Use the function

lm_with_CI(lm(weight ~ height, data = women))

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

[1] "95% CI of the intercept: [-99.1530770025164, -75.8802563308165]"
[1] "95% CI of the slope: [3.27137246888624, 3.62862753111375]"
lm_with_CI(lm(weight ~ Time, data = ChickWeight))

Call:
lm(formula = weight ~ Time, data = ChickWeight)

Residuals:
     Min       1Q   Median       3Q      Max 
-138.331  -14.536    0.926   13.533  160.669 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.4674     3.0365   9.046   <2e-16 ***
Time          8.8030     0.2397  36.725   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.91 on 576 degrees of freedom
Multiple R-squared:  0.7007,    Adjusted R-squared:  0.7002 
F-statistic:  1349 on 1 and 576 DF,  p-value: < 2.2e-16

[1] "95% CI of the intercept: [21.5159560282947, 33.4188942714666]"
[1] "95% CI of the slope: [8.33322725058112, 9.27285128480826]"


HW exercise 4 (c-stat example).

Modify the following R script to create a function to compute the c-statistic illustrated with the data set in the article: Tryon, W.W. (1984). A simplified time-series analysis for evaluating treatment interventions. Journal of Applied Behavioral Analysis, 34(4), 230-233.

Load and check the dataset

# read in data 
dta <- read.table("../data/cstat.txt", header=T)
str(dta)
'data.frame':   42 obs. of  1 variable:
 $ nc: int  28 46 39 45 24 20 35 37 36 40 ...
head(dta)

Plot figure 1

  1. Modify axis setting to make the plot be more similar with the original figure.
  2. Use lapply and user-defined function to draw three segments at the same time.
plot(1:42, dta[,1], type = 'n', xaxt='n', yaxt ='n', xaxs='i', yaxs='i', # 1
     xlab="Observations", ylab="Number of Children")
lines(1:42, dta[,1])
abline(v=10, lty=2)
abline(v=32, lty=2)

# 1
axis(1, at = seq(0, 40, by=5), labels = c(NA, seq(5, 40, by=5)))
axis(2, at = seq(10, 45, by=5), labels = c(NA, seq(15, 45, by=5)), las = 1)

# 2
#segments(1, mean(dta[1:10,1]),10, mean(dta[1:10,1]),col="red")
#segments(11, mean(dta[11:32,1]),32, mean(dta[11:32,1]),col="red")
#segments(33, mean(dta[33:42,1]),42, mean(dta[33:42,1]),col="red")
lapply(as.data.frame(matrix(c(1, 10, 11, 32, 33, 42), 2, 3)),
       function(x) segments(x[1], mean(dta[x[1]:x[2], 1]),
                            x[2], mean(dta[x[1]:x[2], 1]), col = 'red'))

Calculate c-stat for first baseline phase

The original script

cden <- 1-(sum(diff(dta[1:10,1])^2)/(2*(10-1)*var(dta[1:10,1])))
sc <- sqrt((10-2)/((10-1)*(10+1)))
pval <- 1-pnorm(cden/sc)
pval
[1] 0.2866238

The modified script

f_cstat <- function(x) {
  n <- length(x)
  cden <- 1 - (sum(diff(x)^2)/(2*(n-1)*var(x)))
  return(cden)
}

f_sc <- function(x) {
  n <- length(x)
  sc <- sqrt((n-2) / ((n-1) * (n+1)))
  return(sc)
}

f_pval <- function(x) {
  cden <- f_cstat(x)
  sc <- f_sc(x)
  return(1 - pnorm(cden / sc))
}

f_cstat(dta[1:10, 1])
[1] 0.1601208
f_sc(dta[1:10, 1])
[1] 0.2842676
f_pval(dta[1:10, 1])
[1] 0.2866238

Calculate c-stat for first baseline plus group tokens

The original script

n <- 32
cden <- 1-(sum(diff(dta[1:n,1])^2)/(2*(n-1)*var(dta[1:n,1])))
sc <- sqrt((n-2)/((n-1)*(n+1)))
pval <- 1-pnorm(cden/sc)
list(z=cden/sc,pvalue=pval)
$z
[1] 3.879054

$pvalue
[1] 5.243167e-05

The modified script

F_cstat <- function(x) {
  n <- length(x)
  cden <- 1 - (sum(diff(x)^2)/(2*(n-1)*var(x)))
  sc <- sqrt((n-2)/((n-1)*(n+1)))
  pval <- 1-pnorm(cden/sc)
  return(list(c.stat = cden, sc = sc, z = cden/sc, p.value = pval))
}

F_cstat(dta[1:32, 1])
$c.stat
[1] 0.6642762

$sc
[1] 0.1712469

$z
[1] 3.879054

$p.value
[1] 5.243167e-05


HW exercise 5.

Plot the likelihood function to estimate the probability of graduate admission by gender, respectively, for Dept A in UCBAdmissions{datasets}. Construct approximate 95%-CI for each gender. Do they overlap?

\(LL = \sum_{y_i=1}log(\theta) + \sum_{y_i \neq 1} log(1-\theta)\)

set.seed(1234)
theta <- seq(0.01, 0.99, by = .01)

n_m <- sum(UCBAdmissions[,1,1])
n_f <- sum(UCBAdmissions[,2,1])
p_m <- UCBAdmissions[1,1,1] / n_m
p_f <- UCBAdmissions[1,2,1] / n_f
y_m <- rbinom(n, 1, p_m)
y_f <- rbinom(n, 1, p_f)
llkhd_m <- sum(y_m) * log(theta) + (n - sum(y_m)) * log(1 - theta)
llkhd_f <- sum(y_f) * log(theta) + (n - sum(y_f)) * log(1 - theta)

plot(theta, llkhd_f, xlab = 'Probability', ylab = 'Likelihood', main = 'Grid search', type='n')
grid()
lines(theta, llkhd_m, col = 'hotpink1')
lines(theta, llkhd_f, col = 'dodgerblue1')

phat <- mean(y_m)
abline(v = phat, lty = 3, col = 'hotpink1')
arrows(phat - 2*sqrt(phat*(1-phat))/sqrt(n_m), -60,
       phat + 2*sqrt(phat*(1-phat))/sqrt(n_m), -60,
       code = 3, length = 0.1, angle = 90, col = 'hotpink1', lwd = 2)

phat <- mean(y_f)
abline(v = phat, lty = 3, col = 'dodgerblue1')
arrows(phat - 2*sqrt(phat*(1-phat))/sqrt(n_f), -80,
       phat + 2*sqrt(phat*(1-phat))/sqrt(n_f), -80,
       code = 3, length = 0.1, angle = 90, col = 'dodgerblue1', lwd = 2)

legend(x = .2, y = -2200, legend = c('Male', 'Female'), cex = 1.2, lty = 1,
       col = c('hotpink1', 'dodgerblue1'), bty = 'n')

They do not overlap, which means that the graduate admission probability of different genders are different.


HW exercise 6.

(Bonus) The data set contains inter-response times (in milliseconds) in the resting activity of a single neuron recorded from the spinal cord of a cat. Write a function to fit an exponential distribution to the data. More specifically, estimate the rate parameter of the exponential distribution using the maximum likelihood method.

Source: McGill, W.J. (1963). Luce, Bush, & Galanter, eds. Handbook of Mathematical Psychology.

\(L = L(y_1, y_2, ..., y_n; \theta) = \prod_{i=1}^n L(y_i; \theta) = \prod_{i=1}^n \theta e^{-\theta y}\)

\(LL = \sum_{i=1}^n log(y_i; \theta) = \theta ^n exp[-\theta \sum_{i=1}^n y_i]\)

\(\bar {y} = \frac{1}{n}\sum_{i=1}^n y_i\) is the maximum liklihood estimate of the rate parameter \(\theta\).

dta6 <- read.table('../data/data_hw0427_6.txt', header = TRUE)
str(dta6)
'data.frame':   404 obs. of  1 variable:
 $ RT: int  4 2 4 4 5 5 5 5 5 5 ...
summary(dta6)
       RT        
 Min.   :  2.00  
 1st Qu.: 13.75  
 Median : 30.50  
 Mean   : 39.57  
 3rd Qu.: 58.50  
 Max.   :125.00  
mean(dta6$RT)
[1] 39.57426

Jay Liao

2020-05-11