Setup:

knitr::opts_chunk$set(echo = TRUE)
library(alr4)
library(tidyverse)
library(readr)


Question 1

1.1) Show that, in terms of the model parameters, \(\delta= c\beta_1\)

\(E(height \mid mheight = x+c) - E(dheight \mid mheight = x)\) = \(c\widehat{\beta}_1\)


1.2) Find a formula for point estimator of \(\delta\) using the OLS estimate \(\widehat{\beta}_1\): \(\widehat{\delta}\) = \(c\widehat{\beta}_1\)

1.3. Find a formula for the variance of the estimator: \(\mbox{Var}(\widehat{\delta} \mid X) =\mbox{Var}(c\widehat{\beta}_1 \mid X) = \frac{\widehat{\sigma}^2c^2}{SXX}\)


1.4) Find a formula for the standard error of the estimator, \(se(\widehat{\delta} \mid X)\), by taking the square root of your answer from part (3) and substituting the estimate \(\widehat{\sigma}\) for the unknown parameter \(\sigma\).

\(\sqrt\frac{\sigma*c^2}{SXX}\)

1.5) Derive the formula to calculate a 95% confidence interval for \(\delta\). - Now show this confidence interval formula for \(\delta\) as a factor of confidence interval for \(\beta_1\). (That is, you may see it as c times the formula for confidence interval for \(\beta_1\)

\(c\widehat{\beta_1} ± z(.05/2) * \sqrt\frac{\sigma^2c^2}{SXX}\)


  1. Let assume that mothers heights differ by \(5\) inches, that is \(c=5\) in all the above formulas. Use all of the above information to calculate a 95% confidence interval for \(\delta\) and write a sentence interpreting this interval in terms of mothers and daughters height.
mheight <- Heights$mheight
dheight <- Heights$dheight

xbar <- mean(mheight)
ybar <- mean(dheight)
SXY <- sum((mheight- xbar) * (dheight - ybar))
SXX <- sum((mheight- xbar)^2) 
B1<- SXY/SXX

LB <- 5*B1 - qnorm(.975) * sqrt((sd(mheight)*5^2)/SXX)
UB <- 5*B1 + qnorm(.975) * sqrt((sd(mheight)*5^2)/SXX)
c(LB,UB)
## [1] 2.536461 2.881009

95% Confidence Interval: (2.54, 2.88)
1.7.) We might wonder whether the difference in average heights for their daughters also \(c\) inches whose mothers heights differ by \(c\) inches. Use an appropriate hypothesis test to assess whether the there is sufficient evidence to support this claim.

\(H_0:\beta_1 = 1\) vs \(H_0:\beta_1 \neq 1\) + 1 is not in the confidence interval therefor refect the null Hypothesis



Question 2

This question relates to the salary data set which can be found in alr4 package. Simply load the package and access salary date set.

This data set about the salary of faculty members in a small Midwestern college in the early 1980s.

This data frame contains the following columns:

  • degree: Factor with levels “PhD” or “Masters”
  • rank: Factor, “Asst”, “Assoc” or “Prof”
  • sex: Factor, “Male” or “Female”
  • year: Years in current rank
  • ysdeg: Years since highest degree earned
  • salary: dollars per year
males <- salary %>%  group_by(sex="Male") 

Use the statistical inference technique you learn from the class and answer the following questions.

2.1) Find a point estimate for average salary for male faculty members who have 10 years in their current rank. - Please interpret the point estimate for average salary in simple words.

```r
males10 <- salary %>%  group_by(year=10) 

mean(males10$salary)
```

```
## [1] 23797.65
```

This value ($23,797.65) is a single number estimate for the population mean for male salaries with 10 years un their current rank.

2.2) Find a 95% confidence interval for the average salary for male faculty who have 10 years in their current rank. - Please interpret the confidence interval in simple words.

```r
  males_10yr <- males %>%  group_by(year=10)  
  xbar <- mean(males_10yr$salary)
  sd <- sd(males_10yr$salary)
  n <-length(males$salary)
  UB <- xbar + qnorm(.975) * sd/sqrt(n)
  LB <- xbar - qnorm(.975) * sd/sqrt(n)
  c(LB,UB)
```

```
## [1] 22189.35 25405.96
```

Interpretation: We are 95% confident the true average salary for the population of male faculty members who have 10 years in their current rank is between (22,189.35 25,405.96).


2.3) Plot of salary versus year for male faculty and then add the following. - and the estimated regression line in the plot. - add lines to the plot representing point-wise 95% confidence intervals for the average salary function. Also provide a formula explaining how these intervals (i.e., confidence intervals) are computed. - you can do this by creating a grid of points using seq function:

males.lm <- lm(salary ~ year, data = males)
yrs.grid = seq(from = min(males$year), to = max(males$year), length.out = 100)

mmod.cis = predict(males.lm, newdata = data.frame(year = yrs.grid), interval = "confidence")

mmod.pis = predict(males.lm, newdata = data.frame(year = yrs.grid),
                   interval = "prediction")
plot (x=males$year, y= males$salary, xlab = "Years", ylab = "Salary ($)", main = "Salary vs Year")

# confidence level 

lines(yrs.grid, mmod.cis[, "fit"], lwd = 1.5)
lines(yrs.grid, mmod.cis[, "lwr"], col = "red")
lines(yrs.grid, mmod.cis[, "upr"], col = "red")

yrs.grid = seq(from = min(males$year), to = max(males$year),length.out = 100)

2.4) Midwestern college has decided to hire a new faculty member, Michael Finch (a male), as an Assistant professor for the upcoming Spring semester. So, Michael literally has \(0\) years in his current rank. He wishes to predict his salary to decide on whether to accept this job offer.

Please help this new faculty member, Michael, to predict his salary in the follwoing manner.
- please find (1) a point estimate for Michael's salary and (2) an interval estimate for Michael's salary.
- Please interpret point estimate and interval estimate for the salary.
males_0yr <- males %>%  group_by(year= 0)  
yr0_salary <- males_0yr$salary

xbar <- mean(males$year)
ybar <- mean(males$salary)
n <- length(males$salary)
  
SXY <- sum((males$year- xbar) * (males$salary - ybar))
SXX <- sum((males$year- xbar)^2) 

b1= SXY/SXX
b0 = ybar - b1*xbar

#1 Point Estimate : 
pe.salary <- b0 + b1*0
pe.salary
## [1] 18166.15
#2 Interval Estimate
sd <- sd(yr0_salary)
se.Salary <- sd*sqrt((1/n)+((0-xbar)/(SXX)))
LL.Salary <- pe.salary - se.Salary
UL.Salary <-pe.salary + se.Salary

# interval estimate
c(LL.Salary,UL.Salary)
## [1] 17456.19 18876.10

Point Estimate: $18,166.15 Interval Estimate ($17,456.19, $18,876.10)
2.5) Now add lines to the plot produced in part (3) representing point-wise 95% prediction intervals for salary. You can still use grid of points used in part (3). Also, you can use the sample R code we discussed in the class to create these plots

plot (males$year, males$salary, xlab = "Years", ylab = "Salary ($)", main = "Salary vs Year")

## Adding prediction intervals

mmod.pis = predict(males.lm, newdata = data.frame(year = yrs.grid),
interval = "prediction")

lines(yrs.grid, mmod.cis[, "fit"], lwd = 1.5)
lines(yrs.grid, mmod.cis[, "lwr"], col = "red")
lines(yrs.grid, mmod.cis[, "upr"], col = "red")
lines(yrs.grid, mmod.pis[, "lwr"], col = "darkblue")
lines(yrs.grid, mmod.pis[, "upr"], col = "darkblue")

2.6) Briefly discuss the features of confidence interval and prediction interval.
The confidence interval is an estimated interval with (1- \(\alpha\))% confidence that it captures the true mean of the data, so in this case the true average salary for the year in their current rank. A prediction interval is more specific than a confidence interval because its goal is to make a prediction of a new observation that corresponds to a specified value of x. The prediction interval is wider than than the confidence interval because there is more uncertainty about the prediction of a new specific case of y than that of estimating value of y in a confidence interval.

2.7) Michael was told as a part of his decision on accepting this job that the average salary for male faculty members who are new to Midwestern college is greater than 30,000 dollars per year. Literally, the new faculty members will have \(0\) years in their current rank. Use an appropriate hypothesis test to assess whether the there is sufficient evidence to support this claim. You may use p-value to support your answer.

$H_0: $30,000 $ vs $H_1: > $30,000 $
  • \(\mu_{*0}=30000\) under null hypothesis

\(H_0 \; : \; E(Y \mid X = x_*) = \mu_{*0}=30000\) vs.\(H_1 \; : \; E(Y \mid X = x_*) > \$30,000\)<>

sd <- sd(yr0_salary)

#calc z-score given null is true
percentile<- ((pe.salary-30000)/sd)

pval2<- pt(abs(percentile), df = length(males$year - 2), lower.tail = FALSE)
pval2
## [1] 0.0253748

The p-value = .025 which is less than \(\alpha =.05\) so the null hypothesis is rejected.