Setup:
knitr::opts_chunk$set(echo = TRUE)
library(alr4)
library(tidyverse)
library(readr)
1.1) Show that, in terms of the model parameters, \(\delta= c\beta_1\)
\(E(height \mid mheight = x+c) - E(dheight \mid mheight = x)\) = \(c\widehat{\beta}_1\)
1.2) Find a formula for point estimator of \(\delta\) using the OLS estimate \(\widehat{\beta}_1\): \(\widehat{\delta}\) = \(c\widehat{\beta}_1\)
1.3. Find a formula for the variance of the estimator: \(\mbox{Var}(\widehat{\delta} \mid X) =\mbox{Var}(c\widehat{\beta}_1 \mid X) = \frac{\widehat{\sigma}^2c^2}{SXX}\)
1.4) Find a formula for the standard error of the estimator, \(se(\widehat{\delta} \mid X)\), by taking the square root of your answer from part (3) and substituting the estimate \(\widehat{\sigma}\) for the unknown parameter \(\sigma\).
\(\sqrt\frac{\sigma*c^2}{SXX}\)
1.5) Derive the formula to calculate a 95%
confidence interval for \(\delta\). -
Now show this confidence interval formula for \(\delta\) as a factor of confidence interval
for \(\beta_1\). (That is, you may see
it as c times the formula for confidence interval for \(\beta_1\)
\(c\widehat{\beta_1} ± z(.05/2) * \sqrt\frac{\sigma^2c^2}{SXX}\)
mheight <- Heights$mheight
dheight <- Heights$dheight
xbar <- mean(mheight)
ybar <- mean(dheight)
SXY <- sum((mheight- xbar) * (dheight - ybar))
SXX <- sum((mheight- xbar)^2)
B1<- SXY/SXX
LB <- 5*B1 - qnorm(.975) * sqrt((sd(mheight)*5^2)/SXX)
UB <- 5*B1 + qnorm(.975) * sqrt((sd(mheight)*5^2)/SXX)
c(LB,UB)
## [1] 2.536461 2.881009
95% Confidence Interval: (2.54, 2.88)
1.7.) We might wonder whether the difference in average
heights for their daughters also \(c\)
inches whose mothers heights differ by \(c\) inches. Use an appropriate hypothesis
test to assess whether the there is sufficient evidence to support this
claim.
\(H_0:\beta_1 = 1\) vs \(H_0:\beta_1 \neq 1\) + 1 is not in the confidence interval therefor refect the null Hypothesis
This question relates to the salary data set which
can be found in alr4 package. Simply load the package and
access salary date set.
This data set about the salary of faculty members in a small Midwestern college in the early 1980s.
This data frame contains the following columns:
males <- salary %>% group_by(sex="Male")
Use the statistical inference technique you learn from the class and answer the following questions.
2.1) Find a point estimate for average salary for male faculty members who have 10 years in their current rank. - Please interpret the point estimate for average salary in simple words.
```r
males10 <- salary %>% group_by(year=10)
mean(males10$salary)
```
```
## [1] 23797.65
```
This value ($23,797.65) is a single number estimate for the population mean for male salaries with 10 years un their current rank.
2.2) Find a 95% confidence interval for the average salary for male faculty who have 10 years in their current rank. - Please interpret the confidence interval in simple words.
```r
males_10yr <- males %>% group_by(year=10)
xbar <- mean(males_10yr$salary)
sd <- sd(males_10yr$salary)
n <-length(males$salary)
UB <- xbar + qnorm(.975) * sd/sqrt(n)
LB <- xbar - qnorm(.975) * sd/sqrt(n)
c(LB,UB)
```
```
## [1] 22189.35 25405.96
```
Interpretation: We are 95% confident the true average salary for the population of male faculty members who have 10 years in their current rank is between (22,189.35 25,405.96).
2.3) Plot of salary versus year for male faculty and then add the following. - and the estimated regression line in the plot. - add lines to the plot representing point-wise 95% confidence intervals for the average salary function. Also provide a formula explaining how these intervals (i.e., confidence intervals) are computed. - you can do this by creating a grid of points using seq function:
males.lm <- lm(salary ~ year, data = males)
yrs.grid = seq(from = min(males$year), to = max(males$year), length.out = 100)
mmod.cis = predict(males.lm, newdata = data.frame(year = yrs.grid), interval = "confidence")
mmod.pis = predict(males.lm, newdata = data.frame(year = yrs.grid),
interval = "prediction")
plot (x=males$year, y= males$salary, xlab = "Years", ylab = "Salary ($)", main = "Salary vs Year")
# confidence level
lines(yrs.grid, mmod.cis[, "fit"], lwd = 1.5)
lines(yrs.grid, mmod.cis[, "lwr"], col = "red")
lines(yrs.grid, mmod.cis[, "upr"], col = "red")
yrs.grid = seq(from = min(males$year), to = max(males$year),length.out = 100)
2.4) Midwestern college has decided to hire a new faculty member, Michael Finch (a male), as an Assistant professor for the upcoming Spring semester. So, Michael literally has \(0\) years in his current rank. He wishes to predict his salary to decide on whether to accept this job offer.
Please help this new faculty member, Michael, to predict his salary in the follwoing manner.
- please find (1) a point estimate for Michael's salary and (2) an interval estimate for Michael's salary.
- Please interpret point estimate and interval estimate for the salary.
males_0yr <- males %>% group_by(year= 0)
yr0_salary <- males_0yr$salary
xbar <- mean(males$year)
ybar <- mean(males$salary)
n <- length(males$salary)
SXY <- sum((males$year- xbar) * (males$salary - ybar))
SXX <- sum((males$year- xbar)^2)
b1= SXY/SXX
b0 = ybar - b1*xbar
#1 Point Estimate :
pe.salary <- b0 + b1*0
pe.salary
## [1] 18166.15
#2 Interval Estimate
sd <- sd(yr0_salary)
se.Salary <- sd*sqrt((1/n)+((0-xbar)/(SXX)))
LL.Salary <- pe.salary - se.Salary
UL.Salary <-pe.salary + se.Salary
# interval estimate
c(LL.Salary,UL.Salary)
## [1] 17456.19 18876.10
Point Estimate: $18,166.15 Interval Estimate
($17,456.19, $18,876.10)
2.5) Now add lines to the
plot produced in part (3) representing point-wise 95% prediction
intervals for salary. You can still use grid of points used in part (3).
Also, you can use the sample R code we discussed in the class to create
these plots
plot (males$year, males$salary, xlab = "Years", ylab = "Salary ($)", main = "Salary vs Year")
## Adding prediction intervals
mmod.pis = predict(males.lm, newdata = data.frame(year = yrs.grid),
interval = "prediction")
lines(yrs.grid, mmod.cis[, "fit"], lwd = 1.5)
lines(yrs.grid, mmod.cis[, "lwr"], col = "red")
lines(yrs.grid, mmod.cis[, "upr"], col = "red")
lines(yrs.grid, mmod.pis[, "lwr"], col = "darkblue")
lines(yrs.grid, mmod.pis[, "upr"], col = "darkblue")
2.6) Briefly discuss the features of confidence
interval and prediction interval.
The confidence interval is an
estimated interval with (1- \(\alpha\))% confidence that it captures the
true mean of the data, so in this case the true average salary for the
year in their current rank. A prediction interval is more specific than
a confidence interval because its goal is to make a prediction of a new
observation that corresponds to a specified value of x. The prediction
interval is wider than than the confidence interval because there is
more uncertainty about the prediction of a new specific case of y than
that of estimating value of y in a confidence interval.
2.7) Michael was told as a part of his decision on accepting this job that the average salary for male faculty members who are new to Midwestern college is greater than 30,000 dollars per year. Literally, the new faculty members will have \(0\) years in their current rank. Use an appropriate hypothesis test to assess whether the there is sufficient evidence to support this claim. You may use p-value to support your answer.
\(H_0 \; : \; E(Y \mid X = x_*) = \mu_{*0}=30000\) vs.\(H_1 \; : \; E(Y \mid X = x_*) > \$30,000\)<>
sd <- sd(yr0_salary)
#calc z-score given null is true
percentile<- ((pe.salary-30000)/sd)
pval2<- pt(abs(percentile), df = length(males$year - 2), lower.tail = FALSE)
pval2
## [1] 0.0253748
The p-value = .025 which is less than \(\alpha =.05\) so the null hypothesis is rejected.