title: “Homework 3” author: “Ava Moore” output: pdf_document: latex_engine: lualatex
install.packages(“readxl”)
We computed the following:
\[ X^T X = \begin{bmatrix} 40 & 2884 & 2960 \\ 2884 & 218048 & 212533 \\ 2960 & 212533 & 225782 \end{bmatrix} \]
\[ X^T Y = \begin{bmatrix} 110.89 \\ 8225.68 \\ 8409.77 \end{bmatrix} \]
\[ (X^T X)^{-1} = \begin{bmatrix} 1.5065 & -0.0082 & -0.0120 \\ -0.0082 & 0.0001000 & 0.0000131 \\ -0.0120 & 0.0000131 & 0.0001500 \end{bmatrix} \]
Using the formula \(\hat{\beta} = (X^T X)^{-1} X^T Y\):
\[ \hat{\beta} = \begin{bmatrix} -1.5705 \\ 0.0257 \\ 0.0336 \end{bmatrix} \]
Verbal Score Coefficient (\(\hat{\beta}_1 = 0.0257\)):
If math score doesn’t change (holding it constant), if a 1-point
increase in the verbal test score occurs, we would expect an
increase of approximately 0.0257 in GPA.
Math Score Coefficient (\(\hat{\beta}_2 = 0.0336\)):
If verbal score doesn’t change (holding it constant), and a
1-point increase in the math test score occurs, we
would expect an increase of approximately 0.0336 in
GPA.
model <- lm(gpa ~ verbal + math, data = data)
summary(model)
##
## Call:
## lm(formula = gpa ~ verbal + math, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10980 -0.20134 0.05034 0.22738 0.74313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.570537 0.493749 -3.181 0.00297 **
## verbal 0.025732 0.004024 6.395 1.83e-07 ***
## math 0.033615 0.004928 6.822 4.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4023 on 37 degrees of freedom
## Multiple R-squared: 0.6811, Adjusted R-squared: 0.6638
## F-statistic: 39.51 on 2 and 37 DF, p-value: 6.585e-10
The code performs hypothesis testing to evaluate the significance of each parameter on GPA.
Intercept (β₀):
Verbal Score (β₁):
Math Score (β₂):
verbal_score <- 75
math_score <- 90
beta_0 <- -1.5705
beta_1 <- 0.0257
beta_2 <- 0.0336
# Predicted GPA
predicted_gpa <- beta_0 + beta_1 * verbal_score + beta_2 * math_score
predicted_gpa
## [1] 3.381
The estimated GPA for this student is 3.381
The 95% confidence interval for each coefficient \(\hat{\beta}_i\) is calculated as:
\[ \hat{\beta}_i \pm t_{\alpha/2} \cdot \text{SE}(\hat{\beta}_i) \]
to determine
\[ t_{\alpha/2} \]
we use the quantile function which takes 0.975, corresponding to the cumulative probability of the upper or lower tail, and the degrees of freedom. Thus, the equations are as follows
Given: - Standard errors \(\text{SE}(\hat{\beta}_0) = 0.493748502\) \(\text{SE}(\hat{\beta}_1) = 0.004023571\) \(\text{SE}(\hat{\beta}_2) = 0.004927507\)
Degrees of freedom (df): \(37\)
Critical t-value (\(t_{\alpha/2}\)) for a 95% confidence level:
Thus,
\[ CI_{\hat{\beta}_0} = -1.57053654 \pm 2.026 \cdot 0.493748502 \]
Resulting in the interval:
\[ CI_{\hat{\beta}_0} = [-2.57096604, -0.57010705] \]
and
\[ CI_{\hat{\beta}_1} = 0.02573212 \pm 2.026 \cdot 0.004023571 \]
Resulting in the interval:
\[ CI_{\hat{\beta}_1} = [0.01757959, 0.03388465] \]
and
\[ CI_{\hat{\beta}_2} = 0.03361487 \pm 2.026 \cdot 0.004927507 \]
Resulting in the interval:
\[ CI_{\hat{\beta}_2} = [0.02363079, 0.04359895] \]
coef_summary <- summary(model)$coefficients
betas <- coef_summary[, 1]
se_betas <- coef_summary[, 2]
df <- model$df.residual
t_value <- qt(0.975, df)
lower_betas <- betas - t_value * se_betas
upper_betas <- betas + t_value * se_betas
data.frame(Estimate = betas, Lower_95_CI = lower_betas, Upper_95_CI = upper_betas)
## Estimate Lower_95_CI Upper_95_CI
## (Intercept) -1.57053654 -2.57096604 -0.57010705
## verbal 0.02573212 0.01757959 0.03388465
## math 0.03361487 0.02363079 0.04359895
se_betas
## (Intercept) verbal math
## 0.493748502 0.004023571 0.004927507
df
## [1] 37
new_student <- data.frame(verbal = 75, math = 90)
conf_interval <- predict(model, new_student, interval = "confidence", level = 0.95)
pred_interval <- predict(model, new_student, interval = "prediction", level = 0.95)
conf_interval
## fit lwr upr
## 1 3.384711 3.176158 3.593264
pred_interval
## fit lwr upr
## 1 3.384711 2.543365 4.226057
The results tell us that we are 95% certain that the mean GPA of all students who score 75 verbal and 90 math is between [3.175, 3.593]
On the other hand,we are 95% certain that any one student with these scores has a GPA between [2.543, 4.226]
Code for calculating the quadratic model:
data$verbal2 <- data$verbal^2
data$math2 <- data$math^2
data$verbal_math <- data$verbal * data$math
quad_model <- lm(gpa ~ verbal + math + verbal2 + math2 + verbal_math, data = data)
summary(quad_model)
##
## Call:
## lm(formula = gpa ~ verbal + math + verbal2 + math2 + verbal_math,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39203 -0.10935 -0.04432 0.09508 0.42601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.9167631 1.3544134 -7.322 1.75e-08 ***
## verbal 0.1668098 0.0212447 7.852 3.85e-09 ***
## math 0.1375972 0.0267340 5.147 1.11e-05 ***
## verbal2 -0.0011082 0.0001173 -9.449 4.88e-11 ***
## math2 -0.0008433 0.0001594 -5.290 7.23e-06 ***
## verbal_math 0.0002411 0.0001440 1.675 0.103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1871 on 34 degrees of freedom
## Multiple R-squared: 0.9366, Adjusted R-squared: 0.9272
## F-statistic: 100.4 on 5 and 34 DF, p-value: < 2.2e-16
We obtained a P value of 4.88e-11 for Verbal^2 We obtained a P value of 7.23e-06 for Math^2 We obtained a P value of 0.103 for VerbalxMath
These p values indicate that we overfit the model using VerbalxMath, because the p value is above .05. However, it indicates that Verbal^2 and Math^2 had significant effect in the model. Overall, this is reliable, but we would want to be careful that our model is generalizable at extreme points, and we would proably want to remove the x1x2 (verbal x math) term.
We are testing the Hypothesis that the quadratic terms do not significantly improve the model, and that the linear model is significant.
model_linear <- lm(gpa ~ verbal + math, data = data)
model_quadratic <- lm(gpa ~ verbal + math + I(verbal^2) + I(math^2) + I(verbal * math), data = data)
rss_linear <- sum(residuals(model_linear)^2)
rss_quad <- sum(residuals(model_quadratic)^2)
df_linear <- df.residual(model_linear)
df_quad <- df.residual(model_quadratic)
numerator <- (rss_linear - rss_quad) / (df_linear - df_quad)
denominator <- rss_quad / df_quad
F_stat <- numerator / denominator
p_value <- pf(F_stat, df1 = df_linear - df_quad, df2 = df_quad, lower.tail = FALSE)
F_stat
## [1] 45.65476
p_value
## [1] 5.100844e-12
Result: F_stat 45.65476 p_value 5.100844e-12
Since the P value is very small (< .05), we reject the null hypothesis and conclude that
Our initial assumption was correct, as the F test indicates that the model improves with the addition of quadratic terms.