304HW3.knit

title: “Homework 3” author: “Ava Moore” output: pdf_document: latex_engine: lualatex

install.packages(“readxl”)

A) Matrix Calculations

We computed the following:

\(X^T X\)

\[ X^T X = \begin{bmatrix} 40 & 2884 & 2960 \\ 2884 & 218048 & 212533 \\ 2960 & 212533 & 225782 \end{bmatrix} \]

\(X^T Y\)

\[ X^T Y = \begin{bmatrix} 110.89 \\ 8225.68 \\ 8409.77 \end{bmatrix} \]

\((X^T X)^{-1}\)

\[ (X^T X)^{-1} = \begin{bmatrix} 1.5065 & -0.0082 & -0.0120 \\ -0.0082 & 0.0001000 & 0.0000131 \\ -0.0120 & 0.0000131 & 0.0001500 \end{bmatrix} \]

Estimated Coefficients \(\hat{\beta}\)

Using the formula \(\hat{\beta} = (X^T X)^{-1} X^T Y\):

\[ \hat{\beta} = \begin{bmatrix} -1.5705 \\ 0.0257 \\ 0.0336 \end{bmatrix} \]

B) Interpretation:

Verbal Score Coefficient (\(\hat{\beta}_1 = 0.0257\)):
If math score doesn’t change (holding it constant), if a 1-point increase in the verbal test score occurs, we would expect an increase of approximately 0.0257 in GPA.
Math Score Coefficient (\(\hat{\beta}_2 = 0.0336\)):
If verbal score doesn’t change (holding it constant), and a 1-point increase in the math test score occurs, we would expect an increase of approximately 0.0336 in GPA.

C) P Values:

model <- lm(gpa ~ verbal + math, data = data)
summary(model)

## 
## Call:
## lm(formula = gpa ~ verbal + math, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10980 -0.20134  0.05034  0.22738  0.74313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.570537   0.493749  -3.181  0.00297 ** 
## verbal       0.025732   0.004024   6.395 1.83e-07 ***
## math         0.033615   0.004928   6.822 4.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4023 on 37 degrees of freedom
## Multiple R-squared:  0.6811, Adjusted R-squared:  0.6638 
## F-statistic: 39.51 on 2 and 37 DF,  p-value: 6.585e-10

The code performs hypothesis testing to evaluate the significance of each parameter on GPA.

Intercept (β₀):
- Null Hypothesis (H₀): The intercept is equal to 0 (no intercept and no effect on GPA).
- Alternative Hypothesis (H₁): The intercept is not equal to 0.
- The p-value for the intercept is 0.00297, which is less than 0.05. Therefore, we reject the null hypothesis and conclude that if both math and verbal scores are equal to zero, the predicted GPA is not zero. Since the estimated value is negative, we conclude that the baseline GPA is less than zero.
Verbal Score (β₁):
- Null Hypothesis (H₀): Verbal test scores do not significantly affect GPA (β₁ = 0).
- Alternative Hypothesis (H₁): Verbal test scores do significantly affect GPA (β₁ ≠ 0).
- The p-value for the verbal coefficient is 1.83e-07, which is much smaller than 0.05. Therefore, we reject the null hypothesis and conclude that the verbal test score does, in fact, have a significant effect on the estimated GPA of a student.
Math Score (β₂):
- Null Hypothesis (H₀): Math test scores do not significantly affect GPA (β₂ = 0).
- Alternative Hypothesis (H₁): Math test scores significantly affect GPA (β₂ ≠ 0).
- The p-value for the math coefficient is 4.90e-08, which is also much smaller than 0.05. We reject the null hypothesis and conclude that math test scores also have a statistically significant effect on GPA.

D) Estimated GPA for a student with verbal score 75 and math score 90

verbal_score <- 75
math_score <- 90


beta_0 <- -1.5705
beta_1 <- 0.0257   
beta_2 <- 0.0336   

# Predicted GPA
predicted_gpa <- beta_0 + beta_1 * verbal_score + beta_2 * math_score
predicted_gpa

## [1] 3.381

The estimated GPA for this student is 3.381

D) Confidence Interval:

The 95% confidence interval for each coefficient \(\hat{\beta}_i\) is calculated as:

\[ \hat{\beta}_i \pm t_{\alpha/2} \cdot \text{SE}(\hat{\beta}_i) \]

to determine

\[ t_{\alpha/2} \]

we use the quantile function which takes 0.975, corresponding to the cumulative probability of the upper or lower tail, and the degrees of freedom. Thus, the equations are as follows

Given: - Standard errors \(\text{SE}(\hat{\beta}_0) = 0.493748502\) \(\text{SE}(\hat{\beta}_1) = 0.004023571\) \(\text{SE}(\hat{\beta}_2) = 0.004927507\)

Degrees of freedom (df): \(37\)
Critical t-value (\(t_{\alpha/2}\)) for a 95% confidence level:
- \(t_{\alpha/2} \approx 2.026\).

Thus,

\[ CI_{\hat{\beta}_0} = -1.57053654 \pm 2.026 \cdot 0.493748502 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_0} = [-2.57096604, -0.57010705] \]

and

\[ CI_{\hat{\beta}_1} = 0.02573212 \pm 2.026 \cdot 0.004023571 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_1} = [0.01757959, 0.03388465] \]

and

\[ CI_{\hat{\beta}_2} = 0.03361487 \pm 2.026 \cdot 0.004927507 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_2} = [0.02363079, 0.04359895] \]

coef_summary <- summary(model)$coefficients
betas <- coef_summary[, 1]  
se_betas <- coef_summary[, 2]  


df <- model$df.residual


t_value <- qt(0.975, df)


lower_betas <- betas - t_value * se_betas
upper_betas <- betas + t_value * se_betas


data.frame(Estimate = betas, Lower_95_CI = lower_betas, Upper_95_CI = upper_betas)

##                Estimate Lower_95_CI Upper_95_CI
## (Intercept) -1.57053654 -2.57096604 -0.57010705
## verbal       0.02573212  0.01757959  0.03388465
## math         0.03361487  0.02363079  0.04359895

se_betas

## (Intercept)      verbal        math 
## 0.493748502 0.004023571 0.004927507

df

## [1] 37

new_student <- data.frame(verbal = 75, math = 90)


conf_interval <- predict(model, new_student, interval = "confidence", level = 0.95)


pred_interval <- predict(model, new_student, interval = "prediction", level = 0.95)

conf_interval

##        fit      lwr      upr
## 1 3.384711 3.176158 3.593264

pred_interval

##        fit      lwr      upr
## 1 3.384711 2.543365 4.226057

The results tell us that we are 95% certain that the mean GPA of all students who score 75 verbal and 90 math is between [3.175, 3.593]

On the other hand,we are 95% certain that any one student with these scores has a GPA between [2.543, 4.226]

A) Quadratic model

Code for calculating the quadratic model:

data$verbal2 <- data$verbal^2
data$math2 <- data$math^2
data$verbal_math <- data$verbal * data$math

quad_model <- lm(gpa ~ verbal + math + verbal2 + math2 + verbal_math, data = data)


summary(quad_model)

## 
## Call:
## lm(formula = gpa ~ verbal + math + verbal2 + math2 + verbal_math, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39203 -0.10935 -0.04432  0.09508  0.42601 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.9167631  1.3544134  -7.322 1.75e-08 ***
## verbal       0.1668098  0.0212447   7.852 3.85e-09 ***
## math         0.1375972  0.0267340   5.147 1.11e-05 ***
## verbal2     -0.0011082  0.0001173  -9.449 4.88e-11 ***
## math2       -0.0008433  0.0001594  -5.290 7.23e-06 ***
## verbal_math  0.0002411  0.0001440   1.675    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1871 on 34 degrees of freedom
## Multiple R-squared:  0.9366, Adjusted R-squared:  0.9272 
## F-statistic: 100.4 on 5 and 34 DF,  p-value: < 2.2e-16

We obtained a P value of 4.88e-11 for Verbal^2 We obtained a P value of 7.23e-06 for Math^2 We obtained a P value of 0.103 for VerbalxMath

These p values indicate that we overfit the model using VerbalxMath, because the p value is above .05. However, it indicates that Verbal^2 and Math^2 had significant effect in the model. Overall, this is reliable, but we would want to be careful that our model is generalizable at extreme points, and we would proably want to remove the x1x2 (verbal x math) term.

B) Partial Sum of Squares F test

We are testing the Hypothesis that the quadratic terms do not significantly improve the model, and that the linear model is significant.

model_linear <- lm(gpa ~ verbal + math, data = data)
model_quadratic <- lm(gpa ~ verbal + math + I(verbal^2) + I(math^2) + I(verbal * math), data = data)

rss_linear <- sum(residuals(model_linear)^2)
rss_quad <- sum(residuals(model_quadratic)^2)

df_linear <- df.residual(model_linear)
df_quad <- df.residual(model_quadratic)


numerator <- (rss_linear - rss_quad) / (df_linear - df_quad)
denominator <- rss_quad / df_quad
F_stat <- numerator / denominator


p_value <- pf(F_stat, df1 = df_linear - df_quad, df2 = df_quad, lower.tail = FALSE)


F_stat

## [1] 45.65476

p_value

## [1] 5.100844e-12

Result: F_stat 45.65476 p_value 5.100844e-12

Since the P value is very small (< .05), we reject the null hypothesis and conclude that

Our initial assumption was correct, as the F test indicates that the model improves with the addition of quadratic terms.