title: “Homework 3” author: “Ava Moore” output: pdf_document: latex_engine: lualatex


install.packages(“readxl”)


A) Matrix Calculations

We computed the following:

\(X^T X\)

\[ X^T X = \begin{bmatrix} 40 & 2884 & 2960 \\ 2884 & 218048 & 212533 \\ 2960 & 212533 & 225782 \end{bmatrix} \]

\(X^T Y\)

\[ X^T Y = \begin{bmatrix} 110.89 \\ 8225.68 \\ 8409.77 \end{bmatrix} \]

\((X^T X)^{-1}\)

\[ (X^T X)^{-1} = \begin{bmatrix} 1.5065 & -0.0082 & -0.0120 \\ -0.0082 & 0.0001000 & 0.0000131 \\ -0.0120 & 0.0000131 & 0.0001500 \end{bmatrix} \]

Estimated Coefficients \(\hat{\beta}\)

Using the formula \(\hat{\beta} = (X^T X)^{-1} X^T Y\):

\[ \hat{\beta} = \begin{bmatrix} -1.5705 \\ 0.0257 \\ 0.0336 \end{bmatrix} \]


B) Interpretation:

C) P Values:

model <- lm(gpa ~ verbal + math, data = data)
summary(model)
## 
## Call:
## lm(formula = gpa ~ verbal + math, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10980 -0.20134  0.05034  0.22738  0.74313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.570537   0.493749  -3.181  0.00297 ** 
## verbal       0.025732   0.004024   6.395 1.83e-07 ***
## math         0.033615   0.004928   6.822 4.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4023 on 37 degrees of freedom
## Multiple R-squared:  0.6811, Adjusted R-squared:  0.6638 
## F-statistic: 39.51 on 2 and 37 DF,  p-value: 6.585e-10

The code performs hypothesis testing to evaluate the significance of each parameter on GPA.

D) Estimated GPA for a student with verbal score 75 and math score 90

verbal_score <- 75
math_score <- 90


beta_0 <- -1.5705
beta_1 <- 0.0257   
beta_2 <- 0.0336   

# Predicted GPA
predicted_gpa <- beta_0 + beta_1 * verbal_score + beta_2 * math_score
predicted_gpa
## [1] 3.381

The estimated GPA for this student is 3.381

D) Confidence Interval:

The 95% confidence interval for each coefficient \(\hat{\beta}_i\) is calculated as:

\[ \hat{\beta}_i \pm t_{\alpha/2} \cdot \text{SE}(\hat{\beta}_i) \]

to determine

\[ t_{\alpha/2} \]

we use the quantile function which takes 0.975, corresponding to the cumulative probability of the upper or lower tail, and the degrees of freedom. Thus, the equations are as follows

Given: - Standard errors \(\text{SE}(\hat{\beta}_0) = 0.493748502\) \(\text{SE}(\hat{\beta}_1) = 0.004023571\) \(\text{SE}(\hat{\beta}_2) = 0.004927507\)

Thus,

\[ CI_{\hat{\beta}_0} = -1.57053654 \pm 2.026 \cdot 0.493748502 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_0} = [-2.57096604, -0.57010705] \]

and

\[ CI_{\hat{\beta}_1} = 0.02573212 \pm 2.026 \cdot 0.004023571 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_1} = [0.01757959, 0.03388465] \]

and

\[ CI_{\hat{\beta}_2} = 0.03361487 \pm 2.026 \cdot 0.004927507 \]

Resulting in the interval:

\[ CI_{\hat{\beta}_2} = [0.02363079, 0.04359895] \]

coef_summary <- summary(model)$coefficients
betas <- coef_summary[, 1]  
se_betas <- coef_summary[, 2]  


df <- model$df.residual


t_value <- qt(0.975, df)


lower_betas <- betas - t_value * se_betas
upper_betas <- betas + t_value * se_betas


data.frame(Estimate = betas, Lower_95_CI = lower_betas, Upper_95_CI = upper_betas)
##                Estimate Lower_95_CI Upper_95_CI
## (Intercept) -1.57053654 -2.57096604 -0.57010705
## verbal       0.02573212  0.01757959  0.03388465
## math         0.03361487  0.02363079  0.04359895
se_betas
## (Intercept)      verbal        math 
## 0.493748502 0.004023571 0.004927507
df
## [1] 37
new_student <- data.frame(verbal = 75, math = 90)


conf_interval <- predict(model, new_student, interval = "confidence", level = 0.95)


pred_interval <- predict(model, new_student, interval = "prediction", level = 0.95)

conf_interval
##        fit      lwr      upr
## 1 3.384711 3.176158 3.593264
pred_interval
##        fit      lwr      upr
## 1 3.384711 2.543365 4.226057

The results tell us that we are 95% certain that the mean GPA of all students who score 75 verbal and 90 math is between [3.175, 3.593]

On the other hand,we are 95% certain that any one student with these scores has a GPA between [2.543, 4.226]

A) Quadratic model

Code for calculating the quadratic model:

data$verbal2 <- data$verbal^2
data$math2 <- data$math^2
data$verbal_math <- data$verbal * data$math

quad_model <- lm(gpa ~ verbal + math + verbal2 + math2 + verbal_math, data = data)


summary(quad_model)
## 
## Call:
## lm(formula = gpa ~ verbal + math + verbal2 + math2 + verbal_math, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39203 -0.10935 -0.04432  0.09508  0.42601 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.9167631  1.3544134  -7.322 1.75e-08 ***
## verbal       0.1668098  0.0212447   7.852 3.85e-09 ***
## math         0.1375972  0.0267340   5.147 1.11e-05 ***
## verbal2     -0.0011082  0.0001173  -9.449 4.88e-11 ***
## math2       -0.0008433  0.0001594  -5.290 7.23e-06 ***
## verbal_math  0.0002411  0.0001440   1.675    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1871 on 34 degrees of freedom
## Multiple R-squared:  0.9366, Adjusted R-squared:  0.9272 
## F-statistic: 100.4 on 5 and 34 DF,  p-value: < 2.2e-16

We obtained a P value of 4.88e-11 for Verbal^2 We obtained a P value of 7.23e-06 for Math^2 We obtained a P value of 0.103 for VerbalxMath

These p values indicate that we overfit the model using VerbalxMath, because the p value is above .05. However, it indicates that Verbal^2 and Math^2 had significant effect in the model. Overall, this is reliable, but we would want to be careful that our model is generalizable at extreme points, and we would proably want to remove the x1x2 (verbal x math) term.

B) Partial Sum of Squares F test

We are testing the Hypothesis that the quadratic terms do not significantly improve the model, and that the linear model is significant.

model_linear <- lm(gpa ~ verbal + math, data = data)
model_quadratic <- lm(gpa ~ verbal + math + I(verbal^2) + I(math^2) + I(verbal * math), data = data)

rss_linear <- sum(residuals(model_linear)^2)
rss_quad <- sum(residuals(model_quadratic)^2)

df_linear <- df.residual(model_linear)
df_quad <- df.residual(model_quadratic)


numerator <- (rss_linear - rss_quad) / (df_linear - df_quad)
denominator <- rss_quad / df_quad
F_stat <- numerator / denominator


p_value <- pf(F_stat, df1 = df_linear - df_quad, df2 = df_quad, lower.tail = FALSE)


F_stat
## [1] 45.65476
p_value
## [1] 5.100844e-12

Result: F_stat 45.65476 p_value 5.100844e-12

Since the P value is very small (< .05), we reject the null hypothesis and conclude that

Our initial assumption was correct, as the F test indicates that the model improves with the addition of quadratic terms.