The R code is hidden by default. Please click on Show to view all codes. Thank you.
Response variable Y = alumni_giving_rate, X1 =
percent_of_classes_under_20, X2 =
student_faculty_racio.
Import Data and load libraries from URL: https://bgreenwell.github.io/uc-bana7052/data/alumni.csv
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni)
attach(alumni)
library(tidyverse)
library(investr)
library(here)
a) What is the final estimated model?
MLR model with normal errors is given – \(Y_i = \beta_0 + \sum_{j=1}^{p-1} \beta_j X_{ij} + \epsilon_i, i = 1,2, ..., n\).
alumni_fit <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio, data = alumni)
coef(alumni_fit)
(Intercept) percent_of_classes_under_20 student_faculty_ratio
39.6555835 0.1661686 -1.7021103
From the coefficients result above, we can estimate that the final model is \(\hat{Y} = 39.66 + 0.166 X_1 - 1.702 X_2\).
b) What is the predicted alumni_giving_rate for
an observation with percent_of_classes_under_20 = 50 and
student_faculty_ratio = 10?
X1 <- 50
X2 <- 10
Y <- 39.6555835 + 0.1661686 * X1 - 1.7021103 * X2
print(paste("The predicted alumni giving rate is",Y))
[1] "The predicted alumni giving rate is 30.9429105"
The giving rate is 30.94%.
c) Test the statistical significance of the regression coefficient using t-tests; use α=0.05. Obtain the t-statistics and p-values, interpret the results, make a conclusion (i.e. reject or not reject), and explain why.
summary(alumni_fit)
Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
student_faculty_ratio, data = alumni)
Residuals:
Min 1Q Median 3Q Max
-15.00 -6.57 -1.95 4.42 24.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6556 13.5076 2.936 0.005225 **
percent_of_classes_under_20 0.1662 0.1626 1.022 0.312128
student_faculty_ratio -1.7021 0.4421 -3.850 0.000371 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.098 on 45 degrees of freedom
Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
To test the statistical significance using T test, we will use the marginal test to complete this. From the above summary statistics we can see that the t statistics are 1.022 for \(\beta_1\) and -3.850 for \(\beta_2\). We will set the null hypothesis \(H_0 : \beta_j = 0\) to be there is no significant linear relationship between response and predictors, and the alternative hypothesis \(H_1 : \beta_j \neq 0\) is that there is statistical significance between responses and predictors. We will take a look at the p values for each predictor.
Based on the p value for percent_of_classes_under_20,
which is 0.312 and is greater than the threshold \(\alpha = 0.05\), in this case, we failed to
reject the null hypothesis for \(\beta_1\) - that there is no significant
linear relation between alumni_giving_rate and
percent_of_classes_under_20 when controlling
student_faculty_ratio; and based on the p value for
student_faculty_ratio, the p value is 0.00037 and is less
than the 0.05 threshold, so we will reject the null hypothesis for \(\beta_2\) - there is a significant linear
relationship between alumni_giving_rate and
student_faculty_ratio.
d) d. What is the F statistic? Is it significant? Clearly write out the null hypothesis, F-statistic, and p-value and interpret the test results.
summary(alumni_fit)
Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
student_faculty_ratio, data = alumni)
Residuals:
Min 1Q Median 3Q Max
-15.00 -6.57 -1.95 4.42 24.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6556 13.5076 2.936 0.005225 **
percent_of_classes_under_20 0.1662 0.1626 1.022 0.312128
student_faculty_ratio -1.7021 0.4421 -3.850 0.000371 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.098 on 45 degrees of freedom
Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
#manually construct F-test
alumni_fit_reduced <- lm(alumni_giving_rate ~ 1, data = alumni)
anova(alumni_fit_reduced, alumni_fit)
Analysis of Variance Table
Model 1: alumni_giving_rate ~ 1
Model 2: alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 8491.5
2 45 3724.9 2 4766.6 28.793 8.869e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To test the statistical significance using F test, we will set the null hypothesis \(H_0 : \beta_1 = \beta_2 = 0\) - there is no statistic significance between response and predictors; and the alternative hypothesis to be \(H_1 : \beta_j \neq 0\) for at least one \(j \in {1,2}\), meaning either \(\beta_1\) or \(\beta_2\) not equal to zero - there is statistical significance between response and predictors.
From the above summary statistics and the manual F-test results, we
can see that the coefficients of two predictors do not equal to zero:
\(\beta_1 = 0.1662\) and \(\beta_2 = -1.7021\). The F statistic value
is 28.793, and the p value is \(8.869e^-9\), which is smaller than the 0.05
threshold. Therefore, we are going to reject the null hypothesis - there
is a significant linear relationship between the response
alumni_giving_rate and the predictors.
It suggests that all else held constant, for every one percent
increase in the percent_of_classes_under_20, the
alumni_giving_rate is expected to increase
by approximately 0.166, and when all else held constant, for every one
unit increase in student_faculty_ratio, the
alumni_giving_rate is expected to decrease
by 1.702.
e) What is the value of the coefficient of determination? Please interpret.
summary(alumni_fit)$adj.r.squared
[1] 0.5418446
From the summary statistics, we can see there there are two R Squared values. \(R^2\) will increase as more terms are added to the model, so I decide to choose the adjusted R-squared value to report. So the \(R^2 = 0.5418\). This means that 54.18% of the variance can be explained by the model.
f) What is the correlation coefficient \(r_1\) between
percent_of_classes_under_20 and
alumni_giving_rate and the correlation coefficient \(r_2\) between
student_faculty_ratio and alumni_giving_rate?
Do you see any relationship between \(r_1\), \(r_2\), and \(R^2\)?
R2 <- summary(alumni_fit)$adj.r.squared
r1 <- cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
r2 <- cor(alumni$student_faculty_ratio, alumni$alumni_giving_rate)
print(paste("The correlation coefficient r_1 = ", r1))
[1] "The correlation coefficient r_1 = 0.645650419081598"
print(paste("The correlation coefficient r_2 = ", r2))
[1] "The correlation coefficient r_2 = -0.742397463273406"
print(paste("The coefficient of determination R^2 = ", R2))
[1] "The coefficient of determination R^2 = 0.541844592524699"
We learned from simple linear regression that \(r^2 = R^2\), however, in multiple linear regression, looks like the \(R^2\) is a combined value of both predictors, so the value of \(r_1^2\) and \(r_2^2\) does not necessarily add up, hence the relationship between r and R breaks down in multiple linear regression.
There is still some pattern between those values. We see that \(r_1\) is positive and relatively strong, which suggests that as the percentage of classes with fewer than 20 increases, the giving rate will also tend to increase. On the other hand, \(r_2\) is negative and stronger than \(r_1\), it indicates that when the student to faculty ratio increase, the giving rate will decrease. The \(R^2\) value is moderate and indicates that 54% of the variance can be explained by the model.
a) Generate N = 1000 observations from the model \(Y \sim 5X_1 - 2X_2 + \epsilon\), where \(X_1 \sim N(\mu = 2, \sigma = 0.1), X_2 \sim N(\mu = 0, \sigma = 0.4), and \epsilon \sim N(\mu = 0, \sigma = 0.5).\)
set.seed(7052)
x1 <- rnorm(1000, mean = 2, sd = 0.1)
x2 <- rnorm(1000, mean = 0, sd = 0.4)
error <- rnorm(1000, mean = 0, sd = 0.5)
y <- 10 + 5 * x1 - 2 * x2 + error
mdata <- data.frame(x1, x2, y)
head(mdata, n=3)
b) Fit a multiple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?
mdata_fit <- lm(y ~ x1 + x2, data = mdata)
summary(mdata_fit)
Call:
lm(formula = y ~ x1 + x2, data = mdata)
Residuals:
Min 1Q Median 3Q Max
-1.57953 -0.34033 -0.00096 0.31540 1.63513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.26397 0.30604 33.54 <2e-16 ***
x1 4.87776 0.15252 31.98 <2e-16 ***
x2 -2.03892 0.03836 -53.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4843 on 997 degrees of freedom
Multiple R-squared: 0.7913, Adjusted R-squared: 0.7909
F-statistic: 1890 on 2 and 997 DF, p-value: < 2.2e-16
mse <- mean(mdata_fit$residuals^2)
print(paste("MSE:", mse))
[1] "MSE: 0.233881211623243"
I used the lm()function to fit my multiple linear
regression. From the summary statistics, we can see that the coefficient
values are listed as \(\beta_1 =
4.878\) and \(\beta_2 =
-2.039\). Based on the full model: \(\hat{y}= \hat\beta_0 + \hat\beta_1X_1 +
\hat\beta_2X_2 + \epsilon\), I can estimate the prediction
equation to be \(\hat{y} = 10.26 + 4.878X_1 -
2.039X_2\).
The estimated coefficients and standard errors are : 4.878 and 0.153 for x1, -2.039 and 0.038 for x2.
Since we have multiple predictors in the linear regression model, we can use hypotheses of interest like \(H_0 : \beta_j = 0\) - there is no significant linear relationship between response and the predictors, and \(H_1: \beta_j \neq 0\) - there is a significant linear relationship between Y and any of the features. To conduct a marginal test, we will be able to see t value and p value for each predictor.
From the results, we can see that the t value for x1 is 31.98 and the p value is less than \(2e^-16\) which is smaller than the 0.05 threshold, so we can reject the null hypothesis on x1, that there is a significant linear relationship between response y and predictor x1. Same, we see the t value for x2 is -53.16 and p value is also less than \(2e^-16\), so we reject the null hypothesis on x2 as well, there is a significant linear relationship between response y and predictor x2. Taking a look at the F-statistic, it indicates that the F value is 1890 with p value less than \(2e^-16\), so overall, we can also reject the null hypothesis, that there is statistical significance between response y and both predictors.
x1 has a positive linear relationship to response y, when x1 increase, y value will also increase; on the other than, x2 has a negative linear relationship to response y, meaning when x2 increase, y will decrease.
The MSE value is around 0.234, which indicating a reasonable fit of the model to the data.
c) Repeat part b), but re-simulate the data and change the error term to ϵ∼N(μ=0,σ=1) (i.e., standard normal).
set.seed(7052)
x1 <- rnorm(1000, mean = 2, sd = 0.1)
x2 <- rnorm(1000, mean = 0, sd = 0.4)
error <- rnorm(1000, mean = 0, sd = 1)
y <- 10 + 5 * x1 - 2 * x2 + error
mdata2 <- data.frame(x1, x2, y)
head(mdata2, n=3)
mdata_fit2 <- lm(y ~ x1 + x2, data = mdata2)
summary(mdata_fit2)
Call:
lm(formula = y ~ x1 + x2, data = mdata2)
Residuals:
Min 1Q Median 3Q Max
-3.1591 -0.6807 -0.0019 0.6308 3.2703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.52794 0.61207 17.20 <2e-16 ***
x1 4.75551 0.30504 15.59 <2e-16 ***
x2 -2.07784 0.07671 -27.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9687 on 997 degrees of freedom
Multiple R-squared: 0.4904, Adjusted R-squared: 0.4894
F-statistic: 479.8 on 2 and 997 DF, p-value: < 2.2e-16
mse2 <- mean(mdata_fit2$residuals^2)
print(paste("MSE for mdata_fit2 is :", mse2))
[1] "MSE for mdata_fit2 is : 0.935524846492974"
By using the same function with the previous data frame, we can see that the coefficient values are listed as \(\beta_1 = 4.756\) and \(\beta_2 = -2.078\). Based on the full model: \(\hat{y}= \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2 + \epsilon\), I can estimate the prediction equation to be \(\hat{y} = 10.53 + 4.756X_1 - 2.078X_2\).
The estimated coefficients and standard errors are : 4.756 and 0.305 for x1, -2.078 and 0.077 for x2. We can see that with the error variance increased, the standard error for each predictor also increased (doubled).
The hypotheses of interest: to set \(H_0 : \beta_j = 0\) - there is no significant linear relationship between response and the predictors, and \(H_1: \beta_j \neq 0\) - there is a significant linear relationship between Y and any of the features. To conduct a marginal test, we will be able to see t value and p value for each predictor.
From the results, we can see that the t value for x1 is 15.59 and the p value is less than \(2e^-16\) which is smaller than the 0.05 threshold, so we can reject the null hypothesis on x1, that there is a significant linear relationship between response y and predictor x1. Same, we see the t value for x2 is -27.09 and p value is also less than \(2e^-16\), so we reject the null hypothesis on x2 as well, there is a significant linear relationship between response y and predictor x2. Taking a look at the F-statistic, it indicates that the F value is 479.8 with p value less than \(2e^-16\), so overall, we can also reject the null hypothesis, that there is statistical significance between response Y and both predictors.
Same as the previous statistics, x1 has a positive linear relationship to response y, when x1 increase, y value will also increase; on the other than, x2 has a negative linear relationship to response y, meaning when x2 increase, y will decrease.
The MSE value is around 0.936, which has increased quite a bit compared to the previous fitted model statistics, indicating this may be a worse fit compare to the previous one.
d) Repeat parts a)–c) using N=400. What do you conclude? What is the effect to the model parameter estimates when the error variance gets smaller? What is the effect when the sample size gets bigger?
I’ll name the data with N=400, error variance = 0.5 to
mdata3 and N=400, error variance = 1 to
mdata4
set.seed(7052)
x1 <- rnorm(400, mean = 2, sd = 0.1)
x2 <- rnorm(400, mean = 0, sd = 0.4)
error <- rnorm(400, mean = 0, sd = 0.5)
y <- 10 + 5 * x1 - 2 * x2 + error
mdata3 <- data.frame(x1, x2, y)
head(mdata3, n=3)
set.seed(7052)
x1 <- rnorm(400, mean = 2, sd = 0.1)
x2 <- rnorm(400, mean = 0, sd = 0.4)
error <- rnorm(400, mean = 0, sd = 1)
y <- 10 + 5 * x1 - 2 * x2 + error
mdata4 <- data.frame(x1, x2, y)
head(mdata4, n=3)
Perform the linear regression model for mdata3 and mdata4
mdata_fit3 <- lm(y ~ x1 + x2, data = mdata3)
summary(mdata_fit3)
Call:
lm(formula = y ~ x1 + x2, data = mdata3)
Residuals:
Min 1Q Median 3Q Max
-1.85547 -0.36374 0.00169 0.32801 1.31769
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.69826 0.52106 20.53 <2e-16 ***
x1 4.66233 0.25869 18.02 <2e-16 ***
x2 -1.95779 0.06508 -30.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5076 on 397 degrees of freedom
Multiple R-squared: 0.7521, Adjusted R-squared: 0.7509
F-statistic: 602.3 on 2 and 397 DF, p-value: < 2.2e-16
mse3 <- mean(mdata_fit3$residuals^2)
print(paste("MSE of mdata_fit3:", mse3))
[1] "MSE of mdata_fit3: 0.255756735784694"
mdata_fit4 <- lm(y ~ x1 + x2, data = mdata4)
summary(mdata_fit4)
Call:
lm(formula = y ~ x1 + x2, data = mdata4)
Residuals:
Min 1Q Median 3Q Max
-3.7109 -0.7275 0.0034 0.6560 2.6354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.3965 1.0421 10.936 < 2e-16 ***
x1 4.3247 0.5174 8.359 1.08e-15 ***
x2 -1.9156 0.1302 -14.716 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.015 on 397 degrees of freedom
Multiple R-squared: 0.4142, Adjusted R-squared: 0.4113
F-statistic: 140.4 on 2 and 397 DF, p-value: < 2.2e-16
mse4 <- mean(mdata_fit4$residuals^2)
print(paste("MSE of mdata_fit4:", mse4))
[1] "MSE of mdata_fit4: 1.02302694313878"
From the above summary statistics for mdata3 and
mdata4, we can learn:
For mdata3, given the intercept 10.70 and slope for both
predictors are 4.662 and -1.958, we can write the estimated prediction
equation to be \(\hat{y} = 10.7 + 4.662X_1 -
1.958X_2\). For mdata4, given the intercept 11.40
and slope for both predictors are 4.325 and -1.916, we can write the
estimated prediction equation to be \(\hat{y}
= 11.4 + 4.325X_1 - 1.916X_2\).
The estimated coefficients and standard errors for
mdata3 are : 4.662 and 0.259 for x1, -1.958 and 0.065 for
x2; for mdata4: 4.325 and 0.517 for x1, -1.916 and 0.13 for
x2. We can see that with the error variance increased, the standard
error for each predictor also increased (doubled).
The hypothesis testing for both fitted models mdata3 and
mdata4 is the same. The null hypothesis \(H_0\) is that there is no linear
relationship between response y and predictors, \(H_0 : \beta_j = 0\). The alternative
hypothesis \(H_1\) is that there is a
linear relationship between response y and any of the predictors, \(H_1: \beta_j \neq 0\).
For mdata3, the t value for x1 is 18.02 and the p value
is less than \(2e^-16\) which is
smaller than the 0.05 threshold, so we can reject the null hypothesis on
x1, that there is a significant linear relationship between response Y
and predictor x1. Same, we see the t value for x2 is -30.08 and p value
is also less than \(2e^-16\), so we
reject the null hypothesis on x2 as well, there is a significant linear
relationship between response Y and predictor x2. Taking a look at the
F-statistic, it indicates that the F value is 602.3 with p value less
than \(2.2e^-16\), so overall, we can
also reject the null hypothesis, that there is statistical significance
between response Y and both predictors. Same with mdata4, t
value for x1 is 8.359, p value is \(1.08^-15\), t value for x2 is -14.716, p
value is less than \(2e^-16\). Both p
value shows that we should reject the null hypothesis, there is a
statistical significance between response Y and predictors X1 and X2. We
can also learn that from the F-statistic, F value is 140.4 and p value
is less than \(2.2e^-16\).
Same as the previous statistics, x1 has a positive linear relationship to response y, when x1 increase, y value will also increase; on the other than, x2 has a negative linear relationship to response y, meaning when x2 increase, y will decrease.
Also, from this simulation, we can see some patterns when the sample size and error term variance are changed. See the image below:
When sample size increases, the estimates of the coefficients get more precise, and the standard errors are smaller for both predictors. T value will get larger and p value will get lower. Higher t value means that the coefficients are more likely to be statistically significant, and lower p-value makes us likely to reject the null hypothesis. On the other hand, when the error term variance increases, the variability in the response variable increases, and we see higher standard errors for both predictors, almost doubled. The t value will decrease and p value will increase, which means the coefficient will be less likely significant.
e) What about the MSE from each model?
From the table above, we can see that:
When sample size increases, the MSE does not change much while the error variance is the same. However, when the error variance increases, the MSE will also increase, which makes the residuals larger. So with error variance increase, we may see a worse fit of the data.
a) Write out the multiple linear regression model with normal errors in matrix form.
The regression model can be expressed as \(Y = X\beta + \epsilon\), where Y is an n * 1 vector of responses, \(\mathbf{Y} = \begin{bmatrix} Y_1 & Y_2 & \cdots & Y_n \end{bmatrix}\), \(\beta\) is an p * 1 vector of coefficients, \(\mathbf{\beta} = \begin{bmatrix} \beta_1 & \beta_2 & \cdots & \beta_{p-1} \end{bmatrix}\), \(\epsilon\) is the normal error, and X is an n * p model matrix. \(\mathbf{X} = \begin{bmatrix} 1 & X_{11} & X_{12} & \cdot & X_{1,p-1} \\ 1 & X_{21} & X_{22} & \cdot & X_{2,p-1} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \cdot & X_{n,p-1} \end{bmatrix}\).
b) State the model assumptions from part a).
The usual assumption is that the \(\epsilon\) error term is normally distributed \(\epsilon \sim N (0_n, \sigma^2I_n)\). I is the n * n identity matrix. All errors are independent.
c) Write out the model matrix X for the model in part a.
X is an n * p model matrix. \(\mathbf{X} = \begin{bmatrix} 1 & X_{11} & X_{12} & \cdot & X_{1,p-1} \\ 1 & X_{21} & X_{22} & \cdot & X_{2,p-1} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \cdot & X_{n,p-1} \end{bmatrix}\).
d) What is the least squares estimate of β in part a)?
To find the value of \(\beta\) that minimizes SSE, differentiate SSE w.r.t. \(\beta\) and equating to zero, we have equation \(2X^\intercal Y + 2X^\intercal X \beta = 0_p\), so \(\hat\beta = (X^\intercal X) ^{-1} X^\intercal Y\). X has to be full rank in order for \(X^\intercal X\) to be invertible.
e) What does it mean for the estimate in part d) to be unbiased?
As one of the properties of \(\hat\beta\), unbiased means the expected value of \(\hat{\beta}\) equals the true value of \(\beta\).