##Question 1. (15 points) Alumni Donation Data (Multiple Linear Regression). Continue with the same data from homework 1 and fit a multiple linear regression model to the data, where the alumni giving rate is the response variable (Y), and the percentage of classes with fewer than 20 students (X1) and Student/Faculty Ratio (X2) as the predictors.
## loading data
data <- read.csv("alumni.csv")
head(data)
## school percent_of_classes_under_20
## 1 Boston College 39
## 2 Brandeis University 68
## 3 Brown University 60
## 4 California Institute of Technology 65
## 5 Carnegie Mellon University 67
## 6 Case Western Reserve Univ. 52
## student_faculty_ratio alumni_giving_rate private
## 1 13 25 1
## 2 8 33 1
## 3 8 40 1
## 4 3 46 1
## 5 10 28 1
## 6 8 31 1
attach(data)
Y <- alumni_giving_rate
X1 <- percent_of_classes_under_20
X2 <- student_faculty_ratio
MLR <- lm(Y ~ X1 + X2, data = data)
summary(MLR)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.00 -6.57 -1.95 4.42 24.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6556 13.5076 2.936 0.005225 **
## X1 0.1662 0.1626 1.022 0.312128
## X2 -1.7021 0.4421 -3.850 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
## F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
##a.What is your final estimated model?
The final estimated model is Y = 39.66 + 0.17X1 - 1.7X2
Y = 39.66 + (0.17 * 50) - (1.7 * 10)
Y
## [1] 31.16
##b. What is the predicted alumni giving rate for an observation with (X1=50,X2=10)?
The alumni giving rate for the above given values is 31.16.
##c. Using the model summary, test the statistical significance of each regression coefficient using t-tests; use α=0.05. Obtain the t-statistics and p-values, interpret the results, make a conclusion (i.e. reject or not reject), and explain why. Note: please explain what the null hypothesis is.
summary(MLR)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.00 -6.57 -1.95 4.42 24.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6556 13.5076 2.936 0.005225 **
## X1 0.1662 0.1626 1.022 0.312128
## X2 -1.7021 0.4421 -3.850 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
## F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
The above summary gives us the t statistics(coefficient/standard error of the coefficients) of each variable.
t statistics of beta0 is 2.936, beta1 is 1.022 and beta2 is -3.850.
From the above statistics, the value of beta0 and beta1 are positive which implies that they are directly proportional to Y. The case is inverse for beta2 which is negative.
p-value of Intercept is 0.005225, beta1 is 0.312128 and for beta2 is 0.000371.
The coefficient of X1 is not significantly different from zero at the 0.05 level of significance(p=0.3121).
##d. What is the F statistic of the model? Is it significant? Clearly write out the null hypothesis, F-statistic, and p-value and interpret the test results.
F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
Based on the F-statistic and its associated p-value, you can conclude that the overall model is statistically significant. It provides evidence that at least one of the predictor variables (X1 and X2) is significantly related to the alumni giving rate (Y). The model, as a whole, explains a significant amount of the variation in alumni giving rates.
##e. What is the value of the coefficient of determination? Please interpret.
summary_R2 <- summary(MLR)
r_squared <- summary_R2$r.squared
r_squared
## [1] 0.5613406
Based on the above r-squared value which 0.56134 we can interpret that the alumni giving rate is moderately dependent on the variables X1(percent_of_classes_under_20) and X2(student_faculty_ratio).
##f. What is the correlation coefficient r1 between X1 and Y and the correlation coefficient r2 between X2 and Y? Do you see any relationship between r1, r2, and R2?
r1 <- cor(data$percent_of_classes_under_20, data$alumni_giving_rate)
r2 <- cor(data$student_faculty_ratio, data$alumni_giving_rate)
r1
## [1] 0.6456504
r2
## [1] -0.7423975
The above results states that the alumni giving rate is higher when the percentage of classes under 20 is higher and it is lower when the student faculty ratio is higher (meaning the number of students per professor is higher).
In single variable linear regression the square of correlation (r1^2) would be equal to the R^2 value but in case of multiple variable linear regression its not straight forward.
The R-squared value for a multiple linear regression model depends on how well all the predictors work together to explain the variation in the response variable, not just the individual relationships. The combined effect in the multiple regression model might be more significant than what you’d expect from looking at their individual correlations with Y.
##Question 2. (10 points) Simulation Study (Multiple Linear Regression). Assume mean function E(Y|X)=10+5X1−2X2
##a. Generate data with X1∼N(μ=2,σ=0.1), X2∼N(μ=0,σ=0.4), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5).
set.seed(123)
n <- 100
X1 <- rnorm(n, mean = 2, sd = 0.1)
X2 <- rnorm(n, mean = 0, sd = 0.4)
error_term <- rnorm(n, mean = 0, sd = 0.5)
beta0 <- 10
beta1 <- 5
beta2 <- -2
Y <- beta0 + beta1 * X1 + beta2 * X2 + error_term
simulated_data <- data.frame(X1, X2, Y)
head(simulated_data)
## X1 X2 Y
## 1 1.943952 -0.28416263 21.38749
## 2 1.976982 0.10275348 20.33561
## 3 2.155871 -0.09867675 20.84414
## 4 2.007051 -0.13901704 20.58489
## 5 2.012929 -0.38064743 20.61877
## 6 2.171506 -0.01801109 20.65543
##b. Fit a multiple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?
lm_model <- lm(Y ~ X1 + X2, data = simulated_data)
summary(lm_model)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93651 -0.33037 -0.06222 0.31068 1.03991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3992 1.0543 10.813 < 2e-16 ***
## X1 4.3341 0.5243 8.266 7.29e-13 ***
## X2 -1.9702 0.1237 -15.922 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4756 on 97 degrees of freedom
## Multiple R-squared: 0.7758, Adjusted R-squared: 0.7712
## F-statistic: 167.8 on 2 and 97 DF, p-value: < 2.2e-16
residuals <- residuals(lm_model)
mse <- mean(residuals^2)
mse
## [1] 0.2194547
The estimated prediction equation is: Y = 11.399 + 4.334 X1 - 1.97 X2
The estimated coefficients are 11.399, 4.334 and -1.97. Their standard errors are 1.054, 0.524 and 0.124 respectively. Yes, they are significant.
Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1 and X2) meaning their coefficients are zero.
Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1 and X2) meaning their coefficients are non zeros.
The observed t-statistics are 10.813, 8.266 and -15.922. The p-value’s are <2.2e-16, 7.29e-13 and <2e-16 .
From the above values we can clearly see that we can reject the null hypothesis so that there is a clear relation between Y and the variables X1 and X2.
The MSE of the fitted model is 0.2195.
##c. Repeat part b), but re-simulate the data and change the error term to ϵ∼N(0,σ=1)
set.seed(123)
n1 <- 100
X11 <- rnorm(n, mean = 2, sd = 0.1)
X21 <- rnorm(n, mean = 0, sd = 0.4)
error_term1 <- rnorm(n, mean = 0, sd = 1)
beta0 <- 10
beta1 <- 5
beta2 <- -2
Y1 <- beta0 + beta1 * X11 + beta2 * X21 + error_term1
simulated_data1 <- data.frame(X11, X21, Y1)
head(simulated_data1)
## X11 X21 Y1
## 1 1.943952 -0.28416263 22.48690
## 2 1.976982 0.10275348 20.99182
## 3 2.155871 -0.09867675 20.71156
## 4 2.007051 -0.13901704 20.85648
## 5 2.012929 -0.38064743 20.41160
## 6 2.171506 -0.01801109 20.41731
lm_model1 <- lm(Y1 ~ X11 + X21, data = simulated_data1)
summary(lm_model1)
##
## Call:
## lm(formula = Y1 ~ X11 + X21, data = simulated_data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8730 -0.6607 -0.1245 0.6214 2.0798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.7985 2.1085 6.070 2.48e-08 ***
## X11 3.6683 1.0487 3.498 0.000709 ***
## X21 -1.9405 0.2475 -7.841 5.84e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9513 on 97 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4298
## F-statistic: 38.31 on 2 and 97 DF, p-value: 5.472e-13
residuals1 <- residuals(lm_model1)
mse1 <- mean(residuals1^2)
mse1
## [1] 0.8778187
The estimated prediction equation is: Y = 12.799 + 3.668 X1 - 1.941 X2
The estimated coefficients are 12.799, 3.668 and -1.941. Their standard errors are 2.109, 1.049 and 0.248 respectively. Yes, they are significant.
Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1 and X2) meaning their coefficients are zero.
Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1 and X2) meaning their coefficients are non zeros.
The observed t-statistics are 6.07, 3.498 and -7.841. The p-value’s are 2.48e-08, 0.000709, and 5.84e-12
From the above values we can clearly see that we can reject the null hypothesis so that there is a clear relation between Y and the variables X1 and X2.
The MSE of the fitted model is 0.8778.
##d. Repeat parts a)–c) using n=400. What do you conclude? What is the effect to the model parameter estimates when the error variance gets smaller? What is the effect when the sample size gets bigger?
set.seed(123)
n <- 400
X12 <- rnorm(n, mean = 2, sd = 0.1)
X22 <- rnorm(n, mean = 0, sd = 0.4)
error_term2 <- rnorm(n, mean = 0, sd = 0.5)
beta0 <- 10
beta1 <- 5
beta2 <- -2
Y2 <- beta0 + beta1 * X12 + beta2 * X22 + error_term2
simulated_data2 <- data.frame(X12, X22, Y2)
head(simulated_data2)
## X12 X22 Y2
## 1 1.943952 -0.02942241 19.95675
## 2 1.976982 -0.46746057 20.49083
## 3 2.155871 -0.25389931 21.71475
## 4 2.007051 -0.01153662 20.63480
## 5 2.012929 0.26827839 19.66622
## 6 2.171506 -0.66021862 22.25002
lm_model2 <- lm(Y2 ~ X12 + X22, data = simulated_data2)
summary(lm_model2)
##
## Call:
## lm(formula = Y2 ~ X12 + X22, data = simulated_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3078 -0.3684 -0.0175 0.3040 1.5914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.64074 0.53201 20.00 <2e-16 ***
## X12 4.69084 0.26548 17.67 <2e-16 ***
## X22 -2.00673 0.06468 -31.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5137 on 397 degrees of freedom
## Multiple R-squared: 0.7583, Adjusted R-squared: 0.7571
## F-statistic: 622.8 on 2 and 397 DF, p-value: < 2.2e-16
residuals2 <- residuals(lm_model2)
mse2 <- mean(residuals2^2)
mse2
## [1] 0.2618728
set.seed(123)
n <- 400
X13 <- rnorm(n, mean = 2, sd = 0.1)
X23 <- rnorm(n, mean = 0, sd = 0.4)
error_term3 <- rnorm(n, mean = 0, sd = 1)
beta0 <- 10
beta1 <- 5
beta2 <- -2
Y3 <- beta0 + beta1 * X13 + beta2 * X23 + error_term3
simulated_data3 <- data.frame(X13, X23, Y3)
head(simulated_data3)
## X13 X23 Y3
## 1 1.943952 -0.02942241 20.13489
## 2 1.976982 -0.46746057 20.16182
## 3 2.155871 -0.25389931 22.14235
## 4 2.007051 -0.01153662 21.21126
## 5 2.012929 0.26827839 19.80436
## 6 2.171506 -0.66021862 22.32207
lm_model3 <- lm(Y3 ~ X13 + X23, data = simulated_data3)
summary(lm_model3)
##
## Call:
## lm(formula = Y3 ~ X13 + X23, data = simulated_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6156 -0.7368 -0.0350 0.6080 3.1828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2815 1.0640 10.603 < 2e-16 ***
## X13 4.3817 0.5310 8.252 2.32e-15 ***
## X23 -2.0135 0.1294 -15.564 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.027 on 397 degrees of freedom
## Multiple R-squared: 0.4333, Adjusted R-squared: 0.4304
## F-statistic: 151.8 on 2 and 397 DF, p-value: < 2.2e-16
residuals3 <- residuals(lm_model3)
mse3 <- mean(residuals3^2)
mse3
## [1] 1.047491
When sample size increased, model parameter estimates are closer to the true values. Even when error variance decreased, model parameter estimates are closer to the true values.
mse
## [1] 0.2194547
mse1
## [1] 0.8778187
mse2
## [1] 0.2618728
mse3
## [1] 1.047491
##e. What about the MSE (Residual standard error squared) from each model?
MSE got increase when sample size increased and also when error variance decreased.