+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++— title: ‘MA717: Applied Regression and Experimental Data Analysis’ author: “Assignment template” date: ” ” output: pdf_document: default word_document: default html_document: df_print: paged —
options(repos = c(CRAN = "https://cran.r-project.org"))
1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))
file_path <- file.path("M:/Assignment-20241208/College.csv")
mydata <- read.csv("M:/Assignment-20241208/College.csv", header = TRUE, stringsAsFactors = TRUE)
dim(mydata)
## [1] 775 17
head(mydata)
## Private Apps Accept Enroll F.Undergrad P.Undergrad Outstate Room.Board Books
## 1 Yes 1660 1232 721 2885 537 7440 3300 450
## 2 Yes 2186 1924 512 2683 1227 12280 6450 750
## 3 Yes 1428 1097 336 1036 99 11250 3750 400
## 4 Yes 417 349 137 510 63 12960 5450 450
## 5 Yes 193 146 55 249 869 7560 4120 800
## 6 Yes 587 479 158 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Elite
## 1 2200 70 78 18.1 12 7041 60 No
## 2 1500 29 30 12.2 16 10527 56 No
## 3 1165 53 66 12.9 30 8735 54 No
## 4 875 92 97 7.7 37 19016 59 Yes
## 5 1500 76 72 11.9 2 10922 15 No
## 6 675 67 73 9.4 11 9727 55 No
The number of observations : 775 The number of variables : 17
1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))
set.seed(2401616)
mynewdata <-
mydata[sample(1:nrow(mydata),700), ]
summary(mydata)
## Private Apps Accept Enroll F.Undergrad
## No :211 Min. : 81 Min. : 72.0 Min. : 35.0 Min. : 139
## Yes:564 1st Qu.: 778 1st Qu.: 605.5 1st Qu.: 242.0 1st Qu.: 990
## Median : 1558 Median : 1110.0 Median : 434.0 Median : 1708
## Mean : 3004 Mean : 2019.0 Mean : 781.0 Mean : 3707
## 3rd Qu.: 3610 3rd Qu.: 2413.0 3rd Qu.: 902.5 3rd Qu.: 4056
## Max. :48094 Max. :26330.0 Max. :6392.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7332 1st Qu.:3598 1st Qu.: 469.0
## Median : 355.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 857.3 Mean :10449 Mean :4358 Mean : 549.3
## 3rd Qu.: 967.5 3rd Qu.:12938 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.00 Min. : 2.50
## 1st Qu.: 870 1st Qu.: 62.00 1st Qu.: 71.00 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.00 Median :13.60
## Mean :1343 Mean : 72.69 Mean : 79.73 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.00 3rd Qu.:16.50
## Max. :6800 Max. :100.00 Max. :100.00 Max. :39.80
## perc.alumni Expend Grad.Rate Elite
## Min. : 0.00 Min. : 3186 Min. : 10.00 No :697
## 1st Qu.:13.00 1st Qu.: 6754 1st Qu.: 53.00 Yes: 78
## Median :21.00 Median : 8408 Median : 65.00
## Mean :22.76 Mean : 9667 Mean : 65.42
## 3rd Qu.:31.00 3rd Qu.:10847 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :100.00
table(mynewdata$Private)
##
## No Yes
## 190 510
table(mynewdata$Elite)
##
## No Yes
## 629 71
In the new data, there are 564 private universities and 211 public universities. Additionally, there are 78 elite universities and 697 non-elite universities.
1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))
install.packages("ggplot2")
## Installing package into 'C:/Users/kp24011/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\kp24011\AppData\Local\Temp\Rtmpigz83d\downloaded_packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
hist_vars <- c("Outstate", "Room.Board", "Books", "Personal")
titles <- c("Histogram of Outstate Tuition",
"Histogram of Room and Board Costs",
"Histogram of Book Costs",
"Histogram of Personal Expenses")
x_labels <- c("Outstate Tuition",
"Room and Board Costs",
"Book Costs",
"Personal Expenses")
for (i in seq_along(hist_vars)) {
plot <- ggplot(mynewdata, aes(x = .data[[hist_vars[i]]])) +
geom_histogram(binwidth = 900, fill = "skyblue", color = "red") +
labs(
title = titles[i],
x = x_labels[i],
y = "Frequency"
) +
theme_minimal()
print(plot)
}
2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).
model1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
summary(model1)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.843 -9.870 0.157 10.157 44.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.114 1.094 50.356 <2e-16 ***
## PrivateYes 11.729 1.276 9.190 <2e-16 ***
## EliteYes 18.108 1.880 9.632 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.98 on 697 degrees of freedom
## Multiple R-squared: 0.2141, Adjusted R-squared: 0.2119
## F-statistic: 94.95 on 2 and 697 DF, p-value: < 2.2e-16
Significance of the Coefficients: Intercept (55.114): The intercept is statistically significant with a very low p-value (<2e-16), which indicates that the baseline value of the dependent variable (Grad.Rate) is significantly different from zero when all predictors (Private and Elite) are at their reference levels (Private = No, Elite = No). PrivateYes (11.729): This coefficient represents the change in the graduation rate for institutions where “Private = Yes,” compared to those where “Private = No.” The p-value (<2e-16) suggests that this effect is statistically significant, meaning that private institutions, on average, have a graduation rate that is 11.729 percentage points higher than public institutions. EliteYes (18.108): The coefficient for “Elite = Yes” suggests that elite institutions have, on average, a graduation rate 18.108 percentage points higher than non-elite institutions, with a very low p-value (<2e-16), indicating statistical significance. All coefficients (Intercept, PrivateYes, and EliteYes) have p-values much less than 0.05, indicating they are highly statistically significant.
Adjusted R-squared: The Adjusted R-squared value is 0.2119, which suggests that the model explains approximately 21.19% of the variability in the graduation rate. While this is a modest proportion of variance explained, it suggests that other factors not included in the model may be influencing graduation rates. The Multiple R-squared value is 0.2141, which is slightly higher than the Adjusted R-squared because it doesn’t account for the number of predictors. Typically, we focus on Adjusted R-squared for model comparison, as it penalizes the addition of irrelevant variables. While 21.19% of explained variance might not seem very high, in social sciences and educational research, this could still be considered meaningful, especially if there are many unmeasured factors influencing graduation rates.
F-statistic and p-value: The F-statistic is 94.95, and the associated p-value is < 2.2e-16, which is extremely small and highly significant. This result indicates that at least one of the predictors (Private or Elite) significantly contributes to explaining the variation in graduation rates. The F-statistic tests the overall significance of the regression model, and a p-value this small strongly suggests that the model is valid and not due to random chance.
2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))
confint(lm(Grad.Rate ~ Private + Elite, data = mynewdata))
## 2.5 % 97.5 %
## (Intercept) 52.964799 57.26257
## PrivateYes 9.223286 14.23465
## EliteYes 14.416813 21.79863
new_data <- data.frame(Private = "Yes", Elite = "No")
predict(lm(Grad.Rate ~ Private + Elite, data = mynewdata), newdata = new_data, interval = "prediction")
## fit lwr upr
## 1 66.84265 37.39577 96.28953
2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)
model_all <- lm(Grad.Rate ~ ., data = mynewdata)
summary(model_all)
##
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.646 -7.287 -0.441 7.155 52.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0837588 4.9053047 7.152 2.20e-12 ***
## PrivateYes 3.9254138 1.7691302 2.219 0.026826 *
## Apps 0.0016663 0.0004327 3.851 0.000129 ***
## Accept -0.0015698 0.0008470 -1.853 0.064274 .
## Enroll 0.0021744 0.0023102 0.941 0.346933
## F.Undergrad -0.0002027 0.0004041 -0.502 0.616105
## P.Undergrad -0.0015407 0.0003898 -3.953 8.52e-05 ***
## Outstate 0.0011163 0.0002483 4.495 8.17e-06 ***
## Room.Board 0.0015220 0.0006147 2.476 0.013524 *
## Books 0.0006745 0.0030868 0.219 0.827084
## Personal -0.0024121 0.0008447 -2.856 0.004425 **
## PhD 0.1369511 0.0617875 2.216 0.026987 *
## Terminal -0.0353895 0.0666052 -0.531 0.595361
## S.F.Ratio 0.0097674 0.1625001 0.060 0.952088
## perc.alumni 0.3169363 0.0502684 6.305 5.18e-10 ***
## Expend -0.0004464 0.0001601 -2.788 0.005446 **
## EliteYes 4.2327018 2.0267038 2.088 0.037126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared: 0.4551, Adjusted R-squared: 0.4423
## F-statistic: 35.65 on 16 and 683 DF, p-value: < 2.2e-16
model_2_1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
anova(model_2_1, model_all)
## Analysis of Variance Table
##
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 697 156447
## 2 683 108483 14 47964 21.57 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To compare the models, we use an ANOVA test:
Model 1 (2.1): Includes Private and Elite. Model 2 (2.3): Includes all predictors. ANOVA Results: Residual Sum of Squares (RSS): Model 2 (full model) has a lower RSS (108,483 vs. 156,447), indicating a better fit. F-statistic: 21.57, suggesting a significant improvement in Model 2. p-value: < 2.2e-16, indicating that the improvement from Model 1 to Model 2 is statistically significant. Conclusion: The full model (2.3) is significantly better than the simpler model (2.1), as it explains more variance in Grad.Rate and provides a better fit, confirmed by the ANOVA test.
2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)
par(mfrow = c(2, 2))
plot(model_all)
When examining diagnostic plots for the multiple linear regression model
in 2.3, we typically use the following plots:
Residuals vs. Fitted Plot:
This plot shows whether there are any patterns in the residuals (errors). Ideally, residuals should be randomly scattered around zero, without any clear patterns. Observation: If we see a random scatter of points, it suggests that the linearity assumption holds, meaning the relationship between predictors and the outcome is linear.
Normal Q-Q Plot:
This plot checks if the residuals are normally distributed. Points should fall along the diagonal line if the residuals are normally distributed. Observation: If the points mostly follow the line, it indicates that the residuals are approximately normal, which supports the normality assumption.
Scale-Location Plot :
This plot helps check for homoscedasticity (constant variance of residuals). Ideally, the points should be evenly spread without any distinct pattern. Observation: A random spread of points suggests constant variance, but any funnel shape indicates heteroscedasticity (non-constant variance).
Residuals vs. Leverage Plot:
This plot helps identify influential data points that might disproportionately affect the model’s coefficients. Points outside the Cook’s distance lines are influential. Observation: If there are no extreme points far from the rest, the model doesn’t seem to have influential outliers that could bias the results.
2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
library(ggplot2)
full_model <- lm(Grad.Rate ~ ., data = mynewdata)
best_model <- stepAIC(full_model, direction = "both")
## Start: AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - S.F.Ratio 1 0.6 108484 3562.3
## - Books 1 7.6 108491 3562.3
## - F.Undergrad 1 40.0 108523 3562.5
## - Terminal 1 44.8 108528 3562.6
## - Enroll 1 140.7 108624 3563.2
## <none> 108483 3564.3
## - Accept 1 545.5 109029 3565.8
## - Elite 1 692.8 109176 3566.7
## - PhD 1 780.3 109263 3567.3
## - Private 1 782.0 109265 3567.3
## - Room.Board 1 973.8 109457 3568.5
## - Expend 1 1234.9 109718 3570.2
## - Personal 1 1295.3 109778 3570.6
## - Apps 1 2355.3 110838 3577.3
## - P.Undergrad 1 2481.9 110965 3578.1
## - Outstate 1 3209.4 111693 3582.7
## - perc.alumni 1 6313.9 114797 3601.9
##
## Step: AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Books 1 7.7 108491 3560.3
## - F.Undergrad 1 39.6 108523 3560.5
## - Terminal 1 45.2 108529 3560.6
## - Enroll 1 140.4 108624 3561.2
## <none> 108484 3562.3
## - Accept 1 546.0 109030 3563.8
## + S.F.Ratio 1 0.6 108483 3564.3
## - Elite 1 692.7 109176 3564.7
## - PhD 1 786.9 109271 3565.4
## - Private 1 795.5 109279 3565.4
## - Room.Board 1 974.6 109458 3566.6
## - Personal 1 1303.0 109787 3568.6
## - Expend 1 1445.7 109929 3569.6
## - Apps 1 2359.1 110843 3575.4
## - P.Undergrad 1 2482.4 110966 3576.1
## - Outstate 1 3220.3 111704 3580.8
## - perc.alumni 1 6339.9 114824 3600.1
##
## Step: AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal +
## perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - F.Undergrad 1 39.1 108530 3558.6
## - Terminal 1 41.7 108533 3558.6
## - Enroll 1 140.9 108632 3559.3
## <none> 108491 3560.3
## - Accept 1 546.9 109038 3561.9
## + Books 1 7.7 108484 3562.3
## + S.F.Ratio 1 0.7 108491 3562.3
## - Elite 1 696.8 109188 3562.8
## - PhD 1 779.3 109271 3563.4
## - Private 1 803.6 109295 3563.5
## - Room.Board 1 997.6 109489 3564.7
## - Personal 1 1300.5 109792 3566.7
## - Expend 1 1438.7 109930 3567.6
## - Apps 1 2363.5 110855 3573.4
## - P.Undergrad 1 2481.0 110972 3574.2
## - Outstate 1 3214.5 111706 3578.8
## - perc.alumni 1 6333.5 114825 3598.1
##
## Step: AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni +
## Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Terminal 1 47.9 108578 3556.9
## - Enroll 1 128.9 108659 3557.4
## <none> 108530 3558.6
## - Accept 1 532.8 109063 3560.0
## + F.Undergrad 1 39.1 108491 3560.3
## + Books 1 7.2 108523 3560.5
## + S.F.Ratio 1 0.2 108530 3560.6
## - Elite 1 715.7 109246 3561.2
## - PhD 1 780.7 109311 3561.6
## - Private 1 865.4 109396 3562.2
## - Room.Board 1 990.8 109521 3563.0
## - Personal 1 1340.4 109871 3565.2
## - Expend 1 1429.3 109960 3565.8
## - Apps 1 2329.1 110860 3571.5
## - P.Undergrad 1 2771.9 111302 3574.2
## - Outstate 1 3275.8 111806 3577.4
## - perc.alumni 1 6393.9 114924 3596.7
##
## Step: AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite
##
## Df Sum of Sq RSS AIC
## - Enroll 1 126.5 108705 3555.7
## <none> 108578 3556.9
## - Accept 1 541.5 109120 3558.4
## + Terminal 1 47.9 108530 3558.6
## + F.Undergrad 1 45.3 108533 3558.6
## + Books 1 3.6 108575 3558.9
## + S.F.Ratio 1 0.4 108578 3558.9
## - Elite 1 715.6 109294 3559.5
## - Private 1 915.7 109494 3560.8
## - Room.Board 1 956.0 109534 3561.0
## - PhD 1 1231.4 109810 3562.8
## - Personal 1 1335.1 109913 3563.5
## - Expend 1 1447.7 110026 3564.2
## - Apps 1 2357.8 110936 3569.9
## - P.Undergrad 1 2792.8 111371 3572.7
## - Outstate 1 3236.6 111815 3575.5
## - perc.alumni 1 6348.1 114926 3594.7
##
## Step: AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate +
## Room.Board + Personal + PhD + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## <none> 108705 3555.7
## - Accept 1 432.9 109138 3556.5
## + Enroll 1 126.5 108578 3556.9
## + Terminal 1 45.5 108659 3557.4
## + F.Undergrad 1 23.2 108682 3557.6
## + Books 1 5.2 108700 3557.7
## + S.F.Ratio 1 1.1 108704 3557.7
## - Elite 1 800.6 109505 3558.9
## - Private 1 832.3 109537 3559.1
## - Room.Board 1 892.7 109598 3559.4
## - Personal 1 1260.0 109965 3561.8
## - PhD 1 1283.8 109989 3561.9
## - Expend 1 1374.9 110080 3562.5
## - Apps 1 2246.3 110951 3568.0
## - P.Undergrad 1 2666.4 111371 3570.7
## - Outstate 1 3118.2 111823 3573.5
## - perc.alumni 1 6728.9 115434 3595.8
summary(best_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.842 -7.563 -0.507 6.971 52.885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1629560 3.2789880 10.724 < 2e-16 ***
## PrivateYes 3.8908272 1.6952648 2.295 0.022027 *
## Apps 0.0016005 0.0004245 3.771 0.000177 ***
## Accept -0.0010894 0.0006582 -1.655 0.098346 .
## P.Undergrad -0.0015205 0.0003701 -4.108 4.47e-05 ***
## Outstate 0.0010799 0.0002431 4.442 1.04e-05 ***
## Room.Board 0.0014346 0.0006036 2.377 0.017730 *
## Personal -0.0023305 0.0008253 -2.824 0.004881 **
## PhD 0.1122237 0.0393704 2.850 0.004496 **
## perc.alumni 0.3216323 0.0492854 6.526 1.31e-10 ***
## Expend -0.0004355 0.0001476 -2.950 0.003287 **
## EliteYes 4.5101993 2.0036855 2.251 0.024703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared: 0.4539, Adjusted R-squared: 0.4452
## F-statistic: 52 on 11 and 688 DF, p-value: < 2.2e-16
selected_formula <- formula(best_model)
final_model <- lm(selected_formula, data = mynewdata)
summary(final_model)
##
## Call:
## lm(formula = selected_formula, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.842 -7.563 -0.507 6.971 52.885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1629560 3.2789880 10.724 < 2e-16 ***
## PrivateYes 3.8908272 1.6952648 2.295 0.022027 *
## Apps 0.0016005 0.0004245 3.771 0.000177 ***
## Accept -0.0010894 0.0006582 -1.655 0.098346 .
## P.Undergrad -0.0015205 0.0003701 -4.108 4.47e-05 ***
## Outstate 0.0010799 0.0002431 4.442 1.04e-05 ***
## Room.Board 0.0014346 0.0006036 2.377 0.017730 *
## Personal -0.0023305 0.0008253 -2.824 0.004881 **
## PhD 0.1122237 0.0393704 2.850 0.004496 **
## perc.alumni 0.3216323 0.0492854 6.526 1.31e-10 ***
## Expend -0.0004355 0.0001476 -2.950 0.003287 **
## EliteYes 4.5101993 2.0036855 2.251 0.024703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared: 0.4539, Adjusted R-squared: 0.4452
## F-statistic: 52 on 11 and 688 DF, p-value: < 2.2e-16
plot(final_model, which = 1, main = "Residuals vs Fitted")
plot(final_model, which = 2, main = "Normal Q-Q Plot")
plot(final_model, which = 3, main = "Scale-Location Plot")
plot(final_model, which = 5, main = "Residuals vs Leverage")
Justifying the Best Model Using Plots
Diagnostic Plots The key diagnostic plots evaluate the linear regression model are:
Residuals vs Fitted Plot This plot helps us assess the assumption of homoscedasticity, which means that the variance of the residuals should be constant across all levels of the fitted values. If the plot shows a random scatter of points around zero, this suggests that homoscedasticity is met. If we see patterns, such as a funnel shape or curvature, this may indicate problems like heteroscedasticity or non-linearity.
Normal Q-Q Plot This plot helps us assess whether the residuals are normally distributed. In linear regression, the residuals should follow a normal distribution for valid statistical inference. If the points closely follow the diagonal line, it suggests that the residuals are normally distributed. If the points deviate significantly from the line, it indicates a violation of the normality assumption.
Scale-Location Plot Also called the spread-location plot, this helps assess whether the residuals exhibit constant variance (homoscedasticity). Ideally, the plot should show a horizontal band of points. If there is a clear pattern (such as a funnel shape), it suggests the presence of heteroscedasticity.
Residuals vs Leverage Plot This plot helps us identify influential data points that might disproportionately affect the regression model. Points with high leverage are those with extreme values of the predictors, while points with large residuals have large deviations from the fitted values. Influential points that combine both high leverage and large residuals can significantly affect the model’s fit and should be carefully examined. If the plot shows such points, it may indicate that the model needs further refinement or that the data contains outliers.
Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.
X <- model.matrix(Grad.Rate ~ ., data = mynewdata)[, -1]
y <- mynewdata$Grad.Rate
X_scaled <- scale(X)
X_poly <- cbind(X_scaled, X_scaled^2)
colnames(X_poly) <- c(colnames(X_scaled), paste0(colnames(X_scaled), "_sq"))
stepwise_model <- step(lm(Grad.Rate ~ ., data = mynewdata), direction = "both")
## Start: AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - S.F.Ratio 1 0.6 108484 3562.3
## - Books 1 7.6 108491 3562.3
## - F.Undergrad 1 40.0 108523 3562.5
## - Terminal 1 44.8 108528 3562.6
## - Enroll 1 140.7 108624 3563.2
## <none> 108483 3564.3
## - Accept 1 545.5 109029 3565.8
## - Elite 1 692.8 109176 3566.7
## - PhD 1 780.3 109263 3567.3
## - Private 1 782.0 109265 3567.3
## - Room.Board 1 973.8 109457 3568.5
## - Expend 1 1234.9 109718 3570.2
## - Personal 1 1295.3 109778 3570.6
## - Apps 1 2355.3 110838 3577.3
## - P.Undergrad 1 2481.9 110965 3578.1
## - Outstate 1 3209.4 111693 3582.7
## - perc.alumni 1 6313.9 114797 3601.9
##
## Step: AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Books 1 7.7 108491 3560.3
## - F.Undergrad 1 39.6 108523 3560.5
## - Terminal 1 45.2 108529 3560.6
## - Enroll 1 140.4 108624 3561.2
## <none> 108484 3562.3
## - Accept 1 546.0 109030 3563.8
## + S.F.Ratio 1 0.6 108483 3564.3
## - Elite 1 692.7 109176 3564.7
## - PhD 1 786.9 109271 3565.4
## - Private 1 795.5 109279 3565.4
## - Room.Board 1 974.6 109458 3566.6
## - Personal 1 1303.0 109787 3568.6
## - Expend 1 1445.7 109929 3569.6
## - Apps 1 2359.1 110843 3575.4
## - P.Undergrad 1 2482.4 110966 3576.1
## - Outstate 1 3220.3 111704 3580.8
## - perc.alumni 1 6339.9 114824 3600.1
##
## Step: AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal +
## perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - F.Undergrad 1 39.1 108530 3558.6
## - Terminal 1 41.7 108533 3558.6
## - Enroll 1 140.9 108632 3559.3
## <none> 108491 3560.3
## - Accept 1 546.9 109038 3561.9
## + Books 1 7.7 108484 3562.3
## + S.F.Ratio 1 0.7 108491 3562.3
## - Elite 1 696.8 109188 3562.8
## - PhD 1 779.3 109271 3563.4
## - Private 1 803.6 109295 3563.5
## - Room.Board 1 997.6 109489 3564.7
## - Personal 1 1300.5 109792 3566.7
## - Expend 1 1438.7 109930 3567.6
## - Apps 1 2363.5 110855 3573.4
## - P.Undergrad 1 2481.0 110972 3574.2
## - Outstate 1 3214.5 111706 3578.8
## - perc.alumni 1 6333.5 114825 3598.1
##
## Step: AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni +
## Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Terminal 1 47.9 108578 3556.9
## - Enroll 1 128.9 108659 3557.4
## <none> 108530 3558.6
## - Accept 1 532.8 109063 3560.0
## + F.Undergrad 1 39.1 108491 3560.3
## + Books 1 7.2 108523 3560.5
## + S.F.Ratio 1 0.2 108530 3560.6
## - Elite 1 715.7 109246 3561.2
## - PhD 1 780.7 109311 3561.6
## - Private 1 865.4 109396 3562.2
## - Room.Board 1 990.8 109521 3563.0
## - Personal 1 1340.4 109871 3565.2
## - Expend 1 1429.3 109960 3565.8
## - Apps 1 2329.1 110860 3571.5
## - P.Undergrad 1 2771.9 111302 3574.2
## - Outstate 1 3275.8 111806 3577.4
## - perc.alumni 1 6393.9 114924 3596.7
##
## Step: AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite
##
## Df Sum of Sq RSS AIC
## - Enroll 1 126.5 108705 3555.7
## <none> 108578 3556.9
## - Accept 1 541.5 109120 3558.4
## + Terminal 1 47.9 108530 3558.6
## + F.Undergrad 1 45.3 108533 3558.6
## + Books 1 3.6 108575 3558.9
## + S.F.Ratio 1 0.4 108578 3558.9
## - Elite 1 715.6 109294 3559.5
## - Private 1 915.7 109494 3560.8
## - Room.Board 1 956.0 109534 3561.0
## - PhD 1 1231.4 109810 3562.8
## - Personal 1 1335.1 109913 3563.5
## - Expend 1 1447.7 110026 3564.2
## - Apps 1 2357.8 110936 3569.9
## - P.Undergrad 1 2792.8 111371 3572.7
## - Outstate 1 3236.6 111815 3575.5
## - perc.alumni 1 6348.1 114926 3594.7
##
## Step: AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate +
## Room.Board + Personal + PhD + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## <none> 108705 3555.7
## - Accept 1 432.9 109138 3556.5
## + Enroll 1 126.5 108578 3556.9
## + Terminal 1 45.5 108659 3557.4
## + F.Undergrad 1 23.2 108682 3557.6
## + Books 1 5.2 108700 3557.7
## + S.F.Ratio 1 1.1 108704 3557.7
## - Elite 1 800.6 109505 3558.9
## - Private 1 832.3 109537 3559.1
## - Room.Board 1 892.7 109598 3559.4
## - Personal 1 1260.0 109965 3561.8
## - PhD 1 1283.8 109989 3561.9
## - Expend 1 1374.9 110080 3562.5
## - Apps 1 2246.3 110951 3568.0
## - P.Undergrad 1 2666.4 111371 3570.7
## - Outstate 1 3118.2 111823 3573.5
## - perc.alumni 1 6728.9 115434 3595.8
summary(stepwise_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.842 -7.563 -0.507 6.971 52.885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1629560 3.2789880 10.724 < 2e-16 ***
## PrivateYes 3.8908272 1.6952648 2.295 0.022027 *
## Apps 0.0016005 0.0004245 3.771 0.000177 ***
## Accept -0.0010894 0.0006582 -1.655 0.098346 .
## P.Undergrad -0.0015205 0.0003701 -4.108 4.47e-05 ***
## Outstate 0.0010799 0.0002431 4.442 1.04e-05 ***
## Room.Board 0.0014346 0.0006036 2.377 0.017730 *
## Personal -0.0023305 0.0008253 -2.824 0.004881 **
## PhD 0.1122237 0.0393704 2.850 0.004496 **
## perc.alumni 0.3216323 0.0492854 6.526 1.31e-10 ***
## Expend -0.0004355 0.0001476 -2.950 0.003287 **
## EliteYes 4.5101993 2.0036855 2.251 0.024703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared: 0.4539, Adjusted R-squared: 0.4452
## F-statistic: 52 on 11 and 688 DF, p-value: < 2.2e-16
adj_r_squared_stepwise <- summary(stepwise_model)$adj.r.squared
cat("Adjusted R-squared for Stepwise Model: ", adj_r_squared_stepwise, "\n")
## Adjusted R-squared for Stepwise Model: 0.445214
print(coef(final_model))
## (Intercept) PrivateYes Apps Accept P.Undergrad
## 35.1629559728 3.8908272070 0.0016004959 -0.0010893644 -0.0015204627
## Outstate Room.Board Personal PhD perc.alumni
## 0.0010799294 0.0014346068 -0.0023304929 0.1122236857 0.3216322812
## Expend EliteYes
## -0.0004354983 4.5101993356
par(mfrow = c(2, 2))
plot(final_model)
The stepwise regression model with an Adjusted R-squared of 0.4452
provides a good fit for predicting the Grad.Rate variable. This means
that the model explains approximately 44.5% of the variance in
graduation rates, which is a reasonable level of explanatory power for
this type of educational dataset.
Model Summary:
Residual Standard Error: 12.57 suggests that the model’s predictions have a typical error of around 12.57 graduation rate points. F-statistic (52): This indicates that the model as a whole is statistically significant with a p-value < 2.2e-16, meaning the included predictors contribute meaningfully to explaining graduation rates.
Key Predictors:
The final stepwise model includes several key predictors of Grad.Rate:
Private (PrivateYes): Private institutions have a positive and significant impact on graduation rates. Apps: More applications correlate with higher graduation rates. P.Undergrad: A higher proportion of part-time undergraduates is negatively associated with graduation rates. Outstate: Out-of-state tuition costs are positively related to graduation rates. Room.Board: Higher room and board costs are associated with better graduation rates. Personal: Higher personal expenses negatively impact graduation rates. PhD: More faculty with PhDs is positively associated with graduation rates. perc.alumni: A higher percentage of alumni donations strongly boosts graduation rates. Expend: More institutional spending negatively affects graduation rates. Elite (EliteYes): Elite institutions tend to have higher graduation rates.
Model Comparisons:
Ridge Regression: The best lambda value for Ridge was 0.9709, with an Adjusted R-squared of 0.4404. This model performed slightly worse than the stepwise model.
Lasso Regression: The best lambda value for Lasso was 0.5428, but it had a lower Adjusted R-squared of 0.4159, indicating it did not capture the relationships as effectively as the stepwise model.
Multicollinearity:
The Variance Inflation Factor (VIF) values for the stepwise model range from 1.27 to 12.55. Most values are acceptable, but the higher VIFs for Personal and Expend (around 12.5) suggest some multicollinearity, meaning these predictors may be correlated with others and could potentially inflate standard errors. This could be a limitation of the model.
Conclusion:
The stepwise regression model is most appropriate, given its relatively high Adjusted R-squared and performance over the Ridge and Lasso methods. This model explains a significant part of the variance in graduation rates, but some predictors are still suffering from multicollinearity issues. The model’s insights will help institutions understand which factors most influence improving graduation rates.