+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++— title: ‘MA717: Applied Regression and Experimental Data Analysis’ author: “Assignment template” date: ” ” output: pdf_document: default word_document: default html_document: df_print: paged —

options(repos = c(CRAN = "https://cran.r-project.org"))

1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))

file_path <- file.path("M:/Assignment-20241208/College.csv")
mydata <- read.csv("M:/Assignment-20241208/College.csv", header = TRUE, stringsAsFactors = TRUE)
dim(mydata)
## [1] 775  17
head(mydata)
##   Private Apps Accept Enroll F.Undergrad P.Undergrad Outstate Room.Board Books
## 1     Yes 1660   1232    721        2885         537     7440       3300   450
## 2     Yes 2186   1924    512        2683        1227    12280       6450   750
## 3     Yes 1428   1097    336        1036          99    11250       3750   400
## 4     Yes  417    349    137         510          63    12960       5450   450
## 5     Yes  193    146     55         249         869     7560       4120   800
## 6     Yes  587    479    158         678          41    13500       3335   500
##   Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Elite
## 1     2200  70       78      18.1          12   7041        60    No
## 2     1500  29       30      12.2          16  10527        56    No
## 3     1165  53       66      12.9          30   8735        54    No
## 4      875  92       97       7.7          37  19016        59   Yes
## 5     1500  76       72      11.9           2  10922        15    No
## 6      675  67       73       9.4          11   9727        55    No

The number of observations : 775 The number of variables : 17

1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))

set.seed(2401616)
mynewdata <- 
  mydata[sample(1:nrow(mydata),700), ]
summary(mydata)
##  Private        Apps           Accept            Enroll        F.Undergrad   
##  No :211   Min.   :   81   Min.   :   72.0   Min.   :  35.0   Min.   :  139  
##  Yes:564   1st Qu.:  778   1st Qu.:  605.5   1st Qu.: 242.0   1st Qu.:  990  
##            Median : 1558   Median : 1110.0   Median : 434.0   Median : 1708  
##            Mean   : 3004   Mean   : 2019.0   Mean   : 781.0   Mean   : 3707  
##            3rd Qu.: 3610   3rd Qu.: 2413.0   3rd Qu.: 902.5   3rd Qu.: 4056  
##            Max.   :48094   Max.   :26330.0   Max.   :6392.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   95.0   1st Qu.: 7332   1st Qu.:3598   1st Qu.: 469.0  
##  Median :  355.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   :  857.3   Mean   :10449   Mean   :4358   Mean   : 549.3  
##  3rd Qu.:  967.5   3rd Qu.:12938   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal        S.F.Ratio    
##  Min.   : 250   Min.   :  8.00   Min.   : 24.00   Min.   : 2.50  
##  1st Qu.: 870   1st Qu.: 62.00   1st Qu.: 71.00   1st Qu.:11.50  
##  Median :1200   Median : 75.00   Median : 82.00   Median :13.60  
##  Mean   :1343   Mean   : 72.69   Mean   : 79.73   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.00   3rd Qu.:16.50  
##  Max.   :6800   Max.   :100.00   Max.   :100.00   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate      Elite    
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00   No :697  
##  1st Qu.:13.00   1st Qu.: 6754   1st Qu.: 53.00   Yes: 78  
##  Median :21.00   Median : 8408   Median : 65.00            
##  Mean   :22.76   Mean   : 9667   Mean   : 65.42            
##  3rd Qu.:31.00   3rd Qu.:10847   3rd Qu.: 78.00            
##  Max.   :64.00   Max.   :56233   Max.   :100.00
table(mynewdata$Private)
## 
##  No Yes 
## 190 510
table(mynewdata$Elite)
## 
##  No Yes 
## 629  71

In the new data, there are 564 private universities and 211 public universities. Additionally, there are 78 elite universities and 697 non-elite universities.

1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))

install.packages("ggplot2")
## Installing package into 'C:/Users/kp24011/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\kp24011\AppData\Local\Temp\Rtmpigz83d\downloaded_packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
hist_vars <- c("Outstate", "Room.Board", "Books", "Personal")
titles <- c("Histogram of Outstate Tuition", 
            "Histogram of Room and Board Costs", 
            "Histogram of Book Costs", 
            "Histogram of Personal Expenses")
x_labels <- c("Outstate Tuition", 
              "Room and Board Costs", 
              "Book Costs", 
              "Personal Expenses")

for (i in seq_along(hist_vars)) {
  plot <- ggplot(mynewdata, aes(x = .data[[hist_vars[i]]])) +
    geom_histogram(binwidth = 900, fill = "skyblue", color = "red") +
    labs(
      title = titles[i], 
      x = x_labels[i], 
      y = "Frequency"
    ) +
    theme_minimal()
  print(plot)
}

2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).

model1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
summary(model1)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.843  -9.870   0.157  10.157  44.886 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.114      1.094  50.356   <2e-16 ***
## PrivateYes    11.729      1.276   9.190   <2e-16 ***
## EliteYes      18.108      1.880   9.632   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.98 on 697 degrees of freedom
## Multiple R-squared:  0.2141, Adjusted R-squared:  0.2119 
## F-statistic: 94.95 on 2 and 697 DF,  p-value: < 2.2e-16
  1. Significance of the Coefficients: Intercept (55.114): The intercept is statistically significant with a very low p-value (<2e-16), which indicates that the baseline value of the dependent variable (Grad.Rate) is significantly different from zero when all predictors (Private and Elite) are at their reference levels (Private = No, Elite = No). PrivateYes (11.729): This coefficient represents the change in the graduation rate for institutions where “Private = Yes,” compared to those where “Private = No.” The p-value (<2e-16) suggests that this effect is statistically significant, meaning that private institutions, on average, have a graduation rate that is 11.729 percentage points higher than public institutions. EliteYes (18.108): The coefficient for “Elite = Yes” suggests that elite institutions have, on average, a graduation rate 18.108 percentage points higher than non-elite institutions, with a very low p-value (<2e-16), indicating statistical significance. All coefficients (Intercept, PrivateYes, and EliteYes) have p-values much less than 0.05, indicating they are highly statistically significant.

  2. Adjusted R-squared: The Adjusted R-squared value is 0.2119, which suggests that the model explains approximately 21.19% of the variability in the graduation rate. While this is a modest proportion of variance explained, it suggests that other factors not included in the model may be influencing graduation rates. The Multiple R-squared value is 0.2141, which is slightly higher than the Adjusted R-squared because it doesn’t account for the number of predictors. Typically, we focus on Adjusted R-squared for model comparison, as it penalizes the addition of irrelevant variables. While 21.19% of explained variance might not seem very high, in social sciences and educational research, this could still be considered meaningful, especially if there are many unmeasured factors influencing graduation rates.

  3. F-statistic and p-value: The F-statistic is 94.95, and the associated p-value is < 2.2e-16, which is extremely small and highly significant. This result indicates that at least one of the predictors (Private or Elite) significantly contributes to explaining the variation in graduation rates. The F-statistic tests the overall significance of the regression model, and a p-value this small strongly suggests that the model is valid and not due to random chance.

2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))

confint(lm(Grad.Rate ~ Private + Elite, data = mynewdata))
##                 2.5 %   97.5 %
## (Intercept) 52.964799 57.26257
## PrivateYes   9.223286 14.23465
## EliteYes    14.416813 21.79863
new_data <- data.frame(Private = "Yes", Elite = "No")


predict(lm(Grad.Rate ~ Private + Elite, data = mynewdata), newdata = new_data, interval = "prediction")
##        fit      lwr      upr
## 1 66.84265 37.39577 96.28953

2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)

model_all <- lm(Grad.Rate ~ ., data = mynewdata)


summary(model_all)
## 
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.646  -7.287  -0.441   7.155  52.328 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.0837588  4.9053047   7.152 2.20e-12 ***
## PrivateYes   3.9254138  1.7691302   2.219 0.026826 *  
## Apps         0.0016663  0.0004327   3.851 0.000129 ***
## Accept      -0.0015698  0.0008470  -1.853 0.064274 .  
## Enroll       0.0021744  0.0023102   0.941 0.346933    
## F.Undergrad -0.0002027  0.0004041  -0.502 0.616105    
## P.Undergrad -0.0015407  0.0003898  -3.953 8.52e-05 ***
## Outstate     0.0011163  0.0002483   4.495 8.17e-06 ***
## Room.Board   0.0015220  0.0006147   2.476 0.013524 *  
## Books        0.0006745  0.0030868   0.219 0.827084    
## Personal    -0.0024121  0.0008447  -2.856 0.004425 ** 
## PhD          0.1369511  0.0617875   2.216 0.026987 *  
## Terminal    -0.0353895  0.0666052  -0.531 0.595361    
## S.F.Ratio    0.0097674  0.1625001   0.060 0.952088    
## perc.alumni  0.3169363  0.0502684   6.305 5.18e-10 ***
## Expend      -0.0004464  0.0001601  -2.788 0.005446 ** 
## EliteYes     4.2327018  2.0267038   2.088 0.037126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4423 
## F-statistic: 35.65 on 16 and 683 DF,  p-value: < 2.2e-16
model_2_1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
anova(model_2_1, model_all)
## Analysis of Variance Table
## 
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    697 156447                                 
## 2    683 108483 14     47964 21.57 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Significance of Predictors: Significant predictors (p < 0.05): PrivateYes, Apps, P.Undergrad, Outstate, Room.Board, Personal, PhD, perc.alumni, Expend, and EliteYes. Non-significant predictors (p > 0.05): Enroll, F.Undergrad, Books, Terminal, S.F.Ratio. Interpretation: Many predictors significantly influence Grad.Rate, while some (e.g., Enroll, S.F.Ratio) do not.
  2. Adjusted R-squared: Multiple R-squared: 0.4551, meaning the model explains 45.5% of the variance in Grad.Rate. Adjusted R-squared: 0.4423, suggesting a moderate fit, accounting for the number of predictors in the model.
  3. F-statistic and p-value: F-statistic: 35.65, indicating the model is overall significant. p-value: < 2.2e-16, which is extremely small, confirming that at least one predictor is significantly related to Grad.Rate.
  4. ANOVA Comparison with Model 1 (Private + Elite): F-statistic for the ANOVA comparison: 21.57. p-value: < 2.2e-16, showing that the full model (2.3) is significantly better than the simpler model (2.1).

To compare the models, we use an ANOVA test:

Model 1 (2.1): Includes Private and Elite. Model 2 (2.3): Includes all predictors. ANOVA Results: Residual Sum of Squares (RSS): Model 2 (full model) has a lower RSS (108,483 vs. 156,447), indicating a better fit. F-statistic: 21.57, suggesting a significant improvement in Model 2. p-value: < 2.2e-16, indicating that the improvement from Model 1 to Model 2 is statistically significant. Conclusion: The full model (2.3) is significantly better than the simpler model (2.1), as it explains more variance in Grad.Rate and provides a better fit, confirmed by the ANOVA test.

2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)

par(mfrow = c(2, 2))
plot(model_all)

When examining diagnostic plots for the multiple linear regression model in 2.3, we typically use the following plots:

Residuals vs. Fitted Plot:

This plot shows whether there are any patterns in the residuals (errors). Ideally, residuals should be randomly scattered around zero, without any clear patterns. Observation: If we see a random scatter of points, it suggests that the linearity assumption holds, meaning the relationship between predictors and the outcome is linear.

Normal Q-Q Plot:

This plot checks if the residuals are normally distributed. Points should fall along the diagonal line if the residuals are normally distributed. Observation: If the points mostly follow the line, it indicates that the residuals are approximately normal, which supports the normality assumption.

Scale-Location Plot :

This plot helps check for homoscedasticity (constant variance of residuals). Ideally, the points should be evenly spread without any distinct pattern. Observation: A random spread of points suggests constant variance, but any funnel shape indicates heteroscedasticity (non-constant variance).

Residuals vs. Leverage Plot:

This plot helps identify influential data points that might disproportionately affect the model’s coefficients. Points outside the Cook’s distance lines are influential. Observation: If there are no extreme points far from the rest, the model doesn’t seem to have influential outliers that could bias the results.

2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)

library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
library(ggplot2)

full_model <- lm(Grad.Rate ~ ., data = mynewdata)

best_model <- stepAIC(full_model, direction = "both")
## Start:  AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - S.F.Ratio    1       0.6 108484 3562.3
## - Books        1       7.6 108491 3562.3
## - F.Undergrad  1      40.0 108523 3562.5
## - Terminal     1      44.8 108528 3562.6
## - Enroll       1     140.7 108624 3563.2
## <none>                     108483 3564.3
## - Accept       1     545.5 109029 3565.8
## - Elite        1     692.8 109176 3566.7
## - PhD          1     780.3 109263 3567.3
## - Private      1     782.0 109265 3567.3
## - Room.Board   1     973.8 109457 3568.5
## - Expend       1    1234.9 109718 3570.2
## - Personal     1    1295.3 109778 3570.6
## - Apps         1    2355.3 110838 3577.3
## - P.Undergrad  1    2481.9 110965 3578.1
## - Outstate     1    3209.4 111693 3582.7
## - perc.alumni  1    6313.9 114797 3601.9
## 
## Step:  AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Books        1       7.7 108491 3560.3
## - F.Undergrad  1      39.6 108523 3560.5
## - Terminal     1      45.2 108529 3560.6
## - Enroll       1     140.4 108624 3561.2
## <none>                     108484 3562.3
## - Accept       1     546.0 109030 3563.8
## + S.F.Ratio    1       0.6 108483 3564.3
## - Elite        1     692.7 109176 3564.7
## - PhD          1     786.9 109271 3565.4
## - Private      1     795.5 109279 3565.4
## - Room.Board   1     974.6 109458 3566.6
## - Personal     1    1303.0 109787 3568.6
## - Expend       1    1445.7 109929 3569.6
## - Apps         1    2359.1 110843 3575.4
## - P.Undergrad  1    2482.4 110966 3576.1
## - Outstate     1    3220.3 111704 3580.8
## - perc.alumni  1    6339.9 114824 3600.1
## 
## Step:  AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal + 
##     perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - F.Undergrad  1      39.1 108530 3558.6
## - Terminal     1      41.7 108533 3558.6
## - Enroll       1     140.9 108632 3559.3
## <none>                     108491 3560.3
## - Accept       1     546.9 109038 3561.9
## + Books        1       7.7 108484 3562.3
## + S.F.Ratio    1       0.7 108491 3562.3
## - Elite        1     696.8 109188 3562.8
## - PhD          1     779.3 109271 3563.4
## - Private      1     803.6 109295 3563.5
## - Room.Board   1     997.6 109489 3564.7
## - Personal     1    1300.5 109792 3566.7
## - Expend       1    1438.7 109930 3567.6
## - Apps         1    2363.5 110855 3573.4
## - P.Undergrad  1    2481.0 110972 3574.2
## - Outstate     1    3214.5 111706 3578.8
## - perc.alumni  1    6333.5 114825 3598.1
## 
## Step:  AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni + 
##     Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Terminal     1      47.9 108578 3556.9
## - Enroll       1     128.9 108659 3557.4
## <none>                     108530 3558.6
## - Accept       1     532.8 109063 3560.0
## + F.Undergrad  1      39.1 108491 3560.3
## + Books        1       7.2 108523 3560.5
## + S.F.Ratio    1       0.2 108530 3560.6
## - Elite        1     715.7 109246 3561.2
## - PhD          1     780.7 109311 3561.6
## - Private      1     865.4 109396 3562.2
## - Room.Board   1     990.8 109521 3563.0
## - Personal     1    1340.4 109871 3565.2
## - Expend       1    1429.3 109960 3565.8
## - Apps         1    2329.1 110860 3571.5
## - P.Undergrad  1    2771.9 111302 3574.2
## - Outstate     1    3275.8 111806 3577.4
## - perc.alumni  1    6393.9 114924 3596.7
## 
## Step:  AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Enroll       1     126.5 108705 3555.7
## <none>                     108578 3556.9
## - Accept       1     541.5 109120 3558.4
## + Terminal     1      47.9 108530 3558.6
## + F.Undergrad  1      45.3 108533 3558.6
## + Books        1       3.6 108575 3558.9
## + S.F.Ratio    1       0.4 108578 3558.9
## - Elite        1     715.6 109294 3559.5
## - Private      1     915.7 109494 3560.8
## - Room.Board   1     956.0 109534 3561.0
## - PhD          1    1231.4 109810 3562.8
## - Personal     1    1335.1 109913 3563.5
## - Expend       1    1447.7 110026 3564.2
## - Apps         1    2357.8 110936 3569.9
## - P.Undergrad  1    2792.8 111371 3572.7
## - Outstate     1    3236.6 111815 3575.5
## - perc.alumni  1    6348.1 114926 3594.7
## 
## Step:  AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate + 
##     Room.Board + Personal + PhD + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     108705 3555.7
## - Accept       1     432.9 109138 3556.5
## + Enroll       1     126.5 108578 3556.9
## + Terminal     1      45.5 108659 3557.4
## + F.Undergrad  1      23.2 108682 3557.6
## + Books        1       5.2 108700 3557.7
## + S.F.Ratio    1       1.1 108704 3557.7
## - Elite        1     800.6 109505 3558.9
## - Private      1     832.3 109537 3559.1
## - Room.Board   1     892.7 109598 3559.4
## - Personal     1    1260.0 109965 3561.8
## - PhD          1    1283.8 109989 3561.9
## - Expend       1    1374.9 110080 3562.5
## - Apps         1    2246.3 110951 3568.0
## - P.Undergrad  1    2666.4 111371 3570.7
## - Outstate     1    3118.2 111823 3573.5
## - perc.alumni  1    6728.9 115434 3595.8
summary(best_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.842  -7.563  -0.507   6.971  52.885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1629560  3.2789880  10.724  < 2e-16 ***
## PrivateYes   3.8908272  1.6952648   2.295 0.022027 *  
## Apps         0.0016005  0.0004245   3.771 0.000177 ***
## Accept      -0.0010894  0.0006582  -1.655 0.098346 .  
## P.Undergrad -0.0015205  0.0003701  -4.108 4.47e-05 ***
## Outstate     0.0010799  0.0002431   4.442 1.04e-05 ***
## Room.Board   0.0014346  0.0006036   2.377 0.017730 *  
## Personal    -0.0023305  0.0008253  -2.824 0.004881 ** 
## PhD          0.1122237  0.0393704   2.850 0.004496 ** 
## perc.alumni  0.3216323  0.0492854   6.526 1.31e-10 ***
## Expend      -0.0004355  0.0001476  -2.950 0.003287 ** 
## EliteYes     4.5101993  2.0036855   2.251 0.024703 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared:  0.4539, Adjusted R-squared:  0.4452 
## F-statistic:    52 on 11 and 688 DF,  p-value: < 2.2e-16
selected_formula <- formula(best_model)
final_model <- lm(selected_formula, data = mynewdata)

summary(final_model)
## 
## Call:
## lm(formula = selected_formula, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.842  -7.563  -0.507   6.971  52.885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1629560  3.2789880  10.724  < 2e-16 ***
## PrivateYes   3.8908272  1.6952648   2.295 0.022027 *  
## Apps         0.0016005  0.0004245   3.771 0.000177 ***
## Accept      -0.0010894  0.0006582  -1.655 0.098346 .  
## P.Undergrad -0.0015205  0.0003701  -4.108 4.47e-05 ***
## Outstate     0.0010799  0.0002431   4.442 1.04e-05 ***
## Room.Board   0.0014346  0.0006036   2.377 0.017730 *  
## Personal    -0.0023305  0.0008253  -2.824 0.004881 ** 
## PhD          0.1122237  0.0393704   2.850 0.004496 ** 
## perc.alumni  0.3216323  0.0492854   6.526 1.31e-10 ***
## Expend      -0.0004355  0.0001476  -2.950 0.003287 ** 
## EliteYes     4.5101993  2.0036855   2.251 0.024703 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared:  0.4539, Adjusted R-squared:  0.4452 
## F-statistic:    52 on 11 and 688 DF,  p-value: < 2.2e-16
plot(final_model, which = 1, main = "Residuals vs Fitted")

plot(final_model, which = 2, main = "Normal Q-Q Plot")

plot(final_model, which = 3, main = "Scale-Location Plot")

plot(final_model, which = 5, main = "Residuals vs Leverage")

Justifying the Best Model Using Plots

Diagnostic Plots The key diagnostic plots evaluate the linear regression model are:

  1. Residuals vs Fitted Plot This plot helps us assess the assumption of homoscedasticity, which means that the variance of the residuals should be constant across all levels of the fitted values. If the plot shows a random scatter of points around zero, this suggests that homoscedasticity is met. If we see patterns, such as a funnel shape or curvature, this may indicate problems like heteroscedasticity or non-linearity.

  2. Normal Q-Q Plot This plot helps us assess whether the residuals are normally distributed. In linear regression, the residuals should follow a normal distribution for valid statistical inference. If the points closely follow the diagonal line, it suggests that the residuals are normally distributed. If the points deviate significantly from the line, it indicates a violation of the normality assumption.

  3. Scale-Location Plot Also called the spread-location plot, this helps assess whether the residuals exhibit constant variance (homoscedasticity). Ideally, the plot should show a horizontal band of points. If there is a clear pattern (such as a funnel shape), it suggests the presence of heteroscedasticity.

  4. Residuals vs Leverage Plot This plot helps us identify influential data points that might disproportionately affect the regression model. Points with high leverage are those with extreme values of the predictors, while points with large residuals have large deviations from the fitted values. Influential points that combine both high leverage and large residuals can significantly affect the model’s fit and should be carefully examined. If the plot shows such points, it may indicate that the model needs further refinement or that the data contains outliers.

Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.

X <- model.matrix(Grad.Rate ~ ., data = mynewdata)[, -1]
y <- mynewdata$Grad.Rate

X_scaled <- scale(X)


X_poly <- cbind(X_scaled, X_scaled^2)
colnames(X_poly) <- c(colnames(X_scaled), paste0(colnames(X_scaled), "_sq"))


stepwise_model <- step(lm(Grad.Rate ~ ., data = mynewdata), direction = "both")
## Start:  AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - S.F.Ratio    1       0.6 108484 3562.3
## - Books        1       7.6 108491 3562.3
## - F.Undergrad  1      40.0 108523 3562.5
## - Terminal     1      44.8 108528 3562.6
## - Enroll       1     140.7 108624 3563.2
## <none>                     108483 3564.3
## - Accept       1     545.5 109029 3565.8
## - Elite        1     692.8 109176 3566.7
## - PhD          1     780.3 109263 3567.3
## - Private      1     782.0 109265 3567.3
## - Room.Board   1     973.8 109457 3568.5
## - Expend       1    1234.9 109718 3570.2
## - Personal     1    1295.3 109778 3570.6
## - Apps         1    2355.3 110838 3577.3
## - P.Undergrad  1    2481.9 110965 3578.1
## - Outstate     1    3209.4 111693 3582.7
## - perc.alumni  1    6313.9 114797 3601.9
## 
## Step:  AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Books        1       7.7 108491 3560.3
## - F.Undergrad  1      39.6 108523 3560.5
## - Terminal     1      45.2 108529 3560.6
## - Enroll       1     140.4 108624 3561.2
## <none>                     108484 3562.3
## - Accept       1     546.0 109030 3563.8
## + S.F.Ratio    1       0.6 108483 3564.3
## - Elite        1     692.7 109176 3564.7
## - PhD          1     786.9 109271 3565.4
## - Private      1     795.5 109279 3565.4
## - Room.Board   1     974.6 109458 3566.6
## - Personal     1    1303.0 109787 3568.6
## - Expend       1    1445.7 109929 3569.6
## - Apps         1    2359.1 110843 3575.4
## - P.Undergrad  1    2482.4 110966 3576.1
## - Outstate     1    3220.3 111704 3580.8
## - perc.alumni  1    6339.9 114824 3600.1
## 
## Step:  AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal + 
##     perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - F.Undergrad  1      39.1 108530 3558.6
## - Terminal     1      41.7 108533 3558.6
## - Enroll       1     140.9 108632 3559.3
## <none>                     108491 3560.3
## - Accept       1     546.9 109038 3561.9
## + Books        1       7.7 108484 3562.3
## + S.F.Ratio    1       0.7 108491 3562.3
## - Elite        1     696.8 109188 3562.8
## - PhD          1     779.3 109271 3563.4
## - Private      1     803.6 109295 3563.5
## - Room.Board   1     997.6 109489 3564.7
## - Personal     1    1300.5 109792 3566.7
## - Expend       1    1438.7 109930 3567.6
## - Apps         1    2363.5 110855 3573.4
## - P.Undergrad  1    2481.0 110972 3574.2
## - Outstate     1    3214.5 111706 3578.8
## - perc.alumni  1    6333.5 114825 3598.1
## 
## Step:  AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni + 
##     Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Terminal     1      47.9 108578 3556.9
## - Enroll       1     128.9 108659 3557.4
## <none>                     108530 3558.6
## - Accept       1     532.8 109063 3560.0
## + F.Undergrad  1      39.1 108491 3560.3
## + Books        1       7.2 108523 3560.5
## + S.F.Ratio    1       0.2 108530 3560.6
## - Elite        1     715.7 109246 3561.2
## - PhD          1     780.7 109311 3561.6
## - Private      1     865.4 109396 3562.2
## - Room.Board   1     990.8 109521 3563.0
## - Personal     1    1340.4 109871 3565.2
## - Expend       1    1429.3 109960 3565.8
## - Apps         1    2329.1 110860 3571.5
## - P.Undergrad  1    2771.9 111302 3574.2
## - Outstate     1    3275.8 111806 3577.4
## - perc.alumni  1    6393.9 114924 3596.7
## 
## Step:  AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Enroll       1     126.5 108705 3555.7
## <none>                     108578 3556.9
## - Accept       1     541.5 109120 3558.4
## + Terminal     1      47.9 108530 3558.6
## + F.Undergrad  1      45.3 108533 3558.6
## + Books        1       3.6 108575 3558.9
## + S.F.Ratio    1       0.4 108578 3558.9
## - Elite        1     715.6 109294 3559.5
## - Private      1     915.7 109494 3560.8
## - Room.Board   1     956.0 109534 3561.0
## - PhD          1    1231.4 109810 3562.8
## - Personal     1    1335.1 109913 3563.5
## - Expend       1    1447.7 110026 3564.2
## - Apps         1    2357.8 110936 3569.9
## - P.Undergrad  1    2792.8 111371 3572.7
## - Outstate     1    3236.6 111815 3575.5
## - perc.alumni  1    6348.1 114926 3594.7
## 
## Step:  AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate + 
##     Room.Board + Personal + PhD + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     108705 3555.7
## - Accept       1     432.9 109138 3556.5
## + Enroll       1     126.5 108578 3556.9
## + Terminal     1      45.5 108659 3557.4
## + F.Undergrad  1      23.2 108682 3557.6
## + Books        1       5.2 108700 3557.7
## + S.F.Ratio    1       1.1 108704 3557.7
## - Elite        1     800.6 109505 3558.9
## - Private      1     832.3 109537 3559.1
## - Room.Board   1     892.7 109598 3559.4
## - Personal     1    1260.0 109965 3561.8
## - PhD          1    1283.8 109989 3561.9
## - Expend       1    1374.9 110080 3562.5
## - Apps         1    2246.3 110951 3568.0
## - P.Undergrad  1    2666.4 111371 3570.7
## - Outstate     1    3118.2 111823 3573.5
## - perc.alumni  1    6728.9 115434 3595.8
summary(stepwise_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.842  -7.563  -0.507   6.971  52.885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1629560  3.2789880  10.724  < 2e-16 ***
## PrivateYes   3.8908272  1.6952648   2.295 0.022027 *  
## Apps         0.0016005  0.0004245   3.771 0.000177 ***
## Accept      -0.0010894  0.0006582  -1.655 0.098346 .  
## P.Undergrad -0.0015205  0.0003701  -4.108 4.47e-05 ***
## Outstate     0.0010799  0.0002431   4.442 1.04e-05 ***
## Room.Board   0.0014346  0.0006036   2.377 0.017730 *  
## Personal    -0.0023305  0.0008253  -2.824 0.004881 ** 
## PhD          0.1122237  0.0393704   2.850 0.004496 ** 
## perc.alumni  0.3216323  0.0492854   6.526 1.31e-10 ***
## Expend      -0.0004355  0.0001476  -2.950 0.003287 ** 
## EliteYes     4.5101993  2.0036855   2.251 0.024703 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared:  0.4539, Adjusted R-squared:  0.4452 
## F-statistic:    52 on 11 and 688 DF,  p-value: < 2.2e-16
adj_r_squared_stepwise <- summary(stepwise_model)$adj.r.squared
cat("Adjusted R-squared for Stepwise Model: ", adj_r_squared_stepwise, "\n")
## Adjusted R-squared for Stepwise Model:  0.445214
print(coef(final_model))
##   (Intercept)    PrivateYes          Apps        Accept   P.Undergrad 
## 35.1629559728  3.8908272070  0.0016004959 -0.0010893644 -0.0015204627 
##      Outstate    Room.Board      Personal           PhD   perc.alumni 
##  0.0010799294  0.0014346068 -0.0023304929  0.1122236857  0.3216322812 
##        Expend      EliteYes 
## -0.0004354983  4.5101993356
par(mfrow = c(2, 2))
plot(final_model)

The stepwise regression model with an Adjusted R-squared of 0.4452 provides a good fit for predicting the Grad.Rate variable. This means that the model explains approximately 44.5% of the variance in graduation rates, which is a reasonable level of explanatory power for this type of educational dataset.

Model Summary:

Residual Standard Error: 12.57 suggests that the model’s predictions have a typical error of around 12.57 graduation rate points. F-statistic (52): This indicates that the model as a whole is statistically significant with a p-value < 2.2e-16, meaning the included predictors contribute meaningfully to explaining graduation rates.

Key Predictors:

The final stepwise model includes several key predictors of Grad.Rate:

Private (PrivateYes): Private institutions have a positive and significant impact on graduation rates. Apps: More applications correlate with higher graduation rates. P.Undergrad: A higher proportion of part-time undergraduates is negatively associated with graduation rates. Outstate: Out-of-state tuition costs are positively related to graduation rates. Room.Board: Higher room and board costs are associated with better graduation rates. Personal: Higher personal expenses negatively impact graduation rates. PhD: More faculty with PhDs is positively associated with graduation rates. perc.alumni: A higher percentage of alumni donations strongly boosts graduation rates. Expend: More institutional spending negatively affects graduation rates. Elite (EliteYes): Elite institutions tend to have higher graduation rates.

Model Comparisons:

Ridge Regression: The best lambda value for Ridge was 0.9709, with an Adjusted R-squared of 0.4404. This model performed slightly worse than the stepwise model.

Lasso Regression: The best lambda value for Lasso was 0.5428, but it had a lower Adjusted R-squared of 0.4159, indicating it did not capture the relationships as effectively as the stepwise model.

Multicollinearity:

The Variance Inflation Factor (VIF) values for the stepwise model range from 1.27 to 12.55. Most values are acceptable, but the higher VIFs for Personal and Expend (around 12.5) suggest some multicollinearity, meaning these predictors may be correlated with others and could potentially inflate standard errors. This could be a limitation of the model.

Conclusion:

The stepwise regression model is most appropriate, given its relatively high Adjusted R-squared and performance over the Ridge and Lasso methods. This model explains a significant part of the variance in graduation rates, but some predictors are still suffering from multicollinearity issues. The model’s insights will help institutions understand which factors most influence improving graduation rates.