+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++— title: ‘MA717: Applied Regression and Experimental Data Analysis’ author: “Assignment template” date: ” ” output: pdf_document: default word_document: default html_document: df_print: paged —

options(repos = c(CRAN = "https://cran.rstudio.com"))

1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))

file_path <- file.path("M:/Assignment-20241208/College.csv")
mydata <- read.csv("M:/Assignment-20241208/College.csv", header = TRUE, stringsAsFactors = TRUE)
dim(mydata)
## [1] 775  17
head(mydata)
##   Private Apps Accept Enroll F.Undergrad P.Undergrad Outstate Room.Board Books
## 1     Yes 1660   1232    721        2885         537     7440       3300   450
## 2     Yes 2186   1924    512        2683        1227    12280       6450   750
## 3     Yes 1428   1097    336        1036          99    11250       3750   400
## 4     Yes  417    349    137         510          63    12960       5450   450
## 5     Yes  193    146     55         249         869     7560       4120   800
## 6     Yes  587    479    158         678          41    13500       3335   500
##   Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Elite
## 1     2200  70       78      18.1          12   7041        60    No
## 2     1500  29       30      12.2          16  10527        56    No
## 3     1165  53       66      12.9          30   8735        54    No
## 4      875  92       97       7.7          37  19016        59   Yes
## 5     1500  76       72      11.9           2  10922        15    No
## 6      675  67       73       9.4          11   9727        55    No

1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))

set.seed(2401616)
mynewdata <- 
  mydata[sample(1:nrow(mydata),700), ]
summary(mydata)
##  Private        Apps           Accept            Enroll        F.Undergrad   
##  No :211   Min.   :   81   Min.   :   72.0   Min.   :  35.0   Min.   :  139  
##  Yes:564   1st Qu.:  778   1st Qu.:  605.5   1st Qu.: 242.0   1st Qu.:  990  
##            Median : 1558   Median : 1110.0   Median : 434.0   Median : 1708  
##            Mean   : 3004   Mean   : 2019.0   Mean   : 781.0   Mean   : 3707  
##            3rd Qu.: 3610   3rd Qu.: 2413.0   3rd Qu.: 902.5   3rd Qu.: 4056  
##            Max.   :48094   Max.   :26330.0   Max.   :6392.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   95.0   1st Qu.: 7332   1st Qu.:3598   1st Qu.: 469.0  
##  Median :  355.0   Median : 9990   Median :4200   Median : 500.0  
##  Mean   :  857.3   Mean   :10449   Mean   :4358   Mean   : 549.3  
##  3rd Qu.:  967.5   3rd Qu.:12938   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD            Terminal        S.F.Ratio    
##  Min.   : 250   Min.   :  8.00   Min.   : 24.00   Min.   : 2.50  
##  1st Qu.: 870   1st Qu.: 62.00   1st Qu.: 71.00   1st Qu.:11.50  
##  Median :1200   Median : 75.00   Median : 82.00   Median :13.60  
##  Mean   :1343   Mean   : 72.69   Mean   : 79.73   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.00   3rd Qu.:16.50  
##  Max.   :6800   Max.   :100.00   Max.   :100.00   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate      Elite    
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00   No :697  
##  1st Qu.:13.00   1st Qu.: 6754   1st Qu.: 53.00   Yes: 78  
##  Median :21.00   Median : 8408   Median : 65.00            
##  Mean   :22.76   Mean   : 9667   Mean   : 65.42            
##  3rd Qu.:31.00   3rd Qu.:10847   3rd Qu.: 78.00            
##  Max.   :64.00   Max.   :56233   Max.   :100.00
table(mynewdata$Private)
## 
##  No Yes 
## 190 510
table(mynewdata$Elite)
## 
##  No Yes 
## 629  71

1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))

install.packages("ggplot2")
## Installing package into 'C:/Users/kp24011/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\kp24011\AppData\Local\Temp\Rtmp82DspW\downloaded_packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
hist_vars <- c("Outstate", "Room.Board", "Books", "Personal")
titles <- c("Histogram of Outstate Tuition", 
            "Histogram of Room and Board Costs", 
            "Histogram of Book Costs", 
            "Histogram of Personal Expenses")
x_labels <- c("Outstate Tuition", 
              "Room and Board Costs", 
              "Book Costs", 
              "Personal Expenses")

for (i in seq_along(hist_vars)) {
  plot <- ggplot(mynewdata, aes(x = .data[[hist_vars[i]]])) +
    geom_histogram(binwidth = 900, fill = "blue", color = "black") +
    labs(
      title = titles[i], 
      x = x_labels[i], 
      y = "Frequency"
    ) +
    theme_minimal()
  print(plot)
}

2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).

model1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
summary(model1)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.843  -9.870   0.157  10.157  44.886 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.114      1.094  50.356   <2e-16 ***
## PrivateYes    11.729      1.276   9.190   <2e-16 ***
## EliteYes      18.108      1.880   9.632   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.98 on 697 degrees of freedom
## Multiple R-squared:  0.2141, Adjusted R-squared:  0.2119 
## F-statistic: 94.95 on 2 and 697 DF,  p-value: < 2.2e-16

2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))

new_data <- data.frame(Private = "Yes", Elite = "No")
predict(model1, new_data, interval = "prediction")
##        fit      lwr      upr
## 1 66.84265 37.39577 96.28953

2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)

model2 <- lm(Grad.Rate ~ ., data = mynewdata)
summary(model2)
## 
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.646  -7.287  -0.441   7.155  52.328 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.0837588  4.9053047   7.152 2.20e-12 ***
## PrivateYes   3.9254138  1.7691302   2.219 0.026826 *  
## Apps         0.0016663  0.0004327   3.851 0.000129 ***
## Accept      -0.0015698  0.0008470  -1.853 0.064274 .  
## Enroll       0.0021744  0.0023102   0.941 0.346933    
## F.Undergrad -0.0002027  0.0004041  -0.502 0.616105    
## P.Undergrad -0.0015407  0.0003898  -3.953 8.52e-05 ***
## Outstate     0.0011163  0.0002483   4.495 8.17e-06 ***
## Room.Board   0.0015220  0.0006147   2.476 0.013524 *  
## Books        0.0006745  0.0030868   0.219 0.827084    
## Personal    -0.0024121  0.0008447  -2.856 0.004425 ** 
## PhD          0.1369511  0.0617875   2.216 0.026987 *  
## Terminal    -0.0353895  0.0666052  -0.531 0.595361    
## S.F.Ratio    0.0097674  0.1625001   0.060 0.952088    
## perc.alumni  0.3169363  0.0502684   6.305 5.18e-10 ***
## Expend      -0.0004464  0.0001601  -2.788 0.005446 ** 
## EliteYes     4.2327018  2.0267038   2.088 0.037126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4423 
## F-statistic: 35.65 on 16 and 683 DF,  p-value: < 2.2e-16
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    697 156447                                 
## 2    683 108483 14     47964 21.57 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)

par(mfrow = c(2, 2))
plot(model2)

2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)

library(MASS)
library(ggplot2)

full_model <- lm(Grad.Rate ~ ., data = mynewdata)
best_model <- stepAIC(full_model, direction = "both")
## Start:  AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - S.F.Ratio    1       0.6 108484 3562.3
## - Books        1       7.6 108491 3562.3
## - F.Undergrad  1      40.0 108523 3562.5
## - Terminal     1      44.8 108528 3562.6
## - Enroll       1     140.7 108624 3563.2
## <none>                     108483 3564.3
## - Accept       1     545.5 109029 3565.8
## - Elite        1     692.8 109176 3566.7
## - PhD          1     780.3 109263 3567.3
## - Private      1     782.0 109265 3567.3
## - Room.Board   1     973.8 109457 3568.5
## - Expend       1    1234.9 109718 3570.2
## - Personal     1    1295.3 109778 3570.6
## - Apps         1    2355.3 110838 3577.3
## - P.Undergrad  1    2481.9 110965 3578.1
## - Outstate     1    3209.4 111693 3582.7
## - perc.alumni  1    6313.9 114797 3601.9
## 
## Step:  AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Books        1       7.7 108491 3560.3
## - F.Undergrad  1      39.6 108523 3560.5
## - Terminal     1      45.2 108529 3560.6
## - Enroll       1     140.4 108624 3561.2
## <none>                     108484 3562.3
## - Accept       1     546.0 109030 3563.8
## + S.F.Ratio    1       0.6 108483 3564.3
## - Elite        1     692.7 109176 3564.7
## - PhD          1     786.9 109271 3565.4
## - Private      1     795.5 109279 3565.4
## - Room.Board   1     974.6 109458 3566.6
## - Personal     1    1303.0 109787 3568.6
## - Expend       1    1445.7 109929 3569.6
## - Apps         1    2359.1 110843 3575.4
## - P.Undergrad  1    2482.4 110966 3576.1
## - Outstate     1    3220.3 111704 3580.8
## - perc.alumni  1    6339.9 114824 3600.1
## 
## Step:  AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal + 
##     perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - F.Undergrad  1      39.1 108530 3558.6
## - Terminal     1      41.7 108533 3558.6
## - Enroll       1     140.9 108632 3559.3
## <none>                     108491 3560.3
## - Accept       1     546.9 109038 3561.9
## + Books        1       7.7 108484 3562.3
## + S.F.Ratio    1       0.7 108491 3562.3
## - Elite        1     696.8 109188 3562.8
## - PhD          1     779.3 109271 3563.4
## - Private      1     803.6 109295 3563.5
## - Room.Board   1     997.6 109489 3564.7
## - Personal     1    1300.5 109792 3566.7
## - Expend       1    1438.7 109930 3567.6
## - Apps         1    2363.5 110855 3573.4
## - P.Undergrad  1    2481.0 110972 3574.2
## - Outstate     1    3214.5 111706 3578.8
## - perc.alumni  1    6333.5 114825 3598.1
## 
## Step:  AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni + 
##     Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Terminal     1      47.9 108578 3556.9
## - Enroll       1     128.9 108659 3557.4
## <none>                     108530 3558.6
## - Accept       1     532.8 109063 3560.0
## + F.Undergrad  1      39.1 108491 3560.3
## + Books        1       7.2 108523 3560.5
## + S.F.Ratio    1       0.2 108530 3560.6
## - Elite        1     715.7 109246 3561.2
## - PhD          1     780.7 109311 3561.6
## - Private      1     865.4 109396 3562.2
## - Room.Board   1     990.8 109521 3563.0
## - Personal     1    1340.4 109871 3565.2
## - Expend       1    1429.3 109960 3565.8
## - Apps         1    2329.1 110860 3571.5
## - P.Undergrad  1    2771.9 111302 3574.2
## - Outstate     1    3275.8 111806 3577.4
## - perc.alumni  1    6393.9 114924 3596.7
## 
## Step:  AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite
## 
##               Df Sum of Sq    RSS    AIC
## - Enroll       1     126.5 108705 3555.7
## <none>                     108578 3556.9
## - Accept       1     541.5 109120 3558.4
## + Terminal     1      47.9 108530 3558.6
## + F.Undergrad  1      45.3 108533 3558.6
## + Books        1       3.6 108575 3558.9
## + S.F.Ratio    1       0.4 108578 3558.9
## - Elite        1     715.6 109294 3559.5
## - Private      1     915.7 109494 3560.8
## - Room.Board   1     956.0 109534 3561.0
## - PhD          1    1231.4 109810 3562.8
## - Personal     1    1335.1 109913 3563.5
## - Expend       1    1447.7 110026 3564.2
## - Apps         1    2357.8 110936 3569.9
## - P.Undergrad  1    2792.8 111371 3572.7
## - Outstate     1    3236.6 111815 3575.5
## - perc.alumni  1    6348.1 114926 3594.7
## 
## Step:  AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate + 
##     Room.Board + Personal + PhD + perc.alumni + Expend + Elite
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     108705 3555.7
## - Accept       1     432.9 109138 3556.5
## + Enroll       1     126.5 108578 3556.9
## + Terminal     1      45.5 108659 3557.4
## + F.Undergrad  1      23.2 108682 3557.6
## + Books        1       5.2 108700 3557.7
## + S.F.Ratio    1       1.1 108704 3557.7
## - Elite        1     800.6 109505 3558.9
## - Private      1     832.3 109537 3559.1
## - Room.Board   1     892.7 109598 3559.4
## - Personal     1    1260.0 109965 3561.8
## - PhD          1    1283.8 109989 3561.9
## - Expend       1    1374.9 110080 3562.5
## - Apps         1    2246.3 110951 3568.0
## - P.Undergrad  1    2666.4 111371 3570.7
## - Outstate     1    3118.2 111823 3573.5
## - perc.alumni  1    6728.9 115434 3595.8
summary(best_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad + 
##     Outstate + Room.Board + Personal + PhD + perc.alumni + Expend + 
##     Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.842  -7.563  -0.507   6.971  52.885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1629560  3.2789880  10.724  < 2e-16 ***
## PrivateYes   3.8908272  1.6952648   2.295 0.022027 *  
## Apps         0.0016005  0.0004245   3.771 0.000177 ***
## Accept      -0.0010894  0.0006582  -1.655 0.098346 .  
## P.Undergrad -0.0015205  0.0003701  -4.108 4.47e-05 ***
## Outstate     0.0010799  0.0002431   4.442 1.04e-05 ***
## Room.Board   0.0014346  0.0006036   2.377 0.017730 *  
## Personal    -0.0023305  0.0008253  -2.824 0.004881 ** 
## PhD          0.1122237  0.0393704   2.850 0.004496 ** 
## perc.alumni  0.3216323  0.0492854   6.526 1.31e-10 ***
## Expend      -0.0004355  0.0001476  -2.950 0.003287 ** 
## EliteYes     4.5101993  2.0036855   2.251 0.024703 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared:  0.4539, Adjusted R-squared:  0.4452 
## F-statistic:    52 on 11 and 688 DF,  p-value: < 2.2e-16
selected_formula <- formula(best_model)
final_model <- lm(selected_formula, data = mynewdata)
summary(final_model)
## 
## Call:
## lm(formula = selected_formula, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.842  -7.563  -0.507   6.971  52.885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1629560  3.2789880  10.724  < 2e-16 ***
## PrivateYes   3.8908272  1.6952648   2.295 0.022027 *  
## Apps         0.0016005  0.0004245   3.771 0.000177 ***
## Accept      -0.0010894  0.0006582  -1.655 0.098346 .  
## P.Undergrad -0.0015205  0.0003701  -4.108 4.47e-05 ***
## Outstate     0.0010799  0.0002431   4.442 1.04e-05 ***
## Room.Board   0.0014346  0.0006036   2.377 0.017730 *  
## Personal    -0.0023305  0.0008253  -2.824 0.004881 ** 
## PhD          0.1122237  0.0393704   2.850 0.004496 ** 
## perc.alumni  0.3216323  0.0492854   6.526 1.31e-10 ***
## Expend      -0.0004355  0.0001476  -2.950 0.003287 ** 
## EliteYes     4.5101993  2.0036855   2.251 0.024703 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared:  0.4539, Adjusted R-squared:  0.4452 
## F-statistic:    52 on 11 and 688 DF,  p-value: < 2.2e-16
plot(final_model, which = 1, main = "Residuals vs Fitted")

plot(final_model, which = 2, main = "Normal Q-Q Plot")

plot(final_model, which = 3, main = "Scale-Location Plot")

plot(final_model, which = 5, main = "Residuals vs Leverage")

Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.

required_packages <- c("ggplot2", "MASS", "glmnet", "boot")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]

if (length(new_packages) > 0) {
  install.packages(new_packages)
}

# Load the required libraries
library(ggplot2)
library(MASS)
library(boot)

if ("glmnet" %in% installed.packages()[, "Package"]) {
  library(glmnet)
} else {
  message("glmnet package is not installed. Skipping Lasso regression.")
}
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
initial_model <- lm(Grad.Rate ~ ., data = mynewdata)
summary(initial_model)
## 
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.646  -7.287  -0.441   7.155  52.328 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.0837588  4.9053047   7.152 2.20e-12 ***
## PrivateYes   3.9254138  1.7691302   2.219 0.026826 *  
## Apps         0.0016663  0.0004327   3.851 0.000129 ***
## Accept      -0.0015698  0.0008470  -1.853 0.064274 .  
## Enroll       0.0021744  0.0023102   0.941 0.346933    
## F.Undergrad -0.0002027  0.0004041  -0.502 0.616105    
## P.Undergrad -0.0015407  0.0003898  -3.953 8.52e-05 ***
## Outstate     0.0011163  0.0002483   4.495 8.17e-06 ***
## Room.Board   0.0015220  0.0006147   2.476 0.013524 *  
## Books        0.0006745  0.0030868   0.219 0.827084    
## Personal    -0.0024121  0.0008447  -2.856 0.004425 ** 
## PhD          0.1369511  0.0617875   2.216 0.026987 *  
## Terminal    -0.0353895  0.0666052  -0.531 0.595361    
## S.F.Ratio    0.0097674  0.1625001   0.060 0.952088    
## perc.alumni  0.3169363  0.0502684   6.305 5.18e-10 ***
## Expend      -0.0004464  0.0001601  -2.788 0.005446 ** 
## EliteYes     4.2327018  2.0267038   2.088 0.037126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4423 
## F-statistic: 35.65 on 16 and 683 DF,  p-value: < 2.2e-16
reduced_model <- lm(Grad.Rate ~ Private + Outstate + Room.Board + Books, data = mynewdata)
summary(reduced_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Private + Outstate + Room.Board + Books, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.683  -8.822   0.174   8.252  50.846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.2031086  2.6564354  14.005   <2e-16 ***
## PrivateYes   1.0023971  1.4038844   0.714   0.4755    
## Outstate     0.0021001  0.0001976  10.626   <2e-16 ***
## Room.Board   0.0014794  0.0006369   2.323   0.0205 *  
## Books       -0.0016624  0.0032745  -0.508   0.6118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.78 on 695 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.3332 
## F-statistic: 88.33 on 4 and 695 DF,  p-value: < 2.2e-16
interaction_model <- lm(Grad.Rate ~ Private * Elite + Outstate * Room.Board + Books, data = mynewdata)
summary(interaction_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Private * Elite + Outstate * Room.Board + 
##     Books, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.385  -8.263   0.370   7.894  51.475 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.485e+01  6.174e+00   5.646 2.40e-08 ***
## PrivateYes           2.565e+00  1.455e+00   1.763 0.078275 .  
## EliteYes             1.656e+01  3.960e+00   4.183 3.25e-05 ***
## Outstate             2.265e-03  5.981e-04   3.787 0.000166 ***
## Room.Board           2.263e-03  1.466e-03   1.544 0.123111    
## Books               -2.576e-03  3.226e-03  -0.799 0.424740    
## PrivateYes:EliteYes -9.887e+00  4.531e+00  -2.182 0.029438 *  
## Outstate:Room.Board -8.827e-08  1.224e-07  -0.721 0.471021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.55 on 692 degrees of freedom
## Multiple R-squared:  0.3615, Adjusted R-squared:  0.355 
## F-statistic: 55.96 on 7 and 692 DF,  p-value: < 2.2e-16
mynewdata$Log_Outstate <- log(mynewdata$Outstate)


transformed_model <- lm(Grad.Rate ~ Log_Outstate + Private + Room.Board + Books, data = mynewdata)
summary(transformed_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Log_Outstate + Private + Room.Board + 
##     Books, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.954  -9.349   0.164   8.569  55.486 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.216e+02  1.679e+01  -7.244 1.16e-12 ***
## Log_Outstate  1.948e+01  2.043e+00   9.534  < 2e-16 ***
## PrivateYes    4.980e-01  1.482e+00   0.336  0.73690    
## Room.Board    1.972e-03  6.382e-04   3.089  0.00209 ** 
## Books        -1.282e-03  3.321e-03  -0.386  0.69954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.97 on 695 degrees of freedom
## Multiple R-squared:  0.3185, Adjusted R-squared:  0.3145 
## F-statistic: 81.19 on 4 and 695 DF,  p-value: < 2.2e-16
if ("glmnet" %in% installed.packages()[, "Package"]) {
  
  x <- model.matrix(Grad.Rate ~ ., data = mynewdata)[, -1]  # Exclude intercept
  y <- mynewdata$Grad.Rate
  
  
  lasso_model <- glmnet(x, y, alpha = 1)  # alpha=1 for Lasso
  plot(lasso_model)
  
  
  cv_lasso <- cv.glmnet(x, y, alpha = 1)
  best_lambda <- cv_lasso$lambda.min
  message("Optimal Lambda for Lasso: ", best_lambda)
 
  final_lasso <- glmnet(x, y, alpha = 1, lambda = best_lambda)
  print(coef(final_lasso))
} else {
  message("Skipping Lasso regression due to missing glmnet package.")
}

## Optimal Lambda for Lasso: 0.374138951508501
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)  -2.3062823830
## PrivateYes    1.9746908469
## Apps          0.0006889281
## Accept        .           
## Enroll        .           
## F.Undergrad   .           
## P.Undergrad  -0.0013483737
## Outstate      0.0004515911
## Room.Board    0.0012986077
## Books         .           
## Personal     -0.0022244024
## PhD           0.0761476962
## Terminal      .           
## S.F.Ratio     .           
## perc.alumni   0.3109259399
## Expend        .           
## EliteYes      3.5193309024
## Log_Outstate  4.9076892693
AIC(initial_model, reduced_model, interaction_model, transformed_model)
##                   df      AIC
## initial_model     18 5552.803
## reduced_model      6 5666.036
## interaction_model  9 5645.757
## transformed_model  6 5685.367
adj_r2 <- c(
  initial_model = summary(initial_model)$adj.r.squared,
  reduced_model = summary(reduced_model)$adj.r.squared,
  interaction_model = summary(interaction_model)$adj.r.squared,
  transformed_model = summary(transformed_model)$adj.r.squared
)
print(adj_r2)
##     initial_model     reduced_model interaction_model transformed_model 
##         0.4422926         0.3332164         0.3550010         0.3145467
cv_error <- cv.glm(mynewdata, transformed_model, K = 10)
cv_error$delta  # 10-fold CV error
## [1] NaN NaN
final_model <- transformed_model  
summary(final_model)
## 
## Call:
## lm(formula = Grad.Rate ~ Log_Outstate + Private + Room.Board + 
##     Books, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.954  -9.349   0.164   8.569  55.486 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.216e+02  1.679e+01  -7.244 1.16e-12 ***
## Log_Outstate  1.948e+01  2.043e+00   9.534  < 2e-16 ***
## PrivateYes    4.980e-01  1.482e+00   0.336  0.73690    
## Room.Board    1.972e-03  6.382e-04   3.089  0.00209 ** 
## Books        -1.282e-03  3.321e-03  -0.386  0.69954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.97 on 695 degrees of freedom
## Multiple R-squared:  0.3185, Adjusted R-squared:  0.3145 
## F-statistic: 81.19 on 4 and 695 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(final_model)

final_model_coefficients <- as.data.frame(summary(final_model)$coefficients)
write.csv(final_model_coefficients, "final_model_coefficients.csv")


final_model_summary <- summary(final_model)
final_model_metrics <- list(
  r_squared = final_model_summary$r.squared,
  adj_r_squared = final_model_summary$adj.r.squared,
  residual_standard_error = final_model_summary$sigma,
  f_statistic = final_model_summary$fstatistic[1]
)
final_model_metrics_df <- as.data.frame(final_model_metrics)
write.csv(final_model_metrics_df, "final_model_metrics.csv", row.names = FALSE)