+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++— title: ‘MA717: Applied Regression and Experimental Data Analysis’ author: “Assignment template” date: ” ” output: pdf_document: default word_document: default html_document: df_print: paged —
options(repos = c(CRAN = "https://cran.rstudio.com"))
1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))
file_path <- file.path("M:/Assignment-20241208/College.csv")
mydata <- read.csv("M:/Assignment-20241208/College.csv", header = TRUE, stringsAsFactors = TRUE)
dim(mydata)
## [1] 775 17
head(mydata)
## Private Apps Accept Enroll F.Undergrad P.Undergrad Outstate Room.Board Books
## 1 Yes 1660 1232 721 2885 537 7440 3300 450
## 2 Yes 2186 1924 512 2683 1227 12280 6450 750
## 3 Yes 1428 1097 336 1036 99 11250 3750 400
## 4 Yes 417 349 137 510 63 12960 5450 450
## 5 Yes 193 146 55 249 869 7560 4120 800
## 6 Yes 587 479 158 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Elite
## 1 2200 70 78 18.1 12 7041 60 No
## 2 1500 29 30 12.2 16 10527 56 No
## 3 1165 53 66 12.9 30 8735 54 No
## 4 875 92 97 7.7 37 19016 59 Yes
## 5 1500 76 72 11.9 2 10922 15 No
## 6 675 67 73 9.4 11 9727 55 No
1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))
set.seed(2401616)
mynewdata <-
mydata[sample(1:nrow(mydata),700), ]
summary(mydata)
## Private Apps Accept Enroll F.Undergrad
## No :211 Min. : 81 Min. : 72.0 Min. : 35.0 Min. : 139
## Yes:564 1st Qu.: 778 1st Qu.: 605.5 1st Qu.: 242.0 1st Qu.: 990
## Median : 1558 Median : 1110.0 Median : 434.0 Median : 1708
## Mean : 3004 Mean : 2019.0 Mean : 781.0 Mean : 3707
## 3rd Qu.: 3610 3rd Qu.: 2413.0 3rd Qu.: 902.5 3rd Qu.: 4056
## Max. :48094 Max. :26330.0 Max. :6392.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 95.0 1st Qu.: 7332 1st Qu.:3598 1st Qu.: 469.0
## Median : 355.0 Median : 9990 Median :4200 Median : 500.0
## Mean : 857.3 Mean :10449 Mean :4358 Mean : 549.3
## 3rd Qu.: 967.5 3rd Qu.:12938 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.00 Min. : 24.00 Min. : 2.50
## 1st Qu.: 870 1st Qu.: 62.00 1st Qu.: 71.00 1st Qu.:11.50
## Median :1200 Median : 75.00 Median : 82.00 Median :13.60
## Mean :1343 Mean : 72.69 Mean : 79.73 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.00 3rd Qu.:16.50
## Max. :6800 Max. :100.00 Max. :100.00 Max. :39.80
## perc.alumni Expend Grad.Rate Elite
## Min. : 0.00 Min. : 3186 Min. : 10.00 No :697
## 1st Qu.:13.00 1st Qu.: 6754 1st Qu.: 53.00 Yes: 78
## Median :21.00 Median : 8408 Median : 65.00
## Mean :22.76 Mean : 9667 Mean : 65.42
## 3rd Qu.:31.00 3rd Qu.:10847 3rd Qu.: 78.00
## Max. :64.00 Max. :56233 Max. :100.00
table(mynewdata$Private)
##
## No Yes
## 190 510
table(mynewdata$Elite)
##
## No Yes
## 629 71
1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))
install.packages("ggplot2")
## Installing package into 'C:/Users/kp24011/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\kp24011\AppData\Local\Temp\Rtmp82DspW\downloaded_packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
hist_vars <- c("Outstate", "Room.Board", "Books", "Personal")
titles <- c("Histogram of Outstate Tuition",
"Histogram of Room and Board Costs",
"Histogram of Book Costs",
"Histogram of Personal Expenses")
x_labels <- c("Outstate Tuition",
"Room and Board Costs",
"Book Costs",
"Personal Expenses")
for (i in seq_along(hist_vars)) {
plot <- ggplot(mynewdata, aes(x = .data[[hist_vars[i]]])) +
geom_histogram(binwidth = 900, fill = "blue", color = "black") +
labs(
title = titles[i],
x = x_labels[i],
y = "Frequency"
) +
theme_minimal()
print(plot)
}
2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).
model1 <- lm(Grad.Rate ~ Private + Elite, data = mynewdata)
summary(model1)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.843 -9.870 0.157 10.157 44.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.114 1.094 50.356 <2e-16 ***
## PrivateYes 11.729 1.276 9.190 <2e-16 ***
## EliteYes 18.108 1.880 9.632 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.98 on 697 degrees of freedom
## Multiple R-squared: 0.2141, Adjusted R-squared: 0.2119
## F-statistic: 94.95 on 2 and 697 DF, p-value: < 2.2e-16
2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))
new_data <- data.frame(Private = "Yes", Elite = "No")
predict(model1, new_data, interval = "prediction")
## fit lwr upr
## 1 66.84265 37.39577 96.28953
2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)
model2 <- lm(Grad.Rate ~ ., data = mynewdata)
summary(model2)
##
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.646 -7.287 -0.441 7.155 52.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0837588 4.9053047 7.152 2.20e-12 ***
## PrivateYes 3.9254138 1.7691302 2.219 0.026826 *
## Apps 0.0016663 0.0004327 3.851 0.000129 ***
## Accept -0.0015698 0.0008470 -1.853 0.064274 .
## Enroll 0.0021744 0.0023102 0.941 0.346933
## F.Undergrad -0.0002027 0.0004041 -0.502 0.616105
## P.Undergrad -0.0015407 0.0003898 -3.953 8.52e-05 ***
## Outstate 0.0011163 0.0002483 4.495 8.17e-06 ***
## Room.Board 0.0015220 0.0006147 2.476 0.013524 *
## Books 0.0006745 0.0030868 0.219 0.827084
## Personal -0.0024121 0.0008447 -2.856 0.004425 **
## PhD 0.1369511 0.0617875 2.216 0.026987 *
## Terminal -0.0353895 0.0666052 -0.531 0.595361
## S.F.Ratio 0.0097674 0.1625001 0.060 0.952088
## perc.alumni 0.3169363 0.0502684 6.305 5.18e-10 ***
## Expend -0.0004464 0.0001601 -2.788 0.005446 **
## EliteYes 4.2327018 2.0267038 2.088 0.037126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared: 0.4551, Adjusted R-squared: 0.4423
## F-statistic: 35.65 on 16 and 683 DF, p-value: < 2.2e-16
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 697 156447
## 2 683 108483 14 47964 21.57 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)
par(mfrow = c(2, 2))
plot(model2)
2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)
library(MASS)
library(ggplot2)
full_model <- lm(Grad.Rate ~ ., data = mynewdata)
best_model <- stepAIC(full_model, direction = "both")
## Start: AIC=3564.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - S.F.Ratio 1 0.6 108484 3562.3
## - Books 1 7.6 108491 3562.3
## - F.Undergrad 1 40.0 108523 3562.5
## - Terminal 1 44.8 108528 3562.6
## - Enroll 1 140.7 108624 3563.2
## <none> 108483 3564.3
## - Accept 1 545.5 109029 3565.8
## - Elite 1 692.8 109176 3566.7
## - PhD 1 780.3 109263 3567.3
## - Private 1 782.0 109265 3567.3
## - Room.Board 1 973.8 109457 3568.5
## - Expend 1 1234.9 109718 3570.2
## - Personal 1 1295.3 109778 3570.6
## - Apps 1 2355.3 110838 3577.3
## - P.Undergrad 1 2481.9 110965 3578.1
## - Outstate 1 3209.4 111693 3582.7
## - perc.alumni 1 6313.9 114797 3601.9
##
## Step: AIC=3562.29
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Books 1 7.7 108491 3560.3
## - F.Undergrad 1 39.6 108523 3560.5
## - Terminal 1 45.2 108529 3560.6
## - Enroll 1 140.4 108624 3561.2
## <none> 108484 3562.3
## - Accept 1 546.0 109030 3563.8
## + S.F.Ratio 1 0.6 108483 3564.3
## - Elite 1 692.7 109176 3564.7
## - PhD 1 786.9 109271 3565.4
## - Private 1 795.5 109279 3565.4
## - Room.Board 1 974.6 109458 3566.6
## - Personal 1 1303.0 109787 3568.6
## - Expend 1 1445.7 109929 3569.6
## - Apps 1 2359.1 110843 3575.4
## - P.Undergrad 1 2482.4 110966 3576.1
## - Outstate 1 3220.3 111704 3580.8
## - perc.alumni 1 6339.9 114824 3600.1
##
## Step: AIC=3560.34
## Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal +
## perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## - F.Undergrad 1 39.1 108530 3558.6
## - Terminal 1 41.7 108533 3558.6
## - Enroll 1 140.9 108632 3559.3
## <none> 108491 3560.3
## - Accept 1 546.9 109038 3561.9
## + Books 1 7.7 108484 3562.3
## + S.F.Ratio 1 0.7 108491 3562.3
## - Elite 1 696.8 109188 3562.8
## - PhD 1 779.3 109271 3563.4
## - Private 1 803.6 109295 3563.5
## - Room.Board 1 997.6 109489 3564.7
## - Personal 1 1300.5 109792 3566.7
## - Expend 1 1438.7 109930 3567.6
## - Apps 1 2363.5 110855 3573.4
## - P.Undergrad 1 2481.0 110972 3574.2
## - Outstate 1 3214.5 111706 3578.8
## - perc.alumni 1 6333.5 114825 3598.1
##
## Step: AIC=3558.59
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni +
## Expend + Elite
##
## Df Sum of Sq RSS AIC
## - Terminal 1 47.9 108578 3556.9
## - Enroll 1 128.9 108659 3557.4
## <none> 108530 3558.6
## - Accept 1 532.8 109063 3560.0
## + F.Undergrad 1 39.1 108491 3560.3
## + Books 1 7.2 108523 3560.5
## + S.F.Ratio 1 0.2 108530 3560.6
## - Elite 1 715.7 109246 3561.2
## - PhD 1 780.7 109311 3561.6
## - Private 1 865.4 109396 3562.2
## - Room.Board 1 990.8 109521 3563.0
## - Personal 1 1340.4 109871 3565.2
## - Expend 1 1429.3 109960 3565.8
## - Apps 1 2329.1 110860 3571.5
## - P.Undergrad 1 2771.9 111302 3574.2
## - Outstate 1 3275.8 111806 3577.4
## - perc.alumni 1 6393.9 114924 3596.7
##
## Step: AIC=3556.9
## Grad.Rate ~ Private + Apps + Accept + Enroll + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite
##
## Df Sum of Sq RSS AIC
## - Enroll 1 126.5 108705 3555.7
## <none> 108578 3556.9
## - Accept 1 541.5 109120 3558.4
## + Terminal 1 47.9 108530 3558.6
## + F.Undergrad 1 45.3 108533 3558.6
## + Books 1 3.6 108575 3558.9
## + S.F.Ratio 1 0.4 108578 3558.9
## - Elite 1 715.6 109294 3559.5
## - Private 1 915.7 109494 3560.8
## - Room.Board 1 956.0 109534 3561.0
## - PhD 1 1231.4 109810 3562.8
## - Personal 1 1335.1 109913 3563.5
## - Expend 1 1447.7 110026 3564.2
## - Apps 1 2357.8 110936 3569.9
## - P.Undergrad 1 2792.8 111371 3572.7
## - Outstate 1 3236.6 111815 3575.5
## - perc.alumni 1 6348.1 114926 3594.7
##
## Step: AIC=3555.72
## Grad.Rate ~ Private + Apps + Accept + P.Undergrad + Outstate +
## Room.Board + Personal + PhD + perc.alumni + Expend + Elite
##
## Df Sum of Sq RSS AIC
## <none> 108705 3555.7
## - Accept 1 432.9 109138 3556.5
## + Enroll 1 126.5 108578 3556.9
## + Terminal 1 45.5 108659 3557.4
## + F.Undergrad 1 23.2 108682 3557.6
## + Books 1 5.2 108700 3557.7
## + S.F.Ratio 1 1.1 108704 3557.7
## - Elite 1 800.6 109505 3558.9
## - Private 1 832.3 109537 3559.1
## - Room.Board 1 892.7 109598 3559.4
## - Personal 1 1260.0 109965 3561.8
## - PhD 1 1283.8 109989 3561.9
## - Expend 1 1374.9 110080 3562.5
## - Apps 1 2246.3 110951 3568.0
## - P.Undergrad 1 2666.4 111371 3570.7
## - Outstate 1 3118.2 111823 3573.5
## - perc.alumni 1 6728.9 115434 3595.8
summary(best_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + perc.alumni + Expend +
## Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.842 -7.563 -0.507 6.971 52.885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1629560 3.2789880 10.724 < 2e-16 ***
## PrivateYes 3.8908272 1.6952648 2.295 0.022027 *
## Apps 0.0016005 0.0004245 3.771 0.000177 ***
## Accept -0.0010894 0.0006582 -1.655 0.098346 .
## P.Undergrad -0.0015205 0.0003701 -4.108 4.47e-05 ***
## Outstate 0.0010799 0.0002431 4.442 1.04e-05 ***
## Room.Board 0.0014346 0.0006036 2.377 0.017730 *
## Personal -0.0023305 0.0008253 -2.824 0.004881 **
## PhD 0.1122237 0.0393704 2.850 0.004496 **
## perc.alumni 0.3216323 0.0492854 6.526 1.31e-10 ***
## Expend -0.0004355 0.0001476 -2.950 0.003287 **
## EliteYes 4.5101993 2.0036855 2.251 0.024703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared: 0.4539, Adjusted R-squared: 0.4452
## F-statistic: 52 on 11 and 688 DF, p-value: < 2.2e-16
selected_formula <- formula(best_model)
final_model <- lm(selected_formula, data = mynewdata)
summary(final_model)
##
## Call:
## lm(formula = selected_formula, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.842 -7.563 -0.507 6.971 52.885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1629560 3.2789880 10.724 < 2e-16 ***
## PrivateYes 3.8908272 1.6952648 2.295 0.022027 *
## Apps 0.0016005 0.0004245 3.771 0.000177 ***
## Accept -0.0010894 0.0006582 -1.655 0.098346 .
## P.Undergrad -0.0015205 0.0003701 -4.108 4.47e-05 ***
## Outstate 0.0010799 0.0002431 4.442 1.04e-05 ***
## Room.Board 0.0014346 0.0006036 2.377 0.017730 *
## Personal -0.0023305 0.0008253 -2.824 0.004881 **
## PhD 0.1122237 0.0393704 2.850 0.004496 **
## perc.alumni 0.3216323 0.0492854 6.526 1.31e-10 ***
## Expend -0.0004355 0.0001476 -2.950 0.003287 **
## EliteYes 4.5101993 2.0036855 2.251 0.024703 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.57 on 688 degrees of freedom
## Multiple R-squared: 0.4539, Adjusted R-squared: 0.4452
## F-statistic: 52 on 11 and 688 DF, p-value: < 2.2e-16
plot(final_model, which = 1, main = "Residuals vs Fitted")
plot(final_model, which = 2, main = "Normal Q-Q Plot")
plot(final_model, which = 3, main = "Scale-Location Plot")
plot(final_model, which = 5, main = "Residuals vs Leverage")
Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.
required_packages <- c("ggplot2", "MASS", "glmnet", "boot")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages) > 0) {
install.packages(new_packages)
}
# Load the required libraries
library(ggplot2)
library(MASS)
library(boot)
if ("glmnet" %in% installed.packages()[, "Package"]) {
library(glmnet)
} else {
message("glmnet package is not installed. Skipping Lasso regression.")
}
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
initial_model <- lm(Grad.Rate ~ ., data = mynewdata)
summary(initial_model)
##
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.646 -7.287 -0.441 7.155 52.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0837588 4.9053047 7.152 2.20e-12 ***
## PrivateYes 3.9254138 1.7691302 2.219 0.026826 *
## Apps 0.0016663 0.0004327 3.851 0.000129 ***
## Accept -0.0015698 0.0008470 -1.853 0.064274 .
## Enroll 0.0021744 0.0023102 0.941 0.346933
## F.Undergrad -0.0002027 0.0004041 -0.502 0.616105
## P.Undergrad -0.0015407 0.0003898 -3.953 8.52e-05 ***
## Outstate 0.0011163 0.0002483 4.495 8.17e-06 ***
## Room.Board 0.0015220 0.0006147 2.476 0.013524 *
## Books 0.0006745 0.0030868 0.219 0.827084
## Personal -0.0024121 0.0008447 -2.856 0.004425 **
## PhD 0.1369511 0.0617875 2.216 0.026987 *
## Terminal -0.0353895 0.0666052 -0.531 0.595361
## S.F.Ratio 0.0097674 0.1625001 0.060 0.952088
## perc.alumni 0.3169363 0.0502684 6.305 5.18e-10 ***
## Expend -0.0004464 0.0001601 -2.788 0.005446 **
## EliteYes 4.2327018 2.0267038 2.088 0.037126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.6 on 683 degrees of freedom
## Multiple R-squared: 0.4551, Adjusted R-squared: 0.4423
## F-statistic: 35.65 on 16 and 683 DF, p-value: < 2.2e-16
reduced_model <- lm(Grad.Rate ~ Private + Outstate + Room.Board + Books, data = mynewdata)
summary(reduced_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Outstate + Room.Board + Books,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.683 -8.822 0.174 8.252 50.846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2031086 2.6564354 14.005 <2e-16 ***
## PrivateYes 1.0023971 1.4038844 0.714 0.4755
## Outstate 0.0021001 0.0001976 10.626 <2e-16 ***
## Room.Board 0.0014794 0.0006369 2.323 0.0205 *
## Books -0.0016624 0.0032745 -0.508 0.6118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.78 on 695 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.3332
## F-statistic: 88.33 on 4 and 695 DF, p-value: < 2.2e-16
interaction_model <- lm(Grad.Rate ~ Private * Elite + Outstate * Room.Board + Books, data = mynewdata)
summary(interaction_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private * Elite + Outstate * Room.Board +
## Books, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.385 -8.263 0.370 7.894 51.475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.485e+01 6.174e+00 5.646 2.40e-08 ***
## PrivateYes 2.565e+00 1.455e+00 1.763 0.078275 .
## EliteYes 1.656e+01 3.960e+00 4.183 3.25e-05 ***
## Outstate 2.265e-03 5.981e-04 3.787 0.000166 ***
## Room.Board 2.263e-03 1.466e-03 1.544 0.123111
## Books -2.576e-03 3.226e-03 -0.799 0.424740
## PrivateYes:EliteYes -9.887e+00 4.531e+00 -2.182 0.029438 *
## Outstate:Room.Board -8.827e-08 1.224e-07 -0.721 0.471021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.55 on 692 degrees of freedom
## Multiple R-squared: 0.3615, Adjusted R-squared: 0.355
## F-statistic: 55.96 on 7 and 692 DF, p-value: < 2.2e-16
mynewdata$Log_Outstate <- log(mynewdata$Outstate)
transformed_model <- lm(Grad.Rate ~ Log_Outstate + Private + Room.Board + Books, data = mynewdata)
summary(transformed_model)
##
## Call:
## lm(formula = Grad.Rate ~ Log_Outstate + Private + Room.Board +
## Books, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.954 -9.349 0.164 8.569 55.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.216e+02 1.679e+01 -7.244 1.16e-12 ***
## Log_Outstate 1.948e+01 2.043e+00 9.534 < 2e-16 ***
## PrivateYes 4.980e-01 1.482e+00 0.336 0.73690
## Room.Board 1.972e-03 6.382e-04 3.089 0.00209 **
## Books -1.282e-03 3.321e-03 -0.386 0.69954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.97 on 695 degrees of freedom
## Multiple R-squared: 0.3185, Adjusted R-squared: 0.3145
## F-statistic: 81.19 on 4 and 695 DF, p-value: < 2.2e-16
if ("glmnet" %in% installed.packages()[, "Package"]) {
x <- model.matrix(Grad.Rate ~ ., data = mynewdata)[, -1] # Exclude intercept
y <- mynewdata$Grad.Rate
lasso_model <- glmnet(x, y, alpha = 1) # alpha=1 for Lasso
plot(lasso_model)
cv_lasso <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
message("Optimal Lambda for Lasso: ", best_lambda)
final_lasso <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(coef(final_lasso))
} else {
message("Skipping Lasso regression due to missing glmnet package.")
}
## Optimal Lambda for Lasso: 0.374138951508501
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.3062823830
## PrivateYes 1.9746908469
## Apps 0.0006889281
## Accept .
## Enroll .
## F.Undergrad .
## P.Undergrad -0.0013483737
## Outstate 0.0004515911
## Room.Board 0.0012986077
## Books .
## Personal -0.0022244024
## PhD 0.0761476962
## Terminal .
## S.F.Ratio .
## perc.alumni 0.3109259399
## Expend .
## EliteYes 3.5193309024
## Log_Outstate 4.9076892693
AIC(initial_model, reduced_model, interaction_model, transformed_model)
## df AIC
## initial_model 18 5552.803
## reduced_model 6 5666.036
## interaction_model 9 5645.757
## transformed_model 6 5685.367
adj_r2 <- c(
initial_model = summary(initial_model)$adj.r.squared,
reduced_model = summary(reduced_model)$adj.r.squared,
interaction_model = summary(interaction_model)$adj.r.squared,
transformed_model = summary(transformed_model)$adj.r.squared
)
print(adj_r2)
## initial_model reduced_model interaction_model transformed_model
## 0.4422926 0.3332164 0.3550010 0.3145467
cv_error <- cv.glm(mynewdata, transformed_model, K = 10)
cv_error$delta # 10-fold CV error
## [1] NaN NaN
final_model <- transformed_model
summary(final_model)
##
## Call:
## lm(formula = Grad.Rate ~ Log_Outstate + Private + Room.Board +
## Books, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.954 -9.349 0.164 8.569 55.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.216e+02 1.679e+01 -7.244 1.16e-12 ***
## Log_Outstate 1.948e+01 2.043e+00 9.534 < 2e-16 ***
## PrivateYes 4.980e-01 1.482e+00 0.336 0.73690
## Room.Board 1.972e-03 6.382e-04 3.089 0.00209 **
## Books -1.282e-03 3.321e-03 -0.386 0.69954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.97 on 695 degrees of freedom
## Multiple R-squared: 0.3185, Adjusted R-squared: 0.3145
## F-statistic: 81.19 on 4 and 695 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(final_model)
final_model_coefficients <- as.data.frame(summary(final_model)$coefficients)
write.csv(final_model_coefficients, "final_model_coefficients.csv")
final_model_summary <- summary(final_model)
final_model_metrics <- list(
r_squared = final_model_summary$r.squared,
adj_r_squared = final_model_summary$adj.r.squared,
residual_standard_error = final_model_summary$sigma,
f_statistic = final_model_summary$fstatistic[1]
)
final_model_metrics_df <- as.data.frame(final_model_metrics)
write.csv(final_model_metrics_df, "final_model_metrics.csv", row.names = FALSE)