1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))
mydata<-read.csv("College.csv", header=T, stringsAsFactors=TRUE)
dim(mydata)
## [1] 775 17
head(mydata)
tail(mydata)
View(mydata)
print(paste("Number of observations: ", dim(mydata)[1]))
## [1] "Number of observations: 775"
print(paste("Number of variables: ", dim(mydata)[2]))
## [1] "Number of variables: 17"
**dim()- Returns the dimensions of the dataset as a numeric vector: the number of rows and columns.(Here it tells about no.of observations and no. of varibles)
head() and tail()- Dispaly top five and last five values of our varibles of Data Set.**
1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))
set.seed(2404264)
mynewdata <- mydata[sample(nrow(mydata), 700), ]
summary(mynewdata)
## Private Apps Accept Enroll F.Undergrad
## No :196 Min. : 81.0 Min. : 72.0 Min. : 35.0 Min. : 139
## Yes:504 1st Qu.: 796.8 1st Qu.: 622.2 1st Qu.: 243.8 1st Qu.: 999
## Median : 1600.0 Median : 1150.5 Median : 438.0 Median : 1720
## Mean : 3049.8 Mean : 2058.2 Mean : 796.0 Mean : 3796
## 3rd Qu.: 3629.5 3rd Qu.: 2407.5 3rd Qu.: 902.2 3rd Qu.: 4140
## Max. :48094.0 Max. :26330.0 Max. :6392.0 Max. :31643
## P.Undergrad Outstate Room.Board Books
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 96.0
## 1st Qu.: 94.0 1st Qu.: 7248 1st Qu.:3600 1st Qu.: 467.2
## Median : 350.5 Median : 9900 Median :4192 Median : 500.0
## Mean : 870.0 Mean :10424 Mean :4350 Mean : 548.4
## 3rd Qu.: 967.8 3rd Qu.:12897 3rd Qu.:5050 3rd Qu.: 600.0
## Max. :21836.0 Max. :21700 Max. :8124 Max. :2340.0
## Personal PhD Terminal S.F.Ratio
## Min. : 250 Min. : 8.0 Min. : 30.00 Min. : 2.50
## 1st Qu.: 850 1st Qu.: 62.0 1st Qu.: 71.00 1st Qu.:11.50
## Median :1200 Median : 75.0 Median : 82.00 Median :13.60
## Mean :1339 Mean : 72.6 Mean : 79.84 Mean :14.09
## 3rd Qu.:1700 3rd Qu.: 85.0 3rd Qu.: 92.00 3rd Qu.:16.50
## Max. :6800 Max. :100.0 Max. :100.00 Max. :39.80
## perc.alumni Expend Grad.Rate Elite
## Min. : 0.00 Min. : 3186 Min. : 10.00 No :633
## 1st Qu.:13.00 1st Qu.: 6744 1st Qu.: 53.00 Yes: 67
## Median :21.00 Median : 8361 Median : 65.00
## Mean :22.72 Mean : 9682 Mean : 65.05
## 3rd Qu.:31.00 3rd Qu.:10876 3rd Qu.: 77.00
## Max. :64.00 Max. :56233 Max. :100.00
private_public_counts <- table(mynewdata$Private)
elite_non_elite_counts <- table(mynewdata$Elite)
elite_non_elite_counts
##
## No Yes
## 633 67
# Print the results
print(paste("Number of private universities: ", private_public_counts["Yes"]))
## [1] "Number of private universities: 504"
print(paste("Number of public universities: ", private_public_counts["No"]))
## [1] "Number of public universities: 196"
print(paste("Number of Elite universities: ", elite_non_elite_counts["Yes"]))
## [1] "Number of Elite universities: 67"
print(paste("Number of non-Elite universities: ", elite_non_elite_counts["No"]))
## [1] "Number of non-Elite universities: 633"
There are 505 private and 73 elite out of 700 university
For each variable, the output provideded above tells the following: Minimum (Min.): The smallest value of the varible 1st Quartile (1st Qu.): The value below which 25% of the data falls. Median (Median): The middle value (50th percentile). Mean (Mean): The average value. 3rd Quartile (3rd Qu.): The value below which 75% of the data falls. Maximum (Max.): The largest value.
1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))
hist(mynewdata$Outstate, main= "outstates", xlab="Dollars", ylab="students")
hist(mynewdata$Room.Board, main= "Room.Board", xlab="Dollars", ylab="students")
hist(mynewdata$Books,main= "Books", xlab="Dollars", ylab="students")
hist(mynewdata$Personal,main= "Personals", xlab="Dollars", ylab="students")
Comments: 1)outstates 1.1)The histogram is right-skewed, meaning most of
the tuition values are concentrated on the lower end. 1.2)A significant
number of institutions have tuition costs around 7,000 Dollars to 12,000
Dollars. These ranges have the tallest bars, indicating the highest
frequency of values.
Room Board 2.1)The histogram appears to be right-skewed,indicating that most roomboard costs are concentrated at lower values. 2.2)Almost 70-75 percent of the costs fall between 3K and 5K dollars, as indicated by the tallest bars in this range 2.3)
Books 3.1) this hist. is right skewed. 3.2) Most of the all the institute books price range varies in arange of 400 -600 Dollars, with a very few books cost>1000 Dollars.
Personals 4.1) This is highly right skewed histogram. 4.2)Majority of the population spends between 500-2000 dollars on personal care.
2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).
my_model<-lm(Grad.Rate ~ Private + Elite, data=mynewdata)
summary(my_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.462 -9.005 0.538 10.148 45.148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.852 1.069 51.303 <2e-16 ***
## PrivateYes 11.610 1.260 9.211 <2e-16 ***
## EliteYes 19.196 1.924 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.91 on 697 degrees of freedom
## Multiple R-squared: 0.2261, Adjusted R-squared: 0.2239
## F-statistic: 101.8 on 2 and 697 DF, p-value: < 2.2e-16
print(paste('There is approximately no bias in the prediction because the median = 0.23'))
## [1] "There is approximately no bias in the prediction because the median = 0.23"
print(paste('2)With an Estimate Std. of 11.810 and 18.583 of "Private" and "Elite" respectively. Both are highly Significant as p<2e-16 also in both the cases(***)'))
## [1] "2)With an Estimate Std. of 11.810 and 18.583 of \"Private\" and \"Elite\" respectively. Both are highly Significant as p<2e-16 also in both the cases(***)"
print(paste('3)F-statistic(107.8, p < 2.2e-16)indicating the model is highly significant '))
## [1] "3)F-statistic(107.8, p < 2.2e-16)indicating the model is highly significant "
print(paste('4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model'))
## [1] "4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model"
1)There is approximately no bias in the prediction because the median = 0.48 2) Estimate of 11.771 and 18.244 for Private and Elite institute respectively (both positive) suggest that they influence 11.771 times and 18.244 times more than public and non elite institutes. 3) Both private and elite institute are highly significant predictors (p<<<0.05) with *** significance level. 4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model 5) Residual error is 15.19, Suggesting that, on average, the predictions made by the model are off by about 15.19 units 6)21.64% of the variation in the dependent variable(Grad. Rate) is explained by the independent variables(Private + Elite) in the model. 7)F-statistic(calculated) = 96.24 , infering that the model is a good fit overall.
2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))
confint(my_model)
## 2.5 % 97.5 %
## (Intercept) 52.753012 56.95145
## PrivateYes 9.135059 14.08434
## EliteYes 15.419496 22.97296
PYEN <- data.frame(Private = "Yes", Elite = "No")
prediction <- predict(my_model, newdata=PYEN, interval = "prediction")
prediction
## fit lwr upr
## 1 66.46193 37.16478 95.75909
#comments
print(paste('for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private'))
## [1] "for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private"
print(paste('for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite'))
## [1] "for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite"
print(paste('With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76.'))
## [1] "With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76."
comments 1)for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private 2)for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite 3)With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76.
2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)
mydatafull<-lm(Grad.Rate ~., data=mynewdata)
summary(mydatafull)
##
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.737 -6.933 -0.145 7.032 52.402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.2867184 5.0890425 6.934 9.51e-12 ***
## PrivateYes 4.8579960 1.7536277 2.770 0.005753 **
## Apps 0.0017950 0.0004540 3.954 8.48e-05 ***
## Accept -0.0017403 0.0008709 -1.998 0.046075 *
## Enroll 0.0021284 0.0023255 0.915 0.360387
## F.Undergrad -0.0001757 0.0004065 -0.432 0.665736
## P.Undergrad -0.0014847 0.0003935 -3.773 0.000175 ***
## Outstate 0.0011137 0.0002401 4.639 4.20e-06 ***
## Room.Board 0.0011539 0.0006302 1.831 0.067539 .
## Books -0.0018664 0.0029919 -0.624 0.532939
## Personal -0.0021909 0.0008166 -2.683 0.007477 **
## PhD 0.1619072 0.0570072 2.840 0.004644 **
## Terminal -0.0388527 0.0642770 -0.604 0.545741
## S.F.Ratio 0.0081628 0.1715337 0.048 0.962059
## perc.alumni 0.3001929 0.0506449 5.927 4.88e-09 ***
## Expend -0.0004182 0.0001575 -2.655 0.008106 **
## EliteYes 4.4441537 2.1234405 2.093 0.036726 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.64 on 683 degrees of freedom
## Multiple R-squared: 0.455, Adjusted R-squared: 0.4422
## F-statistic: 35.64 on 16 and 683 DF, p-value: < 2.2e-16
# Perform ANOVA to compare the two models
my_model<-lm(Grad.Rate ~ Private + Elite, data=mynewdata)
mydatafull<-lm(Grad.Rate ~., data=mynewdata)
anova_result <- anova(my_model, mydatafull)
# View the ANOVA result
print(anova_result)
## Analysis of Variance Table
##
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 697 154855
## 2 683 109058 14 45797 20.487 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(paste('Since the p-value is very small, we reject the null hypothesis that both models fit equally well. Therefore, mydatafull model is a significantly better model than my_model.'))
## [1] "Since the p-value is very small, we reject the null hypothesis that both models fit equally well. Therefore, mydatafull model is a significantly better model than my_model."
Comments: Let us first discuss summary of mydatafull in 2.3 part
1)The predictor perc.alumini effect most followed by outstate, P.Undergrad, Apps with less significance Predictor are room.Board,private,PhD and Expend.(Decreasing P value,Also p<<0.05 and *** and ** significance) 2) Adjusted R square is 45.14 percent indicating the inclusion of more no. of predictor has perfected the model(model 2.3). 3)Although F-statistic for Model2.3 = 36.95 which has decreases significantly i.e.from 96.24 in model 2.1, but is still very much significant as the p(model 2.3)<<0.05).
MODEL 2.1 Vs MODEL 2.3 As per the above ANOVA table 1) RSS value for model 2.3 is 109997< RSS Model 2.1, meaning it explains more of the variance in the data compared to Model 2.1. This indicates a better fit for Model 2.3. 2) Sum Of Square is 50799, this expalin that additional predictors in model 2.3 expalin better my Grad.Rate 3)Since the p(Model 2.3)<<<0.05, Confirming that the model 2.3 is better than Linear regression Model 2.1
Conclusion:
Model 2.3 is better than Model 2.1
2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)
myfit.poly2 <- lm(Grad.Rate ~ poly(perc.alumni, 2) + poly(Outstate, 2), data = mynewdata)
summary(myfit.poly2)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 2) + poly(Outstate,
## 2), data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.789 -8.155 0.087 8.035 52.541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.0486 0.5052 128.755 < 2e-16 ***
## poly(perc.alumni, 2)1 113.8126 16.2483 7.005 5.86e-12 ***
## poly(perc.alumni, 2)2 -34.2126 14.0999 -2.426 0.0155 *
## poly(Outstate, 2)1 192.3375 16.2297 11.851 < 2e-16 ***
## poly(Outstate, 2)2 -0.7914 14.1212 -0.056 0.9553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 695 degrees of freedom
## Multiple R-squared: 0.3795, Adjusted R-squared: 0.3759
## F-statistic: 106.3 on 4 and 695 DF, p-value: < 2.2e-16
myfit.poly3<-lm(Grad.Rate ~ poly(perc.alumni, 3, raw=T)+Outstate, data=mynewdata)
summary(myfit.poly3)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 3, raw = T) + Outstate,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.243 -8.369 0.080 7.716 52.526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.2142422 2.8684696 11.230 <2e-16 ***
## poly(perc.alumni, 3, raw = T)1 1.1333428 0.3688194 3.073 0.0022 **
## poly(perc.alumni, 3, raw = T)2 -0.0242706 0.0141201 -1.719 0.0861 .
## poly(perc.alumni, 3, raw = T)3 0.0001997 0.0001586 1.259 0.2083
## Outstate 0.0017916 0.0001513 11.842 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 695 degrees of freedom
## Multiple R-squared: 0.3809, Adjusted R-squared: 0.3773
## F-statistic: 106.9 on 4 and 695 DF, p-value: < 2.2e-16
myfit.poly5<-lm(Grad.Rate~poly(perc.alumni, 5, raw=T)+Outstate, data=mynewdata)
summary(myfit.poly5)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 5, raw = T) + Outstate,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.228 -8.336 -0.118 7.786 52.730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.437e+01 4.959e+00 6.931 9.57e-12 ***
## poly(perc.alumni, 5, raw = T)1 7.236e-01 1.308e+00 0.553 0.580
## poly(perc.alumni, 5, raw = T)2 -8.924e-03 1.215e-01 -0.073 0.941
## poly(perc.alumni, 5, raw = T)3 4.735e-04 4.895e-03 0.097 0.923
## poly(perc.alumni, 5, raw = T)4 -1.895e-05 8.759e-05 -0.216 0.829
## poly(perc.alumni, 5, raw = T)5 2.000e-07 5.697e-07 0.351 0.726
## Outstate 1.792e-03 1.513e-04 11.841 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 693 degrees of freedom
## Multiple R-squared: 0.3823, Adjusted R-squared: 0.377
## F-statistic: 71.5 on 6 and 693 DF, p-value: < 2.2e-16
# Diagnostic plots for the multiple linear regression
par(mfrow = c(2, 2)) # Set up a 2x2 plotting layout
plot(myfit.poly2)
par(mfrow = c(2, 2)) # Set up a 2x2 plotting layout
plot(myfit.poly3)
par(mfrow = c(2, 2)) # Set up a 2x2 plotting layout
plot(myfit.poly5)
Residuals vs. Fitted: 1)This plot helps us to assess whether the residuals are randomly distributed and how much. 2)Forpoly2,we might see some curvature, indicating that a linear model might not be sufficient. As you move to degree 3 or 5, the residuals may appear more randomly distributed.
Q-Q Plot: 1)This plot assesses the normaliancy of the residual data points. 2)Deviations from the diagnol line indicating more non-normality. Higher-degree polynomials may show more deviations.
Scale-Location Plot: 1)This checks for constant variance of residual. 2)If the spread of residuals increases with fitted values, it suggests the variance of errors in a model is not the same for all observations, which can be more pronounced in higher-degree models.
Residuals vs. Leverage: 1)This plot show influential points. 2)as the poly degree is increased the Standard residuals, Leverage drecreases i.e Firstly leverage varies from 0 to 0.02 but now (Poly 5) leverage varies from 0.0 to 0.01
2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)
#Using forward selection method
#install.packages("leaps")
library(leaps)
myfit.fwd <- regsubsets(Grad.Rate ~ ., data = mynewdata, nvmax = 16, method = "forward")
summary(myfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Grad.Rate ~ ., data = mynewdata, nvmax = 16,
## method = "forward")
## 16 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## EliteYes FALSE FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll F.Undergrad P.Undergrad Outstate
## 1 ( 1 ) " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " "*" "*"
## 5 ( 1 ) " " "*" " " " " " " "*" "*"
## 6 ( 1 ) " " "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" "*"
## 10 ( 1 ) "*" "*" "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " " " " " "*" " "
## 5 ( 1 ) " " " " "*" " " " " " " "*" " "
## 6 ( 1 ) " " " " "*" "*" " " " " "*" " "
## 7 ( 1 ) " " " " "*" "*" " " " " "*" " "
## 8 ( 1 ) " " " " "*" "*" " " " " "*" " "
## 9 ( 1 ) " " " " "*" "*" " " " " "*" "*"
## 10 ( 1 ) " " " " "*" "*" " " " " "*" "*"
## 11 ( 1 ) "*" " " "*" "*" " " " " "*" "*"
## 12 ( 1 ) "*" " " "*" "*" " " " " "*" "*"
## 13 ( 1 ) "*" " " "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## EliteYes
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) "*"
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
print(paste('Top prdictor selection order in forward is <---- Outstate,perc.alumni,Apps,P.Undergrad,Room.Board,phd,EliteYes,privateYes,Personal,Terminal,Accept,Books,Enroll,F.Undergrad,S.F.Ratio,Expend'))
## [1] "Top prdictor selection order in forward is <---- Outstate,perc.alumni,Apps,P.Undergrad,Room.Board,phd,EliteYes,privateYes,Personal,Terminal,Accept,Books,Enroll,F.Undergrad,S.F.Ratio,Expend"
myfit.fwd.sum <- summary(myfit.fwd)
par(mfrow=c(2,2))
plot(myfit.fwd.sum$rss)
plot(myfit.fwd.sum$adjr2)
plot(myfit.fwd.sum$cp)
plot(myfit.fwd.sum$bic)
which.max(myfit.fwd.sum$adjr2) #12
## [1] 11
which.min(myfit.fwd.sum$cp)#11
## [1] 11
which.min(myfit.fwd.sum$bic)#5
## [1] 7
coef(myfit.fwd, 6)
## (Intercept) Apps P.Undergrad Outstate Personal
## 38.7739545528 0.0008453855 -0.0017793762 0.0013605965 -0.0026138905
## PhD perc.alumni
## 0.1001651379 0.3208192162
Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.
We must start by cleaning the data ,removing outlier and standarizing the numeric variables
# Load dataset
data <- read.csv("College.csv")
# Convert categorical columns
data$Private <- as.factor(data$Private)
data$Elite <- as.factor(data$Elite)
# Handle outliers
handle_outliers <- function(column) {
Q1 <- quantile(data[[column]], 0.25)
Q3 <- quantile(data[[column]], 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
data[[column]][data[[column]] < lower_bound] <- lower_bound
data[[column]][data[[column]] > upper_bound] <- upper_bound
}
columns_to_check <- c("Books", "Room.Board", "Expend", "Grad.Rate")
for (col in columns_to_check) {
handle_outliers(col)
}
# Standardize numeric variables
numeric_columns <- c("Apps", "Accept", "Enroll", "F.Undergrad", "P.Undergrad",
"Outstate", "Room.Board", "Books", "Personal", "PhD",
"Terminal", "S.F.Ratio", "perc.alumni", "Expend", "Grad.Rate")
data[numeric_columns] <- scale(data[numeric_columns])
# Save cleaned data
write.csv(data, "Cleaned_College.csv", row.names = FALSE)
#Top 3 predictor selection order in forward selection process
# Fitting Polynomial Models for Outstate
poly_model_2 <- lm(Grad.Rate ~ poly(Outstate, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(Outstate, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(Outstate, 5) + Private + Elite, data = mynewdata)
# Summarize the models for Outstate
summary(poly_model_2)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 2) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.860 -8.295 0.547 7.941 51.913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.792 1.210 51.909 <2e-16 ***
## poly(Outstate, 2)1 213.161 18.150 11.744 <2e-16 ***
## poly(Outstate, 2)2 -19.402 14.921 -1.300 0.194
## PrivateYes 1.789 1.474 1.213 0.225
## EliteYes 10.120 1.994 5.074 5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.63 on 695 degrees of freedom
## Multiple R-squared: 0.3543, Adjusted R-squared: 0.3506
## F-statistic: 95.35 on 4 and 695 DF, p-value: < 2.2e-16
summary(poly_model_3)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 3) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.227 -8.222 0.540 8.086 50.260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.804 1.209 51.959 < 2e-16 ***
## poly(Outstate, 3)1 212.958 18.136 11.742 < 2e-16 ***
## poly(Outstate, 3)2 -19.782 14.912 -1.327 0.185
## poly(Outstate, 3)3 -19.787 13.642 -1.450 0.147
## PrivateYes 1.754 1.473 1.190 0.234
## EliteYes 10.257 1.995 5.141 3.55e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.62 on 694 degrees of freedom
## Multiple R-squared: 0.3563, Adjusted R-squared: 0.3516
## F-statistic: 76.82 on 5 and 694 DF, p-value: < 2.2e-16
summary(poly_model_5)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 5) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.090 -7.750 0.605 7.973 48.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0736 1.2255 51.468 < 2e-16 ***
## poly(Outstate, 5)1 214.6920 18.2478 11.765 < 2e-16 ***
## poly(Outstate, 5)2 -21.8125 14.9516 -1.459 0.1451
## poly(Outstate, 5)3 -19.9728 13.6347 -1.465 0.1434
## poly(Outstate, 5)4 23.3389 13.8849 1.681 0.0932 .
## poly(Outstate, 5)5 -0.9222 13.8054 -0.067 0.9468
## PrivateYes 1.3376 1.4998 0.892 0.3728
## EliteYes 10.5726 2.0231 5.226 2.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.62 on 692 degrees of freedom
## Multiple R-squared: 0.3589, Adjusted R-squared: 0.3524
## F-statistic: 55.34 on 7 and 692 DF, p-value: < 2.2e-16
# Comments:
print(paste('the intercept Estimate value is 63.516, refering that when all predictors are zero, the predicted value of Grad.Rate is 64.16.'))
## [1] "the intercept Estimate value is 63.516, refering that when all predictors are zero, the predicted value of Grad.Rate is 64.16."
print(paste('The estimated coefficient is 230.06, meaning that for each unit increase in this component of Outstate, Grad.Rate increases by 230.06.'))
## [1] "The estimated coefficient is 230.06, meaning that for each unit increase in this component of Outstate, Grad.Rate increases by 230.06."
print(paste('Std. Error: This represents the standard error of each coefficient estimate. Smaller values indicate more precise estimates.'))
## [1] "Std. Error: This represents the standard error of each coefficient estimate. Smaller values indicate more precise estimates."
print(paste('poly(Outstate, 5)1 has a p-value of < 2e-16, indicating it is highly significant.'))
## [1] "poly(Outstate, 5)1 has a p-value of < 2e-16, indicating it is highly significant."
print(paste('PrivateYes has a p-value of 0.9664 is very high, indicating that this variable is not significant in predicting Grad.Rate.'))
## [1] "PrivateYes has a p-value of 0.9664 is very high, indicating that this variable is not significant in predicting Grad.Rate."
print(paste('EliteYes has a highly significant p-value of 1.91e-06'))
## [1] "EliteYes has a highly significant p-value of 1.91e-06"
print(paste('poly(Outstate, 5)1 and EliteYes are highly significant predictors, with very low p-values'))
## [1] "poly(Outstate, 5)1 and EliteYes are highly significant predictors, with very low p-values"
# Fitting Polynomial Models for perc.alumni
poly_model_2 <- lm(Grad.Rate ~ poly(perc.alumni, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(perc.alumni, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(perc.alumni, 5) + Private + Elite, data = mynewdata)
# Summarize the models for perc.alumni
summary(poly_model_2)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 2) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.174 -8.194 0.705 8.757 48.354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.414 1.096 54.227 < 2e-16 ***
## poly(perc.alumni, 2)1 157.000 16.157 9.717 < 2e-16 ***
## poly(perc.alumni, 2)2 -43.647 14.259 -3.061 0.00229 **
## PrivateYes 5.915 1.304 4.537 6.71e-06 ***
## EliteYes 14.369 1.920 7.483 2.20e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.93 on 695 degrees of freedom
## Multiple R-squared: 0.3265, Adjusted R-squared: 0.3226
## F-statistic: 84.23 on 4 and 695 DF, p-value: < 2.2e-16
summary(poly_model_3)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 3) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.532 -8.281 0.622 8.656 48.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.394 1.096 54.203 < 2e-16 ***
## poly(perc.alumni, 3)1 157.152 16.157 9.727 < 2e-16 ***
## poly(perc.alumni, 3)2 -43.377 14.260 -3.042 0.00244 **
## poly(perc.alumni, 3)3 14.572 13.977 1.043 0.29750
## PrivateYes 5.965 1.304 4.572 5.71e-06 ***
## EliteYes 14.213 1.926 7.379 4.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 694 degrees of freedom
## Multiple R-squared: 0.3276, Adjusted R-squared: 0.3227
## F-statistic: 67.61 on 5 and 694 DF, p-value: < 2.2e-16
summary(poly_model_5)
##
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 5) + Private + Elite,
## data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.932 -8.113 0.637 8.706 48.531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.438 1.097 54.177 < 2e-16 ***
## poly(perc.alumni, 5)1 157.405 16.168 9.736 < 2e-16 ***
## poly(perc.alumni, 5)2 -43.509 14.269 -3.049 0.00238 **
## poly(perc.alumni, 5)3 14.530 13.985 1.039 0.29917
## poly(perc.alumni, 5)4 14.474 13.946 1.038 0.29969
## poly(perc.alumni, 5)5 5.299 13.935 0.380 0.70385
## PrivateYes 5.900 1.307 4.516 7.42e-06 ***
## EliteYes 14.241 1.927 7.389 4.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.93 on 692 degrees of freedom
## Multiple R-squared: 0.3287, Adjusted R-squared: 0.322
## F-statistic: 48.42 on 7 and 692 DF, p-value: < 2.2e-16
# Fitting Polynomial Models for Apps
poly_model_2 <- lm(Grad.Rate ~ poly(Apps, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(Apps, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(Apps, 5) + Private + Elite, data = mynewdata)
# Summarize the models for Apps
summary(poly_model_2)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 2) + Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.259 -8.663 0.264 9.825 50.818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.204 1.138 45.014 < 2e-16 ***
## poly(Apps, 2)1 124.418 17.089 7.280 9.04e-13 ***
## poly(Apps, 2)2 -48.068 15.052 -3.193 0.00147 **
## PrivateYes 17.410 1.437 12.112 < 2e-16 ***
## EliteYes 13.681 1.993 6.863 1.49e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.35 on 695 degrees of freedom
## Multiple R-squared: 0.2849, Adjusted R-squared: 0.2808
## F-statistic: 69.23 on 4 and 695 DF, p-value: < 2.2e-16
summary(poly_model_3)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 3) + Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.494 -8.538 0.412 9.619 51.460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.543 1.142 44.266 < 2e-16 ***
## poly(Apps, 3)1 130.221 17.012 7.655 6.51e-14 ***
## poly(Apps, 3)2 -50.819 14.938 -3.402 0.000707 ***
## poly(Apps, 3)3 53.054 14.465 3.668 0.000263 ***
## PrivateYes 18.379 1.449 12.684 < 2e-16 ***
## EliteYes 13.300 1.979 6.722 3.73e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.22 on 694 degrees of freedom
## Multiple R-squared: 0.2985, Adjusted R-squared: 0.2935
## F-statistic: 59.06 on 5 and 694 DF, p-value: < 2.2e-16
summary(poly_model_5)
##
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 5) + Private + Elite, data = mynewdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.318 -8.600 0.730 8.998 49.535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.396 1.122 44.909 < 2e-16 ***
## poly(Apps, 5)1 132.897 16.700 7.958 7.15e-15 ***
## poly(Apps, 5)2 -52.184 14.658 -3.560 0.000396 ***
## poly(Apps, 5)3 53.585 14.192 3.776 0.000173 ***
## poly(Apps, 5)4 -53.758 13.981 -3.845 0.000132 ***
## poly(Apps, 5)5 52.737 13.995 3.768 0.000178 ***
## PrivateYes 18.666 1.424 13.104 < 2e-16 ***
## EliteYes 12.675 1.946 6.513 1.42e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.95 on 692 degrees of freedom
## Multiple R-squared: 0.3267, Adjusted R-squared: 0.3199
## F-statistic: 47.98 on 7 and 692 DF, p-value: < 2.2e-16