MA717: Applied Regression and Experimental Data Analysis

1.1. Read “College.csv” file into R with following command and use dim() and head() to check if you read the data correct. You should report the number of observations and the number of variables. (5 \(\%\))

mydata<-read.csv("College.csv", header=T, stringsAsFactors=TRUE)
dim(mydata)

## [1] 775  17

head(mydata)

tail(mydata)

View(mydata)

print(paste("Number of observations: ", dim(mydata)[1]))

## [1] "Number of observations:  775"

print(paste("Number of variables: ", dim(mydata)[2]))

## [1] "Number of variables:  17"

**dim()- Returns the dimensions of the dataset as a numeric vector: the number of rows and columns.(Here it tells about no.of observations and no. of varibles)

head() and tail()- Dispaly top five and last five values of our varibles of Data Set.**

1.2. Use your registration number as random seed, generate a random subset of College data with sample size 700, name this new data as mynewdata. Use summary() to output the summarized information about mynewdata. Please report the number of private and public university and the number of Elite university and non-Elite university in this new data. (12 \(\%\))

set.seed(2404264)

mynewdata <- mydata[sample(nrow(mydata), 700), ]

summary(mynewdata)

##  Private        Apps             Accept            Enroll        F.Undergrad   
##  No :196   Min.   :   81.0   Min.   :   72.0   Min.   :  35.0   Min.   :  139  
##  Yes:504   1st Qu.:  796.8   1st Qu.:  622.2   1st Qu.: 243.8   1st Qu.:  999  
##            Median : 1600.0   Median : 1150.5   Median : 438.0   Median : 1720  
##            Mean   : 3049.8   Mean   : 2058.2   Mean   : 796.0   Mean   : 3796  
##            3rd Qu.: 3629.5   3rd Qu.: 2407.5   3rd Qu.: 902.2   3rd Qu.: 4140  
##            Max.   :48094.0   Max.   :26330.0   Max.   :6392.0   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board       Books       
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   :  96.0  
##  1st Qu.:   94.0   1st Qu.: 7248   1st Qu.:3600   1st Qu.: 467.2  
##  Median :  350.5   Median : 9900   Median :4192   Median : 500.0  
##  Mean   :  870.0   Mean   :10424   Mean   :4350   Mean   : 548.4  
##  3rd Qu.:  967.8   3rd Qu.:12897   3rd Qu.:5050   3rd Qu.: 600.0  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :2340.0  
##     Personal         PhD           Terminal        S.F.Ratio    
##  Min.   : 250   Min.   :  8.0   Min.   : 30.00   Min.   : 2.50  
##  1st Qu.: 850   1st Qu.: 62.0   1st Qu.: 71.00   1st Qu.:11.50  
##  Median :1200   Median : 75.0   Median : 82.00   Median :13.60  
##  Mean   :1339   Mean   : 72.6   Mean   : 79.84   Mean   :14.09  
##  3rd Qu.:1700   3rd Qu.: 85.0   3rd Qu.: 92.00   3rd Qu.:16.50  
##  Max.   :6800   Max.   :100.0   Max.   :100.00   Max.   :39.80  
##   perc.alumni        Expend        Grad.Rate      Elite    
##  Min.   : 0.00   Min.   : 3186   Min.   : 10.00   No :633  
##  1st Qu.:13.00   1st Qu.: 6744   1st Qu.: 53.00   Yes: 67  
##  Median :21.00   Median : 8361   Median : 65.00            
##  Mean   :22.72   Mean   : 9682   Mean   : 65.05            
##  3rd Qu.:31.00   3rd Qu.:10876   3rd Qu.: 77.00            
##  Max.   :64.00   Max.   :56233   Max.   :100.00

private_public_counts <- table(mynewdata$Private)

elite_non_elite_counts <- table(mynewdata$Elite)
elite_non_elite_counts

## 
##  No Yes 
## 633  67

# Print the results
print(paste("Number of private universities: ", private_public_counts["Yes"]))

## [1] "Number of private universities:  504"

print(paste("Number of public universities: ", private_public_counts["No"]))

## [1] "Number of public universities:  196"

print(paste("Number of Elite universities: ", elite_non_elite_counts["Yes"]))

## [1] "Number of Elite universities:  67"

print(paste("Number of non-Elite universities: ", elite_non_elite_counts["No"]))

## [1] "Number of non-Elite universities:  633"

There are 505 private and 73 elite out of 700 university

For each variable, the output provideded above tells the following: Minimum (Min.): The smallest value of the varible 1st Quartile (1st Qu.): The value below which 25% of the data falls. Median (Median): The middle value (50th percentile). Mean (Mean): The average value. 3rd Quartile (3rd Qu.): The value below which 75% of the data falls. Maximum (Max.): The largest value.

1.3. Use mynewdata, plot histogram plots of four variables “Outstate”, “Room.Board”, “Books” and “Personal”. Give each plot a suitable title and label for x axis and y axis. (8\(\%\))

hist(mynewdata$Outstate, main= "outstates", xlab="Dollars", ylab="students")

hist(mynewdata$Room.Board, main= "Room.Board", xlab="Dollars", ylab="students")

hist(mynewdata$Books,main= "Books", xlab="Dollars", ylab="students")

hist(mynewdata$Personal,main= "Personals", xlab="Dollars", ylab="students")

Comments: 1)outstates 1.1)The histogram is right-skewed, meaning most of the tuition values are concentrated on the lower end. 1.2)A significant number of institutions have tuition costs around 7,000 Dollars to 12,000 Dollars. These ranges have the tallest bars, indicating the highest frequency of values.

Room Board 2.1)The histogram appears to be right-skewed,indicating that most roomboard costs are concentrated at lower values. 2.2)Almost 70-75 percent of the costs fall between 3K and 5K dollars, as indicated by the tallest bars in this range 2.3)
Books 3.1) this hist. is right skewed. 3.2) Most of the all the institute books price range varies in arange of 400 -600 Dollars, with a very few books cost>1000 Dollars.
Personals 4.1) This is highly right skewed histogram. 4.2)Majority of the population spends between 500-2000 dollars on personal care.

2.1. Use mynewdata, do a linear regression fitting when outcome is “Grad.Rate” and predictors are “Private” and “Elite”. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). (6\(\%\)).

my_model<-lm(Grad.Rate ~ Private + Elite, data=mynewdata)
summary(my_model)

## 
## Call:
## lm(formula = Grad.Rate ~ Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.462  -9.005   0.538  10.148  45.148 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.852      1.069  51.303   <2e-16 ***
## PrivateYes    11.610      1.260   9.211   <2e-16 ***
## EliteYes      19.196      1.924   9.979   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.91 on 697 degrees of freedom
## Multiple R-squared:  0.2261, Adjusted R-squared:  0.2239 
## F-statistic: 101.8 on 2 and 697 DF,  p-value: < 2.2e-16

print(paste('There is approximately no bias in the prediction because the median = 0.23'))

## [1] "There is approximately no bias in the prediction because the median = 0.23"

print(paste('2)With an Estimate Std. of 11.810 and 18.583 of "Private" and "Elite" respectively. Both are highly Significant as p<2e-16 also in both the cases(***)'))

## [1] "2)With an Estimate Std. of 11.810 and 18.583 of \"Private\" and \"Elite\" respectively. Both are highly Significant as p<2e-16 also in both the cases(***)"

print(paste('3)F-statistic(107.8, p < 2.2e-16)indicating the model is highly significant '))

## [1] "3)F-statistic(107.8, p < 2.2e-16)indicating the model is highly significant "

print(paste('4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model'))

## [1] "4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model"

1)There is approximately no bias in the prediction because the median = 0.48 2) Estimate of 11.771 and 18.244 for Private and Elite institute respectively (both positive) suggest that they influence 11.771 times and 18.244 times more than public and non elite institutes. 3) Both private and elite institute are highly significant predictors (p<<<0.05) with *** significance level. 4) we reject the null hypothesis that the predictors(Private + Elite) doesnot effect our model 5) Residual error is 15.19, Suggesting that, on average, the predictions made by the model are off by about 15.19 units 6)21.64% of the variation in the dependent variable(Grad. Rate) is explained by the independent variables(Private + Elite) in the model. 7)F-statistic(calculated) = 96.24 , infering that the model is a good fit overall.

2.2. Use the linear regression fitting result in 2.1, calculate the confidence intervals for the coefficients. Also give the prediction interval of “Grad.Rate” for a new data with Private=“Yes” and Elite=“No”. (4\(\%\))

confint(my_model)

##                 2.5 %   97.5 %
## (Intercept) 52.753012 56.95145
## PrivateYes   9.135059 14.08434
## EliteYes    15.419496 22.97296

PYEN <- data.frame(Private = "Yes", Elite = "No")

prediction <- predict(my_model, newdata=PYEN, interval = "prediction")

prediction

##        fit      lwr      upr
## 1 66.46193 37.16478 95.75909

#comments
print(paste('for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private'))

## [1] "for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private"

print(paste('for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite'))

## [1] "for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite"

print(paste('With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76.'))

## [1] "With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76."

comments 1)for private institute this means tht 95% confidence is there that then data will fall in between 9.135 and 14.084 for private 2)for elite institute this means tht 95% confidence is there that then data will fall in between 15.419 and 22.973 for elite 3)With 95% confidence, the true value of the dependent variable for PYEN is expected to fall between 37.16 and 95.76.

2.3 Use mynewdata, do a multiple linear regression fitting when outcome is “Grad.Rate”, all other variables as predictors. Show the R output and report what you have learned from this output (you need to discuss significance, adjusted R-squared and p-value of F-statistics). Is linear regression model in 2.3 better than linear regression in 2.1? Use ANOVA to justify your conclusion. (14%)

mydatafull<-lm(Grad.Rate ~., data=mynewdata) 
summary(mydatafull)

## 
## Call:
## lm(formula = Grad.Rate ~ ., data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.737  -6.933  -0.145   7.032  52.402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.2867184  5.0890425   6.934 9.51e-12 ***
## PrivateYes   4.8579960  1.7536277   2.770 0.005753 ** 
## Apps         0.0017950  0.0004540   3.954 8.48e-05 ***
## Accept      -0.0017403  0.0008709  -1.998 0.046075 *  
## Enroll       0.0021284  0.0023255   0.915 0.360387    
## F.Undergrad -0.0001757  0.0004065  -0.432 0.665736    
## P.Undergrad -0.0014847  0.0003935  -3.773 0.000175 ***
## Outstate     0.0011137  0.0002401   4.639 4.20e-06 ***
## Room.Board   0.0011539  0.0006302   1.831 0.067539 .  
## Books       -0.0018664  0.0029919  -0.624 0.532939    
## Personal    -0.0021909  0.0008166  -2.683 0.007477 ** 
## PhD          0.1619072  0.0570072   2.840 0.004644 ** 
## Terminal    -0.0388527  0.0642770  -0.604 0.545741    
## S.F.Ratio    0.0081628  0.1715337   0.048 0.962059    
## perc.alumni  0.3001929  0.0506449   5.927 4.88e-09 ***
## Expend      -0.0004182  0.0001575  -2.655 0.008106 ** 
## EliteYes     4.4441537  2.1234405   2.093 0.036726 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.64 on 683 degrees of freedom
## Multiple R-squared:  0.455,  Adjusted R-squared:  0.4422 
## F-statistic: 35.64 on 16 and 683 DF,  p-value: < 2.2e-16

# Perform ANOVA to compare the two models
my_model<-lm(Grad.Rate ~ Private + Elite, data=mynewdata)
mydatafull<-lm(Grad.Rate ~., data=mynewdata) 
anova_result <- anova(my_model, mydatafull)

# View the ANOVA result
print(anova_result)

## Analysis of Variance Table
## 
## Model 1: Grad.Rate ~ Private + Elite
## Model 2: Grad.Rate ~ Private + Apps + Accept + Enroll + F.Undergrad + 
##     P.Undergrad + Outstate + Room.Board + Books + Personal + 
##     PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Elite
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    697 154855                                  
## 2    683 109058 14     45797 20.487 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(paste('Since the p-value is very small, we reject the null hypothesis that both models fit equally well. Therefore, mydatafull model is a significantly better model than my_model.'))

## [1] "Since the p-value is very small, we reject the null hypothesis that both models fit equally well. Therefore, mydatafull model is a significantly better model than my_model."

Comments: Let us first discuss summary of mydatafull in 2.3 part

1)The predictor perc.alumini effect most followed by outstate, P.Undergrad, Apps with less significance Predictor are room.Board,private,PhD and Expend.(Decreasing P value,Also p<<0.05 and *** and ** significance) 2) Adjusted R square is 45.14 percent indicating the inclusion of more no. of predictor has perfected the model(model 2.3). 3)Although F-statistic for Model2.3 = 36.95 which has decreases significantly i.e.from 96.24 in model 2.1, but is still very much significant as the p(model 2.3)<<0.05).

MODEL 2.1 Vs MODEL 2.3 As per the above ANOVA table 1) RSS value for model 2.3 is 109997< RSS Model 2.1, meaning it explains more of the variance in the data compared to Model 2.1. This indicates a better fit for Model 2.3. 2) Sum Of Square is 50799, this expalin that additional predictors in model 2.3 expalin better my Grad.Rate 3)Since the p(Model 2.3)<<<0.05, Confirming that the model 2.3 is better than Linear regression Model 2.1

Conclusion:

Model 2.3 is better than Model 2.1

2.4. Use the diagnostic plots to look at the fitting of multiple linear regression in 2.3. Please comment what you have seen from those plots. (7%)

myfit.poly2 <- lm(Grad.Rate ~ poly(perc.alumni, 2) + poly(Outstate, 2), data = mynewdata)
summary(myfit.poly2)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 2) + poly(Outstate, 
##     2), data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.789  -8.155   0.087   8.035  52.541 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            65.0486     0.5052 128.755  < 2e-16 ***
## poly(perc.alumni, 2)1 113.8126    16.2483   7.005 5.86e-12 ***
## poly(perc.alumni, 2)2 -34.2126    14.0999  -2.426   0.0155 *  
## poly(Outstate, 2)1    192.3375    16.2297  11.851  < 2e-16 ***
## poly(Outstate, 2)2     -0.7914    14.1212  -0.056   0.9553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 695 degrees of freedom
## Multiple R-squared:  0.3795, Adjusted R-squared:  0.3759 
## F-statistic: 106.3 on 4 and 695 DF,  p-value: < 2.2e-16

myfit.poly3<-lm(Grad.Rate ~ poly(perc.alumni, 3, raw=T)+Outstate, data=mynewdata) 
summary(myfit.poly3)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 3, raw = T) + Outstate, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.243  -8.369   0.080   7.716  52.526 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    32.2142422  2.8684696  11.230   <2e-16 ***
## poly(perc.alumni, 3, raw = T)1  1.1333428  0.3688194   3.073   0.0022 ** 
## poly(perc.alumni, 3, raw = T)2 -0.0242706  0.0141201  -1.719   0.0861 .  
## poly(perc.alumni, 3, raw = T)3  0.0001997  0.0001586   1.259   0.2083    
## Outstate                        0.0017916  0.0001513  11.842   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 695 degrees of freedom
## Multiple R-squared:  0.3809, Adjusted R-squared:  0.3773 
## F-statistic: 106.9 on 4 and 695 DF,  p-value: < 2.2e-16

myfit.poly5<-lm(Grad.Rate~poly(perc.alumni, 5, raw=T)+Outstate, data=mynewdata) 
summary(myfit.poly5)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 5, raw = T) + Outstate, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.228  -8.336  -0.118   7.786  52.730 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.437e+01  4.959e+00   6.931 9.57e-12 ***
## poly(perc.alumni, 5, raw = T)1  7.236e-01  1.308e+00   0.553    0.580    
## poly(perc.alumni, 5, raw = T)2 -8.924e-03  1.215e-01  -0.073    0.941    
## poly(perc.alumni, 5, raw = T)3  4.735e-04  4.895e-03   0.097    0.923    
## poly(perc.alumni, 5, raw = T)4 -1.895e-05  8.759e-05  -0.216    0.829    
## poly(perc.alumni, 5, raw = T)5  2.000e-07  5.697e-07   0.351    0.726    
## Outstate                        1.792e-03  1.513e-04  11.841  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 693 degrees of freedom
## Multiple R-squared:  0.3823, Adjusted R-squared:  0.377 
## F-statistic:  71.5 on 6 and 693 DF,  p-value: < 2.2e-16

# Diagnostic plots for the multiple linear regression
par(mfrow = c(2, 2))  # Set up a 2x2 plotting layout
plot(myfit.poly2)

par(mfrow = c(2, 2))  # Set up a 2x2 plotting layout
plot(myfit.poly3)

par(mfrow = c(2, 2))  # Set up a 2x2 plotting layout
plot(myfit.poly5)

Residuals vs. Fitted: 1)This plot helps us to assess whether the residuals are randomly distributed and how much. 2)Forpoly2,we might see some curvature, indicating that a linear model might not be sufficient. As you move to degree 3 or 5, the residuals may appear more randomly distributed.

Q-Q Plot: 1)This plot assesses the normaliancy of the residual data points. 2)Deviations from the diagnol line indicating more non-normality. Higher-degree polynomials may show more deviations.

Scale-Location Plot: 1)This checks for constant variance of residual. 2)If the spread of residuals increases with fitted values, it suggests the variance of errors in a model is not the same for all observations, which can be more pronounced in higher-degree models.

Residuals vs. Leverage: 1)This plot show influential points. 2)as the poly degree is increased the Standard residuals, Leverage drecreases i.e Firstly leverage varies from 0 to 0.02 but now (Poly 5) leverage varies from 0.0 to 0.01

2.5. Use mynewdata, do a variable selection to choose the best model. You should use plots to justify how do you choose your best model. Use the selected predictors of your best model with outcome “Grad.Rate”, do a linear regression fitting and plot the diagnostic plots for this fitting. You can use either exhaustive, or forward, or backward selection method. (14%)

#Using forward selection method
#install.packages("leaps")
library(leaps)

myfit.fwd <- regsubsets(Grad.Rate ~ ., data = mynewdata, nvmax = 16, method = "forward")
summary(myfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Grad.Rate ~ ., data = mynewdata, nvmax = 16, 
##     method = "forward")
## 16 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## EliteYes        FALSE      FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll F.Undergrad P.Undergrad Outstate
## 1  ( 1 )  " "        " "  " "    " "    " "         " "         "*"     
## 2  ( 1 )  " "        " "  " "    " "    " "         " "         "*"     
## 3  ( 1 )  " "        "*"  " "    " "    " "         " "         "*"     
## 4  ( 1 )  " "        "*"  " "    " "    " "         "*"         "*"     
## 5  ( 1 )  " "        "*"  " "    " "    " "         "*"         "*"     
## 6  ( 1 )  " "        "*"  " "    " "    " "         "*"         "*"     
## 7  ( 1 )  "*"        "*"  " "    " "    " "         "*"         "*"     
## 8  ( 1 )  "*"        "*"  " "    " "    " "         "*"         "*"     
## 9  ( 1 )  "*"        "*"  " "    " "    " "         "*"         "*"     
## 10  ( 1 ) "*"        "*"  "*"    " "    " "         "*"         "*"     
## 11  ( 1 ) "*"        "*"  "*"    " "    " "         "*"         "*"     
## 12  ( 1 ) "*"        "*"  "*"    "*"    " "         "*"         "*"     
## 13  ( 1 ) "*"        "*"  "*"    "*"    " "         "*"         "*"     
## 14  ( 1 ) "*"        "*"  "*"    "*"    " "         "*"         "*"     
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"         "*"         "*"     
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"         "*"         "*"     
##           Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 1  ( 1 )  " "        " "   " "      " " " "      " "       " "         " "   
## 2  ( 1 )  " "        " "   " "      " " " "      " "       "*"         " "   
## 3  ( 1 )  " "        " "   " "      " " " "      " "       "*"         " "   
## 4  ( 1 )  " "        " "   " "      " " " "      " "       "*"         " "   
## 5  ( 1 )  " "        " "   "*"      " " " "      " "       "*"         " "   
## 6  ( 1 )  " "        " "   "*"      "*" " "      " "       "*"         " "   
## 7  ( 1 )  " "        " "   "*"      "*" " "      " "       "*"         " "   
## 8  ( 1 )  " "        " "   "*"      "*" " "      " "       "*"         " "   
## 9  ( 1 )  " "        " "   "*"      "*" " "      " "       "*"         "*"   
## 10  ( 1 ) " "        " "   "*"      "*" " "      " "       "*"         "*"   
## 11  ( 1 ) "*"        " "   "*"      "*" " "      " "       "*"         "*"   
## 12  ( 1 ) "*"        " "   "*"      "*" " "      " "       "*"         "*"   
## 13  ( 1 ) "*"        " "   "*"      "*" "*"      " "       "*"         "*"   
## 14  ( 1 ) "*"        "*"   "*"      "*" "*"      " "       "*"         "*"   
## 15  ( 1 ) "*"        "*"   "*"      "*" "*"      " "       "*"         "*"   
## 16  ( 1 ) "*"        "*"   "*"      "*" "*"      "*"       "*"         "*"   
##           EliteYes
## 1  ( 1 )  " "     
## 2  ( 1 )  " "     
## 3  ( 1 )  " "     
## 4  ( 1 )  " "     
## 5  ( 1 )  " "     
## 6  ( 1 )  " "     
## 7  ( 1 )  " "     
## 8  ( 1 )  "*"     
## 9  ( 1 )  "*"     
## 10  ( 1 ) "*"     
## 11  ( 1 ) "*"     
## 12  ( 1 ) "*"     
## 13  ( 1 ) "*"     
## 14  ( 1 ) "*"     
## 15  ( 1 ) "*"     
## 16  ( 1 ) "*"

print(paste('Top prdictor selection order in forward is <---- Outstate,perc.alumni,Apps,P.Undergrad,Room.Board,phd,EliteYes,privateYes,Personal,Terminal,Accept,Books,Enroll,F.Undergrad,S.F.Ratio,Expend'))

## [1] "Top prdictor selection order in forward is <---- Outstate,perc.alumni,Apps,P.Undergrad,Room.Board,phd,EliteYes,privateYes,Personal,Terminal,Accept,Books,Enroll,F.Undergrad,S.F.Ratio,Expend"

myfit.fwd.sum <- summary(myfit.fwd)

par(mfrow=c(2,2))
plot(myfit.fwd.sum$rss)
plot(myfit.fwd.sum$adjr2)
plot(myfit.fwd.sum$cp)
plot(myfit.fwd.sum$bic)

which.max(myfit.fwd.sum$adjr2) #12

## [1] 11

which.min(myfit.fwd.sum$cp)#11

## [1] 11

which.min(myfit.fwd.sum$bic)#5

## [1] 7

coef(myfit.fwd, 6)

##   (Intercept)          Apps   P.Undergrad      Outstate      Personal 
## 38.7739545528  0.0008453855 -0.0017793762  0.0013605965 -0.0026138905 
##           PhD   perc.alumni 
##  0.1001651379  0.3208192162

Use mynewdata, discuss and perform any step(s) that you think that can improve the fitting in Task 2. You need to illustrate your work by using the R codes, output and discussion.

We must start by cleaning the data ,removing outlier and standarizing the numeric variables

# Load dataset
data <- read.csv("College.csv")

# Convert categorical columns
data$Private <- as.factor(data$Private)
data$Elite <- as.factor(data$Elite)

# Handle outliers
handle_outliers <- function(column) {
  Q1 <- quantile(data[[column]], 0.25)
  Q3 <- quantile(data[[column]], 0.75)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  data[[column]][data[[column]] < lower_bound] <- lower_bound
  data[[column]][data[[column]] > upper_bound] <- upper_bound
}
columns_to_check <- c("Books", "Room.Board", "Expend", "Grad.Rate")
for (col in columns_to_check) {
  handle_outliers(col)
}

# Standardize numeric variables
numeric_columns <- c("Apps", "Accept", "Enroll", "F.Undergrad", "P.Undergrad", 
                     "Outstate", "Room.Board", "Books", "Personal", "PhD", 
                     "Terminal", "S.F.Ratio", "perc.alumni", "Expend", "Grad.Rate")
data[numeric_columns] <- scale(data[numeric_columns])

# Save cleaned data
write.csv(data, "Cleaned_College.csv", row.names = FALSE)

#Top 3 predictor selection order in forward selection process

# Fitting Polynomial Models for Outstate
poly_model_2 <- lm(Grad.Rate ~ poly(Outstate, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(Outstate, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(Outstate, 5) + Private + Elite, data = mynewdata)

# Summarize the models for Outstate
summary(poly_model_2)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 2) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.860  -8.295   0.547   7.941  51.913 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          62.792      1.210  51.909   <2e-16 ***
## poly(Outstate, 2)1  213.161     18.150  11.744   <2e-16 ***
## poly(Outstate, 2)2  -19.402     14.921  -1.300    0.194    
## PrivateYes            1.789      1.474   1.213    0.225    
## EliteYes             10.120      1.994   5.074    5e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.63 on 695 degrees of freedom
## Multiple R-squared:  0.3543, Adjusted R-squared:  0.3506 
## F-statistic: 95.35 on 4 and 695 DF,  p-value: < 2.2e-16

summary(poly_model_3)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 3) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.227  -8.222   0.540   8.086  50.260 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          62.804      1.209  51.959  < 2e-16 ***
## poly(Outstate, 3)1  212.958     18.136  11.742  < 2e-16 ***
## poly(Outstate, 3)2  -19.782     14.912  -1.327    0.185    
## poly(Outstate, 3)3  -19.787     13.642  -1.450    0.147    
## PrivateYes            1.754      1.473   1.190    0.234    
## EliteYes             10.257      1.995   5.141 3.55e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.62 on 694 degrees of freedom
## Multiple R-squared:  0.3563, Adjusted R-squared:  0.3516 
## F-statistic: 76.82 on 5 and 694 DF,  p-value: < 2.2e-16

summary(poly_model_5)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Outstate, 5) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.090  -7.750   0.605   7.973  48.895 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         63.0736     1.2255  51.468  < 2e-16 ***
## poly(Outstate, 5)1 214.6920    18.2478  11.765  < 2e-16 ***
## poly(Outstate, 5)2 -21.8125    14.9516  -1.459   0.1451    
## poly(Outstate, 5)3 -19.9728    13.6347  -1.465   0.1434    
## poly(Outstate, 5)4  23.3389    13.8849   1.681   0.0932 .  
## poly(Outstate, 5)5  -0.9222    13.8054  -0.067   0.9468    
## PrivateYes           1.3376     1.4998   0.892   0.3728    
## EliteYes            10.5726     2.0231   5.226 2.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.62 on 692 degrees of freedom
## Multiple R-squared:  0.3589, Adjusted R-squared:  0.3524 
## F-statistic: 55.34 on 7 and 692 DF,  p-value: < 2.2e-16

# Comments:
print(paste('the intercept Estimate  value is 63.516, refering that when all predictors are zero, the predicted value of Grad.Rate is 64.16.'))

## [1] "the intercept Estimate  value is 63.516, refering that when all predictors are zero, the predicted value of Grad.Rate is 64.16."

print(paste('The estimated coefficient is 230.06, meaning that for each unit increase in this component of Outstate, Grad.Rate increases by 230.06.'))

## [1] "The estimated coefficient is 230.06, meaning that for each unit increase in this component of Outstate, Grad.Rate increases by 230.06."

print(paste('Std. Error: This represents the standard error of each coefficient estimate. Smaller values indicate more precise estimates.'))

## [1] "Std. Error: This represents the standard error of each coefficient estimate. Smaller values indicate more precise estimates."

print(paste('poly(Outstate, 5)1 has a p-value of < 2e-16, indicating it is highly significant.'))

## [1] "poly(Outstate, 5)1 has a p-value of < 2e-16, indicating it is highly significant."

print(paste('PrivateYes has a p-value of 0.9664 is very high, indicating that this variable is not significant in predicting Grad.Rate.'))

## [1] "PrivateYes has a p-value of 0.9664 is very high, indicating that this variable is not significant in predicting Grad.Rate."

print(paste('EliteYes has a highly significant p-value of 1.91e-06'))

## [1] "EliteYes has a highly significant p-value of 1.91e-06"

print(paste('poly(Outstate, 5)1 and EliteYes are highly significant predictors, with very low p-values'))

## [1] "poly(Outstate, 5)1 and EliteYes are highly significant predictors, with very low p-values"

# Fitting Polynomial Models for perc.alumni
poly_model_2 <- lm(Grad.Rate ~ poly(perc.alumni, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(perc.alumni, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(perc.alumni, 5) + Private + Elite, data = mynewdata)

# Summarize the models for perc.alumni
summary(poly_model_2)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 2) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.174  -8.194   0.705   8.757  48.354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             59.414      1.096  54.227  < 2e-16 ***
## poly(perc.alumni, 2)1  157.000     16.157   9.717  < 2e-16 ***
## poly(perc.alumni, 2)2  -43.647     14.259  -3.061  0.00229 ** 
## PrivateYes               5.915      1.304   4.537 6.71e-06 ***
## EliteYes                14.369      1.920   7.483 2.20e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.93 on 695 degrees of freedom
## Multiple R-squared:  0.3265, Adjusted R-squared:  0.3226 
## F-statistic: 84.23 on 4 and 695 DF,  p-value: < 2.2e-16

summary(poly_model_3)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 3) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.532  -8.281   0.622   8.656  48.426 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             59.394      1.096  54.203  < 2e-16 ***
## poly(perc.alumni, 3)1  157.152     16.157   9.727  < 2e-16 ***
## poly(perc.alumni, 3)2  -43.377     14.260  -3.042  0.00244 ** 
## poly(perc.alumni, 3)3   14.572     13.977   1.043  0.29750    
## PrivateYes               5.965      1.304   4.572 5.71e-06 ***
## EliteYes                14.213      1.926   7.379 4.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.92 on 694 degrees of freedom
## Multiple R-squared:  0.3276, Adjusted R-squared:  0.3227 
## F-statistic: 67.61 on 5 and 694 DF,  p-value: < 2.2e-16

summary(poly_model_5)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(perc.alumni, 5) + Private + Elite, 
##     data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.932  -8.113   0.637   8.706  48.531 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             59.438      1.097  54.177  < 2e-16 ***
## poly(perc.alumni, 5)1  157.405     16.168   9.736  < 2e-16 ***
## poly(perc.alumni, 5)2  -43.509     14.269  -3.049  0.00238 ** 
## poly(perc.alumni, 5)3   14.530     13.985   1.039  0.29917    
## poly(perc.alumni, 5)4   14.474     13.946   1.038  0.29969    
## poly(perc.alumni, 5)5    5.299     13.935   0.380  0.70385    
## PrivateYes               5.900      1.307   4.516 7.42e-06 ***
## EliteYes                14.241      1.927   7.389 4.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.93 on 692 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.322 
## F-statistic: 48.42 on 7 and 692 DF,  p-value: < 2.2e-16

# Fitting Polynomial Models for Apps
poly_model_2 <- lm(Grad.Rate ~ poly(Apps, 2) + Private + Elite, data = mynewdata)
poly_model_3 <- lm(Grad.Rate ~ poly(Apps, 3) + Private + Elite, data = mynewdata)
poly_model_5 <- lm(Grad.Rate ~ poly(Apps, 5) + Private + Elite, data = mynewdata)

# Summarize the models for Apps
summary(poly_model_2)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 2) + Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.259  -8.663   0.264   9.825  50.818 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      51.204      1.138  45.014  < 2e-16 ***
## poly(Apps, 2)1  124.418     17.089   7.280 9.04e-13 ***
## poly(Apps, 2)2  -48.068     15.052  -3.193  0.00147 ** 
## PrivateYes       17.410      1.437  12.112  < 2e-16 ***
## EliteYes         13.681      1.993   6.863 1.49e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 695 degrees of freedom
## Multiple R-squared:  0.2849, Adjusted R-squared:  0.2808 
## F-statistic: 69.23 on 4 and 695 DF,  p-value: < 2.2e-16

summary(poly_model_3)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 3) + Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.494  -8.538   0.412   9.619  51.460 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      50.543      1.142  44.266  < 2e-16 ***
## poly(Apps, 3)1  130.221     17.012   7.655 6.51e-14 ***
## poly(Apps, 3)2  -50.819     14.938  -3.402 0.000707 ***
## poly(Apps, 3)3   53.054     14.465   3.668 0.000263 ***
## PrivateYes       18.379      1.449  12.684  < 2e-16 ***
## EliteYes         13.300      1.979   6.722 3.73e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.22 on 694 degrees of freedom
## Multiple R-squared:  0.2985, Adjusted R-squared:  0.2935 
## F-statistic: 59.06 on 5 and 694 DF,  p-value: < 2.2e-16

summary(poly_model_5)

## 
## Call:
## lm(formula = Grad.Rate ~ poly(Apps, 5) + Private + Elite, data = mynewdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.318  -8.600   0.730   8.998  49.535 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      50.396      1.122  44.909  < 2e-16 ***
## poly(Apps, 5)1  132.897     16.700   7.958 7.15e-15 ***
## poly(Apps, 5)2  -52.184     14.658  -3.560 0.000396 ***
## poly(Apps, 5)3   53.585     14.192   3.776 0.000173 ***
## poly(Apps, 5)4  -53.758     13.981  -3.845 0.000132 ***
## poly(Apps, 5)5   52.737     13.995   3.768 0.000178 ***
## PrivateYes       18.666      1.424  13.104  < 2e-16 ***
## EliteYes         12.675      1.946   6.513 1.42e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.95 on 692 degrees of freedom
## Multiple R-squared:  0.3267, Adjusted R-squared:  0.3199 
## F-statistic: 47.98 on 7 and 692 DF,  p-value: < 2.2e-16

MA717: Applied Regression and Experimental Data Analysis

Assignment template