As a newly started college like Fei Tian, I wonder how to get people to know us and apply for this college. I found U.S. News and World Report’s College Data with the number of applications for each school, so I would like to explore what makes a college receive more applications than others.
The College dataset is in ISLR2. It contains statistics for many US Colleges from the 1995 US News and World Report issue, with 777 observations. Variables that I am going to study are as follows:
Apps: Number of applications receivedOutstate: Out-of-state tuitionPhD: Pct. of faculty with Ph.D.’sS.F.Ratio: Student/faculty ratioperc.alumni: Pct. alumni who donateExpend: Instructional expenditure per studentGrad.Rate: Graduation ratecollege <- College %>%
mutate(Accept.Rate = Accept/Apps*100) %>%
dplyr::select(Apps, Outstate, PhD, S.F.Ratio, perc.alumni, Expend, Grad.Rate, Accept.Rate)
summary(college)
Apps Outstate PhD S.F.Ratio perc.alumni
Min. : 81 Min. : 2340 Min. : 8.00 Min. : 2.50 Min. : 0.00
1st Qu.: 776 1st Qu.: 7320 1st Qu.: 62.00 1st Qu.:11.50 1st Qu.:13.00
Median : 1558 Median : 9990 Median : 75.00 Median :13.60 Median :21.00
Mean : 3002 Mean :10441 Mean : 72.66 Mean :14.09 Mean :22.74
3rd Qu.: 3624 3rd Qu.:12925 3rd Qu.: 85.00 3rd Qu.:16.50 3rd Qu.:31.00
Max. :48094 Max. :21700 Max. :103.00 Max. :39.80 Max. :64.00
Expend Grad.Rate Accept.Rate
Min. : 3186 Min. : 10.00 Min. : 15.45
1st Qu.: 6751 1st Qu.: 53.00 1st Qu.: 67.56
Median : 8377 Median : 65.00 Median : 77.88
Mean : 9660 Mean : 65.46 Mean : 74.69
3rd Qu.:10830 3rd Qu.: 78.00 3rd Qu.: 84.85
Max. :56233 Max. :118.00 Max. :100.00
ggpairs(college)
lm.fit <- lm(Apps ~ Outstate + PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate + Accept.Rate)
summary(lm.fit)
Call:
lm(formula = Apps ~ Outstate + PhD + S.F.Ratio + perc.alumni +
Expend + Grad.Rate + Accept.Rate)
Residuals:
Min 1Q Median 3Q Max
-5881 -1719 -503 946 40384
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.274e+03 1.324e+03 -0.962 0.336183
Outstate -1.618e-01 4.659e-02 -3.473 0.000544 ***
PhD 6.798e+01 8.144e+00 8.347 3.24e-16 ***
S.F.Ratio 1.972e+02 3.839e+01 5.136 3.56e-07 ***
perc.alumni -6.568e+01 1.168e+01 -5.625 2.59e-08 ***
Expend 2.289e-01 3.512e-02 6.519 1.28e-10 ***
Grad.Rate 3.036e+01 8.572e+00 3.542 0.000421 ***
Accept.Rate -5.968e+01 8.900e+00 -6.705 3.89e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3180 on 769 degrees of freedom
Multiple R-squared: 0.3309, Adjusted R-squared: 0.3248
F-statistic: 54.33 on 7 and 769 DF, p-value: < 2.2e-16
Comment: The linear model summary above shows that all selected predictors are statistically significant in the number of college received applications.
anova_result <- aov(Apps ~ Outstate + PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate + Accept.Rate)
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
Outstate 1 2.924e+07 2.924e+07 2.892 0.089451 .
PhD 1 1.880e+09 1.880e+09 185.867 < 2e-16 ***
S.F.Ratio 1 1.309e+08 1.309e+08 12.942 0.000342 ***
perc.alumni 1 2.549e+08 2.549e+08 25.206 6.41e-07 ***
Expend 1 8.517e+08 8.517e+08 84.217 < 2e-16 ***
Grad.Rate 1 2.448e+08 2.448e+08 24.202 1.06e-06 ***
Accept.Rate 1 4.547e+08 4.547e+08 44.963 3.89e-11 ***
Residuals 769 7.777e+09 1.011e+07
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Comment: Anova result suggests out-of-state tuition would not significantly affect applications.
lm.fit2 <- lm(Apps ~ PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate + Accept.Rate)
summary(lm.fit2)
Call:
lm(formula = Apps ~ PhD + S.F.Ratio + perc.alumni + Expend +
Grad.Rate + Accept.Rate)
Residuals:
Min 1Q Median 3Q Max
-6989 -1744 -489 982 40845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.554e+03 1.331e+03 -1.168 0.2432
PhD 6.447e+01 8.139e+00 7.921 8.24e-15 ***
S.F.Ratio 2.253e+02 3.779e+01 5.960 3.84e-09 ***
perc.alumni -7.599e+01 1.137e+01 -6.681 4.54e-11 ***
Expend 1.824e-01 3.269e-02 5.579 3.36e-08 ***
Grad.Rate 2.023e+01 8.118e+00 2.492 0.0129 *
Accept.Rate -6.239e+01 8.930e+00 -6.987 6.10e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3203 on 770 degrees of freedom
Multiple R-squared: 0.3204, Adjusted R-squared: 0.3151
F-statistic: 60.5 on 6 and 770 DF, p-value: < 2.2e-16
Comment: After removing variable out-of-state tuition, the model evaluator for applications’ variance, the R-square value, dropped 1.05% from 33.09% to 32.04%.
anova(lm.fit, lm.fit2)
Analysis of Variance Table
Model 1: Apps ~ Outstate + PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate +
Accept.Rate
Model 2: Apps ~ PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate + Accept.Rate
Res.Df RSS Df Sum of Sq F Pr(>F)
1 769 7777257083
2 770 7899226672 -1 -121969589 12.06 0.0005439 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Comment: Given a p-value less than 0.05 from the Anova test, it suggests a statistically significant difference between the above two models. So, I decided to keep out-of-state tuition as a predictor since the R-squared value would be better this way.
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1) # set up a 2x2 plot layout with increased margins
plot(lm.fit)
Residuals vs Fitted: There’s a slight pattern rather than random scatter, suggesting possible non-linearity in the relationship. Point 484 appears to be an outlier.
Q-Q Residuals: The tails (especially the upper tail) deviate from the diagonal line, indicating the residuals aren’t perfectly normally distributed.
Scale-Location: The spread of residuals isn’t entirely consistent across fitted values, suggesting possible uneven variance.
Residuals vs Leverage: Point 484 has high standardized residual and moderate leverage, making it potentially influential. Points 285 and 21 also have relatively high leverage.
summary(lm.fit)
Call:
lm(formula = Apps ~ Outstate + PhD + S.F.Ratio + perc.alumni +
Expend + Grad.Rate + Accept.Rate)
Residuals:
Min 1Q Median 3Q Max
-5881 -1719 -503 946 40384
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.274e+03 1.324e+03 -0.962 0.336183
Outstate -1.618e-01 4.659e-02 -3.473 0.000544 ***
PhD 6.798e+01 8.144e+00 8.347 3.24e-16 ***
S.F.Ratio 1.972e+02 3.839e+01 5.136 3.56e-07 ***
perc.alumni -6.568e+01 1.168e+01 -5.625 2.59e-08 ***
Expend 2.289e-01 3.512e-02 6.519 1.28e-10 ***
Grad.Rate 3.036e+01 8.572e+00 3.542 0.000421 ***
Accept.Rate -5.968e+01 8.900e+00 -6.705 3.89e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3180 on 769 degrees of freedom
Multiple R-squared: 0.3309, Adjusted R-squared: 0.3248
F-statistic: 54.33 on 7 and 769 DF, p-value: < 2.2e-16
Student/faculty ratio (197.2): For each unit increase in student-faculty ratio, applications increase by about 197. This is somewhat counterintuitive as lower ratios are typically preferred.
Percentage of faculty with PhDs (67.98): Each percentage point increase in faculty with PhDs is associated with about 68 more applications, all else equal. This indicates that faculty credentials may be important to applicants.
Percentage of alumni who donate (-65.68): For each percentage point increase in alumni who donate, applications decrease by about 66. This unexpected negative relationship might reflect other factors.
Accept.Rate (-59.68): Each percentage point increase in acceptance rate,applications decrease by about 60. It shows strict acceptance process will attract more applicants.
Graduation rate (30.36): Higher graduation rates appear to attract more applicants.
Instructional expenditure per student (0.2289): Higher instructional expenditures might signal better resources and facilities, leading to more applications.
Out-of-state tuition (-0.1618): Higher tuition may slightly discourage applications.
The model shows some violations of regression assumptions, particularly non-linearity and non-constant variance. It may oversimplify complex relationships between variables.
Additional variables and transformations that could enhance the analysis can include percentage of new students from top 10% of high school class, squared instructional expenditure per student.