Colleges Through Time

Author

Erin Morrison

Introduction

I will be working with data about different colleges ranging from 2001-2022. This data set was simplified and posted on Kaggle however it is originally from the U.S. Department of Education College Scorecard. It contains values including the institution name, average SAT score, acceptance rate, median household income, cost of attendance, and more. I want to look at the change in SAT scores and acceptance rates over time. I am also curious to see what factors are big drivers for the cost of attendance and acceptance rates for different colleges.

Libraries & Data

# Libraries
library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.4.2

library(GGally)

Warning: package 'GGally' was built under R version 4.4.2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(alluvial)

Warning: package 'alluvial' was built under R version 4.4.2

library(ggalluvial)

Warning: package 'ggalluvial' was built under R version 4.4.2

library(plotly)

Warning: package 'plotly' was built under R version 4.4.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Data
setwd("~/DATA 110")
college <- read_csv("college_data.csv")

Rows: 58123 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): INSTNM, CITY, STABBR, ZIP, REGION, PREDDEG, CCBASIC, CCUGPROF, CCS...
dbl (23): UNITID, LOCALE, LATITUDE, LONGITUDE, ADM_RATE_ALL, PPTUG_EF, SAT_A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

names(college) <- tolower(names(college))

Linear Regression Acceptance Rate

# Model
admitMod <- lm(adm_rate_all ~ sat_avg_all + median_hh_inc + preddeg + ugds_white + sqrt(ugds_hisp) + region + tuitionfee_in, college)
summary(admitMod)


Call:
lm(formula = adm_rate_all ~ sat_avg_all + median_hh_inc + preddeg + 
    ugds_white + sqrt(ugds_hisp) + region + tuitionfee_in, data = college)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63624 -0.09495  0.01863  0.10862  0.44972 

Coefficients:
                                                                   Estimate
(Intercept)                                                       1.457e+00
sat_avg_all                                                      -8.835e-04
median_hh_inc                                                     2.606e-06
preddegPredominantly bachelor's-degree granting                   7.213e-02
preddegPredominantly certificate-degree granting                 -1.332e-01
ugds_white                                                        3.136e-01
sqrt(ugds_hisp)                                                   1.392e-01
regionGreat Lakes (IL, IN, MI, OH, WI)                           -8.833e-02
regionMid East (DE, DC, MD, NJ, NY, PA)                          -5.146e-02
regionNew England (CT, ME, MA, NH, RI, VT)                       -1.165e-01
regionOutlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)             7.325e-02
regionPlains (IA, KS, MN, MO, NE, ND, SD)                        -9.827e-02
regionRocky Mountains (CO, ID, MT, UT, WY)                       -3.296e-02
regionSoutheast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) -1.193e-01
regionSouthwest (AZ, NM, OK, TX)                                 -1.051e-01
regionU.S. Service Schools                                       -6.503e-01
tuitionfee_in                                                    -2.928e-06
                                                                 Std. Error
(Intercept)                                                       6.631e-02
sat_avg_all                                                       5.334e-05
median_hh_inc                                                     7.109e-07
preddegPredominantly bachelor's-degree granting                   2.451e-02
preddegPredominantly certificate-degree granting                  8.497e-02
ugds_white                                                        2.976e-02
sqrt(ugds_hisp)                                                   4.942e-02
regionGreat Lakes (IL, IN, MI, OH, WI)                            2.515e-02
regionMid East (DE, DC, MD, NJ, NY, PA)                           2.420e-02
regionNew England (CT, ME, MA, NH, RI, VT)                        2.847e-02
regionOutlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)             1.646e-01
regionPlains (IA, KS, MN, MO, NE, ND, SD)                         2.703e-02
regionRocky Mountains (CO, ID, MT, UT, WY)                        3.724e-02
regionSoutheast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)  2.448e-02
regionSouthwest (AZ, NM, OK, TX)                                  2.836e-02
regionU.S. Service Schools                                        1.649e-01
tuitionfee_in                                                     3.874e-07
                                                                 t value
(Intercept)                                                       21.964
sat_avg_all                                                      -16.564
median_hh_inc                                                      3.666
preddegPredominantly bachelor's-degree granting                    2.943
preddegPredominantly certificate-degree granting                  -1.568
ugds_white                                                        10.537
sqrt(ugds_hisp)                                                    2.817
regionGreat Lakes (IL, IN, MI, OH, WI)                            -3.512
regionMid East (DE, DC, MD, NJ, NY, PA)                           -2.127
regionNew England (CT, ME, MA, NH, RI, VT)                        -4.091
regionOutlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)              0.445
regionPlains (IA, KS, MN, MO, NE, ND, SD)                         -3.636
regionRocky Mountains (CO, ID, MT, UT, WY)                        -0.885
regionSoutheast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)  -4.872
regionSouthwest (AZ, NM, OK, TX)                                  -3.707
regionU.S. Service Schools                                        -3.944
tuitionfee_in                                                     -7.559
                                                                 Pr(>|t|)    
(Intercept)                                                       < 2e-16 ***
sat_avg_all                                                       < 2e-16 ***
median_hh_inc                                                    0.000259 ***
preddegPredominantly bachelor's-degree granting                  0.003318 ** 
preddegPredominantly certificate-degree granting                 0.117225    
ugds_white                                                        < 2e-16 ***
sqrt(ugds_hisp)                                                  0.004945 ** 
regionGreat Lakes (IL, IN, MI, OH, WI)                           0.000464 ***
regionMid East (DE, DC, MD, NJ, NY, PA)                          0.033660 *  
regionNew England (CT, ME, MA, NH, RI, VT)                       4.63e-05 ***
regionOutlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)            0.656473    
regionPlains (IA, KS, MN, MO, NE, ND, SD)                        0.000291 ***
regionRocky Mountains (CO, ID, MT, UT, WY)                       0.376399    
regionSoutheast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) 1.28e-06 ***
regionSouthwest (AZ, NM, OK, TX)                                 0.000220 ***
regionU.S. Service Schools                                       8.56e-05 ***
tuitionfee_in                                                    8.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1624 on 1035 degrees of freedom
  (57071 observations deleted due to missingness)
Multiple R-squared:  0.4143,    Adjusted R-squared:  0.4052 
F-statistic: 45.75 on 16 and 1035 DF,  p-value: < 2.2e-16

# Make dataset of only data included in the model
mod1Data <- college |>
  filter(!is.na(sat_avg_all), !is.na(median_hh_inc), !is.na(ugds_white), !is.na(tuitionfee_in)) |>
  select(adm_rate_all, sat_avg_all, median_hh_inc, preddeg, ugds_white, ugds_hisp, region, tuitionfee_in)
mod1Data$model <- predict(admitMod)

# Graph model vs actual
mod1Data |> ggplot(aes(adm_rate_all, model)) +
  geom_point() +
  geom_abline(color = "blue")

# Show normalized variable
hist(sqrt(mod1Data$ugds_hisp))

# Diagnostic plots
autoplot(admitMod, 1:4, nrow=2, ncol=2)

Warning: Removed 301 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_segment()`).

ggpairs(mod1Data, columns = 2:8)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Equation

acceptance rate = -8.835e-04(average SAT) + 2.606e-06(median household income) + 3.136e-01(white undergrad students) + 1.392e-01 * sqrt(hispanic undergrad students) - 2.928e-06(in-state tuition) - 8.833e-02(Great Lakes region) - 5.146e-02(Mid East region) - 1.165e-01(New England region) + 7.325e-02(Outlying Areas region) - 9.827e-02(Plains region) - 3.296e-02(Rocky Mountains region) - 1.193e-01(Southeast region) - 1.051e-01(Southwest region) - 6.503e-01(U.S. Service Schools region) + 7.213e-02(predominantly bachelor’s-degree) - 1.332e-01(predominantly certificate-degree) + 1.457

Analysis

I attempted to make this linear regression as simple as possible by removing variables that made very minimal improvements to the model along with having high p-values, however there were still many variables that contributed to its adjusted R^2 value. Most included numerical variables had reasonably normal distributions except ugds_white, ugds_hisp, and tuitionfee_in. I was able to improve the normality of ugds_hisp but not the other two. There is minimal collinearity, while some variables do appear somewhat correlated (like sat_avg_all and median_hh_inc) they still maintain a fair spread. There seem to be a few outliers that could be affecting model performance. All numerical variables in the model have very significant p-values of < 0.001. Both categorical variables have at least one significant factor level. For the most part each variable affects the model in a predictable way (for example higher SAT scores correlate to lower acceptance rates). However I was surprised that higher median household income is associated with higher acceptance rates when I would have expected the opposite. It was also interesting to see that higher proportions of both white and hispanic students are associated with higher acceptance rates. Over all this model has an adjusted R^2 of 0.4052 so it was able to explain 40.52% of the variability in the data. It should also be mentioned that due to missing data this model may not be representative of the full data set.

Linear Regression 2

# Model
costMod <- lm(costt4_a ~ median_hh_inc + preddeg + sqrt(ugds_hisp) + tuitionfee_in, college) 
summary(costMod)


Call:
lm(formula = costt4_a ~ median_hh_inc + preddeg + sqrt(ugds_hisp) + 
    tuitionfee_in, data = college)

Residuals:
     Min       1Q   Median       3Q      Max 
-14685.1  -2067.3      8.5   2074.3  22809.3 

Coefficients:
                                                   Estimate Std. Error t value
(Intercept)                                       7.199e+03  6.089e+02  11.824
median_hh_inc                                     6.916e-02  9.494e-03   7.284
preddegPredominantly bachelor's-degree granting   1.593e+03  3.511e+02   4.537
preddegPredominantly certificate-degree granting  1.092e+03  6.299e+02   1.733
sqrt(ugds_hisp)                                  -1.499e+03  5.393e+02  -2.779
tuitionfee_in                                     1.007e+00  6.888e-03 146.238
                                                 Pr(>|t|)    
(Intercept)                                       < 2e-16 ***
median_hh_inc                                    4.94e-13 ***
preddegPredominantly bachelor's-degree granting  6.12e-06 ***
preddegPredominantly certificate-degree granting  0.08324 .  
sqrt(ugds_hisp)                                   0.00551 ** 
tuitionfee_in                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3789 on 1688 degrees of freedom
  (56429 observations deleted due to missingness)
Multiple R-squared:  0.9524,    Adjusted R-squared:  0.9523 
F-statistic:  6755 on 5 and 1688 DF,  p-value: < 2.2e-16

# Make dataset of only data included in the model
mod2Data <- college |>
  filter(!is.na(costt4_a), !is.na(median_hh_inc)) |>
  select(costt4_a, median_hh_inc, preddeg, ugds_hisp, tuitionfee_in)
mod2Data$model <- predict(costMod)

# Graph model vs actual
mod2Data |> ggplot(aes(costt4_a, model)) +
  geom_point() +
  geom_abline(color = "blue")

# Diagnostic plots
autoplot(costMod, 1:4, nrow=2, ncol=2)

ggpairs(mod2Data, columns = 2:5)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Equation

cost of attendance = 6.916e-02(median household income) - 1.499e+03 * sqrt(hispanic undergrad students) + 1.007(in-state tuition) + 1.593e+03(predominantly bachelor’s-degree) + 1.092e+03(predominantly certificate-degree) + 7.199e+03

Analysis

Like the first model, I attempted to make this linear regression as simple as possible by removing variables with high p-values that made very minimal improvements. The only numerical variable I couldn’t normalize was tuitionfee_in which is a bit of a concern since it is a big driver of the model. There is minimal collinearity, while some variables do appear somewhat correlated (like tuitionfee_in and median_hh_inc) they still maintain significant variability. There seem to be a few outliers that could be affecting model performance. All numerical variables in the model have very significant p-values of < 0.006. The categorical variable has one factor level with a p-value < 0.001. The way the variables affect the model in this case is pretty expected, tuition sets a base line cost which is then tweaked by other factors in the student population. For example, schools with a higher median household incomes don’t need to give out as much financial aid and therefore makes the cost of attendance higher. Over all this model has an adjusted R^2 of 0.9523 so it was able to explain 95.23% of the variability in the data. This is quite a high R^2 but isn’t entirely unexpected as tuition is entered into the model. However, I was surprised by just how well it preformed considering the variability of financial aid and scholarship packages from school to school. Like the first model, it should be mentioned that due to missing data this model may not representative of the full data set.

Visualization

# trying and failing to find average SAT for each college
avgSAT <- college |>
  select(instnm, sat_avg_all, year) |>
  pivot_wider(names_from = year, values_from = sat_avg_all) # https://tidyr.tidyverse.org/reference/pivot_wider.html

Warning: Values from `sat_avg_all` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(instnm, year)) |>
  dplyr::filter(n > 1L)

avgSAT <- avgSAT |> mutate(mean = college[instnm,2])

college |>
  ggplot(aes(year, sat_avg_all)) +
  geom_smooth() +
  geom_vline(xintercept = 2015) + # last year before SAT redesign https://satsuite.collegeboard.org/media/pdf/redesigned-sat-results-pilot-validity-study.pdf
  geom_vline(xintercept = 2020) # many schools switch to test optional

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 25699 rows containing non-finite outside the scale range
(`stat_smooth()`).

# Filter for the Forbes top 5 colleges
best <- college |>
  filter(instnm %in% c("Princeton University", "Stanford University", "Massachusetts Institute of Technology", "Yale University", "University of California-Berkeley")) # https://www.forbes.com/top-colleges/

# Graph
p <- best |> ggplot(aes(year, adm_rate_all, alluvium = instnm)) + #maybe make alluvial
  geom_alluvium(aes(fill = instnm), decreasing = FALSE, alpha = .6, color = "white", width = .15,) +
  geom_smooth(data = college, aes(year, adm_rate_all, alluvium = "All Colleges"), color = "black") +
  geom_text(aes(x=2015.5, y=0.74, label="All Colleges Average"), cex=4, color="#808080") +
  theme_minimal() +
  scale_fill_manual(values = c("red", "#FF671F", "#4D4F53", "#C4820E", "blue")) +
  labs(title = "Forbes Top 5 US Colleges", subtitle = "Acceptance Rates Over Time 2001-2022",
       x = "Year",
       y = "Acceptance Rate",
       caption = "Data: U.S. Department of Education College Scorecard\nRanking: Forbes’ List of America’s Top Colleges") +
  guides(fill = guide_legend(title="College")) # https://stackoverflow.com/questions/14622421/how-to-change-legend-title-in-ggplot

Warning in geom_smooth(data = college, aes(year, adm_rate_all, alluvium = "All
Colleges"), : Ignoring unknown aesthetics: alluvium

## FINAL PLOT ##
p

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Add plotly (it removes my caption and subtitle)
p <- ggplotly(p)

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Conclusion

I started by making all column names lowercase to make the data easier to work with. Then I explored many possible plots, though I didn’t polish this graph I find the trends in average SAT scores quite interesting. I attempted working with SAT scores and puting schools with the best SATs into an alluvial but it wasn’t very interesting especial since they didn’t follow the trend shown among all the colleges. So insead I settled on making an alluvial of the top five colleges’ acceptance rates and filtered the data for just those colleges. To make the final visualization I used both the filtered data for the alluvial and the full data set to add the trend geom_smooth line. This plot shows that while all these schools had lower than average acceptance rates in 2001 they greatly and steadily decreased all the way to 2022. It also shows that acceptance rates as a whole have stayed relatively stable throughout this time. I am disappointed that the plotly removes the caption and subtitle. I would also have liked to make its tooltip more readable. If I were to do this again I would have made the graph in highcharter to make the interactivity cleaner.