The purpose of this exploratory analysis is to examine how college tuition varies across several factors and observe trends across time and regions.

Rising College Tuition and Associated Factors

CONTEXT

The data is compiled from several datasets sourced from the Tidy Tuesday Github (TTG), which is maintained by the R for Data Science (R4DS) Online Learning Community (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-10/readme.md). This includes college tuition, salary potential, and diversity in schools, as well as a separate dataset tracking historical tuition to provide context for the rate at which tuition has risen over the past few decades.

EDA

QUESTIONS OF INTEREST

  1. How has college tuition changed across 2- and 4-year institutions in the US over time?

  2. Of the numerical variables, which is most strongly correlated with tuition costs? Is this relationship linear? What type of regression model fits the data well?

  3. How are tuition costs distributed across the US? Which states contain a greater average tuition?

#Introduction

Column

Every academic year, the price tag of attending Swarthmore increases by nearly $2,000 in tuition alone1 (with the exception of 2020-2021 due to COVID-19). While tuition increased from $65,774 in 2017 to $70,744 in 2020, the top student wage at Swarthmore went from $10.202 in 2017 to $10.803 per hour in 2020, which amounts to a little over $1,500 per semester before tax – assuming a work schedule of 10 hours per week for a 14-week academic term. After some calculation, I found that Swarthmore tuition increased by 7.03% after 3 years while student wages increased by 5.88% after 3 years. Although work study, scholarships, grants, and financial aid provide a variety of options, the upsurge in tuition remains a concern for American youth. Aside from mortgages, student loans share the largest portion of average household debt4.

Over the past decade, college tuition has risen over 25%5. When adjusting for inflation, this decreases the rate of price growth dramatically to an average of 2.4% per year over the last 10 years, according to CollegeBoard6. However, financial aid and tax benefits have not kept pace with rising prices, and student loan debt upon graduation has escalated 76% since 2000, outstripping the pace of inflation by over 40%7. Furthermore, wage growth in America remains slow. The U.S. Census Bureau reported an increase of 0.8% for the median household income from 2017 to 20188. Looking to previous years, the real median household income plummeted dramatically after the 2008 recession and only started to rise again in 2013. Covid-19 has destabilized the US economy even further in 2020 going into 2021. Tuition continues to increase, albeit at the lowest rate in decades 9. Still, tuition remains a significant concern for current and prospective students in times of uncertainty.

Higher education costs have been mainly attributed to slashes in state funding 10, including other factors such as increased demand, rising costs of financial aid, and the need for more faculty and services 11. As evidenced by its popularity as a debate topic during the 2020 presidential campaign, ascending tuition rates and student debt are essential to address due to the monetary burden that interest payments place on young workers and/or their families when just starting their career. For young adults, failure to meet interest payments could also lower their credit rating, which can severely impact their chances at taking out loans for mortgages, renting an apartment, receiving insurance premiums, and acquiring career opportunities. For their families, this could result in decreased opportunities for younger siblings and struggles to pay their own bills.

In addition, an article by Forbes found that in 2015-2016 stated that black Millennial graduates were most likely to borrow for educational purposes, as well as at a higher rate than their white, Hispanic, Asian, and multiracial peers12. The article suggests that escalating tuition prices do not equally impact students based on race/ethnicity. Furthermore, a report released by the UN in 2016 found that even when controlling for age, educational level, and residential location, workers from an ethnic minority had a significantly lower likelihood of holding a skilled (managerial, professional, and technical) job relative to white, non-indigenous workers13. Another study from 2017 reported that a meta-analysis showed little change in racial discrimination in hiring since 1989 – particularly for black people14. Within colleges themselves, one study suggested a significant negative association between rising tuition costs and racial/ethnic diversity of enrolled students among 2- and 4-year public institutions15. With growing concerns about the hiring and salary potential of graduates across different races/ethnicities, this leads to the central question: How are tuition costs distributed across various factors?

Using data about tuition and fees from 2018-2019 and enrollment diversity from 2014 that are sorted by college/university from the Chronicle of Higher Education (CHE)16, I aim to investigate the relationship between college tuition and a variety of factors including diversity, enrollment number, type of institution, degree length, and location. While some caution should be taken when interpreting analyses pulled from different years – especially since tuition is a variable that is expected to increase each year - we acquired the organized data sets for ease of use from the Tidy Tuesday GitHub (TTG), which is maintained by the R for Data Science (R4DS) Online Learning Community17. For further context, TTG also provided the salary potential and historic tuition rates of colleges within the US. Furthermore, additional data from the National Center for Education Statistics (NCES)18 can be referenced. Using data from CHE and NCES compiled by R4DS, we can create analyses that will provide insight on the relative costs of different types of institutions and offered degrees, as well as if there if colleges of a distinct price bracket are clustered in certain states.

I am primarily interested in measuring tuition rate by the total out-of-state-cost and its association with the previously mentioned factors. Along with the additional data provided by the GitHub, I can also compare the potential salary ranks for graduates across race and school type for a smaller subset of colleges/universities. While acknowledging the limitations of comparing data across different years and the limited number of variables examined in the dataset, my goal is to produce a comprehensive exploratory analysis model that will investigate which of the factors (diversity, type of degree and institution, residence, salary potential, etc.) are influenced by tuition the most. This includes regression analysis, chloropleths, interactive plots, and more. Thus, future analyses could use this exploratory model as a template for more comprehensive tuition data pooled from credible sources, which could include adding US colleges that were not included or had incomplete data from the original data set. In addition, tuition analyses could be further expanded upon the relationship tuition has with other relevant topics such as student debt, state funds, hiring trends for graduates across colleges and other associated factors such as rates of financial aid and more. Overall, these types of analyses and observations are crucial to assisting wider audiences in making informed decisions about the how the cost of attending college differs across several factors. Further analyses could also contribute to increasing awareness of student debt as well as gaps in wage and salary potential that appears unevenly distributed across demographics.

EDA

Column

Of the dataset, notice that the name variable repeats several times. This is due to splitting up each college by enrollment numbers in categories of race/ethnicity. Thus, the actually number of unique colleges/universities we are examining is 2973.

There are several numerical variables for this dataset: in_state_tuition (price of tuition for in-state students),in_state_total (total price of education for in-state students),out_of_state_tuition (price of tuition for out-of-state-students),out_of_state_total(total price of education for out-of-state-students), and total_enrollment (total number of students enrolled). The variableroom and board (price of room and board) was excluded since it was inconsistently reported for each college/university.

Pairs Plot for numerical variables

It’s not exactly to useful to compare the out-of-state total with in-state-total since they’re expected to be heavily correlated because they are calculated from similar factors. In addition, several of the graphs display an odd linear line with a more diffuse cluster of points around the beginning of the line. One area of interest is to examine whether linear regression fits in-state-tuition and total enrollment since it appears to be slightly negatively correlated.

Corr: -0.173

Column

Linear Model of Multiple Variables


Call:
lm(formula = out_of_state_total ~ in_state_total + total_enrollment + 
    type + degree_length, data = tut_all)

Coefficients:
        (Intercept)       in_state_total     total_enrollment  
         -2659.8277               0.9908               0.1347  
        typePrivate           typePublic  degree_length4 Year  
         -1367.9256            7325.2473            4195.2165  

Here, the intercept can be interpreted as the base value (especially for discrete variables), which for type of school is For Profit and for degree_length is 2 Years.

Coefficient Plot

From the coefficient plot, it appears that 4-year public schools differ the most in tuition from the intercept (2-year for-profit school).

Row

Distribution of Tuition by Type of Institution and Degree Length

# A tibble: 1 x 19
  name  state state_code type  degree_length room_and_board in_state_tuition
  <chr> <chr> <chr>      <chr> <chr>                  <dbl>            <dbl>
1 Univ~ Texas TX         Other Other                     NA             8448
# ... with 12 more variables: in_state_total <dbl>, out_of_state_tuition <dbl>,
#   out_of_state_total <dbl>, total_enrollment <dbl>, category <chr>,
#   enrollment <dbl>, rank <dbl>, state_name <chr>, early_career_pay <dbl>,
#   mid_career_pay <dbl>, make_world_better_percent <dbl>, stem_percent <dbl>

In general, it appears that colleges with 4-year degrees have a greater tuition prices than those with 2-year degrees. In addition, private institutions display higher median tuition costs for both degree lengths.

Furthermore, it appears that there is an infinitesimally small number of values for the school type Other, which upon closer inspection only includes one college: The University of North Texas at DallasSystem. This point should be excluded from further analyses since it cannot provide relevant information or create an average from one data point, so it would not be appropriate to run statistical analyses comparing the mean of Other to the rest of the Type of schools.

Overall, this plot shows that the total enrollment of students who identify as white is greater than the total enrollment of those who identify as an ethnicity categorized as a minority or mixed race. Furthermore, we should later compare percentages rather than total enrollment numbers.

Column

Question 1: How has college tuition changed across 2- and 4-year institutions in the US over time?

Historical Tuition

[1] TRUE
# A tibble: 90 x 4
# Groups:   year [19]
   type             year    tuition_type    tuition_cost
   <chr>            <chr>   <chr>                  <dbl>
 1 All Institutions 1985-86 4 Year Constant        12274
 2 All Institutions 1985-86 2 Year Constant         7508
 3 All Institutions 1995-96 4 Year Constant        16224
 4 All Institutions 1995-96 2 Year Constant         7421
 5 All Institutions 2000-01 4 Year Constant        17909
 6 All Institutions 2000-01 2 Year Constant         7576
 7 All Institutions 2001-02 4 Year Constant        18573
 8 All Institutions 2001-02 2 Year Constant         7786
 9 All Institutions 2002-03 4 Year Constant        19240
10 All Institutions 2002-03 2 Year Constant         8331
# ... with 80 more rows
# A tibble: 90 x 5
   type             year    tuition_type    tuition_cost  Year
   <chr>            <chr>   <chr>                  <dbl> <dbl>
 1 All Institutions 1985-86 4 Year Constant        12274  1985
 2 All Institutions 1985-86 2 Year Constant         7508  1985
 3 All Institutions 1995-96 4 Year Constant        16224  1985
 4 All Institutions 1995-96 2 Year Constant         7421  1985
 5 All Institutions 2000-01 4 Year Constant        17909  1985
 6 All Institutions 2000-01 2 Year Constant         7576  1985
 7 All Institutions 2001-02 4 Year Constant        18573  1995
 8 All Institutions 2001-02 2 Year Constant         7786  1995
 9 All Institutions 2002-03 4 Year Constant        19240  1995
10 All Institutions 2002-03 2 Year Constant         8331  1995
# ... with 80 more rows

Observation and Interpretation

We can see that the tuition (adjusted for inflation) of private, 4-year institutions have risen more sharply over the last two decades than that of public colleges/universities. At rates as recent as 2016, it appears that the cost of 4-year degrees are nearly double the price of 2-year degrees.

To observe changes in tuition over past few decades, see: https://rpubs.com/eswat1/FProj_Support1

Column

Question 2: Of the numerical variables, which is most strongly correlated with tuition costs? Is this relationship linear?

# A tibble: 2 x 7
  term              estimate std_error statistic p_value  lower_ci  upper_ci
  <chr>                <dbl>     <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
1 intercept        19071.      124.        154.        0 18829.    19314.   
2 total_enrollment    -0.325     0.012     -27.1       0    -0.348    -0.301

This simple linear model was created with EDA in mind since there appeared to be a weak negative correlation between in_state_tuition and total_enrollment was found to be -0.173.

Interpretation

There appears to be a slight negative linear relationship between In State Tuition and Total Enrollment. However, the LOESS line might be a better fit than the OLS line for this data, especially given the number of outliers and heavy clustering towards the lower end of enrollment.

The observations appear dependent and there appears to be a strong linear relationship between the explanatory variable and the residuals.

Furthermore, the residuals are distributed unequally such that the curve is heavily right-skewed although it is unimodal.

Finally, while the slope appears around 0 in the fitted values vs. residuals plot, the data is clustered to the right side, which is unusual and suggests that a linear model does not fit the data well.

Thus, it appears that a more complex non-linear model is required to study the association between in-state tuition and total enrollment.

Column

Question 3: How are tuition costs distributed across the US? Which states contain a greater average tuition?

Interactive map of schools and locations and tuition.

# A tibble: 1,227,031 x 19
# Groups:   state [49]
    long   lat group order region subregion name  state state_code type 
   <dbl> <dbl> <dbl> <int> <chr>  <chr>     <chr> <chr> <chr>      <chr>
 1 -87.5  30.4     1     1 alaba~ <NA>      Alab~ Alab~ AL         Publ~
 2 -87.5  30.4     1     1 alaba~ <NA>      Alab~ Alab~ AL         Publ~
 3 -87.5  30.4     1     1 alaba~ <NA>      Alab~ Alab~ AL         Publ~
 4 -87.5  30.4     1     1 alaba~ <NA>      Amri~ Alab~ AL         Priv~
 5 -87.5  30.4     1     1 alaba~ <NA>      Athe~ Alab~ AL         Publ~
 6 -87.5  30.4     1     1 alaba~ <NA>      Aubu~ Alab~ AL         Publ~
 7 -87.5  30.4     1     1 alaba~ <NA>      Aubu~ Alab~ AL         Publ~
 8 -87.5  30.4     1     1 alaba~ <NA>      Bevi~ Alab~ AL         Publ~
 9 -87.5  30.4     1     1 alaba~ <NA>      Birm~ Alab~ AL         Priv~
10 -87.5  30.4     1     1 alaba~ <NA>      Bish~ Alab~ AL         Publ~
# ... with 1,227,021 more rows, and 9 more variables: degree_length <chr>,
#   room_and_board <dbl>, in_state_tuition <dbl>, in_state_total <dbl>,
#   out_of_state_tuition <dbl>, out_of_state_total <dbl>, state_avg <dbl>,
#   above_average <lgl>, pct_above <dbl>
p0 <- ggplot(data = tut_map,
             mapping = aes(x = long, y = lat,
                           group = group, fill = state))
p1 <- p0 + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(projection = "albers", lat0 = 39, lat1 = 45) 
p2 <- p1 + scale_fill_viridis_d() +
    labs(title = "2018-2019 Tuition Rates Across the US", fill = NULL) 
p3 <- p2 + theme_map() +
  theme(legend.position = "none") 

p3

#plotly_build(p3)

Interpretation

Column

Conclusion

Limitations

Due to time constraints, I would say that the interactivity of this exploratory analysis was limited and that the data required further cleaning or analyzing more specific subsets. In terms of the data, the Tuition dataset is from 2018-2019 while the diversity data is from 2014, and both were consolidated and organized by the Tidy Tuesday Github. We should be cautious when drawing conclusions from time-dependent data and from different periods, especially with variables that constantly change since tuition is expected to increase, as well as enrollment numbers. The dataset also only analyzed slightly less than 3000 colleges, which is less than 60% of the total number of colleges/universities in the US.

Future Analyses

Thus, future analyses could use this exploratory model as a template for more comprehensive tuition data pooled from credible sources, which could include adding US colleges that were not included or had incomplete data from the original data set.

In addition, tuition analyses could be further expanded upon the relationship tuition has with other relevant topics such as student debt, state funds, hiring trends for graduates across colleges and other associated factors such as rates of financial aid and more.

Overall, these types of analyses and observations are crucial to assisting wider audiences in making informed decisions about the how the cost of attending college differs across several factors.

Further analyses could also contribute to increasing awareness of student debt as well as gaps in wage and salary potential that appears unevenly distributed across demographics.