Introduction

Choosing a college is stressful enough without worrying about things like how much it costs, how likely you are to graduate, or how much debt you’ll be stuck with after. I wanted to use this project to dig into some real data about 4-year colleges and actually see how bad (or good) things really are. The dataset I’m using is the CollegeScores4yr dataset from the Lock5Data collection. It contains information on things like tuition, graduation rates, student debt, enrollment, and federal aid. Since I’m at the stage of life where everyone is either in college, trying to transfer, or figuring out loans, I thought it would be worth actually looking at this with stats.

For this project, I’ll be answering 10 simple but real-life questions using descriptive statistics like means, standard deviations, histograms, boxplots, barplots, and correlations.


Data Preparation

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_link <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read_csv(data_link)
## Rows: 2012 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Name, State, Accred, Control, Region, Locale
## dbl (31): ID, Main, MainDegree, HighDegree, Latitude, Longitude, AdmitRate, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(college)
##  [1] "Name"        "State"       "ID"          "Main"        "Accred"     
##  [6] "MainDegree"  "HighDegree"  "Control"     "Region"      "Locale"     
## [11] "Latitude"    "Longitude"   "AdmitRate"   "MidACT"      "AvgSAT"     
## [16] "Online"      "Enrollment"  "White"       "Black"       "Hispanic"   
## [21] "Asian"       "Other"       "PartTime"    "NetPrice"    "Cost"       
## [26] "TuitionIn"   "TuitonOut"   "TuitionFTE"  "InstructFTE" "FacSalary"  
## [31] "FullTimeFac" "Pell"        "CompRate"    "Debt"        "Female"     
## [36] "FirstGen"    "MedIncome"

Question 1 - Do private colleges really charge way more than public ones on average?

college %>%
  group_by(Control) %>%
  summarise(mean_tuition = mean(TuitionFTE, na.rm = TRUE))
## # A tibble: 3 × 2
##   Control mean_tuition
##   <chr>          <dbl>
## 1 Private       15919.
## 2 Profit        16268.
## 3 Public         8100.
ggplot(college, aes(x = Control, y = TuitionFTE)) +
  geom_boxplot() +
  labs(title = "Tuition: Private vs Public Colleges", x = "Type", y = "Tuition per Full-Time Equivalent")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Private colleges seem to charge a lot more, but the range also varies depending on the school. Definitely confirms what most people already assume.


Question 2 - How many schools have graduation rates below 50%?

ggplot(college, aes(x = CompRate)) +
  geom_histogram(binwidth = 5, color = "black", fill = "skyblue") +
  geom_vline(xintercept = 50, color = "red", linetype = "dashed") +
  labs(title = "Graduation Rates Distribution", x = "Graduation Rate (%)", y = "Number of Colleges")
## Warning: Removed 167 rows containing non-finite outside the scale range
## (`stat_bin()`).

A surprising number of schools have graduation rates under 50%, which makes you wonder if people are dropping out due to costs, academics, or other reasons.


Question 3 - What is the standard deviation of average net price?

sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096

There’s a lot of variation in net prices. It shows that the sticker price is not always the price people actually end up paying.


Question 4 - How uneven is the admission rate?

ggplot(college, aes(x = AdmitRate)) +
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightgreen") +
  labs(title = "Distribution of Admission Rates", x = "Admission Rate", y = "Number of Colleges")
## Warning: Removed 360 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(college, aes(y = AdmitRate)) +
  geom_boxplot() +
  labs(title = "Boxplot of Admission Rates")
## Warning: Removed 360 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Admission rates are all over the place. Some schools take almost everybody, while others are super selective. No surprise there, but it’s cool to see it visually.


Question 5 - What is the variance of admission rates?

var(college$AdmitRate, na.rm = TRUE)
## [1] 0.04333848

The variance confirms how spread out admission rates are. Schools are definitely not following one common pattern.


Question 6 - Does higher tuition lead to higher graduation rates?

ggplot(college, aes(x = TuitionFTE, y = CompRate)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Tuition vs Graduation Rate", x = "Tuition per FTE", y = "Graduation Rate (%)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 169 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 169 rows containing missing values or values outside the scale range
## (`geom_point()`).

cor(college$TuitionFTE, college$CompRate, use = "complete.obs")
## [1] 0.4556305

There’s a small positive trend, but it’s not super strong. Paying more might help, but it doesn’t guarantee a diploma.


Question 7 - What is the proportion of private vs public colleges?

ggplot(college, aes(x = Control)) +
  geom_bar() +
  labs(title = "Private vs Public Colleges", x = "Type of Institution", y = "Count")

There are definitely more public colleges, which makes sense since state schools are everywhere.


Question 8 - What is the average percentage of students receiving federal aid (Pell Grants)?

mean(college$Pell, na.rm = TRUE)
## [1] 37.85296

On average, around 32% of students are receiving Pell Grants, which tells you a lot about how many students need financial help.


Question 9 - What’s the distribution of student enrollment sizes?

ggplot(college, aes(x = Enrollment)) +
  geom_histogram(binwidth = 5000, color = "black", fill = "purple") +
  labs(title = "Distribution of Enrollment Sizes", x = "Enrollment", y = "Number of Colleges")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Most colleges are actually on the smaller side, but a few big universities boost the average a lot.


Question 10 - Do colleges with higher graduation rates also have less debt?

ggplot(college, aes(x = CompRate, y = Debt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Graduation Rate vs Median Debt", x = "Graduation Rate (%)", y = "Median Debt ($)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 273 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 273 rows containing missing values or values outside the scale range
## (`geom_point()`).

cor(college$CompRate, college$Debt, use = "complete.obs")
## [1] -0.15836

Interestingly, schools with better graduation rates tend to have students with slightly less debt. It might be because they actually finish faster or manage costs better.