Choosing a college is stressful enough without worrying about things like how much it costs, how likely you are to graduate, or how much debt you’ll be stuck with after. I wanted to use this project to dig into some real data about 4-year colleges and actually see how bad (or good) things really are. The dataset I’m using is the CollegeScores4yr dataset from the Lock5Data collection. It contains information on things like tuition, graduation rates, student debt, enrollment, and federal aid. Since I’m at the stage of life where everyone is either in college, trying to transfer, or figuring out loans, I thought it would be worth actually looking at this with stats.
For this project, I’ll be answering 10 simple but real-life questions using descriptive statistics like means, standard deviations, histograms, boxplots, barplots, and correlations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_link <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read_csv(data_link)
## Rows: 2012 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Name, State, Accred, Control, Region, Locale
## dbl (31): ID, Main, MainDegree, HighDegree, Latitude, Longitude, AdmitRate, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(college)
## [1] "Name" "State" "ID" "Main" "Accred"
## [6] "MainDegree" "HighDegree" "Control" "Region" "Locale"
## [11] "Latitude" "Longitude" "AdmitRate" "MidACT" "AvgSAT"
## [16] "Online" "Enrollment" "White" "Black" "Hispanic"
## [21] "Asian" "Other" "PartTime" "NetPrice" "Cost"
## [26] "TuitionIn" "TuitonOut" "TuitionFTE" "InstructFTE" "FacSalary"
## [31] "FullTimeFac" "Pell" "CompRate" "Debt" "Female"
## [36] "FirstGen" "MedIncome"
college %>%
group_by(Control) %>%
summarise(mean_tuition = mean(TuitionFTE, na.rm = TRUE))
## # A tibble: 3 × 2
## Control mean_tuition
## <chr> <dbl>
## 1 Private 15919.
## 2 Profit 16268.
## 3 Public 8100.
ggplot(college, aes(x = Control, y = TuitionFTE)) +
geom_boxplot() +
labs(title = "Tuition: Private vs Public Colleges", x = "Type", y = "Tuition per Full-Time Equivalent")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Private colleges seem to charge a lot more, but the range also varies depending on the school. Definitely confirms what most people already assume.
ggplot(college, aes(x = CompRate)) +
geom_histogram(binwidth = 5, color = "black", fill = "skyblue") +
geom_vline(xintercept = 50, color = "red", linetype = "dashed") +
labs(title = "Graduation Rates Distribution", x = "Graduation Rate (%)", y = "Number of Colleges")
## Warning: Removed 167 rows containing non-finite outside the scale range
## (`stat_bin()`).
A surprising number of schools have graduation rates under 50%, which makes you wonder if people are dropping out due to costs, academics, or other reasons.
sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096
There’s a lot of variation in net prices. It shows that the sticker price is not always the price people actually end up paying.
ggplot(college, aes(x = AdmitRate)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightgreen") +
labs(title = "Distribution of Admission Rates", x = "Admission Rate", y = "Number of Colleges")
## Warning: Removed 360 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(college, aes(y = AdmitRate)) +
geom_boxplot() +
labs(title = "Boxplot of Admission Rates")
## Warning: Removed 360 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Admission rates are all over the place. Some schools take almost everybody, while others are super selective. No surprise there, but it’s cool to see it visually.
var(college$AdmitRate, na.rm = TRUE)
## [1] 0.04333848
The variance confirms how spread out admission rates are. Schools are definitely not following one common pattern.
ggplot(college, aes(x = TuitionFTE, y = CompRate)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Tuition vs Graduation Rate", x = "Tuition per FTE", y = "Graduation Rate (%)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 169 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 169 rows containing missing values or values outside the scale range
## (`geom_point()`).
cor(college$TuitionFTE, college$CompRate, use = "complete.obs")
## [1] 0.4556305
There’s a small positive trend, but it’s not super strong. Paying more might help, but it doesn’t guarantee a diploma.
ggplot(college, aes(x = Control)) +
geom_bar() +
labs(title = "Private vs Public Colleges", x = "Type of Institution", y = "Count")
There are definitely more public colleges, which makes sense since state schools are everywhere.
mean(college$Pell, na.rm = TRUE)
## [1] 37.85296
On average, around 32% of students are receiving Pell Grants, which tells you a lot about how many students need financial help.
ggplot(college, aes(x = Enrollment)) +
geom_histogram(binwidth = 5000, color = "black", fill = "purple") +
labs(title = "Distribution of Enrollment Sizes", x = "Enrollment", y = "Number of Colleges")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
Most colleges are actually on the smaller side, but a few big universities boost the average a lot.
ggplot(college, aes(x = CompRate, y = Debt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Graduation Rate vs Median Debt", x = "Graduation Rate (%)", y = "Median Debt ($)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 273 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 273 rows containing missing values or values outside the scale range
## (`geom_point()`).
cor(college$CompRate, college$Debt, use = "complete.obs")
## [1] -0.15836
Interestingly, schools with better graduation rates tend to have students with slightly less debt. It might be because they actually finish faster or manage costs better.