Mini Project 1

Question 3

Project title:

Is College Worth It? Comparing Cost and Early Career Earnings

3–5 research questions:

Do colleges with higher tuition usually have higher median earnings after graduation? Do public and private colleges differ in tuition and earnings? Which states have the highest average tuition? Which states have the highest median earnings for graduates? Is there a difference between schools with high debt and schools with high earnings?

Coding:

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

# Load dataset
college <- read_csv("Most-Recent-Cohorts-Institution.csv")

## Rows: 6322 Columns: 3308
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2368): OPEID, OPEID6, INSTNM, CITY, STABBR, ZIP, ACCREDAGENCY, INSTURL,...
## dbl  (851): UNITID, SCH_DEG, HCM2, MAIN, NUMBRANCH, PREDDEG, HIGHDEG, CONTRO...
## lgl   (89): LOCALE2, UG, UGDS_WHITENH, UGDS_BLACKNH, UGDS_API, UGDS_AIANOLD,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(college)

## # A tibble: 6 × 3,308
##   UNITID OPEID    OPEID6 INSTNM   CITY  STABBR ZIP   ACCREDAGENCY INSTURL NPCURL
##    <dbl> <chr>    <chr>  <chr>    <chr> <chr>  <chr> <chr>        <chr>   <chr> 
## 1 100654 00100200 001002 Alabama… Norm… AL     35762 Southern As… www.aa… www.a…
## 2 100663 00105200 001052 Univers… Birm… AL     3529… Southern As… https:… https…
## 3 100690 02503400 025034 Amridge… Mont… AL     3611… Southern As… https:… https…
## 4 100706 00105500 001055 Univers… Hunt… AL     35899 Southern As… www.ua… uah.c…
## 5 100724 00100500 001005 Alabama… Mont… AL     3610… Southern As… www.al… tcc.r…
## 6 100751 00105100 001051 The Uni… Tusc… AL     3548… Southern As… www.ua… ua.ai…
## # ℹ 3,298 more variables: SCH_DEG <dbl>, HCM2 <dbl>, MAIN <dbl>,
## #   NUMBRANCH <dbl>, PREDDEG <dbl>, HIGHDEG <dbl>, CONTROL <dbl>,
## #   ST_FIPS <dbl>, REGION <dbl>, LOCALE <dbl>, LOCALE2 <lgl>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, CCBASIC <dbl>, CCUGPROF <dbl>, CCSIZSET <dbl>, HBCU <dbl>,
## #   PBI <dbl>, ANNHI <dbl>, TRIBAL <dbl>, AANAPII <dbl>, HSI <dbl>,
## #   NANTI <dbl>, MENONLY <dbl>, WOMENONLY <dbl>, RELAFFIL <dbl>,
## #   ADM_RATE <dbl>, ADM_RATE_ALL <dbl>, SATVR25 <dbl>, SATVR75 <dbl>, …

# Select only useful columns
college_clean <- college %>%
  select(
    school = INSTNM,
    state = STABBR,
    tuition = TUITIONFEE_IN,
    cost = COSTT4_A,
    low_income_students = PCTPELL,
    median_debt = GRAD_DEBT_MDN,
    median_earnings = MD_EARN_WNE_P10
  )

# Convert earnings and debt to numeric (dataset stores some as text)
college_clean <- college_clean %>%
  mutate(
    median_earnings = as.numeric(median_earnings),
    median_debt = as.numeric(median_debt)
  )

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `median_debt = as.numeric(median_debt)`.
## Caused by warning:
## ! NAs introduced by coercion

# Remove missing values
college_clean <- college_clean %>%
  drop_na(tuition, median_earnings)

# Basic summary statistics
summary(college_clean)

##     school             state              tuition           cost      
##  Length:3399        Length:3399        Min.   :    0   Min.   : 4451  
##  Class :character   Class :character   1st Qu.: 5550   1st Qu.:16844  
##  Mode  :character   Mode  :character   Median :12097   Median :25815  
##                                        Mean   :18075   Mean   :32025  
##                                        3rd Qu.:25023   3rd Qu.:42555  
##                                        Max.   :72097   Max.   :93512  
##                                                        NA's   :347    
##  low_income_students  median_debt    median_earnings 
##  Min.   :0.0000      Min.   : 2819   Min.   : 11998  
##  1st Qu.:0.2457      1st Qu.:11719   1st Qu.: 37722  
##  Median :0.3389      Median :19976   Median : 44950  
##  Mean   :0.3778      Mean   :18507   Mean   : 48203  
##  3rd Qu.:0.4696      3rd Qu.:24807   3rd Qu.: 55939  
##  Max.   :1.0000      Max.   :43021   Max.   :143372  
##  NA's   :279         NA's   :298

# Average earnings by state
state_summary <- college_clean %>%
  group_by(state) %>%
  summarise(
    avg_tuition = mean(tuition, na.rm = TRUE),
    avg_earnings = mean(median_earnings, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_earnings))

print(state_summary)

## # A tibble: 58 × 3
##    state avg_tuition avg_earnings
##    <chr>       <dbl>        <dbl>
##  1 RI         39308        65514.
##  2 MA         34673.       61678.
##  3 DC         34101.       60394.
##  4 CT         31057.       60388.
##  5 NJ         19837.       56295.
##  6 PA         27033.       55945.
##  7 NH         21665.       54872.
##  8 NY         25186        54869.
##  9 CA         16909.       54469.
## 10 MD         19207.       54074.
## # ℹ 48 more rows

# Relationship between tuition and earnings
ggplot(college_clean, aes(x = tuition, y = median_earnings)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", color = "blue") +
  labs(
    title = "Does Higher Tuition Lead to Higher Earnings?",
    x = "Tuition Cost",
    y = "Median Earnings 10 Years After Enrollment"
  )

## `geom_smooth()` using formula = 'y ~ x'

# Debt vs earnings
ggplot(college_clean, aes(x = median_debt, y = median_earnings)) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Student Debt vs Earnings",
    x = "Median Debt",
    y = "Median Earnings"
  )

## Warning: Removed 298 rows containing missing values or values outside the scale range
## (`geom_point()`).

Conculsion:

The results suggest that attending college is associated with higher earnings, and schools with higher tuition tend to have graduates with higher incomes.

However, the relationship is not perfect. Other factors such as school quality, field of study, and location also influence earnings after college.

Mini Project 1

Jake Memoli

2026-03-24

Question 1

Question 2

Question 3