Question 1

data(cars)
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Q2

library(jsonlite)

library(jsonlite)

api_base <- "https://min-api.cryptocompare.com/data/v2/histoday"
f_sym <- "BTC"
t_sym <- "USD"
days <- "100"

final_link <- paste0(api_base, "?fsym=", f_sym, "&tsym=", t_sym, "&limit=", days)

res <- fromJSON(final_link)

btc_data <- res$Data$Data

max_price <- max(btc_data$close)

print(max_price)
## [1] 96945.09

Project Overview

Ultimate Goal

The goal of this project is to evaluate whether higher tuition and student debt are associated with higher post-graduation earnings, and to determine which institutions provide stronger financial return on investment (ROI).


Research Questions

  1. Is there a relationship between tuition cost and median earnings 10 years after graduation?
  2. Does higher student loan debt correlate with higher or lower earnings?
  3. Do private institutions generate higher earnings than public institutions?
  4. Is admission rate related to post-graduation earnings?
  5. Which institutions provide the strongest ROI (earnings relative to debt)?

Data Extraction

Description

We use the College Scorecard dataset from the U.S. Department of Education. This dataset includes tuition, debt, earnings, and institutional characteristics.


library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded

Import Dataset

college <- read_csv("Most-Recent-Cohorts-Institution.csv")
## Rows: 6429 Columns: 3306
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2380): OPEID, OPEID6, INSTNM, CITY, STABBR, ZIP, ACCREDAGENCY, INSTURL,...
## dbl  (851): UNITID, SCH_DEG, HCM2, MAIN, NUMBRANCH, PREDDEG, HIGHDEG, CONTRO...
## lgl   (75): LOCALE2, UG, UGDS_WHITENH, UGDS_BLACKNH, UGDS_API, UGDS_AIANOLD,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Description:
We import the raw College Scorecard dataset into R for cleaning and analysis.


Data Cleaning and Preparation

Select Relevant Variables

college_clean <- college %>%
  select(
    INSTNM,
    CONTROL,
    TUITIONFEE_IN,
    ADM_RATE,
    C150_4,
    DEBT_MDN,
    MD_EARN_WNE_P10
  )

Description:
We keep only the variables necessary to answer our research questions.


Handle Missing Values

college_clean <- college_clean %>%
  mutate(across(where(is.character), ~na_if(., "NULL")))

Description:
We remove observations missing tuition, debt, or earnings because they are essential to our analysis.


Convert Variables to Numeric

college_clean <- college_clean %>%
  mutate(
    TUITIONFEE_IN = as.numeric(TUITIONFEE_IN),
    DEBT_MDN = as.numeric(DEBT_MDN),
    MD_EARN_WNE_P10 = as.numeric(MD_EARN_WNE_P10),
    ADM_RATE = as.numeric(ADM_RATE),
    C150_4 = as.numeric(C150_4)
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `DEBT_MDN = as.numeric(DEBT_MDN)`.
## Caused by warning:
## ! NAs introduced by coercion

Description:
Some columns are stored as characters. We convert them to numeric for proper calculations and modeling.


Convert Institution Type to Factor

college_clean$CONTROL <- factor(
  college_clean$CONTROL,
  levels = c(1,2,3),
  labels = c("Public","Private Nonprofit","Private For-Profit")
)

Description:
We relabel the institution type variable to improve readability.


Create ROI Variable

college_clean <- college_clean %>%
  mutate(
    ROI_ratio = MD_EARN_WNE_P10 / DEBT_MDN
  )

Description:
We create a return-on-investment (ROI) proxy by dividing earnings by median debt.


Descriptive Data Analysis

Summary Statistics

summary(college_clean)
##     INSTNM                        CONTROL     TUITIONFEE_IN      ADM_RATE    
##  Length:6429        Public            :2056   Min.   :  600   Min.   :0.000  
##  Class :character   Private Nonprofit :1953   1st Qu.: 5688   1st Qu.:0.604  
##  Mode  :character   Private For-Profit:2420   Median :11790   Median :0.779  
##                                               Mean   :17238   Mean   :0.728  
##                                               3rd Qu.:23186   3rd Qu.:0.908  
##                                               Max.   :69330   Max.   :1.000  
##                                               NA's   :2700    NA's   :4483   
##      C150_4         DEBT_MDN     MD_EARN_WNE_P10    ROI_ratio      
##  Min.   :0.000   Min.   : 1932   Min.   :  8579   Min.   : 0.6588  
##  1st Qu.:0.372   1st Qu.: 7000   1st Qu.: 31830   1st Qu.: 3.0765  
##  Median :0.525   Median : 9500   Median : 40568   Median : 3.8182  
##  Mean   :0.520   Mean   :11269   Mean   : 43508   Mean   : 4.3167  
##  3rd Qu.:0.671   3rd Qu.:15000   3rd Qu.: 51994   3rd Qu.: 5.1352  
##  Max.   :1.000   Max.   :38980   Max.   :143372   Max.   :20.8075  
##  NA's   :4157    NA's   :1146    NA's   :1149     NA's   :1615

Description:
We examine summary statistics to understand variable distributions and ranges.


Distribution of Tuition

ggplot(college_clean, aes(x = TUITIONFEE_IN)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of In-State Tuition",
       x = "Tuition",
       y = "Frequency")
## Warning: Removed 2700 rows containing non-finite outside the scale range
## (`stat_bin()`).

Description:
This histogram shows how tuition is distributed across institutions.


Distribution of Earnings

ggplot(college_clean, aes(x = MD_EARN_WNE_P10)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of Median Earnings (10 Years)",
       x = "Earnings",
       y = "Frequency")
## Warning: Removed 1149 rows containing non-finite outside the scale range
## (`stat_bin()`).

Description:
This histogram shows how post-graduation earnings vary across schools.


Correlation Matrix

numeric_vars <- college_clean %>%
  select(TUITIONFEE_IN, DEBT_MDN, MD_EARN_WNE_P10, ADM_RATE, C150_4)

cor_matrix <- cor(numeric_vars, use = "complete.obs")

corrplot(cor_matrix, method = "circle")

Description:
We calculate correlations to evaluate relationships between tuition, debt, earnings, admission rate, and completion rate.


Research Question Analysis

Tuition vs Earnings

ggplot(college_clean, aes(x = TUITIONFEE_IN, y = MD_EARN_WNE_P10)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  labs(title = "Tuition vs Post-Graduation Earnings",
       x = "In-State Tuition",
       y = "Median Earnings (10 Years)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2957 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2957 rows containing missing values or values outside the scale range
## (`geom_point()`).

Description:
This scatterplot with regression line shows whether higher tuition is associated with higher earnings.


Debt vs Earnings

ggplot(college_clean, aes(x = DEBT_MDN, y = MD_EARN_WNE_P10)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  labs(title = "Debt vs Earnings",
       x = "Median Debt",
       y = "Median Earnings (10 Years)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1615 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1615 rows containing missing values or values outside the scale range
## (`geom_point()`).

Description:
This plot evaluates whether higher student debt is associated with higher earnings.


Institution Type vs Earnings

ggplot(college_clean, aes(x = CONTROL, y = MD_EARN_WNE_P10)) +
  geom_boxplot() +
  labs(title = "Earnings by Institution Type",
       x = "Institution Type",
       y = "Median Earnings (10 Years)")
## Warning: Removed 1149 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Description:
This boxplot compares earnings between public and private institutions.


Admission Rate vs Earnings

ggplot(college_clean, aes(x = ADM_RATE, y = MD_EARN_WNE_P10)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  labs(title = "Admission Rate vs Earnings",
       x = "Admission Rate",
       y = "Median Earnings (10 Years)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 4656 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4656 rows containing missing values or values outside the scale range
## (`geom_point()`).

Description:
This plot shows whether more selective institutions (lower admission rates) produce higher earnings.


Regression Model

model <- lm(MD_EARN_WNE_P10 ~ TUITIONFEE_IN + DEBT_MDN + ADM_RATE + C150_4 + CONTROL,
            data = college_clean)

summary(model)
## 
## Call:
## lm(formula = MD_EARN_WNE_P10 ~ TUITIONFEE_IN + DEBT_MDN + ADM_RATE + 
##     C150_4 + CONTROL, data = college_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48214  -5970  -1173   5268  79411 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.657e+04  1.961e+03  18.649  < 2e-16 ***
## TUITIONFEE_IN              4.088e-01  3.189e-02  12.821  < 2e-16 ***
## DEBT_MDN                   3.918e-01  8.069e-02   4.855 1.32e-06 ***
## ADM_RATE                  -8.563e+03  1.540e+03  -5.561 3.16e-08 ***
## C150_4                     2.909e+04  2.127e+03  13.674  < 2e-16 ***
## CONTROLPrivate Nonprofit  -1.299e+04  9.845e+02 -13.199  < 2e-16 ***
## CONTROLPrivate For-Profit -1.472e+03  1.697e+03  -0.867    0.386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11810 on 1539 degrees of freedom
##   (4883 observations deleted due to missingness)
## Multiple R-squared:  0.4465, Adjusted R-squared:  0.4443 
## F-statistic: 206.9 on 6 and 1539 DF,  p-value: < 2.2e-16

Description:
We estimate a multiple linear regression model to determine which variables significantly predict post-graduation earnings.


Conclusion

This project explored the relationship between tuition, student debt, and post-graduation earnings.
The analysis provides insight into whether college provides financial return and which institutional characteristics are associated with stronger earnings outcomes.