Required packages: Tidyverse

library(tidyverse)

Bonus: Reading in the CSV file from my GitHub repository

csv_link <- 'https://raw.githubusercontent.com/pmahdi/cuny-bridge/main/causaldata_scorecard.csv'
college_data_original <- read_csv(file = csv_link)
head(college_data_original, 3)
## # A tibble: 3 × 9
##    ...1 unitid inst_name           state…¹ pred_…²  year earni…³ count…⁴ count…⁵
##   <dbl>  <dbl> <chr>               <chr>     <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1     1 100654 Alabama A & M Univ… AL            3  2007   36600     116    1139
## 2     2 100663 University of Alab… AL            3  2007   40800     366    2636
## 3     3 100690 Amridge University  AL            3  2007      NA       6      25
## # … with abbreviated variable names ¹​state_abbr, ²​pred_degree_awarded_ipeds,
## #   ³​earnings_med, ⁴​count_not_working, ⁵​count_working

Question: Are students of institutions that predominantly award 4+ year degrees working and earning more than the students of other institutions?

Initial wranglings:

  • Dropping column 1 as an unnecessary index column
college_data <- college_data_original[-1]
  • Renaming pred_degree_awarded_ipeds to degrees_awarded for convenience and clarity
filt_degrees <- which(names(college_data) == 'pred_degree_awarded_ipeds')
names(college_data)[filt_degrees] <- 'degrees_awarded'
  • Changing the degrees_awarded column’s data type from double to character to match what that data represents
college_data$degrees_awarded <- as.character(college_data$degrees_awarded)
college_data[college_data$degrees_awarded == '1', 'degrees_awarded'] <- 'less than 2 yr'
college_data[college_data$degrees_awarded == '2', 'degrees_awarded'] <- '2 yr'
college_data[college_data$degrees_awarded == '3', 'degrees_awarded'] <- '4 yr or more'
  • Inserting new column representing the proportion of working students per institution for ease of comparison (raw student numbers vary greatly between institutions)
working_ratio <- college_data['count_working'] / (college_data['count_working'] + college_data['count_not_working'])
college_data['working_ratio'] <- working_ratio
  • Checking all the initial changes
head(college_data, 3)
## # A tibble: 3 × 9
##   unitid inst_name         state…¹ degre…²  year earni…³ count…⁴ count…⁵ worki…⁶
##    <dbl> <chr>             <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 100654 Alabama A & M Un… AL      4 yr o…  2007   36600     116    1139   0.908
## 2 100663 University of Al… AL      4 yr o…  2007   40800     366    2636   0.878
## 3 100690 Amridge Universi… AL      4 yr o…  2007      NA       6      25   0.806
## # … with abbreviated variable names ¹​state_abbr, ²​degrees_awarded,
## #   ³​earnings_med, ⁴​count_not_working, ⁵​count_working, ⁶​working_ratio

Data analysis:

  • Comparing overall median values to target group median values, with target group being the institutions offering 4+ year degrees
earnings_all <- median(college_data$earnings_med, na.rm = TRUE)
earnings_target_group <- median(college_data$earnings_med[college_data$degrees_awarded == '4 yr or more'], na.rm = TRUE)

working_ratio_all <- median(college_data$working_ratio, na.rm = TRUE)
working_ratio_target_group <- median(college_data$working_ratio[college_data$degrees_awarded == '4 yr or more'], na.rm = TRUE)

print(list('earnings_all' = earnings_all, 'earnings_target_group' = earnings_target_group, 'working_ratio_all' = working_ratio_all, 'working_ratio_target_group' = working_ratio_target_group))
## $earnings_all
## [1] 31600
## 
## $earnings_target_group
## [1] 41700
## 
## $working_ratio_all
## [1] 0.8304795
## 
## $working_ratio_target_group
## [1] 0.888587

Target group’s median values pertaining to both variables (earnings_med and working_ratio) are higher than the overall median values for those variables. This suggests that the target group (institutions predominantly offering 4+ year degrees) has higher values within the two columns.

  • Comparing the 3 groups more explicitly
result <- aggregate(cbind(earnings_med, working_ratio) ~ degrees_awarded, college_data, median)
names(result) <- c('degrees_awarded', 'median_earnings_med', 'median_working_ratio')
result <- result[c(3, 1, 2), ]
result
##   degrees_awarded median_earnings_med median_working_ratio
## 3  less than 2 yr               24200            0.7835052
## 1            2 yr               30900            0.8244955
## 2    4 yr or more               41800            0.8891667
  • Visual comparison for confirmation

Analyzing the relevant median values both mathematically and visually suggests that the answer to the question initially posed at the beginning of this report is a yes. Students who graduated from institutions offering mostly 4+ year degrees are indeed earning more money and working more than the students who graduated from institutions offering mostly shorter degrees. The disparity is much more pronounced in the case of earnings data, with many more high-value outliers visible in the target group’s boxplot than in the other groups’ boxplots.

A follow-up question: Is there any correlation between the variables earnings_med and working_ratio?

cor_earnings_working <- cor(x = college_data$earnings_med, y = college_data$working_ratio, method = 'pearson', use = 'pairwise.complete.obs')
cor_earnings_working
## [1] 0.7164122

A value of 0.7164122 indicates that there is positive correlation between the variables earnings_med and working_ratio.