R Bridge Course | Final Project

Required packages: Tidyverse

library(tidyverse)

Bonus: Reading in the CSV file from my GitHub repository

csv_link <- 'https://raw.githubusercontent.com/pmahdi/cuny-bridge/main/causaldata_scorecard.csv'
college_data_original <- read_csv(file = csv_link)
head(college_data_original, 3)

## # A tibble: 3 × 9
##    ...1 unitid inst_name           state…¹ pred_…²  year earni…³ count…⁴ count…⁵
##   <dbl>  <dbl> <chr>               <chr>     <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1     1 100654 Alabama A & M Univ… AL            3  2007   36600     116    1139
## 2     2 100663 University of Alab… AL            3  2007   40800     366    2636
## 3     3 100690 Amridge University  AL            3  2007      NA       6      25
## # … with abbreviated variable names ¹state_abbr, ²pred_degree_awarded_ipeds,
## #   ³earnings_med, ⁴count_not_working, ⁵count_working

Question: Are students of institutions that predominantly award 4+ year degrees working and earning more than the students of other institutions?

Initial wranglings:

Dropping column 1 as an unnecessary index column

college_data <- college_data_original[-1]

Renaming pred_degree_awarded_ipeds to degrees_awarded for convenience and clarity

filt_degrees <- which(names(college_data) == 'pred_degree_awarded_ipeds')
names(college_data)[filt_degrees] <- 'degrees_awarded'

Changing the degrees_awarded column’s data type from double to character to match what that data represents

college_data$degrees_awarded <- as.character(college_data$degrees_awarded)
college_data[college_data$degrees_awarded == '1', 'degrees_awarded'] <- 'less than 2 yr'
college_data[college_data$degrees_awarded == '2', 'degrees_awarded'] <- '2 yr'
college_data[college_data$degrees_awarded == '3', 'degrees_awarded'] <- '4 yr or more'

Inserting new column representing the proportion of working students per institution for ease of comparison (raw student numbers vary greatly between institutions)

working_ratio <- college_data['count_working'] / (college_data['count_working'] + college_data['count_not_working'])
college_data['working_ratio'] <- working_ratio

Checking all the initial changes

head(college_data, 3)

## # A tibble: 3 × 9
##   unitid inst_name         state…¹ degre…²  year earni…³ count…⁴ count…⁵ worki…⁶
##    <dbl> <chr>             <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 100654 Alabama A & M Un… AL      4 yr o…  2007   36600     116    1139   0.908
## 2 100663 University of Al… AL      4 yr o…  2007   40800     366    2636   0.878
## 3 100690 Amridge Universi… AL      4 yr o…  2007      NA       6      25   0.806
## # … with abbreviated variable names ¹state_abbr, ²degrees_awarded,
## #   ³earnings_med, ⁴count_not_working, ⁵count_working, ⁶working_ratio

Data analysis:

Comparing overall median values to target group median values, with target group being the institutions offering 4+ year degrees

earnings_all <- median(college_data$earnings_med, na.rm = TRUE)
earnings_target_group <- median(college_data$earnings_med[college_data$degrees_awarded == '4 yr or more'], na.rm = TRUE)

working_ratio_all <- median(college_data$working_ratio, na.rm = TRUE)
working_ratio_target_group <- median(college_data$working_ratio[college_data$degrees_awarded == '4 yr or more'], na.rm = TRUE)

print(list('earnings_all' = earnings_all, 'earnings_target_group' = earnings_target_group, 'working_ratio_all' = working_ratio_all, 'working_ratio_target_group' = working_ratio_target_group))

## $earnings_all
## [1] 31600
## 
## $earnings_target_group
## [1] 41700
## 
## $working_ratio_all
## [1] 0.8304795
## 
## $working_ratio_target_group
## [1] 0.888587

Target group’s median values pertaining to both variables (earnings_med and working_ratio) are higher than the overall median values for those variables. This suggests that the target group (institutions predominantly offering 4+ year degrees) has higher values within the two columns.

Comparing the 3 groups more explicitly

result <- aggregate(cbind(earnings_med, working_ratio) ~ degrees_awarded, college_data, median)
names(result) <- c('degrees_awarded', 'median_earnings_med', 'median_working_ratio')
result <- result[c(3, 1, 2), ]
result

##   degrees_awarded median_earnings_med median_working_ratio
## 3  less than 2 yr               24200            0.7835052
## 1            2 yr               30900            0.8244955
## 2    4 yr or more               41800            0.8891667

Visual comparison for confirmation

Analyzing the relevant median values both mathematically and visually suggests that the answer to the question initially posed at the beginning of this report is a yes. Students who graduated from institutions offering mostly 4+ year degrees are indeed earning more money and working more than the students who graduated from institutions offering mostly shorter degrees. The disparity is much more pronounced in the case of earnings data, with many more high-value outliers visible in the target group’s boxplot than in the other groups’ boxplots.

A follow-up question: Is there any correlation between the variables earnings_med and working_ratio?

First, we must ensure that earnings_med and working_ratio are normally distributed. Only then can we calculate the Pearson correlation coefficient to ascertain the answer. Plotting frequency histograms for both variables allows us to check that they are normally distributed.

Both variables look to be normally distributed. Next, graphing a scatter plot for the two variables allows for identification of any problematic wild outliers.

The scatter plot is satisfactory for our purpose. Next, the Pearson correlation coefficient is calculated. Its value should be between -1 and 1. Its sign determines whether the variables increase proportionally or inversely to each other, while its absolute value determines the strength of the correlation. A value of 0 indicates no correlation and a value of 1 or -1 indicates perfect correlation.

cor_earnings_working <- cor(x = college_data$earnings_med, y = college_data$working_ratio, method = 'pearson', use = 'pairwise.complete.obs')
cor_earnings_working

## [1] 0.7164122

A value of 0.7164122 indicates that there is positive correlation between the variables earnings_med and working_ratio.

R Bridge Course | Final Project

Prinon Mahdi

2022-07-27

Bonus: Reading in the CSV file from my GitHub repository

Question: Are students of institutions that predominantly award 4+ year degrees working and earning more than the students of other institutions?

Initial wranglings:

Data analysis:

A follow-up question: Is there any correlation between the variables earnings_med and working_ratio?