The cost of a college education is a matter of great importance to thousands of U.S. high school students who embark on their journey to higher education each year. However, as the price of a college degree increases and the student debt crisis mounts, its return on investment coming into question. As a consequence, higher education institutions are seeing significant drops in enrollment as Gen Z becomes disillusioned.

Using data from the U.S. Department of Education’s College Scorecard, this analysis will explore the link between tuition rates, graduation rates, admission rate, and future earnings for thousands of colleges in the United States. Uncovering the relationships between these factors could be helpful for students who are exploring their college options and want the best opportunity for a return on their investment.

Data Source

The data is compiled by universities nationwide the U.S. Department of Education (USDE) and made available on their website. The file is titled “Most Recent Institution-Level Data.” Data is reported to the USDE for students who were eligible for some form of federal aid.

Link to data: https://collegescorecard.ed.gov/data

# Reading in the data
colleges <- read.csv(file = 'Most-Recent-Cohorts-Institution.csv')
# head(colleges)
# Install and reading in color library
#install.packages('viridis')
library('viridis')
## Warning: package 'viridis' was built under R version 4.2.3
## Loading required package: viridisLite
library(car)
## Loading required package: carData
# Retrieve the number of rows and columns
dim(colleges)
## [1] 6681 2989

Section 1

Is there a relationship between colleges’ and universities’ tuition and their 6-year graduation rate?

A college degree cannot help students earn a higher income if they do not graduate. Furthermore, failing to graduate from college after taking out student loans can have a negative impact on a student’s available income, making it harder to afford necessities like rent or groceries as they pay off their debt.

This section will examine whether universities that charge higher levels of tuition also have better graduation rates. It utilizes the 6-year graduation rate, which is generally considered in higher education to be the true average time it takes a student to complete their undergraduate studies rather than four years.

# Selecting only institution names, the cost of attendance and graduation rate
cost <- colleges[, c('INSTNM', 'COSTT4_A', 'C150_4')]
cost$COSTT4_A <- as.numeric(cost$COSTT4_A)
## Warning: NAs introduced by coercion
cost$C150_4 <- as.numeric(cost$C150_4)
## Warning: NAs introduced by coercion
colnames(cost) <- c('Institution Name', 'Cost of Attendance', 'Graduation Rate')
head(cost)
##                      Institution Name Cost of Attendance Graduation Rate
## 1            Alabama A & M University              23445          0.2866
## 2 University of Alabama at Birmingham              25542          0.6117
## 3                  Amridge University              20100          0.2500
## 4 University of Alabama in Huntsville              24861          0.5714
## 5            Alabama State University              21892          0.3177
## 6           The University of Alabama              30016          0.7214
tail(cost)
##                                           Institution Name Cost of Attendance
## 6676     Pennsylvania State University-Penn State Shenango                 NA
## 6677 Pennsylvania State University-Penn State Wilkes-Barre                 NA
## 6678         Pennsylvania State University-Penn State York                 NA
## 6679 Pennsylvania State University-Penn State Great Valley                 NA
## 6680   Pennsylvania State University-Penn State Harrisburg                 NA
## 6681   Pennsylvania State University-Penn State Brandywine                 NA
##      Graduation Rate
## 6676              NA
## 6677              NA
## 6678              NA
## 6679              NA
## 6680              NA
## 6681              NA
# Checking for NAs
sum(is.na(cost))
## [1] 7769
# Removing rows with NAs.
cost <- cost[complete.cases(cost), ]
sum(is.na(cost))
## [1] 0
# Checking the dimensions of the remaining rows
dim(cost)
## [1] 2139    3
head(cost)
##                      Institution Name Cost of Attendance Graduation Rate
## 1            Alabama A & M University              23445          0.2866
## 2 University of Alabama at Birmingham              25542          0.6117
## 3                  Amridge University              20100          0.2500
## 4 University of Alabama in Huntsville              24861          0.5714
## 5            Alabama State University              21892          0.3177
## 6           The University of Alabama              30016          0.7214
tail(cost)
##                                                     Institution Name
## 5977 Indiana Institute of Technology-College of Professional Studies
## 5996                                                Yeshiva of Ocean
## 5997         Congregation Talmidei Mesivta Tiferes Shmiel Aleksander
## 6007                             Yeshiva Gedolah of Woodlake Village
## 6046                                                  Urshan College
## 6106                               The Pennsylvania State University
##      Cost of Attendance Graduation Rate
## 5977              22827          0.1185
## 5996              14100          0.8333
## 5997              23300          0.1136
## 6007              14300          0.5000
## 6046              11054          0.1176
## 6106              33960          0.7281

We can see plainly that while the Graduation Rate almost normally distributed and slightly left skewed, the Cost of Attendance is skewed right. Those values are gathered around the lower end of the cost scale.

# Examining the data with histograms
hist(cost$`Cost of Attendance`)

hist(cost$`Graduation Rate`)

# Plotting the data shows a somewhat linear relationship
plot(cost$`Cost of Attendance`, cost$`Graduation Rate`, xlab="Cost of Attendance", ylab="Graduation Rate")  

cost.lm <- lm(cost$`Graduation Rate` ~ cost$`Cost of Attendance`)
summary(cost.lm)
## 
## Call:
## lm(formula = cost$`Graduation Rate` ~ cost$`Cost of Attendance`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58940 -0.10638  0.00005  0.10263  0.61930 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.784e-01  8.416e-03   33.08   <2e-16 ***
## cost$`Cost of Attendance` 7.005e-06  2.183e-07   32.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1754 on 2137 degrees of freedom
## Multiple R-squared:  0.3252, Adjusted R-squared:  0.3249 
## F-statistic:  1030 on 1 and 2137 DF,  p-value: < 2.2e-16
plot(cost$`Cost of Attendance`, cost$`Graduation Rate`, xlab="Cost of Attendance", ylab="Graduation Rate")  
abline(cost.lm)

Based on the coefficients and p-value, there appears to be a linear relationship between the cost of tuition and graduation rate where a $1 increase in tuition leads to a minuscule increase in the graduate rate. But a plot of residuals reveals that there is a non-constant error variance. Therefore, a log transformation is needed to see if the model can be improved.

par(mfrow = c(2, 2))
plot(cost.lm)

# Performing a log transformation on Cost of Attendance
plot(cost$`Graduation Rate` ~ log(cost$`Cost of Attendance`), xlab="Cost of Attendance", ylab="Graduation Rate")

cost_log.lm <- lm(cost$`Graduation Rate` ~ log(cost$`Cost of Attendance`))
summary(cost_log.lm)
## 
## Call:
## lm(formula = cost$`Graduation Rate` ~ log(cost$`Cost of Attendance`))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60561 -0.10510  0.00225  0.11275  0.64335 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.790086   0.076905  -23.28   <2e-16 ***
## log(cost$`Cost of Attendance`)  0.223880   0.007445   30.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 2137 degrees of freedom
## Multiple R-squared:  0.2973, Adjusted R-squared:  0.297 
## F-statistic: 904.2 on 1 and 2137 DF,  p-value: < 2.2e-16
# Plotting the  model
plot(log(cost$`Cost of Attendance`), cost$`Graduation Rate`, xlab="Cost of Attendance", ylab="Graduation Rate")  
abline(cost_log.lm)

# Plotting the residuals
par(mfrow = c(2, 2))
plot(cost_log.lm)

Conclusion:

The log transformation has improved the model, and we an observe much more randomness in the plot of the residuals. We can say there is a linear correlation between universities’ tuition and graduation rates, where a 1 unit increase in tuition will lead to a .22% increase in graduation rate.

This is perhaps because universities that charge more money may have more student and academic support resources.

Section 2

Does a college’s region impact how much money students will earn?

High-cost and prestigious universities tend to be concentrated on the East and West Coasts, which could lead to the assumption that attending college in these areas will give students the best chance of later earning a high salary. This section will look at whether the geographic region is correlated with students’ starting salaries.

# Selecting only institution names, the cost of attendance, median earnings, region and graduation rate
geo_region <- colleges[, c('INSTNM', 'COSTT4_A', 'C150_4', 'MD_EARN_WNE_P6' , 'REGION')]
geo_region$COSTT4_A <- as.numeric(geo_region$COSTT4_A)
## Warning: NAs introduced by coercion
geo_region$C150_4 <- as.numeric(geo_region$C150_4)
## Warning: NAs introduced by coercion
geo_region$MD_EARN_WNE_P6 <- as.numeric(geo_region$MD_EARN_WNE_P6)
## Warning: NAs introduced by coercion
colnames(geo_region) <- c('Institution Name', 'Cost of Attendance', 'Graduation Rate', 'Median Earnings', 'Region')
head(geo_region)
##                      Institution Name Cost of Attendance Graduation Rate
## 1            Alabama A & M University              23445          0.2866
## 2 University of Alabama at Birmingham              25542          0.6117
## 3                  Amridge University              20100          0.2500
## 4 University of Alabama in Huntsville              24861          0.5714
## 5            Alabama State University              21892          0.3177
## 6           The University of Alabama              30016          0.7214
##   Median Earnings Region
## 1           28704      5
## 2           39271      5
## 3           32939      5
## 4           47533      5
## 5           25949      5
## 6           44485      5
tail(geo_region)
##                                           Institution Name Cost of Attendance
## 6676     Pennsylvania State University-Penn State Shenango                 NA
## 6677 Pennsylvania State University-Penn State Wilkes-Barre                 NA
## 6678         Pennsylvania State University-Penn State York                 NA
## 6679 Pennsylvania State University-Penn State Great Valley                 NA
## 6680   Pennsylvania State University-Penn State Harrisburg                 NA
## 6681   Pennsylvania State University-Penn State Brandywine                 NA
##      Graduation Rate Median Earnings Region
## 6676              NA              NA      2
## 6677              NA              NA      2
## 6678              NA              NA      2
## 6679              NA              NA      2
## 6680              NA              NA      2
## 6681              NA              NA      2
# Removing NAs
geo_region <- geo_region[complete.cases(geo_region), ]
sum(is.na(geo_region))
## [1] 0
head(geo_region)
##                      Institution Name Cost of Attendance Graduation Rate
## 1            Alabama A & M University              23445          0.2866
## 2 University of Alabama at Birmingham              25542          0.6117
## 3                  Amridge University              20100          0.2500
## 4 University of Alabama in Huntsville              24861          0.5714
## 5            Alabama State University              21892          0.3177
## 6           The University of Alabama              30016          0.7214
##   Median Earnings Region
## 1           28704      5
## 2           39271      5
## 3           32939      5
## 4           47533      5
## 5           25949      5
## 6           44485      5
tail(geo_region)
##                                                     Institution Name
## 5711                                               Husson University
## 5777                   Indiana Wesleyan University-National & Global
## 5814                                             Carolina University
## 5871                                     Purdue University Northwest
## 5974     Drury University-College of Continuing Professional Studies
## 5977 Indiana Institute of Technology-College of Professional Studies
##      Cost of Attendance Graduation Rate Median Earnings Region
## 5711              31303          0.5764           38215      1
## 5777              23591          0.3833           49124      3
## 5814              22830          0.4615           28038      5
## 5871              17408          0.4192           38320      3
## 5974              20867          0.3125           32347      4
## 5977              22827          0.1185           35502      3
dim(geo_region)
## [1] 2031    5
# Renaming columns
geo_regions <- geo_region[, c('Institution Name', 'Region', 'Median Earnings')]
geo_regions$Region[geo_regions$Region == 1] <- 'New England'
geo_regions$Region[geo_regions$Region == 2] <- 'Mid East'
geo_regions$Region[geo_regions$Region == 3] <- 'Great Lakes'
geo_regions$Region[geo_regions$Region == 4] <- 'Plains'
geo_regions$Region[geo_regions$Region == 5] <- 'Southeast'
geo_regions$Region[geo_regions$Region == 6] <- 'Southwest'
geo_regions$Region[geo_regions$Region == 7] <- 'Rocky Mountains'
geo_regions$Region[geo_regions$Region == 8] <- 'Far West'
geo_regions$Region[geo_regions$Region == 9] <- 'Outlying Areas'
head(geo_regions)
##                      Institution Name    Region Median Earnings
## 1            Alabama A & M University Southeast           28704
## 2 University of Alabama at Birmingham Southeast           39271
## 3                  Amridge University Southeast           32939
## 4 University of Alabama in Huntsville Southeast           47533
## 5            Alabama State University Southeast           25949
## 6           The University of Alabama Southeast           44485
# Calculating the median earnings per region 
group_region <- aggregate(geo_regions$`Median Earnings`, list(geo_regions$Region), median)
colnames(group_region) <- c("Region", "Median")
group_region <- group_region[-1,]
group_region$Region <- as.character(group_region$Region)
geo_region <- geo_region[complete.cases(geo_region), ]
group_region
##             Region  Median
## 2         Far West 41851.0
## 3      Great Lakes 39914.0
## 4         Mid East 43284.5
## 5      New England 44057.0
## 6   Outlying Areas 19917.0
## 7           Plains 39996.0
## 8  Rocky Mountains 35365.0
## 9        Southeast 34078.0
## 10       Southwest 35876.0
# Visualizing the results
barplot(group_region$Median, col=viridis(9), xlab = 'Regions', ylab = 'Student Median Earnings')
legend("bottomright", fill = viridis(9), legend = group_region$Region, xpd=TRUE, cex=.6, bg = 'white')

We can see that New England, Mid East, and Far West (which include the states New York and California) have the highest median earnings for students. The Southeast and Outlying Areas, which include poor states like Louisiana and U.S. territories like Guam, have the lowest median earnings for students.

# Getting correlations for all columns, excluding the names column
cor(geo_region[,-1])
##                    Cost of Attendance Graduation Rate Median Earnings
## Cost of Attendance          1.0000000       0.5934087       0.4936390
## Graduation Rate             0.5934087       1.0000000       0.5593289
## Median Earnings             0.4936390       0.5593289       1.0000000
## Region                     -0.2985036      -0.2278097      -0.2273482
##                        Region
## Cost of Attendance -0.2985036
## Graduation Rate    -0.2278097
## Median Earnings    -0.2273482
## Region              1.0000000
# Generating a heat map
heatmap(abs(cor(geo_region[,-1])), cexRow = .7, cexCol = .7)

The heatmap of correlations is not showing a strong relationship between Earnings and the Region of a college or university. However, it reveals a correlation between Median Earnings and Graduation Rate. This deserves further investigation.

Section 3

Is there a correlation between colleges’ Graduation Rate and students’ Median Earnings?

These histograms show that Graduation Rate is fairly normally distributed, but Median Earnings is right-skewed. This may call for a log transformation.

hist(geo_region$`Graduation Rate`)

hist(geo_region$`Median Earnings`)

# Plotting the data without log regression
earnings_grad.lm <- lm(geo_regions$`Median Earnings` ~ geo_region$`Graduation Rate`)
plot(geo_region$`Graduation Rate`, geo_regions$`Median Earnings`)
abline(earnings_grad.lm)

# Residuals without log regression
par(mfrow = c(2, 2))
plot(earnings_grad.lm)

# Plot with log regression on 
earnings_grad.lm <- lm(log(geo_regions$`Median Earnings`) ~ geo_region$`Graduation Rate`)
plot(geo_region$`Graduation Rate`, log(geo_regions$`Median Earnings`))
abline(earnings_grad.lm)

par(mfrow = c(2, 2))
plot(earnings_grad.lm)

summary(earnings_grad.lm)
## 
## Call:
## lm(formula = log(geo_regions$`Median Earnings`) ~ geo_region$`Graduation Rate`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30847 -0.10641  0.01391  0.12497  1.19952 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  10.13555    0.01456  695.95   <2e-16 ***
## geo_region$`Graduation Rate`  0.78816    0.02591   30.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2426 on 2029 degrees of freedom
## Multiple R-squared:  0.3132, Adjusted R-squared:  0.3129 
## F-statistic: 925.3 on 1 and 2029 DF,  p-value: < 2.2e-16

Conclusion

While the model isn’t perfect, we do see some correlation between Median Earnings and Graduation Rate. As the Graduation Rate increases by 1 percent, Median Earnings will increase by 0.79 units.

Section 4

Are tech graduates really getting high salaries?

People think that an IT background earns college graduates a lot a high starting salary. We did a regression analysis on the this topic to find out if a correlation exists between having a STEM degree and high earnings.

We took the percentage of students who were awarded degrees in all kinds of technology programs (including Computer Engineering, Communication Technology, Science and Technology) from all universities in the data set. Additional data includes the median earnings after 10 years and the cost of attendance of every university, categorized based on whether their cost of attendance was above or below the mean tuition rate.

What we found is interesting! The increased percentage of technology graduates in U.S. doesn’t account for the increase in median earnings 10 years after graduation.

This shows that IT graduates are not really getting paid higher salaries than their peers!

# Subsetting the data
earningsTech <- colleges[, c('INSTNM', 'COSTT4_A', 'MD_EARN_WNE_P10', 'PCIP10','PCIP15','PCIP41')]
earningsTech$COSTT4_A <- as.numeric(earningsTech$COSTT4_A)
## Warning: NAs introduced by coercion
earningsTech$MD_EARN_WNE_P10 <- as.numeric(earningsTech$MD_EARN_WNE_P10)
## Warning: NAs introduced by coercion
earningsTech$PCIP10 <- as.numeric(earningsTech$PCIP10)
## Warning: NAs introduced by coercion
earningsTech$PCIP15 <- as.numeric(earningsTech$PCIP15)
## Warning: NAs introduced by coercion
earningsTech$PCIP41 <- as.numeric(earningsTech$PCIP41)
## Warning: NAs introduced by coercion
colnames(earningsTech) <- c('Institution Name', 'Cost of Attendance', 'Earnings', 'CommTech', 'EngTech', 'SciTech')
head(earningsTech)
##                      Institution Name Cost of Attendance Earnings CommTech
## 1            Alabama A & M University              23445    36339   0.0393
## 2 University of Alabama at Birmingham              25542    46990   0.0000
## 3                  Amridge University              20100    37895   0.0000
## 4 University of Alabama in Huntsville              24861    54361   0.0000
## 5            Alabama State University              21892    32084   0.0000
## 6           The University of Alabama              30016    52751   0.0000
##   EngTech SciTech
## 1  0.0154   0e+00
## 2  0.0000   4e-04
## 3  0.0000   0e+00
## 4  0.0000   0e+00
## 5  0.0000   0e+00
## 6  0.0000   0e+00
#Removing Na's
earningsTech <- earningsTech[complete.cases(earningsTech), ]
sum(is.na(earningsTech))
## [1] 0
#Combined the percentage of all Tech related degrees. 
earningsTech$TotalTech <- earningsTech$CommTech + earningsTech$EngTech + earningsTech$SciTech



#log transformation for the percentage of degrees 
earningsTech$TotalTech <- log(earningsTech$TotalTech)

#Removing infinite values after log transformation.
for (i in 1:nrow(earningsTech)){
  if(is.infinite(earningsTech$TotalTech[i])){
    earningsTech <- earningsTech[-i,]
  }
}

for (i in 1:nrow(earningsTech)){
  if(is.infinite(earningsTech$TotalTech[i])){
    earningsTech <- earningsTech[-i,]
  }
}
for (i in 1:nrow(earningsTech)){
  if(is.infinite(earningsTech$TotalTech[i])){
    earningsTech <- earningsTech[-i,]
  }
}

for (i in 1:nrow(earningsTech)){
  if(is.infinite(earningsTech$TotalTech[i])){
    earningsTech <- earningsTech[-i,]
  }
}

for (i in 1:nrow(earningsTech)){
  if(is.infinite(earningsTech$TotalTech[i])){
    earningsTech <- earningsTech[-i,]
  }
}




# Categorizes universities as having a high or not high tuition rate
earningsTech$Inst_Range <- ifelse(earningsTech$`Cost of Attendance` >=
    mean(earningsTech$`Cost of Attendance`), "AboveAvg", "BelowAvg")

The regression analysis begins here:

#Creating a model
techModel <- lm(earningsTech$Earnings ~ earningsTech$TotalTech + earningsTech$`Cost of Attendance`+earningsTech$Inst_Range)

#Full Model
techmodel1 <- lm(earningsTech$Earnings ~ earningsTech$TotalTech + earningsTech$`Cost of Attendance`+earningsTech$Inst_Range + earningsTech$TotalTech: earningsTech$Inst_Range)
#Summary of the Model
summary(techModel)
## 
## Call:
## lm(formula = earningsTech$Earnings ~ earningsTech$TotalTech + 
##     earningsTech$`Cost of Attendance` + earningsTech$Inst_Range)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26098  -4800   -504   4084  47282 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        3.056e+04  1.159e+03  26.363  < 2e-16 ***
## earningsTech$TotalTech            -1.115e+03  1.783e+02  -6.252 5.42e-10 ***
## earningsTech$`Cost of Attendance`  4.146e-01  2.631e-02  15.761  < 2e-16 ***
## earningsTech$Inst_RangeBelowAvg   -3.889e+03  7.296e+02  -5.331 1.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8300 on 1359 degrees of freedom
## Multiple R-squared:  0.4175, Adjusted R-squared:  0.4163 
## F-statistic: 324.7 on 3 and 1359 DF,  p-value: < 2.2e-16
summary(techmodel1)
## 
## Call:
## lm(formula = earningsTech$Earnings ~ earningsTech$TotalTech + 
##     earningsTech$`Cost of Attendance` + earningsTech$Inst_Range + 
##     earningsTech$TotalTech:earningsTech$Inst_Range)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25309  -4979   -574   4164  48295 
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             2.863e+04  1.363e+03
## earningsTech$TotalTech                                 -1.653e+03  2.686e+02
## earningsTech$`Cost of Attendance`                       4.137e-01  2.625e-02
## earningsTech$Inst_RangeBelowAvg                        -4.855e+02  1.467e+03
## earningsTech$TotalTech:earningsTech$Inst_RangeBelowAvg  9.574e+02  3.582e+02
##                                                        t value Pr(>|t|)    
## (Intercept)                                             21.009  < 2e-16 ***
## earningsTech$TotalTech                                  -6.153    1e-09 ***
## earningsTech$`Cost of Attendance`                       15.762  < 2e-16 ***
## earningsTech$Inst_RangeBelowAvg                         -0.331  0.74072    
## earningsTech$TotalTech:earningsTech$Inst_RangeBelowAvg   2.673  0.00761 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8282 on 1358 degrees of freedom
## Multiple R-squared:  0.4206, Adjusted R-squared:  0.4189 
## F-statistic: 246.4 on 4 and 1358 DF,  p-value: < 2.2e-16
pairs(earningsTech$Earnings ~ earningsTech$TotalTech + earningsTech$`Cost of Attendance`)

# Partial F-test
anova(techModel, techmodel1)
## Analysis of Variance Table
## 
## Model 1: earningsTech$Earnings ~ earningsTech$TotalTech + earningsTech$`Cost of Attendance` + 
##     earningsTech$Inst_Range
## Model 2: earningsTech$Earnings ~ earningsTech$TotalTech + earningsTech$`Cost of Attendance` + 
##     earningsTech$Inst_Range + earningsTech$TotalTech:earningsTech$Inst_Range
##   Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
## 1   1359 9.3626e+10                                
## 2   1358 9.3136e+10  1 489949973 7.1439 0.007612 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Plotting the Model
plot(techModel)

abline(techModel)
## Warning in abline(techModel): only using the first two of 4 regression
## coefficients

#Fitted Values Plot
plot(techModel$fitted.values, earningsTech$Earnings, main = "Fitted Values vs Earnings")
abline(lm(earningsTech$Earnings ~techModel$fitted.values))

residualPlots(techModel)

##                                   Test stat Pr(>|Test stat|)  
## earningsTech$TotalTech              -0.4436           0.6574  
## earningsTech$`Cost of Attendance`    1.2005           0.2302  
## Tukey test                           2.0663           0.0388 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Variance Inflation Factors
vif(techModel)
##            earningsTech$TotalTech earningsTech$`Cost of Attendance` 
##                          1.004803                          2.260711 
##           earningsTech$Inst_Range 
##                          2.256711
#Added Value plots
avPlots(techModel)

Conclusion

The partial F-test shows the interaction terms does not have any significance, leading us to move forward with the reduced model. From this model, universities charging below-average tuition rates also have students who earn lower salaries compared to their peers at high-tuition universities.

However, our analysis does not support the idea that the tech graduates in the U.S. are more highly paid than their peers.

Section 5

How is the admission rate correlated with open admission policy of a college?

In this section, we explore whether having an open admission policy impacts the admission rate of an institution. If a college allows anyone to enroll, we might expect that its admission rate is higher than a more selective institution.

library("ggpubr")
## Warning: package 'ggpubr' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
# Replacing nulls with 0
o<-colleges$OPENADMP
o[o == "NULL"] <- 0
OPENADMP<-o
df<-data.frame(OPENADMP)

# Replacing nulls with 0
m<-colleges$ADM_RATE
m[m == "NULL"] <- 0
df['ADM_RATE']=m


##Spearman correlation coefficient
res2 <-cor.test(as.numeric(df$ADM_RATE), as.numeric(df$OPENADMP),  method = "spearman")
## Warning in cor.test.default(as.numeric(df$ADM_RATE), as.numeric(df$OPENADMP), :
## Cannot compute exact p-value with ties
res2
## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(df$ADM_RATE) and as.numeric(df$OPENADMP)
## S = 6947130117, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8602241

Conclusion

Based on the Spearman correlation test , the p-value is < 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that ADM_RATE and OPENADMP are significantly correlated with a correlation coefficient of 0.86 and p-value of < 2.2e-16.

Section 6

Is tution rate correlated with admission rate of a college?

In this section, we explore whether there is a relationship between the exclusivity of a college and how much it costs to attend.

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(df1$TUITIONFEE_IN) and as.numeric(df1$ADM_RATE)
## t = 52.967, df = 6679, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5267615 0.5605403
## sample estimates:
##       cor 
## 0.5438711

Conclusion

Based on Pearson’s correlation test, the p-value is < 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that TUITIONFEE_IN and ADM_RATE are significantly correlated with a correlation coefficient of 0.54 and p-value of < 2.2e-16.

Our Overall Findings

Tuition and Graduation Rate: We did find a correlation between these two variables. Universities that charge more for tuition also tend to have better graduation rates, which means more students are finishing their studies. It’s possible that these institutions have more resources to support student success.

Region and Earnings: We did not find a strong relationship between where a student graduates and their future salaries. This debunks any misconception students might have that going specifically to the East or West Coasts for college would in itself lead to higher earnings.

Earnings and Graduation Rate: We did find a relationship between universities’ graduation rates and how much their students go on to earn. It stands to reason that if more students complete their studies, they are earning more money.

STEM and Earnings: We did not find a strong correlation between earning a STEM degree and making higher earnings than students who graduated with different degrees. However, we did find a relationship between tuition rates and earnings. Universities that charge more money for attendance also tend to have graduates who earn higher salaries.

Admission Rate and Open Admission Policy: We found a correlation between these variables. Universities with Open Admission Policies do in fact enroll more students. This leads us to believe that more exclusive universities would have smaller overall student bodies.

Admission Rate and Tuition: We did find a correlation between the exclusivity of a university and its tuition rate. The more selective a university is, the higher its tuition rate tends to be.