Introduction

Utilizing an extensive dataset obtained from the Department of Education, covering all higher education institutions in 2023, the aim is to analyze the diverse range of factors impacting graduation rates. The dataset utilized for this analysis contains 6,352 rows and 78 columns, featuring information ranging from school names, tuition fees, geographical location, and demographic proportions to SAT averages, admission rates, financial aid statistics, and much more. The preparation for analyzing this dataset included simplifying and renaming columns for comprehensive purposes, removing columns that weren’t relevant to this exploration. Rows with missing data (NAs) were removed from the dataset before each analysis question was explored, not at the beginning when the dataset was initially prepared. This approach ensures that no important data related to the specific variable being studied is lost. Additionally, new data frames and columns were created for this analysis.

In this assignment, the focus is on consolidating theoretical knowledge into practical skills with the application of Ridge and LASSO regression techniques. The goal is to build linear and logistic models by implementing Ridge and LASSO functions over a range of regularization parameter lambda values. The assignment entails the creation of two regularized models: a regression model predicting Completion Rate (C150_4_L4), and a logistic regression model predicting Completion Rate Median and Completion Rate High columns.

Ridge and LASSO are regularization techniques used to prevent overfitting in statistical models by adding a penalty term to the cost function. Ridge regression adds a penalty equivalent to the square of the magnitude of coefficients, while LASSO adds a penalty equivalent to the absolute value of the magnitude of coefficients. The main difference between the two lies in the penalty term: Ridge tends to shrink coefficients towards zero, while LASSO tends to set some coefficients to exactly zero, effectively performing variable selection.

Additionally, outliers are identified and removed using Mahalanobis Distance and Local Outlier Factor methods. Ridge regression is then applied, estimating lambda.min and lambda.1se values and fitting the model against the training set to report interesting findings and performance metrics. Similarly, LASSO regression is applied to the training set, with focus on reporting coefficients and identifying any coefficients that reduce to zero.

By comparing the performance of the Ridge and LASSO models, insights can be gained into which regularization technique better suits the dataset and the predictive task at hand. This comparison will shed light on the effectiveness of each method and whether the outcomes align with expectations.

# Week 5

# Importing and preparing the dataset 

CollegeDataset <- read_csv("college_scorecard_Short1.csv")
## Rows: 6352 Columns: 78
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): INSTNM, CITY, STABBR, Completion_Rate_Median, Completion_Rate_High...
## dbl (65): SCHTYPE, ICLEVEL, REGION, LOCALE, C150_4_L4, NUMBRANCH, PREDDEG, H...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean names

clean_names(CollegeDataset)

# Column Names

colnames(CollegeDataset)

# Learn about the dataset

dim(CollegeDataset) # checking dimensions of the dataset

summary(CollegeDataset)

# Checking the structure of the dataset 

str(CollegeDataset)

skim(CollegeDataset)

# Checking for NAs in columns

colSums(is.na(CollegeDataset))
# Various columns have a decent amount of NAs

# Removing NAs from response (dependent) variables 

# Completion_Rate

CollegeDataset <- CollegeDataset %>% drop_na(C150_4_L4)

# Completion_Rate_Median

CollegeDataset <- CollegeDataset %>% drop_na(Completion_Rate_Median)

# Completion_Rate_High

CollegeDataset <- CollegeDataset %>% drop_na(Completion_Rate_High)

# Checking dimensions now after NA removal

dim(CollegeDataset)

Variable Selection

To ensure the robustness and efficiency of the analysis, a systematic process of variable selection was conducted on the dataset obtained from the Department of Education. Initially, a comprehensive missing data report (Figure 1) was generated to understand the extent of missing values in the dataset. After removing NAs from the three response variables C150_4_L4, Completion Rate Median, and Completion Rate High, the dataset decreased to 5242 observations with 78 columns.

To streamline the dataset while retaining its informative value, columns with more than 300 missing values were eliminated, resulting in the removal of 21 columns. This step reduced the dataset to 5242 rows and 57 columns, minimizing the impact of missing data on subsequent analyses. Further refinement was undertaken by selecting columns based on their relevance to the analysis objectives and the least amount of missing data. This process resulted in the creation of a subset, CollegeDataset2, containing 5242 observations across 13 carefully chosen columns. These columns include essential predictor variables such as enrollment demographics (UG25ABV, UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, and UGDS_ASIAN), metrics related to financial aid (PCTPELL and PCTFLOAN), and various institutional characteristics (HIGHDEG and NUMBRANCH).

# ------------------------------------
# Generate a missing data report
# ------------------------------------

missmap(CollegeDataset, main = "Figure 1: Missing Values vs Observed")

# Currently the CollegeDataset has 5242 rows and 78 columns

# ------------------------------------
# List columns with most missing values
# ------------------------------------

sort(colSums(is.na(CollegeDataset)), decreasing = TRUE)

# SAT_AVG             ADM_RATE                ADMCON7           ROOMBOARD_ON 
# 4219                   3385                   3373                   3313 
# OTHEREXPENSE_ON     ENDOWBEGIN               ENDOWEND               PFTFAC 
# 3310                   2856                   2856                   2095 
# BOOKSUPPLY          TUITIONFEE_IN         TUITIONFEE_OUT          AVGFACSAL 
# 2027                   1982                   1982                   1833 
# MARRIED          MEDIAN_HH_INC           POVERTY_RATE             UNEMP_RATE 
# 925                    904                    904                    904 
# FEMALE              FIRST_GEN              DEPENDENT        MD_EARN_WNE_1YR 
# 897                    741                    509                    429 
# RET_FT_4_L4         AGE_ENTRY                 FAMINC             COSTT4_A_P 
# 397                    225                    225                    212 
# Num4_PUB_PRIV      PAR_ED_PCT_1STGEN          PAR_ED_PCT_MS          PAR_ED_PCT_HS 
# 208                    207                    207                    207 
# DEP_INC_AVG         APPL_SCH_N              IRPS_2MOR             IRPS_ASIAN 
# 207                    207                    123                    123 
# IRPS_BLACK          IRPS_HISP             IRPS_WHITE             IRPS_WOMEN 
# 123                    123                    123                    123 
# IRPS_MEN              UG25ABV          GRAD_DEBT_MDN         WDRAW_DEBT_MDN 
# 123                     81                     50                     50 
# PCTPELL             PCTFLOAN                SCHTYPE                STUFACR 
# 8                      8                      7                      4 
# UGDS             UGDS_WHITE             UGDS_BLACK              UGDS_HISP 
# 2                      2                      2                      2 
# UGDS_ASIAN          PCIP09                 PCIP11                 PCIP13 
# 2                     1                      1                      1 
# PCIP27              PCIP40                 PCIP41                 PCIP42 
# 1                      1                      1                      1 
# PCIP51                PCIP52                TUITFTE               INEXPFTE 
# 1                      1                      1                      1 

# ------------------------------------
# Eliminate columns with more than 300 missing values
# Note that the number 300 is chosen rather arbitrarily because it feels like 
# a decent amount of NAs for the columns to have with out loosing too much valuable data
# ------------------------------------

CollegeDataset = CollegeDataset[, colSums(is.na(CollegeDataset)) < 300]

# dim(CollegeDataset) # 5242 by 57 columns meaning that 21 columns were removed

sort(colSums(is.na(CollegeDataset)), decreasing = TRUE)

# These are the remaining columns
# AGE_ENTRY            FAMINC             COSTT4_A_P          Num4_PUB_PRIV 
# 225                    225                    212                    208 
# PAR_ED_PCT_1STGEN   PAR_ED_PCT_MS          PAR_ED_PCT_HS            DEP_INC_AVG 
# 207                    207                    207                    207 
# APPL_SCH_N            IRPS_2MOR             IRPS_ASIAN             IRPS_BLACK 
# 207                    123                    123                    123 
# IRPS_HISP             IRPS_WHITE             IRPS_WOMEN               IRPS_MEN 
# 123                    123                    123                    123 
# UG25ABV          GRAD_DEBT_MDN         WDRAW_DEBT_MDN                PCTPELL 
# 81                     50                     50                      8 
# PCTFLOAN                SCHTYPE                STUFACR              UGDS 
# 8                      7                      4                      2 
# UGDS_WHITE             UGDS_BLACK              UGDS_HISP             UGDS_ASIAN 
# 2                      2                      2                      2 
# PCIP09                 PCIP11                 PCIP13                 PCIP27 
# 1                      1                      1                      1 
# PCIP40                 PCIP41                 PCIP42                 PCIP51 
# 1                      1                      1                      1 
# PCIP52                TUITFTE               INEXPFTE                ICLEVEL 
# 1                      1                      1                      0 
# INSTNM                 REGION                 LOCALE                   CITY 
# 0                      0                      0                      0 
# STABBR              C150_4_L4 Completion_Rate_Median   Completion_Rate_High 
# 0                      0                      0                      0 
# ZIP              NUMBRANCH                PREDDEG                HIGHDEG 
# 0                      0                      0                      0 
# CCBASIC               CCUGPROF               CCSIZSET                   HBCU 
# 0                      0                      0                      0 
# DISTANCEONLY 
# 0 

# colnames(CollegeDataset)

# str(CollegeDataset)

# ------------------------------------
# Keep the following columns and create a subset
# This is the first pass of shortening the dataframe
# Choose predictor variables based on the areas of interest and least amount of NAs 
# ------------------------------------

CollegeDataset2 = CollegeDataset[, c("C150_4_L4","Completion_Rate_Median",
                                     "Completion_Rate_High",
                                    "NUMBRANCH",
                                    "UG25ABV",
                                    "HIGHDEG",
                                    "UGDS",             
                                    "UGDS_WHITE",
                                    "UGDS_BLACK",
                                    "UGDS_HISP",             
                                    "UGDS_ASIAN",
                                    "PCTPELL",
                                    "PCTFLOAN")]

# Convert HIGHDEG numbers to categories

CollegeDataset2$HIGHDEG <- factor(CollegeDataset2$HIGHDEG, levels = 0:4, 
                                  labels = c("Non-Degree-Granting", "Certificate Degree", 
                                             "Associate Degree", "Bachelors Degree", 
                                             "Graduate Degree"))

dim(CollegeDataset2) # 5242 observations with 13 columns

sort(colSums(is.na(CollegeDataset2)), decreasing = TRUE)

# PCTPELL               PCTFLOAN             UGDS_WHITE             UGDS_BLACK 
# 8                      8                      2                      2 
# UGDS_HISP             UGDS_ASIAN            C150_4_L4  Completion_Rate_Median 
# 2                      2                      0                      0 
# Completion_Rate_High  NUMBRANCH             UGDS                 STABBR 
# 0                      0                      2                     0 
# HIGHDEG     UG25ABV
# 0              81
                                      
# remove missing values

CollegeDataset2 = na.omit(CollegeDataset2)

# dim(CollegeDataset2) # 5154 observations with 13 columns

To ensure interpretability and ease of analysis, certain categorical variables were transformed into factor variables. Notably, the variable HIGHDEG, representing the highest degree awarded by institutions, was categorized into meaningful labels: “Non-Degree-Granting”, “Certificate Degree”, “Associate Degree”, “Bachelors Degree”, and “Graduate Degree.”

After variable selection and transformation, the resulting dataset, CollegeDataset2, comprised 5154 observations with 13 columns. This meticulous process of variable selection guarantees the integrity and relevance of the dataset for future analyses, facilitating meaningful insights into the factors influencing graduation rates in higher education institutions.

Removing Outliers

To initiate this analysis, outliers were identified and removed from the dataset using two distinct methods: Mahalanobis Distance and Local Outlier Factor (LOF). These methods provide valuable insights into data points that deviate significantly from the majority, assisting in the creation of robust statistical models.

Mahalanobis Distance

The Mahalanobis Distance method detects outliers based on the calculated distance of each data point from the centroid of the dataset, accounting for the covariance structure of the variables. For this analysis, the numeric columns relevant to the study, namely C150_4_L4, NUMBRANCH, UG25ABV, UGDS, UGDS_WHITE, PCTPELL, PCTFLOAN, UGDS_BLACK, UGDS_HISP, and UGDS_ASIAN, were selected.

# print types of columns in college dataset 2

# sapply(CollegeDataset2, class) # Not all numeric

# Use mahalanobis distance to detect outliers, based on the following numeric columns:
# -------------------------------------------------------------------------
# "C150_4_L4"   "NUMBRANCH"   "UGD" "UGDS_WHITE"  "PCTPELL" "PCTFLOAN"
# "UGDS_BLACK"  "UGDS_HISP"             "UGDS_ASIAN"  "UG25ABV"             
# -------------------------------------------------------------------------

listofcols = c("C150_4_L4","NUMBRANCH", "UG25ABV", "UGDS", "UGDS_WHITE",
               "PCTPELL","PCTFLOAN","UGDS_BLACK","UGDS_HISP","UGDS_ASIAN")

# check if the columns in the list are numeric

# sapply(CollegeDataset2[, listofcols], class) #All numeric

# Use mahalanobis distance to detect outliers

outliers = mahalanobis(CollegeDataset2[, listofcols], 
                       colMeans(CollegeDataset2[, listofcols]), 
                       cov(CollegeDataset2[, listofcols]))
# print(outliers)

# Use the quantile function to find the 95th percentile of the mahalanobis distance
# The ones above this value are the outliers

Outlier_Threshold = quantile(outliers, 0.95)

# print(Outlier_Threshold) # 26.05974

# print the outliers

# dim(CollegeDataset2[outliers > Outlier_Threshold, ]) # 258 outliers

# Create a colunm to identify the outliers

CollegeDataset2$Outliers_Maha = ifelse(outliers > Outlier_Threshold, 1, 0)

# Print first 6 rows

Outlier_Maha_College2_6 <- head(CollegeDataset2)

# Present as a nice table

knitr::kable(Outlier_Maha_College2_6, caption = 
               "College Dataset w/ Outliers (Mahalanobis Distance)")
College Dataset w/ Outliers (Mahalanobis Distance)
C150_4_L4 Completion_Rate_Median Completion_Rate_High NUMBRANCH UG25ABV HIGHDEG UGDS UGDS_WHITE UGDS_BLACK UGDS_HISP UGDS_ASIAN PCTPELL PCTFLOAN Outliers_Maha
0.2807 Below Median Other 1 0.0617 Graduate Degree 5098 0.0184 0.8978 0.0114 0.0014 0.6853 0.6552 0
0.6245 Above Median Other 1 0.1794 Graduate Degree 13284 0.5297 0.2458 0.0669 0.0767 0.3253 0.4401 0
0.4444 Below Median Other 1 0.8606 Graduate Degree 251 0.2470 0.6932 0.0438 0.0000 0.7852 0.8423 0
0.6072 Above Median Other 1 0.1519 Graduate Degree 7358 0.7196 0.0871 0.0610 0.0357 0.2377 0.3578 0
0.2843 Below Median Other 1 0.0677 Graduate Degree 3495 0.0152 0.9259 0.0129 0.0020 0.7205 0.7637 0
0.7223 Above Median Other 1 0.0735 Graduate Degree 30725 0.7676 0.1050 0.0549 0.0137 0.1712 0.3454 0
# Create a data frame with Outliers_Maha and their colors based on the threshold

Outliers_Maha_df <- data.frame(
  Outliers_Maha = outliers,
  color = ifelse(outliers > Outlier_Threshold, "red", "black"))

# plot outliers based on the mahalanobis distance

ggplot(Outliers_Maha_df, aes(x = seq_along(Outliers_Maha), y = Outliers_Maha, color = color)) + geom_point(shape = 19) + scale_color_identity() +
  labs(x = "Index",
       y = "Mahalanobis Distance",
       title = "Figure 2: Outliers Based on Mahalanobis Distance",
       caption = "Red points indicate outliers beyond the threshold: 26.06") + theme_bw()

The analysis involved computing the Mahalanobis Distance for each observation and identifying outliers beyond the 95th percentile threshold. The threshold for outliers was determined to be 26.06, indicating that any value surpassing this threshold is classified as an outlier. These outliers were flagged and labeled accordingly, with the column Outliers_Maha indicating their presence. If the Mahalanobis Distance of a data point exceeded the threshold, it was marked as “1” in the Outliers_Maha column; otherwise, it was labeled as “0.” Figure 2 illustrates the outliers within the dataset, identified using the Mahalanobis distance, with the red data points representing those outliers.

Local Outlier Factor (LOF)

The LOF method detects outliers by examining the local density deviation of a data point with respect to its neighbors. A data point is considered an outlier if its density significantly deviates from that of its neighbors. In this analysis, the LOF was calculated for the same set of numeric columns. Outliers were identified using a threshold value of 1.5, which indicates a substantial deviation in density compared to the surrounding data points. These outliers were flagged and labeled accordingly, with the column Outliers_LOF indicating their presence. If the LO) of a data point exceeded the threshold of 1.5, it was marked as “1” in the Outliers_LOF column; otherwise, it was labeled as “0”. Figure 3 provides a visual representation of the outliers within the dataset, with the red data points indicating those identified using the LOF algorithm.

# Outlier detection using LOF 
# ---------------------------

Outliers_LOF = lof(CollegeDataset2[, listofcols], minPts =  5)

# Create a colunm to identify the outliers

CollegeDataset2$Outliers_LOF = ifelse(Outliers_LOF > 1.5, 1, 0)

# print the outliers

# dim(CollegeDataset2[Outliers_LOF > 1.5, ]) # 452 outliers

# Print first 6 rows

Outlier_LOF_College2_6 <- head(CollegeDataset2)

# Present as a nice table

knitr::kable(Outlier_LOF_College2_6, caption = 
               "College Dataset w/ Outliers (LOF Method)")
College Dataset w/ Outliers (LOF Method)
C150_4_L4 Completion_Rate_Median Completion_Rate_High NUMBRANCH UG25ABV HIGHDEG UGDS UGDS_WHITE UGDS_BLACK UGDS_HISP UGDS_ASIAN PCTPELL PCTFLOAN Outliers_Maha Outliers_LOF
0.2807 Below Median Other 1 0.0617 Graduate Degree 5098 0.0184 0.8978 0.0114 0.0014 0.6853 0.6552 0 0
0.6245 Above Median Other 1 0.1794 Graduate Degree 13284 0.5297 0.2458 0.0669 0.0767 0.3253 0.4401 0 0
0.4444 Below Median Other 1 0.8606 Graduate Degree 251 0.2470 0.6932 0.0438 0.0000 0.7852 0.8423 0 0
0.6072 Above Median Other 1 0.1519 Graduate Degree 7358 0.7196 0.0871 0.0610 0.0357 0.2377 0.3578 0 0
0.2843 Below Median Other 1 0.0677 Graduate Degree 3495 0.0152 0.9259 0.0129 0.0020 0.7205 0.7637 0 0
0.7223 Above Median Other 1 0.0735 Graduate Degree 30725 0.7676 0.1050 0.0549 0.0137 0.1712 0.3454 0 0
# Create a data frame with outliers_lof and their colors based on the threshold

Outliers_LOF_df <- data.frame(
  LOF_Outliers = Outliers_LOF,
  color = ifelse(Outliers_LOF > 1.5, "red", "black"))

# plot outliers based on the LOF

ggplot(Outliers_LOF_df, aes(x = seq_along(Outliers_LOF), y = LOF_Outliers, color = color)) +
  geom_point(shape = 19) +
  scale_color_identity() +
  labs(x = "Index",
       y = "Local Outlier Factor (LOF)",
       title = "Figure 3: Outliers Based on Local Outlier Factor (LOF)",
       caption = "Red points indicate outliers with LOF greater than 1.5") + theme_bw()

Mahalanobis Distance & LOF

To provide a comprehensive overview of outliers, the results from both the Mahalanobis Distance and LOF methods were combined. The total number of outliers was computed by summing the outliers detected by each method. Table 1 illustrates the distribution of outliers detected, with “0” indicating no outliers, “1” denoting outliers detected by either method, and “2” representing outliers identified by both Mahalanobis Distance and LOF.

Upon analysis, it was found that 4487 observations exhibited no outliers, 624 observations were flagged as outliers by one method, and 43 observations were identified as outliers by both methods. The column Outliers in the dataset represents these totals. Additionally, to maintain a high-quality dataset for subsequent analysis, observations flagged as outliers by both methods (outliers = 2) were removed. This resulted in a dataset containing 5074 rows and 16 columns, allowing for enough observations while ensuring the model’s reliability.

# Sum the two columns to get the total number of outliers

CollegeDataset2$Outliers = CollegeDataset2$Outliers_Maha + CollegeDataset2$Outliers_LOF

Outliers_Table <- table(CollegeDataset2$Outliers)

knitr::kable(Outliers_Table, caption = "Table 1: 
             Summary of Outliers Detected by Mahalanobis Distance and LOF Methods")
Table 1: Summary of Outliers Detected by Mahalanobis Distance and LOF Methods
Var1 Freq
0 4487
1 624
2 43
# Subset data by keeping where outliers = 0 or 1
# I want to keep as much of the data to get a better quality model so thats
# why I chose to remove outliers that = 2 as both the Mahalanobis Distance and LOF
# determine there is an outlier at that specific row

CollegeDataset2 = CollegeDataset2[CollegeDataset2$Outliers %in% c(0, 1), ]

# dim(CollegeDataset2) # 5111 rows by 16 columns

Multiple Linear Regression Model (Completion Rate)

A multiple linear regression (MLR) analysis was created to explore the factors impacting college completion rates. The study centered on the completion rate (C150_4_L4) as the response variable, with pertinent predictor variables such as the number of branches, types of degrees awarded, demographics of the student body, and financial aid statistics.

The insights presented in Table 2 provide a deeper understanding of how various factors impact college completion rates, shedding light on their magnitudes and implications within the regression model. For example:

Therefore, the multiple linear regression model is as follows:

C150_4_L4 = 0.46 − 0.01* NUMBRANCH + 0.20HIGHDEGCertificate Degree− 0.06HIGHDEGAssociate Degree − 0.06HIGHDEGBachelors Degree − 0.01HIGHDEGGraduate Degree − 0.06* UG25ABV + 0UGDS + 0.07 UGDS_WHITE − 0.11UGDS_BLACK + 0.08UGDS_HISP + 0.51UGDS_ASIAN − 0.16PCTPELL+0.25*PCTFLOAN

# Prep Work

# Subset the CollegeDataset2 to select to focus on Completion Rate

CR_Data <- CollegeDataset2 [c("NUMBRANCH",
                               "HIGHDEG",
                              "UG25ABV",
                              "UGDS",
                               "UGDS_WHITE",
                               "UGDS_BLACK",
                               "UGDS_HISP",             
                               "UGDS_ASIAN",
                               "PCTPELL",
                               "PCTFLOAN",
                              "C150_4_L4")]

# ---------------------------------
# Fit the MLR model
# ---------------------------------

# Split the data into TRAIN and TEST sets

set.seed(1996)  # Set seed for reproducibility

# TRAIN

CR_Train_Index = createDataPartition(CR_Data$C150_4_L4, p = .8, list = FALSE) # 80% for training

CR_Train = CR_Data[ CR_Train_Index,]

# dim(CR_Train) 3554 rows with 11 columns 

# TEST 20% for testing

CR_Test = CR_Data[-CR_Train_Index,]

# dim(CR_Test) 1520 rows with 11 columns

# ------------------------------------
# fit the model
# ------------------------------------

CR_Model = lm(C150_4_L4 ~ NUMBRANCH + 
              HIGHDEG +
              UG25ABV +
              UGDS +
              UGDS_WHITE +
              UGDS_BLACK +
              UGDS_HISP +            
              UGDS_ASIAN +
              PCTPELL +
              PCTFLOAN, data = CR_Train)

# summary(CR_Model)

CR_Model_Table <- tidy(CR_Model, conf.int = TRUE)

# Making a table to present the findings

nice_table(CR_Model_Table, title = "Table 2:
           Multiple Linear Regression Results of Factors Influencing College Completion Rate")

Table 2:
Multiple Linear Regression Results of Factors Influencing College Completion Rate

Term

estimate

std.error

statistic

p

95% CI

(Intercept)

0.46

0.07

6.45

< .001***

[0.32, 0.60]

NUMBRANCH

-0.01

0.00

-5.42

< .001***

[-0.01, -0.00]

HIGHDEGCertificate Degree

0.20

0.07

3.01

.003**

[0.07, 0.33]

HIGHDEGAssociate Degree

-0.06

0.07

-0.95

.344

[-0.20, 0.07]

HIGHDEGBachelors Degree

-0.06

0.07

-0.83

.407

[-0.19, 0.08]

HIGHDEGGraduate Degree

-0.01

0.07

-0.14

.890

[-0.14, 0.12]

UG25ABV

-0.06

0.02

-3.68

< .001***

[-0.09, -0.03]

UGDS

0.00

0.00

0.52

.600

[-0.00, 0.00]

UGDS_WHITE

0.07

0.02

2.98

.003**

[0.02, 0.12]

UGDS_BLACK

-0.11

0.03

-4.35

< .001***

[-0.17, -0.06]

UGDS_HISP

0.08

0.03

2.94

.003**

[0.03, 0.13]

UGDS_ASIAN

0.51

0.05

10.12

< .001***

[0.41, 0.61]

PCTPELL

-0.16

0.02

-7.89

< .001***

[-0.20, -0.12]

PCTFLOAN

0.25

0.01

17.68

< .001***

[0.22, 0.28]

# Extract the Multiple R-squared and Adjusted R-squared

Multiple_R_squared <- summary(CR_Model)$r.squared

Adjusted_R_squared <- summary(CR_Model)$adj.r.squared

# Create a data frame with the extracted values

CR_RSquared_Results_Table <- data.frame(
  "Metric" = c("Multiple R-squared", "Adjusted R-squared"),
  "Value" = c(Multiple_R_squared, Adjusted_R_squared))

kable(CR_RSquared_Results_Table, caption = "Table 3:
      Multiple Linear Regression Model Performance Metrics")
Table 3: Multiple Linear Regression Model Performance Metrics
Metric Value
Multiple R-squared 0.3295391
Adjusted R-squared 0.3274012

Furthermore, Table 3 provides the performance metrics of the multiple linear regression model used to analyze the factors affecting college completion rates. Two key metrics are presented: the Multiple R-squared and Adjusted R-squared values.

The Multiple R-squared (R^2) value, also known as the coefficient of determination, indicates the proportion of the variance in college completion rates explained by the predictor variables included in the model. With a value of approximately 0.33, it suggests that around 33% of the variability in completion rates can be attributed to the variables considered in the regression model. A higher R-squared value suggests a better fit of the model to the data, signifying stronger explanatory power collectively from the predictor variables.

The Adjusted R-squared value, approximately 0.33 in this case, adjusts for the number of predictor variables in the model. It penalizes the inclusion of unnecessary variables that do not significantly enhance the model’s explanatory ability. Despite being slightly lower than the Multiple R-squared, the Adjusted R-squared value provides a more accurate assessment of the model’s goodness-of-fit, considering the model’s complexity.

From these metrics, it can be inferred that the regression model, while statistically significant with a moderate R-squared value, only explains a portion of the variability in college completion rates. This implies the presence of other unaccounted factors influencing completion rates. Additionally, the robustness of the model’s fit, as indicated by the Adjusted R-squared value, remains intact even after accounting for the number of predictor variables included in the analysis.

Ridge Regression

In the exploration of the multiple linear regression model, ridge regression is applied to further analyze the factors influencing college completion rates. Ridge regression introduces regularization to the model, aiming to mitigate multicollinearity and overfitting by adding a penalty term to the coefficient estimates.

In employing ridge regression with an alpha value of 0, the process involves determining the optimal lambda value through cross-validation. This lambda value serves as the regularization parameter, crucial for minimizing the mean squared error of the model. Specifically, two important lambda values are calculated: lambda.min and lambda.1se.

Lambda.min represents the value of lambda that minimizes the mean squared error, ensuring the best possible fit to the data. It prioritizes prediction accuracy, potentially resulting in more complex models.

Lambda.1se, on the other hand, is the largest lambda value within one standard error of lambda.min. It provides a more conservative approach, favoring simpler models that are less susceptible to overfitting while maintaining reasonable predictive accuracy.

Although lambda.min and lambda.1se may result in different values (lambda.min: 0.02281474, lambda.1se: 0.05300874), both are used in the ridge regression analysis presented in Table 4. Despite their differences, each lambda value contributes to the overall assessment of the model’s performance and aids in determining the optimal level of regularization for balancing model complexity and predictive accuracy.

# ------------------------------------
# Now apply regularization to the model
# ------------------------------------

# -------------------------------------
# Apply Ridge
# alpha = 0 for Ridge
# Use Cv.glmnet to find the best lambda
#---------------------------------------

CR_Model_Ridge_CV = cv.glmnet(as.matrix(CR_Train[, -ncol(CR_Train)]), 
                              CR_Train$C150_4_L4, alpha = 0)

# print the best lambda

# CR_Model_Ridge_CV$lambda.min # lambda.min = 0.005179007

# Print coefficients for the best lambda

CR_Coefficients_Ridge <- predict(CR_Model_Ridge_CV, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CR_Coefficients_Ridge_matrix <- as.matrix(CR_Coefficients_Ridge)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CR_Coefficients_Ridge_df <- as.data.frame(CR_Coefficients_Ridge_matrix)

# Create a basic table using kable

kable(CR_Coefficients_Ridge_df, caption = "Table 4: Ridge Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate")
Table 4: Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate
lambda.min
(Intercept) 0.3707355
NUMBRANCH -0.0074852
HIGHDEG 0.0000000
UG25ABV 0.0497032
UGDS -0.0000035
UGDS_WHITE 0.1537852
UGDS_BLACK -0.0315805
UGDS_HISP 0.1548810
UGDS_ASIAN 0.5598989
PCTPELL -0.0761827
PCTFLOAN 0.2576537
# print the best lambda.1se

# CR_Model_Ridge_CV$lambda.1se # lambda.1se = 0.05300874

# Print coefficients for the best lambda.1se

CR_Coefficients_Ridge_Lambda1se <- predict(CR_Model_Ridge_CV, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CR_Coefficients_Ridge_Lambda1se_matrix <- as.matrix(CR_Coefficients_Ridge_Lambda1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CR_Coefficients_Ridge_Lambda1se_df <- as.data.frame(CR_Coefficients_Ridge_Lambda1se_matrix)

# Create a basic table using kable

# kable(CR_Coefficients_Ridge_df, caption = "Table 4B: Ridge Regression Coefficients 
      # Analyzing Factors Influencing College Completion Rate")

### Either lambda.min or lambda.1se result in the same coefficients although they result in different values.

In Table 4, the results of the ridge regression analysis are presented, detailing the coefficients of predictor variables influencing college completion rates. The analysis reveals valuable insights into the magnitude and direction of these relationships. For example:

  • Intercept (0.3707355): Serving as the baseline, this coefficient represents the estimated completion rate when all predictor variables are zero. It provides context for interpreting the effects of other variables. 

  • NUMBRANCH (-0.0074852): The negative coefficient suggests that an increase in the number of branches is associated with a slight decrease in completion rates. This implies that institutions with more branches may face challenges in achieving higher completion rates, possibly due to logistical or administrative complexities. 

  • HIGHDEG (0): With a coefficient close to zero, the types of degrees awarded by the college have a negligible influence on completion rates, according to this analysis. This suggests that regardless of the degree types offered (certificate, associate, bachelor, or master’s), their impact on completion rates is minimal. 

  • UG25ABV (0.0497032): A positive coefficient indicates that a higher percentage of undergraduate students over 25 years old correlates with a modest increase in completion rates. This suggests that institutions with a significant proportion of older undergraduate students may have higher completion rates, possibly due to their greater maturity and dedication to completing their studies. 

  • UGDS (-0.0000035): The coefficient close to zero suggests that total undergraduate enrollment has minimal impact on completion rates. This implies that the sheer size of the student body does not significantly affect completion rates, highlighting the need to focus on other factors to improve outcomes. 

  • Demographic Composition

    • UGDS_WHITE (0.1537852): A positive coefficient indicates that a higher percentage of White students is associated with higher completion rates. This underscores the importance of diversity and inclusion efforts in fostering positive educational outcomes for all student demographics. 

    • UGDS_BLACK (-0.0315805): The negative coefficient suggests that a higher percentage of Black students correlates with lower completion rates. Addressing equity gaps and providing tailored support to underrepresented minority groups may be essential in improving completion rates for these students. 

    • UGDS_HISP (0.1548810): The positive coefficient indicates that a higher percentage of Hispanic students is associated with higher completion rates. This highlights the importance of culturally responsive practices and targeted interventions to support Hispanic student success. 

    • UGDS_ASIAN (0.5598989): Among demographic variables, a higher percentage of Asian students exhibits the strongest positive impact on completion rates, correlating with significantly higher completion rates. Understanding and replicating the factors contributing to the success of Asian students may inform strategies to enhance completion rates for all students. 

  • Financial Aid Metrics

    • PCTPELL (-0.0761827): The negative coefficient suggests that a higher percentage of Pell Grant recipients correlates with lower completion rates. This underscores the challenges faced by economically disadvantaged students and the importance of addressing barriers to their success. 

    • PCTFLOAN (0.2576537): The positive coefficient indicates that a higher percentage of students with federal loans is associated with higher completion rates. This suggests that access to financial aid, particularly in the form of federal loans, may facilitate degree completion for some students. 

Therefore, the ridge regression model is as follows:

Predicted (C150_4_L4) = 0.3707355 + (-0.0074852) * NUMBRANCH + (0.0000000) * HIGHDEG + (0.0497032) * UG25ABV + (-0.0000035) * UGDS + (0.1537852) * UGDS_WHITE + (-0.0315805) * UGDS_BLACK + (0.1548810) * UGDS_HISP + (0.5598989) * UGDS_ASIAN + (-0.0761827) * PCTPELL + (0.2576537) * PCTFLOAN

Overall, ridge regression enhances the understanding of the relationships between predictor variables and college completion rates by incorporating regularization techniques. The resulting coefficients offer valuable insights for decision-makers in academia to identify and address factors affecting completion rates effectively.

Train Data

In this ridge regression analysis, the data is partitioned into training and testing sets, with 80% earmarked for training and 20% for testing. The analysis begins with a focus on the training set, which consists of 3554 observations. In analyzing the train data using ridge regression, two essential metrics, labeled 4A and 4B, were computed to assess the model’s predictive performance.

  • 4A: The Ridge Regression R^2 value obtained using the lambda.min parameter was found to be 0.1240899. This indicates that approximately 12.41% of the variability in college completion rates can be explained by the predictor variables included in the model. This suggests that the model captures a modest portion of the variation in completion rates, providing valuable insight into factors influencing student success. 
# make predictions (lambda.min)

CR_Predictions_Ridge_CV_Train = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.min")

# Calculate R2

CR_R2_Ridge_CV_Train = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Train)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))

kable(CR_R2_Ridge_CV_Train, caption = "4A: Ridge Regression R-squared for College Completion Rate Prediction (lambda.min)")
4A: Ridge Regression R-squared for College Completion Rate Prediction (lambda.min)
x
0.1240899
# make predictions (lambda.1se)

CR_Predictions_Ridge_CV_Train_Lambda1se = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.1se")

# Calculate R2

CR_R2_Ridge_CV_Train_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Train_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))

kable(CR_R2_Ridge_CV_Train_Lambda1se, caption = "4B: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)")
4B: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)
x
0.1138844
## Write another paragraph with the other lambda result and stat how although the coefficients don't change the r square value do and most likely so will the matrix in the LR down below. 
  • 4B: Similarly, the Ridge Regression R^2 value computed using the lambda.1se parameter was 0.1138844. This suggests that approximately 11.39% of the variability in college completion rates can be explained by the model. While slightly lower than the R^2 value obtained with lambda.min, it still signifies a meaningful level of explanatory power. 

These findings offer valuable insights into the effectiveness of the ridge regression model in predicting college completion rates. While neither R^2 value is particularly high, they indicate that the model captures a portion of the variability in completion rates. This information can guide strategic decision-making processes within educational institutions, helping to identify areas for improvement and interventions to enhance student outcomes. 

Test Data

The test dataset, comprising of 1520 observations, evaluates the performance of the ridge regression model in predicting college completion rates. In evaluating the test data using ridge regression, two crucial metrics, labeled 4C and 4D, were computed to assess the model’s predictive performance.

  • 4C: The Ridge Regression R^2 value obtained for the test data, using the lambda.min parameter, was found to be 0.1417864. This indicates that approximately 14.18% of the variability in college completion rates can be explained by the predictor variables included in the model. This suggests that the model demonstrates a moderate ability to predict college completion rates based on the test data.
# make predictions (lambda.min)

CR_Predictions_Ridge_CV_Test = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.min")

# Calculate R2

CR_R2_Ridge_CV_Test = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Ridge_CV_Test)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))

kable(CR_R2_Ridge_CV_Test, caption = "4C: Ridge Regression R-squared for College Completion Rate Prediction (Test Data)")
4C: Ridge Regression R-squared for College Completion Rate Prediction (Test Data)
x
0.1417864
# make predictions (lambda.1se)

CR_Predictions_Ridge_CV_Test_Lambda1se = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.1se")

# Calculate R2

CR_R2_Ridge_CV_Test_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Test_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))

kable(CR_R2_Ridge_CV_Train_Lambda1se, caption = "4D: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)")
4D: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)
x
0.1138844
  • 4D: Conversely, the Ridge Regression R^2 value computed using the lambda.1se parameter was 0.1138844. This suggests that approximately 11.39% of the variability in college completion rates can be explained by the model, using a more conservative regularization parameter. While slightly lower than the R^2 value obtained with lambda.min, it still indicates a meaningful level of predictive power.

These findings provide valuable insights into the effectiveness of the ridge regression model in predicting college completion rates when applied to unseen test data. The higher R^2 value obtained with lambda.min suggests that the model performs relatively well in explaining the variability in completion rates. However, it’s important to note that the model’s performance may vary depending on the choice of regularization parameter. These insights can inform decision-making processes within educational institutions, aiding in the development of strategies to support student success and improve completion rates.

LASSO

In this section, the application of LASSO (Least Absolute Shrinkage and Selection Operator) regression is detailed. LASSO is employed to identify the most significant predictors of college completion rates while simultaneously performing variable selection.

The analysis begins with fitting the LASSO model to the training data. By setting the alpha parameter to 1, LASSO regularization is applied, favoring sparse models where some coefficients are reduced to exactly zero. Through cross-validation, the optimal lambda value is determined to strike a balance between minimizing prediction error and preventing overfitting. In this analysis, both the lambda.min value (0.0003104733) and the lambda.1se value (0.007341109) are calculated, reflecting different levels of regularization. Table 5A and 5B will respectively present the LASSO model based on each lambda value, providing insights into the selected predictors and their coefficients.

Table 5A (lambda.min) presents the coefficients resulting from the LASSO regression analysis, which aims to identify the most significant predictors of college completion rates. Starting with the intercept (0.3504087), this coefficient represents the completion rate when all predictor variables are zero. In this context, it serves as a baseline for comparison.

Moving on to the predictor variables:

  • NUMBRANCH (-0.0074946): This negative coefficient suggests that an increase in the number of branches corresponds to a slight reduction in completion rates. While the effect is relatively minor, it highlights the potential challenges associated with managing multiple branches and the importance of strategic planning in mitigating any adverse impact on completion rates. 

  • HIGHDEG (0): With a coefficient of zero, the types of degrees awarded by the college do not significantly influence completion rates in the LASSO model. This implies that the distribution of degrees among students does not play a substantial role in determining completion rates. 

  • UG25ABV (0.0494927): The positive coefficient indicates that a higher percentage of undergraduate students over 25 years old is associated with a slight increase in completion rates. This suggests that older students may possess greater determination or commitment to completing their studies, contributing positively to overall completion rates. 

  • UGDS (-0.0000036): The coefficient’s proximity to zero implies that total undergraduate enrollment has minimal impact on completion rates. While enrollment size is an important consideration for educational institutions, the model suggests that it may not significantly affect completion rates. 

  • Demographic Composition: 

    • UGDS_WHITE (0.1754200): This positive coefficient suggests that a higher percentage of White students positively influences completion rates. It implies that colleges with a larger proportion of White students tend to exhibit higher completion rates, possibly due to various socio-economic factors or institutional support mechanisms tailored to this demographic. 

    • UGDS_BLACK (-0.0105645): Conversely, the negative coefficient indicates that a higher percentage of Black students is associated with lower completion rates. This underscores the importance of addressing disparities in educational outcomes and implementing targeted interventions to support Black students and enhance their chances of completing their academic programs. 

    • UGDS_HISP (0.1787894): The positive coefficient suggests that a higher percentage of Hispanic students positively impacts completion rates. Educational institutions with a significant Hispanic student population may benefit from cultural competency initiatives and support services tailored to the needs of Hispanic students to improve their retention and completion rates. 

    • UGDS_ASIAN (0.5943168): With the highest coefficient among demographic variables, a higher percentage of Asian students significantly boosts completion rates. This indicates that colleges with a substantial Asian student population tend to have higher completion rates, possibly due to cultural factors, academic preparedness, or other institutional characteristics that facilitate student success. 

  • Financial Aid Metrics: 

    • PCTPELL (-0.0817687): The negative coefficient suggests that a higher percentage of Pell Grant recipients is associated with lower completion rates. This highlights the challenges faced by students from low-income backgrounds and underscores the importance of financial aid policies and support programs in promoting student success and retention. 

    • PCTFLOAN (0.2646317): The positive coefficient indicates that a higher percentage of students with federal loans correlates with higher completion rates. This suggests that access to federal loans may enable students to overcome financial barriers and complete their academic programs successfully. 

Therefore, the LASSO regression model with the coefficients at lambda.min is as follows:

Predicted (C150_4_L4) = 0.3504087 + (-0.0074946) * NUMBRANCH + 0.0494927 * UG25ABV + (-0.0000036) * UGDS + 0.1754200 * UGDS_WHITE + (-0.0105645) * UGDS_BLACK + 0.1787894 * UGDS_HISP + 0.5943168 * UGDS_ASIAN + (-0.0817687) * PCTPELL + 0.2646317 * PCTFLOAN

# ------------------------------------
# Apply Lasso and Fit the model
# alpha = 1 for Lasso
# Use Cv.glmnet to find the best lambda
# ------------------------------------

# Creating Model

CR_Model_Lasso_CV = cv.glmnet(as.matrix(CR_Train[, -ncol(CR_Train)]), CR_Train$C150_4_L4, alpha = 1)

# print the best lambda (lambda.min)

# CR_Model_Lasso_CV$lambda.min # lambda.min = 0.0003104733

# Print coefficients for the best lambda

CR_Coefficients_LASSO <- predict(CR_Model_Lasso_CV, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CR_Coefficients_LASSO_Matrix <- as.matrix(CR_Coefficients_LASSO)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CR_Coefficients_LASSO_df <- as.data.frame(CR_Coefficients_LASSO_Matrix)

# Create a basic table using kable

kable(CR_Coefficients_LASSO_df, caption = "Table 5A: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate")
Table 5A: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate
lambda.min
(Intercept) 0.3504087
NUMBRANCH -0.0074946
HIGHDEG 0.0000000
UG25ABV 0.0494927
UGDS -0.0000036
UGDS_WHITE 0.1754200
UGDS_BLACK -0.0105645
UGDS_HISP 0.1787894
UGDS_ASIAN 0.5943168
PCTPELL -0.0817687
PCTFLOAN 0.2646317
# print the best lambda (lambda.1se)

# CR_Model_Lasso_CV$lambda.1se # lambda.1se =  0.007341109

# Print coefficients for the best lambda (lambda.1se)

CR_Coefficients_LASSO_Lambda1se <- predict(CR_Model_Lasso_CV, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CR_Coefficients_LASSO_Lambda1se_Matrix <- as.matrix(CR_Coefficients_LASSO_Lambda1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CR_Coefficients_LASSO_Lambda1se_df <- as.data.frame(CR_Coefficients_LASSO_Lambda1se_Matrix)

# Create a basic table using kable

kable(CR_Coefficients_LASSO_Lambda1se_df, caption = "Table 5B: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate")
Table 5B: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate
lambda.1se
(Intercept) 0.4853050
NUMBRANCH -0.0048453
HIGHDEG 0.0000000
UG25ABV 0.0000000
UGDS -0.0000021
UGDS_WHITE 0.0235416
UGDS_BLACK -0.1174150
UGDS_HISP 0.0000000
UGDS_ASIAN 0.3369200
PCTPELL 0.0000000
PCTFLOAN 0.2025453

Table 5B (lambda.1se) displays the coefficients derived from the LASSO regression analysis, aimed at identifying significant predictors of college completion rates.

Intercept (0.4853050): This intercept represents the baseline completion rate when all predictor variables are zero, providing a reference point for comparison.

Predictor Variables:

  • NUMBRANCH (-0.0048453): The negative coefficient suggests that an increase in the number of branches is associated with a slight decrease in completion rates. While the effect is minimal, it highlights the importance of strategic planning in managing multiple branches to mitigate any adverse impact on completion rates. 

  • HIGHDEG (0): With a coefficient of zero, the types of degrees awarded by the college do not significantly influence completion rates in the LASSO model. 

  • UG25ABV (0): The coefficient’s proximity to zero indicates that the percentage of undergraduate students over 25 years old has minimal impact on completion rates. 

  • UGDS (-0.0000021): The coefficient suggests that total undergraduate enrollment has negligible impact on completion rates. 

  • Demographic Composition: 

    • UGDS_WHITE (0.0235416): A positive coefficient implies that a higher percentage of White students positively influences completion rates, possibly due to socio-economic factors or institutional support mechanisms tailored to this demographic. 

    • UGDS_BLACK (-0.1174150): Conversely, the negative coefficient indicates that a higher percentage of Black students is associated with lower completion rates, highlighting the importance of addressing disparities in educational outcomes. 

    • UGDS_HISP (0): With a coefficient of zero, the percentage of Hispanic students does not significantly impact completion rates in the LASSO model. 

    • UGDS_ASIAN (0.3369200): The positive coefficient indicates that a higher percentage of Asian students significantly boosts completion rates, suggesting cultural factors or institutional characteristics that facilitate student success. 

  • Financial Aid Metrics: 

    • PCTPELL (0): The coefficient’s proximity to zero suggests that the percentage of Pell Grant recipients has minimal impact on completion rates. 

    • PCTFLOAN (0.2025453): A positive coefficient implies that a higher percentage of students with federal loans correlates with higher completion rates, indicating the role of federal loans in overcoming financial barriers to academic completion. 

Therefore, the LASSO regression model with the coefficients at lambda.1se is as follows:

Predicted (C150_4_L4) = 0.4853050 -0.0048453 * NUMBRANCH - 0.0000021 * UGDS + 0.0235416 * UGDS_WHITE - 0.1174150 * UGDS_BLACK + 0.3369200 * UGDS_ASIAN + 0.2025453 * PCTFLOAN

These findings provide valuable insights into the complex interplay of factors influencing college completion rates and can inform strategic decision-making and policy development aimed at enhancing student success and retention in higher education institutions.

Train Data

In this LASSO analysis, the data is partitioned into training and testing sets, with 80% earmarked for training and 20% for testing. The analysis begins with a focus on the training set, which consists of 3554 observations. In the examination of the training data, the LASSO regression analysis revealed noteworthy insights. When employing the lambda.min (5C) parameter, the model achieved an R-squared value of 0.1244594. This indicates that approximately 12.45% of the variability in college completion rate prediction can be explained by the predictor variables included in the model.

# make predictions (lambda.min)

CR_Predictions_Lasso_CV_Train = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]), s = "lambda.min")

# Calculate R2

CR_R2_Lasso_CV_Train = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Lasso_CV_Train)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))

kable(CR_R2_Lasso_CV_Train, caption = "5C: LASSO Regression R-squared for College Completion Rate Prediction (lambda.min)")
5C: LASSO Regression R-squared for College Completion Rate Prediction (lambda.min)
x
0.1244594
# make predictions (lambda.1se)

CR_Predictions_Lasso_CV_Train_Lambda1se = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]), s = "lambda.1se")

# Calculate R2

CR_R2_Lasso_CV_Train_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Lasso_CV_Train_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))

kable(CR_R2_Lasso_CV_Train_Lambda1se, caption = "5D: LASSO Regression R-squared for College Completion Rate Prediction (lambda.1se)")
5D: LASSO Regression R-squared for College Completion Rate Prediction (lambda.1se)
x
0.1056068

Similarly, utilizing the lambda.1se (5D) parameter resulted in an R-squared value of 0.1056068. These findings suggest that while the LASSO model demonstrates some predictive ability, it explains only a modest portion of the variance in college completion rates. This implies that factors beyond those included in the current model may also influence completion rates and should be considered for a more comprehensive understanding. Further refinement of the model or exploration of additional predictors may be necessary to improve its predictive accuracy and capture a more substantial portion of the variability in completion rates.

Test Data

In assessing the test data, the LASSO regression analysis revealed notable insights into the predictive performance of the model. When using the lambda.min parameter, the model achieved an R^2 value of 0.1435921. This indicates that approximately 14.36% of the variability in college completion rate prediction can be explained by the predictor variables included in the model.

# make predictions (lambda.min)

CR_Predictions_Lasso_CV_Test = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.min")

# Calculate R2

CR_R2_Lasso_CV_Test = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Lasso_CV_Test)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))

kable(CR_R2_Lasso_CV_Test, caption = "5E: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)")
5E: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)
x
0.1435921
# make predictions (lambda.1se)

CR_Predictions_Lasso_CV_Test_Lambda1se = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.1se")

# Calculate R2

CR_R2_Lasso_CV_Test_Lambda1se = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Lasso_CV_Test_Lambda1se)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))

kable(CR_R2_Lasso_CV_Test_Lambda1se, caption = "5F: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)")
5F: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)
x
0.1083415

Similarly, employing the lambda.1se parameter resulted in an R^2 value of 0.1083415. These findings suggest that the LASSO model demonstrates some predictive ability on the test data but explains a modest portion of the variance in college completion rates. While the model exhibits a slightly higher R^2 value compared to the training data, indicating better performance on the test set, it still suggests that additional factors beyond those considered in the current model may influence completion rates. Therefore, further refinement or exploration of additional predictors may be necessary to enhance the model’s predictive accuracy and capture a more substantial portion of the variability in completion rates.

Logistic Regression Model (Completion Rate Medium)

As previously noted in another analysis, the choice of “Completion Rate Medium” as the response variable is based on its balanced nature. This term refers to the scenario where the classes within a response variable are distributed relatively evenly, ensuring that neither class dominates the other in terms of frequency. In the case of “Completion Rate Medium,” the dataset displays a near-equal distribution between its categories. Specifically, there are 2,534 instances classified as “below median” (0) and 2,577 (1) instances classified as “above median.” This balance ensures that predictive models developed using this response variable don’t favor any particular outcome, leading to a more robust and unbiased analysis.

The frequency table provided in Table 6 represents this observation, demonstrating a nearly equivalent number of instances in each category, further validating the balanced nature of the response variable. This balance enhances the reliability of predictive models built on this variable, ensuring that they accurately represent the underlying data distribution and can provide actionable insights without being skewed towards any specific outcome.

A logistic regression analysis was conducted to investigate the factors influencing college completion rate medium (Completion_Rate_Median). The study focused on Completion_Rate_Median as the response variable, with various predictor variables including the number of branches, types of degrees awarded, demographics of the student body, and financial aid statistics. The insights from Table 7 provide valuable insights into the factors influencing college completion rates and their respective magnitudes within the regression model:

Therefore, the logistic regression model is as follows:

logit (P (Y = Completion Rate Medium)) = − 0.75 − 0.03NUMBRANCH + 2.14HIGHDEG_Certificate_Degree − 0.60HIGHDEG_Associate_Degree − 0.51HIGHDEG_Bachelors_Degree − 0.21HIGHDEG_Graduate_Degree − 0.36UG25ABV + 0.00UGDS + 0.29UGDS_WHITE − 1.84UGDS_BLACK + 0.57UGDS_HISP + 5.51UGDS_ASIAN − 2.03PCTPELL + 2.94*PCTFLOAN

Furthermore, this model was analyzed on a train and test dataset. In the training dataset, which comprises 80% of the total data and consists of 4,059 observations, the logistic regression model was evaluated to understand its predictive performance. From this analysis, a confusion matrix was generated to assess the model’s predictions compared to the actual outcomes for college completion rates.

# Prep Work

# Subset the CollegeDataset2 to select to focus on "Completion_Rate_Median"

CRM_Data <- CollegeDataset2 [c("NUMBRANCH",
                               "HIGHDEG",
                               "UG25ABV",
                               "UGDS",
                               "UGDS_WHITE",
                               "UGDS_BLACK",
                               "UGDS_HISP",             
                               "UGDS_ASIAN",
                               "PCTPELL",
                               "PCTFLOAN",
                               "Completion_Rate_Median")]

# ----------------------------------------------------
# Convert y (Completion_Rate_Median) values to 0 and 1
# ----------------------------------------------------

CRM_Data$Completion_Rate_Median <- ifelse(CRM_Data$Completion_Rate_Median 
                                           == "Above Median", 1, 0)

CRM_Freq_Table <- table(CRM_Data$Completion_Rate_Median)

knitr::kable(CRM_Freq_Table, caption = "Table 6: 
            Frequency Table of Completion Rate Median")
Table 6: Frequency Table of Completion Rate Median
Var1 Freq
0 2534
1 2577
# ---------------------------------
# Fit the logistic regression model
# ---------------------------------

# Split the data into TRAIN sets

set.seed(1996)  # Set seed for reproducibility

# TRAIN

CRM_Train_Indices <- sample(nrow(CRM_Data), nrow(CRM_Data) * 0.8) # 80% for training

CRM_Train_Data <- CRM_Data[CRM_Train_Indices, ]

# dim(CRM_Train_Data) 4059 rows with 12 columns

# TEST

CRM_Test_Data <- CRM_Data[-CRM_Train_Indices, ] # 20% for testing

# dim(CRM_Test_Data) 1015 rows with 12 columns

# Fit the logistic regression model on the CRM_Train_Data

CRM_Model <- glm(Completion_Rate_Median ~ NUMBRANCH + HIGHDEG + UG25ABV +
                 UGDS + UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN + 
                 PCTPELL + PCTFLOAN, data = CRM_Train_Data, family = binomial)

# summary(CRM_Model) ##### USE TRAIN MODEL!!!

# Presenting findings in table format 

CRM_Model_Table <- tidy(CRM_Model, conf.int = TRUE)

nice_table(CRM_Model_Table, title = "Table 7:
Logistic Regression Analysis of Factors Influencing College Completion Rate Medium")

Table 7:
Logistic Regression Analysis of Factors Influencing College Completion Rate Medium

Term

estimate

std.error

statistic

p

95% CI

(Intercept)

-0.75

1.00

-0.75

.454

[-2.94, 1.12]

NUMBRANCH

-0.03

0.02

-2.28

.023*

[-0.06, -0.01]

HIGHDEGCertificate Degree

2.14

0.94

2.27

.023*

[0.39, 4.25]

HIGHDEGAssociate Degree

-0.60

0.94

-0.64

.524

[-2.36, 1.51]

HIGHDEGBachelors Degree

-0.51

0.95

-0.54

.592

[-2.28, 1.61]

HIGHDEGGraduate Degree

-0.21

0.95

-0.22

.822

[-1.98, 1.90]

UG25ABV

-0.36

0.20

-1.81

.070

[-0.76, 0.03]

UGDS

0.00

0.00

0.48

.632

[-0.00, 0.00]

UGDS_WHITE

0.29

0.33

0.87

.384

[-0.35, 0.94]

UGDS_BLACK

-1.84

0.36

-5.09

< .001***

[-2.54, -1.13]

UGDS_HISP

0.57

0.35

1.66

.096

[-0.10, 1.26]

UGDS_ASIAN

5.51

0.78

7.03

< .001***

[4.00, 7.08]

PCTPELL

-2.03

0.27

-7.47

< .001***

[-2.56, -1.50]

PCTFLOAN

2.94

0.19

15.46

< .001***

[2.57, 3.32]

# TRAIN

# Make predictions

CRM_Predictions_Logistic = predict(CRM_Model, CRM_Train_Data, type = "response")

# convert the probabilities to 0 and 1

CRM_Predictions_Logistic = ifelse(CRM_Predictions_Logistic > 0.5, 1, 0)

# Print the confusion matrix

CRM_Train_Confusion_Matrix <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Logistic)

# print(CRM_Train_Confusion_Matrix)

kable(CRM_Train_Confusion_Matrix, caption = "Table 8: 
      Confusion Matrix for College Completion Rate Medium from Train Data")
Table 8: Confusion Matrix for College Completion Rate Medium from Train Data
0 1
0 1524 498
1 470 1596
# TEST

# Make predictions

CRM_Predictions_Logistic_Test = predict(CRM_Model, CRM_Test_Data, type = "response")

# convert the probabilities to 0 and 1

CRM_Predictions_Logistic_Test = ifelse(CRM_Predictions_Logistic_Test > 0.5, 1, 0)

# Print the confusion matrix

CRM_Test_Confusion_Matrix <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Logistic_Test)

# print(CRM_Train_Confusion_Matrix)

kable(CRM_Test_Confusion_Matrix, caption = "Table 9: 
      Confusion Matrix for College Completion Rate Medium from Test Data")
Table 9: Confusion Matrix for College Completion Rate Medium from Test Data
0 1
0 393 119
1 129 382

In the confusion matrix for the train dataset (Table 8), the model correctly identified 1,524 instances where colleges did not complete their courses (true negatives) and 1,596 instances where colleges did complete (true positives). However, there were instances of misclassification, with 498 colleges incorrectly classified as not completing when they did (false negatives), and 470 colleges misclassified as completing when they did not (false positives).

Similarly, the test dataset, representing 20% of the data with 1015 observations, was used to evaluate the model’s generalization performance. In the test dataset’s confusion matrix (Table 9), the model correctly identified 393 instances where colleges did not complete (true negatives) and 382 instances where they did (true positives). However, there were misclassifications as well, with 119 colleges incorrectly predicted as not completing when they did (false negatives), and 129 colleges incorrectly predicted as completing when they did not (false positives).

Overall, the logistic regression model demonstrates promising performance in predicting college completion rates. However, there is room for improvement, particularly in reducing misclassifications. Further investigation into the misclassified instances, continuous monitoring, and recalibration of the model can enhance its predictive accuracy over time. Additionally, considering additional variables or refining existing ones may improve the model’s performance in accurately predicting college completion rates.

Ridge Regression

In the exploration of logistic regression to analyze the factors influencing college completion rates, ridge regression was employed to enhance the model’s predictive performance. Ridge regression introduces regularization, mitigating issues such as multicollinearity and overfitting by incorporating a penalty term into the coefficient estimates.

In this analysis, two key lambda values, lambda.min and lambda.1se, were used to determine the optimal level of regularization for the model. lambda.min (0.01134566) and lambda.1se (0.05516942) were selected through cross-validation to strike a balance between model complexity and predictive accuracy.

Table 10A presents the ridge regression coefficients at lambda.minand provides detailed insights into the impact of predictor variables on college completion rates. For example:

  • Intercept (-1.3229162): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that in the absence of other factors, the expected completion rate is approximately -132.29%. However, this interpretation may not have practical significance, and it’s important to consider the other predictor variables. 

  • NUMBRANCH (-0.0411465): For each additional branch, the completion rate decreases by approximately 0.041 units. This indicates that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, suggesting that the proportion of various degree levels awarded does not significantly impact completion rates in this model. 

  • UG25ABV (0.5633525): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.563 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates. 

  • UGDS (0.0000343): The coefficient for UGDS is 0.0000343, indicating that total enrollment has a negligible effect on completion rates. 

  • UGDS_WHITE (0.8949482): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.895 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.6196046): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.620 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (1.1171618): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.117 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (4.3124523): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.312 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-0.7402258): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.740 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (2.2515051): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.252 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the logistic regression model is as follows:  

Predicted (Completion_Rate_Median) = -1.3229162 - 0.0411465NUMBRANCH + 0.5633525UG25ABV + 0.0000343UGDS + 0.8949482UGDS_WHITE - 0.6196046UGDS_BLACK + 1.1171618UGDS_HISP + 4.3124523UGDS_ASIAN - 0.7402258PCTPELL + 2.2515051*PCTFLOAN

#-------------------------
# Apply Ridge
# Fit the model
# Find the best lambda
#------------------------

# Creating Model

CRM_Model_Ridge_CV_Logistic = cv.glmnet(as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), CRM_Train_Data$Completion_Rate_Median, alpha = 0, family = "binomial")

# print the best lambda (lambda.min)

# CRM_Model_Ridge_CV_Logistic$lambda.min # -- lambda.min = 0.01134566

# print coefficients for the best lambda

CRM_Coefficients_Ridge <- predict(CRM_Model_Ridge_CV_Logistic, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRM_Coefficients_Ridge_matrix <- as.matrix(CRM_Coefficients_Ridge)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRM_Coefficients_Ridge_df <- as.data.frame(CRM_Coefficients_Ridge_matrix)

# Create a basic table using kable

kable(CRM_Coefficients_Ridge_df, caption = "Table 10A: 
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium")
Table 10A: Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium
lambda.min
(Intercept) -1.3229162
NUMBRANCH -0.0411465
HIGHDEG 0.0000000
UG25ABV 0.5633525
UGDS -0.0000343
UGDS_WHITE 0.8949482
UGDS_BLACK -0.6196046
UGDS_HISP 1.1171618
UGDS_ASIAN 4.3124523
PCTPELL -0.7402258
PCTFLOAN 2.2515051
# print the best lambda (lambda.1se)

# CRM_Model_Ridge_CV_Logistic$lambda.1se # -- lambda.1se = 0.05516942

# print coefficients for the best lambda

CRM_Coefficients_Ridge_Lambda1se <- predict(CRM_Model_Ridge_CV_Logistic, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRM_Coefficients_Ridge_Lambda1se_matrix <- as.matrix(CRM_Coefficients_Ridge_Lambda1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRM_Coefficients_Ridge_Lambda1se_df <- as.data.frame(CRM_Coefficients_Ridge_Lambda1se_matrix)

# Create a basic table using kable

kable(CRM_Coefficients_Ridge_Lambda1se_df, caption = "Table 10B: 
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium")
Table 10B: Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium
lambda.1se
(Intercept) -0.8796900
NUMBRANCH -0.0312556
HIGHDEG 0.0000000
UG25ABV 0.3992692
UGDS -0.0000275
UGDS_WHITE 0.5035802
UGDS_BLACK -0.7721856
UGDS_HISP 0.5752159
UGDS_ASIAN 3.0434189
PCTPELL -0.3536183
PCTFLOAN 1.7197953

Table 10B presents the ridge regression coefficients at lambda.1se and offers detailed insights into the impact of predictor variables on college completion rates. For example:

  • Intercept (-0.8796900): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that, in the absence of other factors, the expected completion rate is approximately -87.97%. However, this interpretation may not have practical significance on its own, and it’s crucial to consider the other predictor variables. 

  • NUMBRANCH (-0.0312556): For each additional branch a college has, the completion rate decreases by approximately 0.031 units. This implies that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, suggesting that the proportion of various degree levels awarded does not significantly impact completion rates in this model. 

  • UG25ABV (0.3992692): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.399 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates. 

  • UGDS (-0.0000275): The coefficient for UGDS is -0.0000275, indicating that total enrollment has a negligible effect on completion rates. 

  • UGDS_WHITE (0.5035802): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.504 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.7721856): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.772 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (0.5752159): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.575 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (3.0434189): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 3.043 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-0.3536183): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.354 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (1.7197953): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 1.720 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the ridge regression model is as follows:

Predicted (Completion_Rate_Median) = -0.8796900 - 0.0312556NUMBRANCH + 0.3992692UG25ABV - 0.0000275UGDS + 0.5035802UGDS_WHITE - 0.7721856UGDS_BLACK + 0.5752159UGDS_HISP + 3.0434189UGDS_ASIAN - 0.3536183PCTPELL + 1.7197953*PCTFLOAN

Train Data

In Table 11A (lambda.min), the confusion matrix reveals the model’s predictions categorized into four outcomes: true positives, true negatives, false positives, and false negatives. True negatives (0,0) indicate instances where the model correctly predicted that colleges did not complete, and indeed, they did not. The model accurately identified 1,272 such cases. Additionally, false positives (0,1) represent instances where the model incorrectly predicted completion when colleges did not complete. Here, the model made 750 false positive predictions. False negatives (1,0) occur when the model predicted non-completion, but colleges did complete. The model made 678 false negative predictions in this scenario. Finally, true positives (1,1) signify cases where the model correctly predicted that colleges completed, and they did. The model accurately identified 1,388 such cases.

# Make Predictions (lambda.min)

CRM_Predictions_Ridge_Logistic_Train = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Ridge_Logistic_Train = ifelse(CRM_Predictions_Ridge_Logistic_Train > 0.5, 1, 0)

# Print the confusion matrix

CRM_Train_CM_Ridge <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Train)

# print(CRM_Tain_CM_Ridge)

kable(CRM_Train_CM_Ridge, caption = "Table 11A: 
      Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
Table 11A: Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)
0 1
0 1272 750
1 678 1388
# Make Predictions (lambda.1se)

CRM_Predictions_Ridge_Logistic_Train_lambda.1se = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Ridge_Logistic_Train_lambda.1se = ifelse(CRM_Predictions_Ridge_Logistic_Train_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRM_Train_CM_Ridge_lambda.1se <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Train_lambda.1se)

# print(CRM_Train_CM_Ridge_lambda.1se)

kable(CRM_Train_CM_Ridge_lambda.1se, caption = "Table 11B: 
      Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
Table 11B: Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)
0 1
0 1271 751
1 668 1398

Similarly, Table 11B (lambda.1se) provides a breakdown of the model’s performance with slightly different regularization parameters. It mirrors the structure of Table 11A, revealing true negatives, false positives, false negatives, and true positives. True negatives (0,0) remained consistent, with the model correctly predicting 1,271 instances where colleges did not complete. False positives (0,1) increased slightly to 751 instances, indicating a higher number of incorrect predictions of completion. False negatives (1,0) decreased to 668 instances, suggesting an improvement in correctly identifying colleges that completed. True positives (1,1) increased to 1,398 instances, indicating a higher number of accurate predictions of completion by the model.

Test Data

In this matrix (Table 13A), true negatives (0,0) represent instances where the model correctly predicted colleges that did not complete, aligning with the actual outcomes. The model identified 328 such cases. False positives (0,1) indicate instances where the model inaccurately predicted completion when colleges did not complete, accounting for 184 instances. False negatives (1,0) occur when the model failed to predict completion for colleges that did complete, totaling 169 instances. True positives (1,1) represent instances where the model correctly predicted colleges that completed, matching the actual outcomes. The model accurately identified 342 such cases.

# Make Predictions (lambda.min)

CRM_Predictions_Ridge_Logistic_Test = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Ridge_Logistic_Test = ifelse(CRM_Predictions_Ridge_Logistic_Test > 0.5, 1, 0)

# Print the confusion matrix

CRM_Test_CM_Ridge <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Test)

# print(CRM_Test_CM_Ridge)

kable(CRM_Test_CM_Ridge, caption = "Table 12A: 
      Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
Table 12A: Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)
0 1
0 328 184
1 169 342
# Make Predictions (lambda.1se)

CRM_Predictions_Ridge_Logistic_Test_lambda.1se = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Ridge_Logistic_Test_lambda.1se = ifelse(CRM_Predictions_Ridge_Logistic_Test_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRM_Test_CM_Ridge_lambda.1se <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Test_lambda.1se)

# print(CRM_Test_CM_Ridge)

kable(CRM_Test_CM_Ridge_lambda.1se, caption = "Table 12B: 
      Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
Table 12B: Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)
0 1
0 321 191
1 169 342

Similarly, Table 12B (lambda.1se) provides insights into the model’s predictions with slightly different regularization parameters. True negatives (0,0) remained consistent, with the model correctly predicting 321 instances where colleges did not complete. False positives (0,1) increased to 191 instances, indicating a higher number of incorrect predictions of completion. False negatives (1,0) remained at 169 instances, suggesting consistent challenges in accurately predicting completion for colleges that did complete. True positives (1,1) increased to 342 instances, indicating a higher number of accurate predictions of completion by the model.

The variation in true positives, true negatives, false positives, and false negatives between the two matrices highlights the model’s performance sensitivity to different regularization parameters. While true positives and true negatives reflect the model’s accurate predictions of completion and non-completion, false positives and false negatives indicate areas of misclassification. The model’s ability to accurately predict completion rates, as evidenced by high true positive and true negative rates, is crucial for decision-making processes in education and business contexts. However, the presence of false positives and false negatives underscores the model’s limitations and areas for improvement.

LASSO

To analyze factors influencing college completion rates median, logistic regression was employed, and to enhance the model’s predictive capability, LASSO regression was implemented. LASSO regression, known for its regularization technique, effectively mitigates issues like multicollinearity and overfitting by incorporating a penalty term into the coefficient estimates.

Throughout this analysis, two pivotal lambda values, termed as lambda.min and lambda.1se, were instrumental in determining the optimal degree of regularization for the model. These lambda values, specifically lambda.min at 0.0007464686 and lambda.1se at 0.008385261, were meticulously selected through cross-validation to strike a delicate equilibrium between the model’s complexity and its predictive accuracy.

In Table 13A, the LASSO regression coefficients provide crucial insights into the factors influencing college completion rates. These coefficients are derived at the lambda.min value, which represents the optimal level of regularization for the model. Key findings include:

  • Intercept (-1.6472394): The intercept serves as the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -164.72%. However, this interpretation may not have practical significance, and it’s important to consider other predictor variables. 

  • NUMBRANCH (-0.0424150): A negative coefficient indicates that for each additional branch, the completion rate decreases by approximately 0.042 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is zero, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model. 

  • UG25ABV (0.6031363): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.603 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates. 

  • UGDS (-0.0000365): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has a insignificant effect on completion rates. 

  • UGDS_WHITE (1.2263659): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.226 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.3344819): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.334 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (1.5149724): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.515 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (5.0068296): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 5.007 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-0.9134661): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.913 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (2.4440316): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.444 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the LASSO regression model with the coefficients at lambda.min is as follows:

Predicted (Completion_Rate_Median) = -1.6472394 - 0.0424150NUMBRANCH+ 0.6031363UG25ABV - 0.0000365UGDS + 1.2263659UGDS_WHITE - 0.3344819UGDS_BLACK + 1.5149724UGDS_HISP + 5.0068296UGDS_ASIAN - 0.9134661PCTPELL + 2.4440316*PCTFLOAN

# Apply Lasso
# Fit the model
# Find the best lambda

# Create Model

CRM_Model_Lasso_CV_Logistic = cv.glmnet(as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), CRM_Train_Data$Completion_Rate_Median, alpha = 1, family = "binomial")

# print the best lambda (lambda.min)

# CRM_Model_Lasso_CV_Logistic$lambda.min # lambda.min = 0.0007464686

# print coefficients for the best lambda

CRM_Coefficients_LASSO <- predict(CRM_Model_Lasso_CV_Logistic, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRM_Coefficients_LASSO_matrix <- as.matrix(CRM_Coefficients_LASSO)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRM_Coefficients_LASSO_df <- as.data.frame(CRM_Coefficients_LASSO_matrix)

# Create a basic table using kable

kable(CRM_Coefficients_LASSO_df, caption = "Table 13A: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate Medium")
Table 13A: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium
lambda.min
(Intercept) -1.6472394
NUMBRANCH -0.0424150
HIGHDEG 0.0000000
UG25ABV 0.6031363
UGDS -0.0000365
UGDS_WHITE 1.2263659
UGDS_BLACK -0.3344819
UGDS_HISP 1.5149724
UGDS_ASIAN 5.0068296
PCTPELL -0.9134661
PCTFLOAN 2.4440316
# print the best lambda (lambda.1se)

# CRM_Model_Lasso_CV_Logistic$lambda.1se # -- lambda.1se = 0.008385261

# print coefficients for the best lambda

CRM_Coefficients_LASSO_lambda.1se <- predict(CRM_Model_Lasso_CV_Logistic, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRM_Coefficients_LASSO_lambda.1se_matrix <- as.matrix(CRM_Coefficients_LASSO_lambda.1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRM_Coefficients_LASSO_lambda.1se_df <- as.data.frame(CRM_Coefficients_LASSO_lambda.1se_matrix)

# Create a basic table using kable

kable(CRM_Coefficients_LASSO_lambda.1se_df, caption = "Table 13B: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate Medium")
Table 13B: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium
lambda.1se
(Intercept) -0.9059768
NUMBRANCH -0.0296236
HIGHDEG 0.0000000
UG25ABV 0.3132804
UGDS -0.0000269
UGDS_WHITE 0.3977125
UGDS_BLACK -0.9236192
UGDS_HISP 0.5292159
UGDS_ASIAN 3.3484793
PCTPELL -0.4000180
PCTFLOAN 2.0676466

Table 13B presents the LASSO regression coefficients obtained at the lambda.1se value, offering valuable insights into the factors influencing college completion rates. Key findings include:

  • Intercept (-0.9059768): The intercept represents the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -90.60%. However, this interpretation may not hold practical significance and requires consideration of other predictor variables. 

  • NUMBRANCH (-0.0296236): The negative coefficient for NUMBRANCH implies that for each additional branch, the completion rate decreases by approximately 0.030 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is zero, indicating that the proportion of various degree levels awarded does not significantly influence completion rates in this model. 

  • UG25ABV (0.3132804): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.313 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates. 

  • UGDS (-0.0000269): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has a negligible effect on completion rates. 

  • UGDS_WHITE (0.3977125): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.398 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.9236192): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.924 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (0.5292159): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.529 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (3.3484793): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 3.348 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-0.4000180): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.400 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (2.0676466): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.068 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the LASSO regression model with the coefficients at lambda.1se is as follows:

Predicted (Completion_Rate_Median) = -0.9059768 - 0.0296236NUMBRANCH + 0.3132804UG25ABV - 0.0000269UGDS + 0.3977125UGDS_WHITE - 0.9236192UGDS_BLACK + 0.5292159UGDS_HISP + 3.3484793UGDS_ASIAN - 0.4000180PCTPELL + 2.0676466*PCTFLOAN

Train Data

In analyzing the train data, the LASSO regression model’s performance was evaluated using confusion matrices to assess its predictive accuracy in determining college completion rates median. In Table 14A, at the lambda.min value, the model correctly classified 1,278 instances where completion rates were predicted to be low (0) and were indeed low (true negatives), while it correctly classified 1,387 instances where completion rates were predicted to be high (1) and were indeed high (true positives). However, there were 744 instances where completion rates were predicted to be low but were actually high (false negatives), and 679 instances where completion rates were predicted to be high but were actually low (false positives). This indicates that the model, at the lambda.min value, achieved a balance between identifying both low and high completion rates but had a moderate number of misclassifications.

# Make Predictions (lambda.min)

CRM_Predictions_Lasso_Log_Train = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.min", type = "response")

# convert the probabilities to 0 and 1

CRM_Predictions_Lasso_Log_Train = ifelse(CRM_Predictions_Lasso_Log_Train > 0.5, 1, 0)

# Print the confusion matrix

CRM_Train_CM_LASSO <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Train)

# print(CRM_Tain_CM_LASSO)

kable(CRM_Train_CM_LASSO, caption = "Table 14A: 
      LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
Table 14A: LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)
0 1
0 1278 744
1 679 1387
# Make Predictions (lambda.1se)

CRM_Predictions_Lasso_Log_Train_lambda.1se = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.1se", type = "response")

# convert the probabilities to 0 and 1

CRM_Predictions_Lasso_Log_Train_lambda.1se = ifelse(CRM_Predictions_Lasso_Log_Train_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRM_Train_CM_LASSO_lambda.1se <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Train_lambda.1se)

# print(CRM_Train_CM_LASSO_lambda.1se)

kable(CRM_Train_CM_LASSO_lambda.1se, caption = "Table 14B: 
      LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
Table 14B: LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)
0 1
0 1267 755
1 661 1405

Similarly, in Table 14B, at the lambda.1se value, the model correctly classified 1,267 instances of low completion rates (true negatives) and 1,405 instances of high completion rates (true positives). However, there were 755 instances where completion rates were predicted to be low but were actually high (false negatives), and 661 instances where completion rates were predicted to be high but were actually low (false positives). This suggests that at the lambda.1se value, the model’s predictive accuracy slightly improved in correctly identifying high completion rates but had a slightly higher number of misclassifications compared to the lambda.min value.

Test Data

In analyzing the test data, the LASSO regression model’s performance was assessed using confusion matrices to evaluate its predictive accuracy in determining college completion rates. In Table 15A, at the lambda.min value, the model correctly classified 329 instances where completion rates were predicted to be low (0) and were indeed low (true negatives), while it correctly classified 338 instances where completion rates were predicted to be high (1) and were indeed high (true positives). However, there were 183 instances where completion rates were predicted to be low but were actually high (false negatives), and 173 instances where completion rates were predicted to be high but were actually low (false positives). This indicates that the model, at the lambda.min value, achieved a balance between identifying both low and high completion rates but had a moderate number of misclassifications.

# Make Predictions (lambda.min)

CRM_Predictions_Lasso_Log_Test = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Lasso_Log_Test = ifelse(CRM_Predictions_Lasso_Log_Test > 0.5, 1, 0)

# Print the confusion matrix

CRM_Test_CM_LASSO <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Test)

# print(CRM_Test_CM_LASSO)

kable(CRM_Test_CM_LASSO, caption = "Table 15A: 
      LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
Table 15A: LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)
0 1
0 329 183
1 173 338
# Make Predictions (lambda.1se)

CRM_Predictions_Lasso_Log_Test_lambda.1se = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRM_Predictions_Lasso_Log_Test_lambda.1se = ifelse(CRM_Predictions_Lasso_Log_Test_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRM_Test_CM_LASSO_lambda.1se <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Test_lambda.1se)

# print(CRM_Test_CM_LASSO)

kable(CRM_Test_CM_LASSO_lambda.1se, caption = "Table 15A: 
      LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
Table 15A: LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)
0 1
0 319 193
1 166 345

Similarly, in Table 15B, at the lambda.1se value, the model correctly classified 319 instances of low completion rates (true negatives) and 345 instances of high completion rates (true positives). However, there were 193 instances where completion rates were predicted to be low but were actually high (false negatives), and 166 instances where completion rates were predicted to be high but were actually low (false positives). This suggests that at the lambda.1se value, the model’s predictive accuracy slightly improved in correctly identifying high completion rates but had a slightly higher number of misclassifications compared to the lambda.min value.

Logistic Regression Model (Completion Rate High)

Upon review of Table 16, it becomes evident that the dataset exhibits an imbalance in the distribution of the Completion Rate High variable, where the classes are unevenly distributed. Specifically, there are 795 instances categorized as “High” and 4316 instances falling into “Other” categories, as presented in the table.

This discrepancy highlights the class imbalance within the Completion Rate High variable, with a notable disparity between observations classified as “High” and those classified otherwise.

Acknowledging this imbalance is crucial for devising effective modeling strategies. Imbalanced datasets can present challenges for predictive modeling techniques. To address this, strategies such as oversampling, under sampling, or leveraging specialized algorithms tailored to handle imbalanced data may be employed, as warranted by the findings presented in Table 16.

A logistic regression analysis was performed to explore the determinants of high college completion rates (Completion_Rate_High). The investigation centered on Completion_Rate_High as the response variable, examining a range of predictor variables encompassing the number of branches, types of degrees conferred, student demographic characteristics, and financial aid metrics.

The findings presented in Table 17 offer significant insights into the factors influencing high college completion rates and their respective magnitudes within the regression model, aligning with the focus on completion rates categorized as “high” in this analysis.

In Table 17, a logistic regression analysis provides insights into the factors influencing high college completion rates. The intercept, with an estimate of 0.13, represents the baseline probability of high completion rates when all predictor variables are zero. However, its high p-value indicates insignificance, suggesting minimal influence on high completion rates. Key findings include:

Therefore, the logistic regression model is as follows:

logit (P (Y = Completion Rate Medium)) = = 0.13 - 0.22NUMBRANCH - 0.37HIGHDEGCertificate Degree - 2.69HIGHDEGAssociate Degree - 1.90HIGHDEGBachelors Degree - 2.42HIGHDEGGraduate Degree + 0.85UG25ABV + 0.36UGDS_WHITE - 1.38UGDS_BLACK - 0.13UGDS_HISP + 4.42UGDS_ASIAN - 2.10PCTPELL + 0.85PCTFLOAN

# Prep Work

# Subset data with factors of focus

CRH_Data <- CollegeDataset2[c("NUMBRANCH",
                              "HIGHDEG",
                              "UG25ABV",
                              "UGDS",
                              "UGDS_WHITE",
                              "UGDS_BLACK",
                              "UGDS_HISP",             
                              "UGDS_ASIAN",
                              "PCTPELL",
                              "PCTFLOAN",
                              "Completion_Rate_High")]

# ----------------------------------------------------
# Convert y (Completion_Rate_High) values to 0 and 1
# ----------------------------------------------------

CRH_Data$Completion_Rate_High <- ifelse(CRH_Data$Completion_Rate_High 
                                         == "High", 1, 0)

CRH_Freq_Table <- table(CRH_Data$Completion_Rate_High)

kable(CRH_Freq_Table, caption = "Table 16: 
            Frequency Table of College Completion Rate High")
Table 16: Frequency Table of College Completion Rate High
Var1 Freq
0 4316
1 795
# ---------------------------------
# Fit the logistic regression model
# ---------------------------------

# Split the data into TEST and TRAIN sets 

set.seed(1996)  # Set seed for reproducibility

# TRAIN

CRH_Train_Indices <- sample(nrow(CRH_Data), nrow(CRH_Data) * 0.8) # 80% for training

CRH_Train_Data <- CRH_Data[CRH_Train_Indices, ]

# dim(CRH_Train_Data) 4123 rows and 11 columns

# TEST

CRH_Test_Data <- CRH_Data[-CRH_Train_Indices, ] # 20% for testing

# dim(CRH_Test_Data) 1031 rows and 11 columns 

# Fit the logistic regression model on the CRM_Train_Data

CRH_Model <- glm(Completion_Rate_High ~ NUMBRANCH + HIGHDEG + UG25ABV +
                   UGDS + UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN +
                   PCTPELL + PCTFLOAN, data = CRH_Train_Data, family = binomial)

# summary(CRH_Model)

# Presenting findings in table format 

CRH_Model_Table <- tidy(CRH_Model, conf.int = TRUE)

nice_table(CRH_Model_Table, title = "Table 17:
Logistic Regression Analysis of Factors Influencing College Completion Rate High")

Table 17:
Logistic Regression Analysis of Factors Influencing College Completion Rate High

Term

estimate

std.error

statistic

p

95% CI

(Intercept)

0.13

1.02

0.13

.900

[-2.07, 2.08]

NUMBRANCH

-0.22

0.04

-5.60

< .001***

[-0.30, -0.15]

HIGHDEGCertificate Degree

-0.37

0.93

-0.40

.691

[-2.13, 1.69]

HIGHDEGAssociate Degree

-2.69

0.94

-2.87

.004**

[-4.48, -0.61]

HIGHDEGBachelors Degree

-1.90

0.94

-2.02

.043*

[-3.70, 0.18]

HIGHDEGGraduate Degree

-2.42

0.94

-2.58

.010**

[-4.21, -0.34]

UG25ABV

0.85

0.25

3.38

.001***

[0.36, 1.35]

UGDS

0.00

0.00

3.58

< .001***

[0.00, 0.00]

UGDS_WHITE

0.36

0.42

0.86

.390

[-0.44, 1.21]

UGDS_BLACK

-1.38

0.47

-2.96

.003**

[-2.28, -0.45]

UGDS_HISP

-0.13

0.44

-0.29

.775

[-0.98, 0.77]

UGDS_ASIAN

4.42

0.79

5.57

< .001***

[2.90, 6.02]

PCTPELL

-2.10

0.33

-6.44

< .001***

[-2.75, -1.47]

PCTFLOAN

0.85

0.23

3.62

< .001***

[0.39, 1.31]

# TRAIN

# Make predictions

CRH_Predictions_Logistic_Train = predict(CRH_Model, CRH_Train_Data, type = "response")

# convert the probabilities to 0 and 1

CRH_Predictions_Logistic_Train = ifelse(CRH_Predictions_Logistic_Train > 0.5, 1, 0)

# Print the confusion matrix

CRH_Train_Confusion_Matrix <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Logistic_Train)

# print(CRH_Train_Confusion_Matrix)

kable(CRH_Train_Confusion_Matrix, caption = "Table 18: 
      Confusion Matrix for College Completion Rate High from Train Data")
Table 18: Confusion Matrix for College Completion Rate High from Train Data
0 1
0 3406 40
1 587 55
# TEST

# Make predictions

CRH_Predictions_Logistic_Test = predict(CRH_Model, CRH_Test_Data, type = "response")

# convert the probabilities to 0 and 1

CRH_Predictions_Logistic_Test = ifelse(CRH_Predictions_Logistic_Test > 0.5, 1, 0)

# Print the confusion matrix

CRH_Test_Confusion_Matrix <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Logistic_Test)

# print(CRM_Train_Confusion_Matrix)

kable(CRH_Test_Confusion_Matrix, caption = "Table 19: 
      Confusion Matrix for College Completion Rate High from Test Data")
Table 19: Confusion Matrix for College Completion Rate High from Test Data
0 1
0 861 9
1 141 12

Furthermore, this model was analyzed on a train and test dataset. In the training dataset, which comprises 80% of the total data and consists of 4,123 observations, the logistic regression model was evaluated to understand its predictive performance. From this analysis, a confusion matrix was generated to assess the model’s predictions compared to the actual outcomes for college completion rates.

In the confusion matrix corresponding to the train dataset (Table 18), the logistic regression model accurately identified 3,406 instances where colleges did not complete their courses (true negatives) and 55 instances where colleges did complete (true positives). However, there were cases of misclassification, with 40 colleges incorrectly labeled as not completing when they did (false negatives), and 587 colleges misclassified as completing when they did not (false positives).

Similarly, the test dataset, representing 20% of the total data with 1,031 observations, was utilized to assess the model’s generalization performance. In the confusion matrix for the test dataset (Table 19), the model correctly identified 861 instances where colleges did not complete (true negatives) and 12 instances where they did (true positives). Nevertheless, misclassifications occurred, with 9 colleges incorrectly predicted as not completing when they did (false negatives), and 141 colleges incorrectly predicted as completing when they did not (false positives).

Overall, while the logistic regression model shows promising performance in predicting college completion rates, there is room for improvement, particularly in reducing misclassifications. Further exploration of misclassified instances, ongoing monitoring, and adjusting of the model can enhance its predictive accuracy over time. Additionally, considering additional variables or refining existing ones may contribute to improving the model’s ability to accurately predict college completion rates.

Ridge Regression

In the exploration of logistic regression to analyze the factors influencing college completion rates high, ridge regression was applied to augment the model’s predictive performance. Ridge regression, known for its regularization technique, addresses concerns such as multicollinearity and overfitting by introducing a penalty term into the coefficient estimates.

Throughout this analysis, two crucial lambda values, namely lambda.min and lambda.1se, were utilized to ascertain the optimal level of regularization for the model. These lambda values, specifically lambda.min (0.01134566) and lambda.1se (0.05516942), were determined via cross-validation to strike a balance between model complexity and predictive accuracy.

Table 20A presents the ridge regression coefficients at lambda.min, revealing the impact of predictor variables on college completion rates high:

  • Intercept (-2.7698630): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that in the absence of other factors, the expected completion rate is approximately -276.99%. However, this interpretation may not have practical significance without considering other predictor variables. 

  • NUMBRANCH (-0.1688791): For each additional branch, the completion rate decreases by approximately 0.169 units. This indicates that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is 0, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model. 

  • UG25ABV (1.7869716): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 1.787 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates. 

  • UGDS (-0.0000173): The coefficient for UGDS is -0.0000173, indicating that total enrollment has a negligible effect on completion rates. 

  • UGDS_WHITE (1.3168060): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.317 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.2853582): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.285 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (0.8180263): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.818 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (4.3771100): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.377 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-1.0062817): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 1.006 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (0.5674224): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.567 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the ridge regression model (lambda.min) is as follows:

Predicted (Completion_Rate_High) = -2.7698630 - 0.1688791NUMBRANCH + 1.7869716UG25ABV - 0.0000173UGDS + 1.3168060UGDS_WHITE - 0.2853582UGDS_BLACK + 0.8180263UGDS_HISP + 4.3771100UGDS_ASIAN - 1.0062817PCTPELL + 0.5674224*PCTFLOAN

#-------------------------
# Apply Ridge
# Fit the model
# Find the best lambda
#------------------------

# Creating Model

CRH_Model_Ridge_CV_Logistic = cv.glmnet(as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), CRH_Train_Data$Completion_Rate_High, alpha = 0, family = "binomial")

# print the best lambda (lambda.min)

# CRH_Model_Ridge_CV_Logistic$lambda.min # lambda.min = 0.004535024

# print coefficients for the best lambda

CRH_Coefficients_Ridge <- predict(CRH_Model_Ridge_CV_Logistic, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRH_Coefficients_Ridge_matrix <- as.matrix(CRH_Coefficients_Ridge)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRH_Coefficients_Ridge_df <- as.data.frame(CRH_Coefficients_Ridge_matrix)

# Create a basic table using kable

kable(CRH_Coefficients_Ridge_df, caption = "Table 20A: 
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High")
Table 20A: Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High
lambda.min
(Intercept) -2.7698630
NUMBRANCH -0.1688791
HIGHDEG 0.0000000
UG25ABV 1.7869716
UGDS -0.0000173
UGDS_WHITE 1.3168060
UGDS_BLACK -0.2853582
UGDS_HISP 0.8180263
UGDS_ASIAN 4.3771100
PCTPELL -1.0062817
PCTFLOAN 0.5674224
# print the best lambda (lambda.1se)

# CRH_Model_Ridge_CV_Logistic$lambda.1se #-- lambda.1se =  0.1555726

# print coefficients for the best lambda

CRH_Coefficients_Ridge_lambda.1se <- predict(CRH_Model_Ridge_CV_Logistic, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRH_Coefficients_Ridge_matrix_lambda.1se <- as.matrix(CRH_Coefficients_Ridge_lambda.1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRH_Coefficients_Ridge_lambda.1se_df <- as.data.frame(CRH_Coefficients_Ridge_matrix_lambda.1se)

# Create a basic table using kable

kable(CRH_Coefficients_Ridge_lambda.1se_df, caption = "Table 20B: 
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High")
Table 20B: Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High
lambda.1se
(Intercept) -1.8601314
NUMBRANCH -0.0473951
HIGHDEG 0.0000000
UG25ABV 0.6482804
UGDS -0.0000085
UGDS_WHITE 0.2874011
UGDS_BLACK -0.3799900
UGDS_HISP 0.0022770
UGDS_ASIAN 1.9933289
PCTPELL -0.3080749
PCTFLOAN 0.0881586

In Table 20B, which presents the ridge regression coefficients at lambda.1se, valuable insights into the factors influencing college completion rates, particularly focusing on the Completion Rate High category, are revealed. Key findings include:

  • Intercept (-1.8601314): The intercept represents the estimated completion rate when all predictor variables are zero. Here, it suggests that in the absence of other factors, the expected completion rate is approximately -186.01%. However, this interpretation might not have practical significance on its own. 

  • NUMBRANCH (-0.0473951): For each additional branch, the completion rate decreases by approximately 0.047 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, indicating that the proportion of various degree levels awarded does not significantly impact completion rates in this model. 

  • UG25ABV (0.6482804): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.648 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates. 

  • UGDS (-0.0000085): The coefficient for UGDS is -0.0000085, suggesting that total enrollment has a negligible effect on completion rates. 

  • UGDS_WHITE (0.2874011): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.287 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.3799900): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.380 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (0.0022770): The coefficient suggests that the proportion of Hispanic students has a minimal effect on completion rates, with an increase of approximately 0.002 units. 

  • UGDS_ASIAN (1.9933289): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 1.993 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-0.3080749): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.308 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (0.0881586): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.088 units. This highlights the role of financial aid in supporting students through completion. 

Therefore, the ridge regression model (lambda.1se) is as follows:

Predicted (Completion_Rate_High) = 2.7698630 - 0.1688791NUMBRANCH + 1.7869716UG25ABV - 0.0000173UGDS + 1.3168060UGDS_WHITE - 0.2853582UGDS_BLACK + 0.8180263UGDS_HISP + 4.3771100UGDS_ASIA - 1.0062817PCTPELL + 0.5674224*PCTFLOAN

Train Data

In the analysis of the train dataset, the ridge regression model revealed insights into the classification performance regarding college completion rate high, as shown in Tables 21A and 21B. In Table 21A, the model accurately identified 3,435 instances where colleges didn’t complete their courses (true negatives) and 19 instances where they did (true positives). However, there were misclassifications, with 11 colleges incorrectly classified as not completing when they did (false negatives), and 623 colleges misclassified as completing when they did not (false positives).

Upon evaluating the model with different regularization strengths, Table 21B displays the confusion matrix for lambda.1se. Here, the model correctly identified 3,445 instances where colleges did not complete (true negatives) and 2 instances where they did (true positives). However, there were misclassifications as well, with 1 college incorrectly predicted as not completing when it did (false negative), and 640 colleges incorrectly predicted as completing when they did not (false positives).

# Make Predictions (lambda.min)

CRH_Predictions_Ridge_Logistic_Train = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Ridge_Logistic_Train = ifelse(CRH_Predictions_Ridge_Logistic_Train > 0.5, 1, 0)

# Print the confusion matrix

CRH_Train_CM_Ridge <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Train)

# print(CRH_Train_CM_Ridge)

kable(CRH_Train_CM_Ridge, caption = "Table 21A: 
      Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)")
Table 21A: Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)
0 1
0 3435 11
1 623 19
# Make Predictions (lambda.1se)

CRH_Predictions_Ridge_Logistic_Train_lambda.1se = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Ridge_Logistic_Train_lambda.1se = ifelse(CRH_Predictions_Ridge_Logistic_Train_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRH_Train_CM_Ridge_lambda.1se <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Train_lambda.1se)

# print(CRH_Train_CM_Ridge)

kable(CRH_Train_CM_Ridge_lambda.1se, caption = "Table 21B: 
      Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
Table 21B: Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)
0 1
0 3445 1
1 640 2

The true positives and true negatives represent the correct predictions made by the model regarding completion and non-completion of courses, respectively. However, false positives indicate instances where the model incorrectly predicted completion when the outcome was non-completion. Similarly, false negatives represent instances where the model incorrectly predicted non-completion when the actual outcome was completion.

These findings highlight the classification performance of the ridge regression model in predicting college completion rates. While the model demonstrates relatively high accuracy in identifying non-completion instances, it faces challenges in correctly identifying completion instances, as evidenced by the higher number of false positives and false negatives. Improving the model’s ability to detect true completion instances is crucial for enhancing its effectiveness in supporting decision-making processes related to college completion rates.

Test Data

In the assessment of the test dataset using ridge regression, the confusion matrices provided valuable insights into the model’s classification performance for college completion rate high, as displayed in Tables 22A and 22B. Table 22A illustrates the confusion matrix for lambda.min. Here, the model correctly identified 865 instances where colleges did not complete their courses (true negatives) and 4 instances where they did (true positives). However, there were misclassifications, with 5 colleges incorrectly classified as not completing when they did (false negatives), and 149 colleges misclassified as completing when they did not (false positives).

Subsequently, Table 22B presents the confusion matrix for lambda.1se. In this matrix, the model correctly identified 870 instances where colleges did not complete (true negatives) and 153 instances where they did (true positives). Notably, there were no instances of false negatives, indicating that the model correctly identified all instances of college completion. However, the model misclassified 153 colleges as completing when they did not (false positives).

# Make Predictions (lambda.min)

CRH_Predictions_Ridge_Logistic_Test = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Ridge_Logistic_Test = ifelse(CRH_Predictions_Ridge_Logistic_Test > 0.5, 1, 0)

# Print the confusion matrix

CRH_Test_CM_Ridge <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Test)

# print(CRH_Test_CM_Ridge)

kable(CRH_Test_CM_Ridge, caption = "Table 22A: 
      Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)")
Table 22A: Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)
0 1
0 865 5
1 149 4
# Make Predictions (lambda.1se)

CRH_Predictions_Ridge_Logistic_Test_lambda.1se = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Ridge_Logistic_Test_lambda.1se = ifelse(CRH_Predictions_Ridge_Logistic_Test_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRH_Test_CM_Ridge_lambda.1se <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Test_lambda.1se)

# print(CRH_Test_CM_Ridge)

kable(CRH_Test_CM_Ridge_lambda.1se, caption = "Table 22B: 
      Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
Table 22B: Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)
0
0 870
1 153

True positives and true negatives represent the correct predictions made by the model regarding completion and non-completion of courses, respectively. However, false positives indicate instances where the model incorrectly predicted completion when the outcome was non-completion. While false negatives represent instances where the model incorrectly predicted non-completion when the actual outcome was completion.

These findings highlight the model’s classification performance in predicting college completion rates on the test dataset. Despite achieving relatively high accuracy in identifying non-completion instances, the model exhibited challenges in accurately identifying completion instances, as evidenced by the presence of false positives. Addressing these misclassifications is essential for improving the model’s reliability and effectiveness in predicting college completion rates, thereby supporting informed decision-making processes.

LASSO

To analyze factors influencing college completion rates high, logistic regression was employed, and to enhance the model’s predictive capability, LASSO regression was implemented. Throughout this analysis, two pivotal lambda values, termed as lambda.min and lambda.1se, were instrumental in determining the optimal degree of regularization for the model. These lambda values, specifically lambda.min at 0.0006280493 and lambda.1se at 0.01788707, were meticulously selected through cross-validation to strike a delicate equilibrium between the model’s complexity and its predictive accuracy.

In Table 23A, the LASSO regression coefficients provide crucial insights into the factors influencing college completion rates high. These coefficients are derived at the lambda.min value, which represents the optimal level of regularization for the model. Key findings include:

  • Intercept (-3.0166681): The intercept represents the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -301.67%. 

  • NUMBRANCH (-0.1853803): A negative coefficient indicates that for each additional branch, the completion rate decreases by approximately 0.185 units. This implies that spreading resources across multiple branches might weaken support systems, leading to lower completion rates. 

  • HIGHDEG (0): The coefficient for HIGHDEG is zero, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model. 

  • UG25ABV (1.8535113): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 1.854 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates. 

  • UGDS (-0.0000166): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has an insignificant effect on completion rates. 

  • UGDS_WHITE (1.6001851): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.600 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion. 

  • UGDS_BLACK (-0.0180945): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.018 units. This highlights potential disparities or challenges faced by Black students in completing their degrees. 

  • UGDS_HISP (1.0913299): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.091 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates. 

  • UGDS_ASIAN (4.7485007): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.749 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students. 

  • PCTPELL (-1.0592512): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 1.059 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees. 

  • PCTFLOAN (0.5948730): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.595 units. This highlights the role of financial aid in supporting students through completion. 

The LASSO regression model with the coefficients at lambda.min is as follows:

Predicted (Completion_Rate_High) = -3.0166681 - 0.1853803NUMBRANCH + 1.8535113UG25ABV - 0.0000166UGDS + 1.6001851UGDS_WHITE - 0.0180945UGDS_BLACK + 1.0913299UGDS_HISP + 4.7485007UGDS_ASIAN - 1.0592512PCTPELL + 0.5948730*PCTFLOAN

# Apply Lasso
# Fit the model
# Find the best lambda

# Create Model

CRH_Model_Lasso_CV_Logistic = cv.glmnet(as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), CRH_Train_Data$Completion_Rate_High, alpha = 1, family = "binomial")

# print the best lambda (lambda.min)

# CRH_Model_Lasso_CV_Logistic$lambda.min # -- lambda.min = 0.0006280493

# print coefficients for the best lambda

CRH_Coefficients_LASSO <- predict(CRH_Model_Lasso_CV_Logistic, s = "lambda.min", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRH_Coefficients_LASSO_matrix <- as.matrix(CRH_Coefficients_LASSO)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRH_Coefficients_LASSO_df <- as.data.frame(CRH_Coefficients_LASSO_matrix)

# Create a basic table using kable

kable(CRH_Coefficients_LASSO_df, caption = "Table 23A: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate High")
Table 23A: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate High
lambda.min
(Intercept) -3.0166681
NUMBRANCH -0.1853803
HIGHDEG 0.0000000
UG25ABV 1.8535113
UGDS -0.0000166
UGDS_WHITE 1.6001851
UGDS_BLACK -0.0180945
UGDS_HISP 1.0913299
UGDS_ASIAN 4.7485007
PCTPELL -1.0592512
PCTFLOAN 0.5948730
# print the best lambda (lambda.1se)

# CRH_Model_Lasso_CV_Logistic$lambda.1se #-- lambda.1se = 0.01788707

# print coefficients for the best lambda

CRH_Coefficients_LASSO_lambda.1se <- predict(CRH_Model_Lasso_CV_Logistic, s = "lambda.1se", type = "coefficients")

# Convert the sparse matrix to a regular matrix

CRH_Coefficients_LASSO_matrix_lambda.1se <- as.matrix(CRH_Coefficients_LASSO_lambda.1se)

# Convert the matrix to a data frame
# Convert coefficients to a data frame

CRH_Coefficients_LASSO_df_lambda.1se <- as.data.frame(CRH_Coefficients_LASSO_matrix_lambda.1se)

# Create a basic table using kable

kable(CRH_Coefficients_LASSO_df_lambda.1se, caption = "Table 23B: LASSO Regression Coefficients 
      Analyzing Factors Influencing College Completion Rate High")
Table 23B: LASSO Regression Coefficients Analyzing Factors Influencing College Completion Rate High
lambda.1se
(Intercept) -2.0358658
NUMBRANCH -0.0623581
HIGHDEG 0.0000000
UG25ABV 0.7650294
UGDS 0.0000000
UGDS_WHITE 0.2964689
UGDS_BLACK -0.2249161
UGDS_HISP 0.0000000
UGDS_ASIAN 2.2646633
PCTPELL 0.0000000
PCTFLOAN 0.0000000

The LASSO regression analysis, examining the factors influencing college completion rates high, produced intriguing results as depicted in Table 23B. At the lambda.1se value, the intercept is estimated to be -2.0358658. This represents the baseline completion rate when all predictor variables are zero.

For the predictor variables, NUMBRANCH shows a negative coefficient of -0.0623581. This suggests that for each additional branch, the completion rate decreases by approximately 0.062 units. In contrast, UG25ABV exhibits a positive coefficient of 0.7650294, indicating that a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.765 units.

Interestingly, several predictor variables such as HIGHDEG, UGDS, UGDS_HISP, PCTPELL, and PCTFLOAN have coefficients estimated at zero. This suggests that these variables do not significantly impact completion rates in the model at the specified level of regularization.

Furthermore, UGDS_BLACK has a negative coefficient of -0.2249161, implying that an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.225 units. Additionally, UGDS_WHITE shows a positive coefficient of 0.2964689, indicating that an increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.296 units.

Additionally, UGDS_ASIAN demonstrates a notably high coefficient of 2.2646633, suggesting that the presence of Asian students significantly boosts completion rates, increasing them by approximately 2.265 units.

Overall, these findings provide valuable insights into the factors influencing college completion rates high, highlighting the importance of demographic composition and student characteristics in predicting completion outcomes.

The LASSO regression model for analyzing factors influencing college completion rates high, with the corresponding coefficients at lambda.1se, is as follows:

Predicted (Completion_Rate_High) = -2.0358658 - 0.0623581NUMBRANCH + 0.7650294UG25ABV - 0.2249161UGDS_BLACK + 0.2964689UGDS_WHITE + 2.2646633UGDS_ASIAN + 0.5948730PCTFLOAN

Train Data

In the analysis of college completion rates high using LASSO regression, the confusion matrices for the train dataset were examined to assess the model’s predictive performance. In Table 24A, representing the confusion matrix at lambda.min, the model correctly identified 3,446 instances where colleges didn’t completes their courses (true negatives) and 19 instances where colleges did complete (true positives). However, there were misclassifications, with 637 colleges incorrectly classified as not completing when they did (false negatives), and 14 colleges misclassified as completing when they did not (false positives).

Similarly, in Table 24B, which represents the confusion matrix at lambda.1se, the model correctly identified 3,446 instances where colleges did not complete (true negatives) and 5 instances where they did (true positives). However, there were misclassifications as well, with 623 colleges incorrectly predicted as not completing when they did (false negatives), and 3 colleges incorrectly predicted as completing when they did not (false positives).

# Make Predictions (lambda.min)

CRH_Predictions_Lasso_Log_Train = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.min", type = "response")

# convert the probabilities to 0 and 1

CRH_Predictions_Lasso_Log_Train = ifelse(CRH_Predictions_Lasso_Log_Train > 0.5, 1, 0)

# Print the confusion matrix

CRH_Train_CM_LASSO <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Train)

# print(CRH_Train_CM_LASSO)

kable(CRH_Train_CM_LASSO, caption = "Table 24A: 
      LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)")
Table 24A: LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)
0 1
0 3432 14
1 623 19
# Make Predictions (lambda.1se)

CRH_Predictions_Lasso_Log_Train_lambda.1se = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.1se", type = "response")

# convert the probabilities to 0 and 1

CRH_Predictions_Lasso_Log_Train_lambda.1se = ifelse(CRH_Predictions_Lasso_Log_Train_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRH_Train_CM_LASSO_lambda.1se <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Train_lambda.1se)

# print(CRH_Train_CM_LASSO_lambda.1se)

kable(CRH_Train_CM_LASSO_lambda.1se, caption = "Table 24B: 
      LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
Table 24B: LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)
0 1
0 3443 3
1 637 5

These findings suggest that the LASSO regression model, while generally effective in identifying instances of college completion rates high, still exhibits limitations in accurately predicting these outcomes. The high number of false negatives indicates that the model underestimates the completion rates, potentially overlooking colleges that complete their courses. Conversely, the occurrence of false positives suggests that the model sometimes identifies completion rates that are not realized.

These misclassifications can have implications for decision-making processes, as they may lead to inaccurate assessments of college performance or resource allocation. Therefore, further refinement of the model, possibly through adjusting regularization parameters or including additional predictor variables, may be necessary to improve its predictive accuracy and reduce misclassifications.

Test Data

In the evaluation of the LASSO regression model’s performance on the test dataset, two confusion matrices were analyzed to assess its ability to predict college completion rates high. In Table 25A, representing the confusion matrix at lambda.min, the model correctly identified 865 instances where colleges did not complete their courses (true negatives) and 4 instances where colleges did complete (true positives). However, there were misclassifications, with 152 colleges incorrectly classified as not completing when they did (false negatives), and 5 colleges misclassified as completing when they did not (false positives).

Similarly, in Table 25B, which represents the confusion matrix at lambda.1se, the model correctly identified 869 instances where colleges did not complete (true negatives) and 1 instance where they did (true positive). However, there were misclassifications as well, with 149 colleges incorrectly predicted as not completing when they did (false negatives), and 1 college incorrectly predicted as completing when it did not (false positive).

# Make Predictions (lambda.min)

CRH_Predictions_Lasso_Log_Test = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.min", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Lasso_Log_Test = ifelse(CRH_Predictions_Lasso_Log_Test > 0.5, 1, 0)

# Print the confusion matrix

CRH_Test_CM_LASSO <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Test)

# print(CRH_Test_CM_LASSO)

kable(CRH_Test_CM_LASSO, caption = "Table 25A: 
      LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)")
Table 25A: LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)
0 1
0 865 5
1 149 4
# Make Predictions (lambda.1se)

CRH_Predictions_Lasso_Log_Test_lambda.1se = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.1se", type = "response")

# Convert the probabilities to 0 and 1

CRH_Predictions_Lasso_Log_Test_lambda.1se = ifelse(CRH_Predictions_Lasso_Log_Test_lambda.1se > 0.5, 1, 0)

# Print the confusion matrix

CRH_Test_CM_LASSO_lambda.1se <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Test_lambda.1se)

# print(CRH_Test_CM_LASSO)

kable(CRH_Test_CM_LASSO_lambda.1se, caption = "Table 25B: 
      LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
Table 25B: LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)
0 1
0 869 1
1 152 1

These results suggest that the LASSO regression model, while demonstrating some capability in identifying instances of college completion rates high, still exhibits limitations in accurately predicting these outcomes, especially for the positive class. The relatively high number of false negatives indicates that the model tends to underestimate the completion rates, potentially overlooking colleges that complete their courses. Conversely, the occurrence of false positives suggests that the model sometimes identifies completion rates that are not realized.

Such misclassifications can have implications for decision-making processes, as they may lead to inaccurate assessments of college performance or resource allocation. Therefore, further refinement of the model, such as adjusting regularization parameters or considering additional predictor variables, may be necessary to enhance its predictive accuracy and reduce misclassifications.

Conclusion

The analysis of logistic regression, ridge regression, and LASSO regression models provided valuable insights into the factors influencing high college completion rates. By comparing the performance of these models using lambda.min and lambda.1se values, significant differences emerged, shedding light on their predictive capabilities. 

In evaluating the performance of the models developed in this analysis report, it becomes evident that each approach offers unique insights into the factors influencing high college completion rates. Among the models employed – logistic regression, ridge regression, LASSO regression, and multiple linear regression – each has its strengths and limitations, which are crucial to consider in determining the most effective predictive tool. 

The logistic regression model, which focused on predicting high college completion rates, offered a comprehensive understanding of the relationships between various predictor variables and the likelihood of achieving high completion rates. For instance, it highlighted the impact of factors such as the presence of different degree programs, student demographics, and financial aid metrics on completion rates. However, despite its interpretability, logistic regression struggled with misclassifications, particularly in identifying instances of high completion rates accurately. 

On the other hand, ridge regression, known for its regularization technique, addressed concerns such as multicollinearity and overfitting by introducing a penalty term into the coefficient estimates. The ridge regression model demonstrated improved classification performance compared to logistic regression, particularly in accurately identifying non-completion instances. However, it still faced challenges in accurately predicting completion instances, indicating room for further refinement. 

Similarly, LASSO regression, another regularization-based approach, provided valuable insights into the factors influencing high college completion rates. By selecting a subset of relevant predictor variables while shrinking others to zero, LASSO regression offered a balance between model complexity and predictive accuracy. However, like ridge regression, it exhibited limitations in accurately predicting completion instances, especially for the positive class. 

In contrast, multiple linear regression, a more traditional modeling technique, offered insights into the linear associations between predictor variables and high completion rates. Despite its simplicity compared to the regularization-based approaches, multiple linear regression provided valuable insights into the relationships between predictor variables and completion rates, albeit without addressing issues like multicollinearity or overfitting. 

Comparing the performance of these models, while ridge regression and LASSO regression demonstrated improved classification performance compared to logistic regression, none of the models achieved optimal predictive accuracy. Misclassifications persisted across all models, indicating the complexity of predicting high college completion rates accurately. This outcome was somewhat expected, given the inherent challenges associated with predicting human behavior and educational outcomes. 

Ultimately, the choice of the “better” model depends on the specific goals and requirements of the analysis. If interpretability and understanding the linear relationships between predictor variables and completion rates are paramount, multiple linear regression may be preferred. However, if predictive accuracy and addressing issues like multicollinearity and overfitting are crucial, ridge regression or LASSO regression may offer more suitable alternatives. 

For instance, consider the logistic regression model:

logit (P (Y = `Completion Rate High`)) = 0.13 - 0.22*`NUMBRANCH` - 0.37*`HIGHDEGCertificate Degree` - 2.69*`HIGHDEGAssociate Degree` - 1.90*`HIGHDEGBachelors Degree` - 2.42*`HIGHDEGGraduate Degree` + 0.85*`UG25ABV` + 0.36*`UGDS_WHITE` - 1.38*`UGDS_BLACK` - 0.13*`UGDS_HISP` + 4.42*`UGDS_ASIAN` - 2.10*`PCTPELL` + 0.85*`PCTFLOAN` 

While no single model emerged as the definitive “best” performer, each provided valuable insights into the complex interplay of factors influencing high college completion rates. Further refinement and iteration of these models, possibly through incorporating additional predictor variables or exploring alternative modeling techniques, are necessary to enhance their predictive accuracy and reliability for informing decision-making processes related to educational outcomes.

References

Bluman, A. (2018). Elementary statistics: A step by step approach (10th ed.). McGraw Hill. Goodreads. (n.d.).

Kabacoff, R.I. (2022). R in action: Data analysis and graphics with R and tidyverse (3rd edition).