Utilizing an extensive dataset obtained from the Department of Education, covering all higher education institutions in 2023, the aim is to analyze the diverse range of factors impacting graduation rates. The dataset utilized for this analysis contains 6,352 rows and 78 columns, featuring information ranging from school names, tuition fees, geographical location, and demographic proportions to SAT averages, admission rates, financial aid statistics, and much more. The preparation for analyzing this dataset included simplifying and renaming columns for comprehensive purposes, removing columns that weren’t relevant to this exploration. Rows with missing data (NAs) were removed from the dataset before each analysis question was explored, not at the beginning when the dataset was initially prepared. This approach ensures that no important data related to the specific variable being studied is lost. Additionally, new data frames and columns were created for this analysis.
In this assignment, the focus is on consolidating theoretical
knowledge into practical skills with the application of Ridge and LASSO
regression techniques. The goal is to build linear and logistic models
by implementing Ridge and LASSO functions over a range of regularization
parameter lambda values. The assignment entails the creation of two
regularized models: a regression model predicting Completion Rate
(C150_4_L4), and a logistic regression model predicting
Completion Rate Median and
Completion Rate High columns.
Ridge and LASSO are regularization techniques used to prevent overfitting in statistical models by adding a penalty term to the cost function. Ridge regression adds a penalty equivalent to the square of the magnitude of coefficients, while LASSO adds a penalty equivalent to the absolute value of the magnitude of coefficients. The main difference between the two lies in the penalty term: Ridge tends to shrink coefficients towards zero, while LASSO tends to set some coefficients to exactly zero, effectively performing variable selection.
Additionally, outliers are identified and removed using Mahalanobis Distance and Local Outlier Factor methods. Ridge regression is then applied, estimating lambda.min and lambda.1se values and fitting the model against the training set to report interesting findings and performance metrics. Similarly, LASSO regression is applied to the training set, with focus on reporting coefficients and identifying any coefficients that reduce to zero.
By comparing the performance of the Ridge and LASSO models, insights can be gained into which regularization technique better suits the dataset and the predictive task at hand. This comparison will shed light on the effectiveness of each method and whether the outcomes align with expectations.
# Week 5
# Importing and preparing the dataset
CollegeDataset <- read_csv("college_scorecard_Short1.csv")
## Rows: 6352 Columns: 78
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): INSTNM, CITY, STABBR, Completion_Rate_Median, Completion_Rate_High...
## dbl (65): SCHTYPE, ICLEVEL, REGION, LOCALE, C150_4_L4, NUMBRANCH, PREDDEG, H...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean names
clean_names(CollegeDataset)
# Column Names
colnames(CollegeDataset)
# Learn about the dataset
dim(CollegeDataset) # checking dimensions of the dataset
summary(CollegeDataset)
# Checking the structure of the dataset
str(CollegeDataset)
skim(CollegeDataset)
# Checking for NAs in columns
colSums(is.na(CollegeDataset))
# Various columns have a decent amount of NAs
# Removing NAs from response (dependent) variables
# Completion_Rate
CollegeDataset <- CollegeDataset %>% drop_na(C150_4_L4)
# Completion_Rate_Median
CollegeDataset <- CollegeDataset %>% drop_na(Completion_Rate_Median)
# Completion_Rate_High
CollegeDataset <- CollegeDataset %>% drop_na(Completion_Rate_High)
# Checking dimensions now after NA removal
dim(CollegeDataset)
To ensure the robustness and efficiency of the analysis, a systematic
process of variable selection was conducted on the dataset obtained from
the Department of Education. Initially, a comprehensive missing data
report (Figure 1) was generated to understand the extent of missing
values in the dataset. After removing NAs from the three response
variables C150_4_L4, Completion Rate Median,
and Completion Rate High, the dataset decreased to 5242
observations with 78 columns.
To streamline the dataset while retaining its informative value,
columns with more than 300 missing values were eliminated, resulting in
the removal of 21 columns. This step reduced the dataset to 5242 rows
and 57 columns, minimizing the impact of missing data on subsequent
analyses. Further refinement was undertaken by selecting columns based
on their relevance to the analysis objectives and the least amount of
missing data. This process resulted in the creation of a subset,
CollegeDataset2, containing 5242 observations across 13
carefully chosen columns. These columns include essential predictor
variables such as enrollment demographics (UG25ABV,
UGDS, UGDS_WHITE, UGDS_BLACK,
UGDS_HISP, and UGDS_ASIAN), metrics related to
financial aid (PCTPELL and PCTFLOAN), and
various institutional characteristics (HIGHDEG and
NUMBRANCH).
# ------------------------------------
# Generate a missing data report
# ------------------------------------
missmap(CollegeDataset, main = "Figure 1: Missing Values vs Observed")
# Currently the CollegeDataset has 5242 rows and 78 columns
# ------------------------------------
# List columns with most missing values
# ------------------------------------
sort(colSums(is.na(CollegeDataset)), decreasing = TRUE)
# SAT_AVG ADM_RATE ADMCON7 ROOMBOARD_ON
# 4219 3385 3373 3313
# OTHEREXPENSE_ON ENDOWBEGIN ENDOWEND PFTFAC
# 3310 2856 2856 2095
# BOOKSUPPLY TUITIONFEE_IN TUITIONFEE_OUT AVGFACSAL
# 2027 1982 1982 1833
# MARRIED MEDIAN_HH_INC POVERTY_RATE UNEMP_RATE
# 925 904 904 904
# FEMALE FIRST_GEN DEPENDENT MD_EARN_WNE_1YR
# 897 741 509 429
# RET_FT_4_L4 AGE_ENTRY FAMINC COSTT4_A_P
# 397 225 225 212
# Num4_PUB_PRIV PAR_ED_PCT_1STGEN PAR_ED_PCT_MS PAR_ED_PCT_HS
# 208 207 207 207
# DEP_INC_AVG APPL_SCH_N IRPS_2MOR IRPS_ASIAN
# 207 207 123 123
# IRPS_BLACK IRPS_HISP IRPS_WHITE IRPS_WOMEN
# 123 123 123 123
# IRPS_MEN UG25ABV GRAD_DEBT_MDN WDRAW_DEBT_MDN
# 123 81 50 50
# PCTPELL PCTFLOAN SCHTYPE STUFACR
# 8 8 7 4
# UGDS UGDS_WHITE UGDS_BLACK UGDS_HISP
# 2 2 2 2
# UGDS_ASIAN PCIP09 PCIP11 PCIP13
# 2 1 1 1
# PCIP27 PCIP40 PCIP41 PCIP42
# 1 1 1 1
# PCIP51 PCIP52 TUITFTE INEXPFTE
# 1 1 1 1
# ------------------------------------
# Eliminate columns with more than 300 missing values
# Note that the number 300 is chosen rather arbitrarily because it feels like
# a decent amount of NAs for the columns to have with out loosing too much valuable data
# ------------------------------------
CollegeDataset = CollegeDataset[, colSums(is.na(CollegeDataset)) < 300]
# dim(CollegeDataset) # 5242 by 57 columns meaning that 21 columns were removed
sort(colSums(is.na(CollegeDataset)), decreasing = TRUE)
# These are the remaining columns
# AGE_ENTRY FAMINC COSTT4_A_P Num4_PUB_PRIV
# 225 225 212 208
# PAR_ED_PCT_1STGEN PAR_ED_PCT_MS PAR_ED_PCT_HS DEP_INC_AVG
# 207 207 207 207
# APPL_SCH_N IRPS_2MOR IRPS_ASIAN IRPS_BLACK
# 207 123 123 123
# IRPS_HISP IRPS_WHITE IRPS_WOMEN IRPS_MEN
# 123 123 123 123
# UG25ABV GRAD_DEBT_MDN WDRAW_DEBT_MDN PCTPELL
# 81 50 50 8
# PCTFLOAN SCHTYPE STUFACR UGDS
# 8 7 4 2
# UGDS_WHITE UGDS_BLACK UGDS_HISP UGDS_ASIAN
# 2 2 2 2
# PCIP09 PCIP11 PCIP13 PCIP27
# 1 1 1 1
# PCIP40 PCIP41 PCIP42 PCIP51
# 1 1 1 1
# PCIP52 TUITFTE INEXPFTE ICLEVEL
# 1 1 1 0
# INSTNM REGION LOCALE CITY
# 0 0 0 0
# STABBR C150_4_L4 Completion_Rate_Median Completion_Rate_High
# 0 0 0 0
# ZIP NUMBRANCH PREDDEG HIGHDEG
# 0 0 0 0
# CCBASIC CCUGPROF CCSIZSET HBCU
# 0 0 0 0
# DISTANCEONLY
# 0
# colnames(CollegeDataset)
# str(CollegeDataset)
# ------------------------------------
# Keep the following columns and create a subset
# This is the first pass of shortening the dataframe
# Choose predictor variables based on the areas of interest and least amount of NAs
# ------------------------------------
CollegeDataset2 = CollegeDataset[, c("C150_4_L4","Completion_Rate_Median",
"Completion_Rate_High",
"NUMBRANCH",
"UG25ABV",
"HIGHDEG",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"PCTPELL",
"PCTFLOAN")]
# Convert HIGHDEG numbers to categories
CollegeDataset2$HIGHDEG <- factor(CollegeDataset2$HIGHDEG, levels = 0:4,
labels = c("Non-Degree-Granting", "Certificate Degree",
"Associate Degree", "Bachelors Degree",
"Graduate Degree"))
dim(CollegeDataset2) # 5242 observations with 13 columns
sort(colSums(is.na(CollegeDataset2)), decreasing = TRUE)
# PCTPELL PCTFLOAN UGDS_WHITE UGDS_BLACK
# 8 8 2 2
# UGDS_HISP UGDS_ASIAN C150_4_L4 Completion_Rate_Median
# 2 2 0 0
# Completion_Rate_High NUMBRANCH UGDS STABBR
# 0 0 2 0
# HIGHDEG UG25ABV
# 0 81
# remove missing values
CollegeDataset2 = na.omit(CollegeDataset2)
# dim(CollegeDataset2) # 5154 observations with 13 columns
To ensure interpretability and ease of analysis, certain categorical
variables were transformed into factor variables. Notably, the variable
HIGHDEG, representing the highest degree awarded by
institutions, was categorized into meaningful labels:
“Non-Degree-Granting”, “Certificate Degree”, “Associate Degree”,
“Bachelors Degree”, and “Graduate Degree.”
After variable selection and transformation, the resulting dataset,
CollegeDataset2, comprised 5154 observations with 13
columns. This meticulous process of variable selection guarantees the
integrity and relevance of the dataset for future analyses, facilitating
meaningful insights into the factors influencing graduation rates in
higher education institutions.
To initiate this analysis, outliers were identified and removed from the dataset using two distinct methods: Mahalanobis Distance and Local Outlier Factor (LOF). These methods provide valuable insights into data points that deviate significantly from the majority, assisting in the creation of robust statistical models.
The Mahalanobis Distance method detects outliers based on the
calculated distance of each data point from the centroid of the dataset,
accounting for the covariance structure of the variables. For this
analysis, the numeric columns relevant to the study, namely
C150_4_L4, NUMBRANCH, UG25ABV,
UGDS, UGDS_WHITE, PCTPELL,
PCTFLOAN, UGDS_BLACK, UGDS_HISP,
and UGDS_ASIAN, were selected.
# print types of columns in college dataset 2
# sapply(CollegeDataset2, class) # Not all numeric
# Use mahalanobis distance to detect outliers, based on the following numeric columns:
# -------------------------------------------------------------------------
# "C150_4_L4" "NUMBRANCH" "UGD" "UGDS_WHITE" "PCTPELL" "PCTFLOAN"
# "UGDS_BLACK" "UGDS_HISP" "UGDS_ASIAN" "UG25ABV"
# -------------------------------------------------------------------------
listofcols = c("C150_4_L4","NUMBRANCH", "UG25ABV", "UGDS", "UGDS_WHITE",
"PCTPELL","PCTFLOAN","UGDS_BLACK","UGDS_HISP","UGDS_ASIAN")
# check if the columns in the list are numeric
# sapply(CollegeDataset2[, listofcols], class) #All numeric
# Use mahalanobis distance to detect outliers
outliers = mahalanobis(CollegeDataset2[, listofcols],
colMeans(CollegeDataset2[, listofcols]),
cov(CollegeDataset2[, listofcols]))
# print(outliers)
# Use the quantile function to find the 95th percentile of the mahalanobis distance
# The ones above this value are the outliers
Outlier_Threshold = quantile(outliers, 0.95)
# print(Outlier_Threshold) # 26.05974
# print the outliers
# dim(CollegeDataset2[outliers > Outlier_Threshold, ]) # 258 outliers
# Create a colunm to identify the outliers
CollegeDataset2$Outliers_Maha = ifelse(outliers > Outlier_Threshold, 1, 0)
# Print first 6 rows
Outlier_Maha_College2_6 <- head(CollegeDataset2)
# Present as a nice table
knitr::kable(Outlier_Maha_College2_6, caption =
"College Dataset w/ Outliers (Mahalanobis Distance)")
| C150_4_L4 | Completion_Rate_Median | Completion_Rate_High | NUMBRANCH | UG25ABV | HIGHDEG | UGDS | UGDS_WHITE | UGDS_BLACK | UGDS_HISP | UGDS_ASIAN | PCTPELL | PCTFLOAN | Outliers_Maha |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.2807 | Below Median | Other | 1 | 0.0617 | Graduate Degree | 5098 | 0.0184 | 0.8978 | 0.0114 | 0.0014 | 0.6853 | 0.6552 | 0 |
| 0.6245 | Above Median | Other | 1 | 0.1794 | Graduate Degree | 13284 | 0.5297 | 0.2458 | 0.0669 | 0.0767 | 0.3253 | 0.4401 | 0 |
| 0.4444 | Below Median | Other | 1 | 0.8606 | Graduate Degree | 251 | 0.2470 | 0.6932 | 0.0438 | 0.0000 | 0.7852 | 0.8423 | 0 |
| 0.6072 | Above Median | Other | 1 | 0.1519 | Graduate Degree | 7358 | 0.7196 | 0.0871 | 0.0610 | 0.0357 | 0.2377 | 0.3578 | 0 |
| 0.2843 | Below Median | Other | 1 | 0.0677 | Graduate Degree | 3495 | 0.0152 | 0.9259 | 0.0129 | 0.0020 | 0.7205 | 0.7637 | 0 |
| 0.7223 | Above Median | Other | 1 | 0.0735 | Graduate Degree | 30725 | 0.7676 | 0.1050 | 0.0549 | 0.0137 | 0.1712 | 0.3454 | 0 |
# Create a data frame with Outliers_Maha and their colors based on the threshold
Outliers_Maha_df <- data.frame(
Outliers_Maha = outliers,
color = ifelse(outliers > Outlier_Threshold, "red", "black"))
# plot outliers based on the mahalanobis distance
ggplot(Outliers_Maha_df, aes(x = seq_along(Outliers_Maha), y = Outliers_Maha, color = color)) + geom_point(shape = 19) + scale_color_identity() +
labs(x = "Index",
y = "Mahalanobis Distance",
title = "Figure 2: Outliers Based on Mahalanobis Distance",
caption = "Red points indicate outliers beyond the threshold: 26.06") + theme_bw()
The analysis involved computing the Mahalanobis Distance for each
observation and identifying outliers beyond the 95th percentile
threshold. The threshold for outliers was determined to be 26.06,
indicating that any value surpassing this threshold is classified as an
outlier. These outliers were flagged and labeled accordingly, with the
column Outliers_Maha indicating their presence. If the
Mahalanobis Distance of a data point exceeded the threshold, it was
marked as “1” in the Outliers_Maha column; otherwise, it
was labeled as “0.” Figure 2 illustrates the outliers within the
dataset, identified using the Mahalanobis distance, with the red data
points representing those outliers.
The LOF method detects outliers by examining the local density
deviation of a data point with respect to its neighbors. A data point is
considered an outlier if its density significantly deviates from that of
its neighbors. In this analysis, the LOF was calculated for the same set
of numeric columns. Outliers were identified using a threshold value of
1.5, which indicates a substantial deviation in density compared to the
surrounding data points. These outliers were flagged and labeled
accordingly, with the column Outliers_LOF indicating their
presence. If the LO) of a data point exceeded the threshold of 1.5, it
was marked as “1” in the Outliers_LOF column; otherwise, it
was labeled as “0”. Figure 3 provides a visual representation of the
outliers within the dataset, with the red data points indicating those
identified using the LOF algorithm.
# Outlier detection using LOF
# ---------------------------
Outliers_LOF = lof(CollegeDataset2[, listofcols], minPts = 5)
# Create a colunm to identify the outliers
CollegeDataset2$Outliers_LOF = ifelse(Outliers_LOF > 1.5, 1, 0)
# print the outliers
# dim(CollegeDataset2[Outliers_LOF > 1.5, ]) # 452 outliers
# Print first 6 rows
Outlier_LOF_College2_6 <- head(CollegeDataset2)
# Present as a nice table
knitr::kable(Outlier_LOF_College2_6, caption =
"College Dataset w/ Outliers (LOF Method)")
| C150_4_L4 | Completion_Rate_Median | Completion_Rate_High | NUMBRANCH | UG25ABV | HIGHDEG | UGDS | UGDS_WHITE | UGDS_BLACK | UGDS_HISP | UGDS_ASIAN | PCTPELL | PCTFLOAN | Outliers_Maha | Outliers_LOF |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.2807 | Below Median | Other | 1 | 0.0617 | Graduate Degree | 5098 | 0.0184 | 0.8978 | 0.0114 | 0.0014 | 0.6853 | 0.6552 | 0 | 0 |
| 0.6245 | Above Median | Other | 1 | 0.1794 | Graduate Degree | 13284 | 0.5297 | 0.2458 | 0.0669 | 0.0767 | 0.3253 | 0.4401 | 0 | 0 |
| 0.4444 | Below Median | Other | 1 | 0.8606 | Graduate Degree | 251 | 0.2470 | 0.6932 | 0.0438 | 0.0000 | 0.7852 | 0.8423 | 0 | 0 |
| 0.6072 | Above Median | Other | 1 | 0.1519 | Graduate Degree | 7358 | 0.7196 | 0.0871 | 0.0610 | 0.0357 | 0.2377 | 0.3578 | 0 | 0 |
| 0.2843 | Below Median | Other | 1 | 0.0677 | Graduate Degree | 3495 | 0.0152 | 0.9259 | 0.0129 | 0.0020 | 0.7205 | 0.7637 | 0 | 0 |
| 0.7223 | Above Median | Other | 1 | 0.0735 | Graduate Degree | 30725 | 0.7676 | 0.1050 | 0.0549 | 0.0137 | 0.1712 | 0.3454 | 0 | 0 |
# Create a data frame with outliers_lof and their colors based on the threshold
Outliers_LOF_df <- data.frame(
LOF_Outliers = Outliers_LOF,
color = ifelse(Outliers_LOF > 1.5, "red", "black"))
# plot outliers based on the LOF
ggplot(Outliers_LOF_df, aes(x = seq_along(Outliers_LOF), y = LOF_Outliers, color = color)) +
geom_point(shape = 19) +
scale_color_identity() +
labs(x = "Index",
y = "Local Outlier Factor (LOF)",
title = "Figure 3: Outliers Based on Local Outlier Factor (LOF)",
caption = "Red points indicate outliers with LOF greater than 1.5") + theme_bw()
To provide a comprehensive overview of outliers, the results from both the Mahalanobis Distance and LOF methods were combined. The total number of outliers was computed by summing the outliers detected by each method. Table 1 illustrates the distribution of outliers detected, with “0” indicating no outliers, “1” denoting outliers detected by either method, and “2” representing outliers identified by both Mahalanobis Distance and LOF.
Upon analysis, it was found that 4487 observations exhibited no
outliers, 624 observations were flagged as outliers by one method, and
43 observations were identified as outliers by both methods. The column
Outliers in the dataset represents these totals.
Additionally, to maintain a high-quality dataset for subsequent
analysis, observations flagged as outliers by both methods (outliers =
2) were removed. This resulted in a dataset containing 5074 rows and 16
columns, allowing for enough observations while ensuring the model’s
reliability.
# Sum the two columns to get the total number of outliers
CollegeDataset2$Outliers = CollegeDataset2$Outliers_Maha + CollegeDataset2$Outliers_LOF
Outliers_Table <- table(CollegeDataset2$Outliers)
knitr::kable(Outliers_Table, caption = "Table 1:
Summary of Outliers Detected by Mahalanobis Distance and LOF Methods")
| Var1 | Freq |
|---|---|
| 0 | 4487 |
| 1 | 624 |
| 2 | 43 |
# Subset data by keeping where outliers = 0 or 1
# I want to keep as much of the data to get a better quality model so thats
# why I chose to remove outliers that = 2 as both the Mahalanobis Distance and LOF
# determine there is an outlier at that specific row
CollegeDataset2 = CollegeDataset2[CollegeDataset2$Outliers %in% c(0, 1), ]
# dim(CollegeDataset2) # 5111 rows by 16 columns
A multiple linear regression (MLR) analysis was created to explore
the factors impacting college completion rates. The study centered on
the completion rate (C150_4_L4) as the response variable,
with pertinent predictor variables such as the number of branches, types
of degrees awarded, demographics of the student body, and financial aid
statistics.
The insights presented in Table 2 provide a deeper understanding of how various factors impact college completion rates, shedding light on their magnitudes and implications within the regression model. For example:
Intercept (0.46): The intercept serves as the baseline completion rate when all predictor variables are zero. In this model, it suggests that in the absence of other factors, the expected college completion rate is approximately 46%.
NUMBRANCH (-0.01): The negative coefficient indicates that for each additional branch a college has, the completion rate decreases by approximately 0.01%. This implies that spreading resources across multiple branches might weaken support systems, hindering overall completion rates.
HIGHDEG Variables:
HIGHDEGCertificate Degree (0.20): A positive coefficient of 0.20 suggests that for each percentage point increase in the proportion of certificate degrees awarded, the completion rate increases by 0.20%. This indicates that focusing on certificate programs might positively influence completion rates.
HIGHDEGAssociate Degree (-0.06): While the negative coefficient of -0.06 for associate degrees implies a slight decrease in completion rates with each percentage point increase in their proportion, it’s statistically insignificant, suggesting a need for further investigation into its impact.
HIGHDEGBachelors Degree (-0.06): Similarly, the negative coefficient for bachelor’s degrees indicates a slight negative effect on completion rates, but it’s not statistically significant.
HIGHDEGGraduate Degree (-0.01): The negligible coefficient suggests that the proportion of graduate degrees awarded doesn’t significantly influence completion rates.
UG25ABV (-0.06): With a negative coefficient, a higher percentage of undergraduates over 25 years old correlates with a decrease in completion rates by approximately 0.06%. This indicates potential challenges faced by older students in completing their degrees.
UGDS Variables:
UGDS_WHITE (0.07): The positive coefficient implies that for each percentage point increase in the proportion of White students, completion rates increase by approximately 0.07%.
UGDS_BLACK (-0.11): Conversely, the negative coefficient suggests a decrease in completion rates by approximately 0.11% for each percentage point increase in the proportion of Black students.
UGDS_HISP (0.08): The positive coefficient indicates that a higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.08%.
UGDS_ASIAN (0.51): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 0.51% for each percentage point increase in their proportion.
PCTPELL (-0.16): The negative coefficient suggests that for each percentage point increase in Pell Grant recipients, completion rates decrease by approximately 0.16%. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (0.25): The positive coefficient indicates that for each percentage point increase in students with federal loans, completion rates increase by approximately 0.25%. This underscores the role of financial aid in supporting students through completion.
Therefore, the multiple linear regression model is as follows:
C150_4_L4 = 0.46 − 0.01* NUMBRANCH +
0.20HIGHDEGCertificate Degree−
0.06HIGHDEGAssociate Degree −
0.06HIGHDEGBachelors Degree −
0.01HIGHDEGGraduate Degree − 0.06*
UG25ABV + 0UGDS + 0.07
UGDS_WHITE − 0.11UGDS_BLACK +
0.08UGDS_HISP + 0.51UGDS_ASIAN −
0.16PCTPELL+0.25*PCTFLOAN
# Prep Work
# Subset the CollegeDataset2 to select to focus on Completion Rate
CR_Data <- CollegeDataset2 [c("NUMBRANCH",
"HIGHDEG",
"UG25ABV",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"PCTPELL",
"PCTFLOAN",
"C150_4_L4")]
# ---------------------------------
# Fit the MLR model
# ---------------------------------
# Split the data into TRAIN and TEST sets
set.seed(1996) # Set seed for reproducibility
# TRAIN
CR_Train_Index = createDataPartition(CR_Data$C150_4_L4, p = .8, list = FALSE) # 80% for training
CR_Train = CR_Data[ CR_Train_Index,]
# dim(CR_Train) 3554 rows with 11 columns
# TEST 20% for testing
CR_Test = CR_Data[-CR_Train_Index,]
# dim(CR_Test) 1520 rows with 11 columns
# ------------------------------------
# fit the model
# ------------------------------------
CR_Model = lm(C150_4_L4 ~ NUMBRANCH +
HIGHDEG +
UG25ABV +
UGDS +
UGDS_WHITE +
UGDS_BLACK +
UGDS_HISP +
UGDS_ASIAN +
PCTPELL +
PCTFLOAN, data = CR_Train)
# summary(CR_Model)
CR_Model_Table <- tidy(CR_Model, conf.int = TRUE)
# Making a table to present the findings
nice_table(CR_Model_Table, title = "Table 2:
Multiple Linear Regression Results of Factors Influencing College Completion Rate")
Table 2: | |||||
|---|---|---|---|---|---|
Term | estimate | std.error | statistic | p | 95% CI |
(Intercept) | 0.46 | 0.07 | 6.45 | < .001*** | [0.32, 0.60] |
NUMBRANCH | -0.01 | 0.00 | -5.42 | < .001*** | [-0.01, -0.00] |
HIGHDEGCertificate Degree | 0.20 | 0.07 | 3.01 | .003** | [0.07, 0.33] |
HIGHDEGAssociate Degree | -0.06 | 0.07 | -0.95 | .344 | [-0.20, 0.07] |
HIGHDEGBachelors Degree | -0.06 | 0.07 | -0.83 | .407 | [-0.19, 0.08] |
HIGHDEGGraduate Degree | -0.01 | 0.07 | -0.14 | .890 | [-0.14, 0.12] |
UG25ABV | -0.06 | 0.02 | -3.68 | < .001*** | [-0.09, -0.03] |
UGDS | 0.00 | 0.00 | 0.52 | .600 | [-0.00, 0.00] |
UGDS_WHITE | 0.07 | 0.02 | 2.98 | .003** | [0.02, 0.12] |
UGDS_BLACK | -0.11 | 0.03 | -4.35 | < .001*** | [-0.17, -0.06] |
UGDS_HISP | 0.08 | 0.03 | 2.94 | .003** | [0.03, 0.13] |
UGDS_ASIAN | 0.51 | 0.05 | 10.12 | < .001*** | [0.41, 0.61] |
PCTPELL | -0.16 | 0.02 | -7.89 | < .001*** | [-0.20, -0.12] |
PCTFLOAN | 0.25 | 0.01 | 17.68 | < .001*** | [0.22, 0.28] |
# Extract the Multiple R-squared and Adjusted R-squared
Multiple_R_squared <- summary(CR_Model)$r.squared
Adjusted_R_squared <- summary(CR_Model)$adj.r.squared
# Create a data frame with the extracted values
CR_RSquared_Results_Table <- data.frame(
"Metric" = c("Multiple R-squared", "Adjusted R-squared"),
"Value" = c(Multiple_R_squared, Adjusted_R_squared))
kable(CR_RSquared_Results_Table, caption = "Table 3:
Multiple Linear Regression Model Performance Metrics")
| Metric | Value |
|---|---|
| Multiple R-squared | 0.3295391 |
| Adjusted R-squared | 0.3274012 |
Furthermore, Table 3 provides the performance metrics of the multiple linear regression model used to analyze the factors affecting college completion rates. Two key metrics are presented: the Multiple R-squared and Adjusted R-squared values.
The Multiple R-squared (R^2) value, also known as the coefficient of determination, indicates the proportion of the variance in college completion rates explained by the predictor variables included in the model. With a value of approximately 0.33, it suggests that around 33% of the variability in completion rates can be attributed to the variables considered in the regression model. A higher R-squared value suggests a better fit of the model to the data, signifying stronger explanatory power collectively from the predictor variables.
The Adjusted R-squared value, approximately 0.33 in this case, adjusts for the number of predictor variables in the model. It penalizes the inclusion of unnecessary variables that do not significantly enhance the model’s explanatory ability. Despite being slightly lower than the Multiple R-squared, the Adjusted R-squared value provides a more accurate assessment of the model’s goodness-of-fit, considering the model’s complexity.
From these metrics, it can be inferred that the regression model, while statistically significant with a moderate R-squared value, only explains a portion of the variability in college completion rates. This implies the presence of other unaccounted factors influencing completion rates. Additionally, the robustness of the model’s fit, as indicated by the Adjusted R-squared value, remains intact even after accounting for the number of predictor variables included in the analysis.
In the exploration of the multiple linear regression model, ridge regression is applied to further analyze the factors influencing college completion rates. Ridge regression introduces regularization to the model, aiming to mitigate multicollinearity and overfitting by adding a penalty term to the coefficient estimates.
In employing ridge regression with an alpha value of 0, the process
involves determining the optimal lambda value through cross-validation.
This lambda value serves as the regularization parameter, crucial for
minimizing the mean squared error of the model. Specifically, two
important lambda values are calculated: lambda.min and
lambda.1se.
Lambda.min represents the value of lambda that minimizes
the mean squared error, ensuring the best possible fit to the data. It
prioritizes prediction accuracy, potentially resulting in more complex
models.
Lambda.1se, on the other hand, is the largest lambda
value within one standard error of lambda.min. It provides
a more conservative approach, favoring simpler models that are less
susceptible to overfitting while maintaining reasonable predictive
accuracy.
Although lambda.min and lambda.1se may
result in different values (lambda.min: 0.02281474,
lambda.1se: 0.05300874), both are used in the ridge
regression analysis presented in Table 4. Despite their differences,
each lambda value contributes to the overall assessment of the model’s
performance and aids in determining the optimal level of regularization
for balancing model complexity and predictive accuracy.
# ------------------------------------
# Now apply regularization to the model
# ------------------------------------
# -------------------------------------
# Apply Ridge
# alpha = 0 for Ridge
# Use Cv.glmnet to find the best lambda
#---------------------------------------
CR_Model_Ridge_CV = cv.glmnet(as.matrix(CR_Train[, -ncol(CR_Train)]),
CR_Train$C150_4_L4, alpha = 0)
# print the best lambda
# CR_Model_Ridge_CV$lambda.min # lambda.min = 0.005179007
# Print coefficients for the best lambda
CR_Coefficients_Ridge <- predict(CR_Model_Ridge_CV, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CR_Coefficients_Ridge_matrix <- as.matrix(CR_Coefficients_Ridge)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CR_Coefficients_Ridge_df <- as.data.frame(CR_Coefficients_Ridge_matrix)
# Create a basic table using kable
kable(CR_Coefficients_Ridge_df, caption = "Table 4: Ridge Regression Coefficients
Analyzing Factors Influencing College Completion Rate")
| lambda.min | |
|---|---|
| (Intercept) | 0.3707355 |
| NUMBRANCH | -0.0074852 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.0497032 |
| UGDS | -0.0000035 |
| UGDS_WHITE | 0.1537852 |
| UGDS_BLACK | -0.0315805 |
| UGDS_HISP | 0.1548810 |
| UGDS_ASIAN | 0.5598989 |
| PCTPELL | -0.0761827 |
| PCTFLOAN | 0.2576537 |
# print the best lambda.1se
# CR_Model_Ridge_CV$lambda.1se # lambda.1se = 0.05300874
# Print coefficients for the best lambda.1se
CR_Coefficients_Ridge_Lambda1se <- predict(CR_Model_Ridge_CV, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CR_Coefficients_Ridge_Lambda1se_matrix <- as.matrix(CR_Coefficients_Ridge_Lambda1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CR_Coefficients_Ridge_Lambda1se_df <- as.data.frame(CR_Coefficients_Ridge_Lambda1se_matrix)
# Create a basic table using kable
# kable(CR_Coefficients_Ridge_df, caption = "Table 4B: Ridge Regression Coefficients
# Analyzing Factors Influencing College Completion Rate")
### Either lambda.min or lambda.1se result in the same coefficients although they result in different values.
In Table 4, the results of the ridge regression analysis are presented, detailing the coefficients of predictor variables influencing college completion rates. The analysis reveals valuable insights into the magnitude and direction of these relationships. For example:
Intercept (0.3707355): Serving as the baseline, this coefficient represents the estimated completion rate when all predictor variables are zero. It provides context for interpreting the effects of other variables.
NUMBRANCH (-0.0074852): The negative coefficient suggests that an increase in the number of branches is associated with a slight decrease in completion rates. This implies that institutions with more branches may face challenges in achieving higher completion rates, possibly due to logistical or administrative complexities.
HIGHDEG (0): With a coefficient close to zero, the types of degrees awarded by the college have a negligible influence on completion rates, according to this analysis. This suggests that regardless of the degree types offered (certificate, associate, bachelor, or master’s), their impact on completion rates is minimal.
UG25ABV (0.0497032): A positive coefficient indicates that a higher percentage of undergraduate students over 25 years old correlates with a modest increase in completion rates. This suggests that institutions with a significant proportion of older undergraduate students may have higher completion rates, possibly due to their greater maturity and dedication to completing their studies.
UGDS (-0.0000035): The coefficient close to zero suggests that total undergraduate enrollment has minimal impact on completion rates. This implies that the sheer size of the student body does not significantly affect completion rates, highlighting the need to focus on other factors to improve outcomes.
Demographic Composition:
UGDS_WHITE (0.1537852): A positive coefficient indicates that a higher percentage of White students is associated with higher completion rates. This underscores the importance of diversity and inclusion efforts in fostering positive educational outcomes for all student demographics.
UGDS_BLACK (-0.0315805): The negative coefficient suggests that a higher percentage of Black students correlates with lower completion rates. Addressing equity gaps and providing tailored support to underrepresented minority groups may be essential in improving completion rates for these students.
UGDS_HISP (0.1548810): The positive coefficient indicates that a higher percentage of Hispanic students is associated with higher completion rates. This highlights the importance of culturally responsive practices and targeted interventions to support Hispanic student success.
UGDS_ASIAN (0.5598989): Among demographic variables, a higher percentage of Asian students exhibits the strongest positive impact on completion rates, correlating with significantly higher completion rates. Understanding and replicating the factors contributing to the success of Asian students may inform strategies to enhance completion rates for all students.
Financial Aid Metrics:
PCTPELL (-0.0761827): The negative coefficient suggests that a higher percentage of Pell Grant recipients correlates with lower completion rates. This underscores the challenges faced by economically disadvantaged students and the importance of addressing barriers to their success.
PCTFLOAN (0.2576537): The positive coefficient indicates that a higher percentage of students with federal loans is associated with higher completion rates. This suggests that access to financial aid, particularly in the form of federal loans, may facilitate degree completion for some students.
Therefore, the ridge regression model is as follows:
Predicted (C150_4_L4) = 0.3707355 + (-0.0074852) *
NUMBRANCH + (0.0000000) * HIGHDEG +
(0.0497032) * UG25ABV + (-0.0000035) * UGDS +
(0.1537852) * UGDS_WHITE + (-0.0315805) *
UGDS_BLACK + (0.1548810) * UGDS_HISP +
(0.5598989) * UGDS_ASIAN + (-0.0761827) *
PCTPELL + (0.2576537) * PCTFLOAN
Overall, ridge regression enhances the understanding of the relationships between predictor variables and college completion rates by incorporating regularization techniques. The resulting coefficients offer valuable insights for decision-makers in academia to identify and address factors affecting completion rates effectively.
In this ridge regression analysis, the data is partitioned into training and testing sets, with 80% earmarked for training and 20% for testing. The analysis begins with a focus on the training set, which consists of 3554 observations. In analyzing the train data using ridge regression, two essential metrics, labeled 4A and 4B, were computed to assess the model’s predictive performance.
lambda.min parameter was found to be 0.1240899. This
indicates that approximately 12.41% of the variability in college
completion rates can be explained by the predictor variables included in
the model. This suggests that the model captures a modest portion of the
variation in completion rates, providing valuable insight into factors
influencing student success. # make predictions (lambda.min)
CR_Predictions_Ridge_CV_Train = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.min")
# Calculate R2
CR_R2_Ridge_CV_Train = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Train)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))
kable(CR_R2_Ridge_CV_Train, caption = "4A: Ridge Regression R-squared for College Completion Rate Prediction (lambda.min)")
| x |
|---|
| 0.1240899 |
# make predictions (lambda.1se)
CR_Predictions_Ridge_CV_Train_Lambda1se = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.1se")
# Calculate R2
CR_R2_Ridge_CV_Train_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Train_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))
kable(CR_R2_Ridge_CV_Train_Lambda1se, caption = "4B: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)")
| x |
|---|
| 0.1138844 |
## Write another paragraph with the other lambda result and stat how although the coefficients don't change the r square value do and most likely so will the matrix in the LR down below.
lambda.1se parameter was 0.1138844. This
suggests that approximately 11.39% of the variability in college
completion rates can be explained by the model. While slightly lower
than the R^2 value obtained with lambda.min, it still
signifies a meaningful level of explanatory power. These findings offer valuable insights into the effectiveness of the ridge regression model in predicting college completion rates. While neither R^2 value is particularly high, they indicate that the model captures a portion of the variability in completion rates. This information can guide strategic decision-making processes within educational institutions, helping to identify areas for improvement and interventions to enhance student outcomes.
The test dataset, comprising of 1520 observations, evaluates the performance of the ridge regression model in predicting college completion rates. In evaluating the test data using ridge regression, two crucial metrics, labeled 4C and 4D, were computed to assess the model’s predictive performance.
lambda.min parameter, was found to be
0.1417864. This indicates that approximately 14.18% of the variability
in college completion rates can be explained by the predictor variables
included in the model. This suggests that the model demonstrates a
moderate ability to predict college completion rates based on the test
data.# make predictions (lambda.min)
CR_Predictions_Ridge_CV_Test = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.min")
# Calculate R2
CR_R2_Ridge_CV_Test = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Ridge_CV_Test)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))
kable(CR_R2_Ridge_CV_Test, caption = "4C: Ridge Regression R-squared for College Completion Rate Prediction (Test Data)")
| x |
|---|
| 0.1417864 |
# make predictions (lambda.1se)
CR_Predictions_Ridge_CV_Test_Lambda1se = predict(CR_Model_Ridge_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]),s = "lambda.1se")
# Calculate R2
CR_R2_Ridge_CV_Test_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Ridge_CV_Test_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))
kable(CR_R2_Ridge_CV_Train_Lambda1se, caption = "4D: Ridge Regression R-squared for College Completion Rate Prediction (lambda.1se)")
| x |
|---|
| 0.1138844 |
lambda.1se parameter was 0.1138844. This
suggests that approximately 11.39% of the variability in college
completion rates can be explained by the model, using a more
conservative regularization parameter. While slightly lower than the R^2
value obtained with lambda.min, it still indicates a
meaningful level of predictive power.These findings provide valuable insights into the effectiveness of
the ridge regression model in predicting college completion rates when
applied to unseen test data. The higher R^2 value obtained with
lambda.min suggests that the model performs relatively well
in explaining the variability in completion rates. However, it’s
important to note that the model’s performance may vary depending on the
choice of regularization parameter. These insights can inform
decision-making processes within educational institutions, aiding in the
development of strategies to support student success and improve
completion rates.
In this section, the application of LASSO (Least Absolute Shrinkage and Selection Operator) regression is detailed. LASSO is employed to identify the most significant predictors of college completion rates while simultaneously performing variable selection.
The analysis begins with fitting the LASSO model to the training
data. By setting the alpha parameter to 1, LASSO regularization is
applied, favoring sparse models where some coefficients are reduced to
exactly zero. Through cross-validation, the optimal lambda value is
determined to strike a balance between minimizing prediction error and
preventing overfitting. In this analysis, both the
lambda.min value (0.0003104733) and the
lambda.1se value (0.007341109) are calculated, reflecting
different levels of regularization. Table 5A and 5B will respectively
present the LASSO model based on each lambda value, providing insights
into the selected predictors and their coefficients.
Table 5A (lambda.min) presents the coefficients
resulting from the LASSO regression analysis, which aims to identify the
most significant predictors of college completion rates. Starting with
the intercept (0.3504087), this coefficient represents the completion
rate when all predictor variables are zero. In this context, it serves
as a baseline for comparison.
Moving on to the predictor variables:
NUMBRANCH (-0.0074946): This negative coefficient suggests that an increase in the number of branches corresponds to a slight reduction in completion rates. While the effect is relatively minor, it highlights the potential challenges associated with managing multiple branches and the importance of strategic planning in mitigating any adverse impact on completion rates.
HIGHDEG (0): With a coefficient of zero, the types of degrees awarded by the college do not significantly influence completion rates in the LASSO model. This implies that the distribution of degrees among students does not play a substantial role in determining completion rates.
UG25ABV (0.0494927): The positive coefficient indicates that a higher percentage of undergraduate students over 25 years old is associated with a slight increase in completion rates. This suggests that older students may possess greater determination or commitment to completing their studies, contributing positively to overall completion rates.
UGDS (-0.0000036): The coefficient’s proximity to zero implies that total undergraduate enrollment has minimal impact on completion rates. While enrollment size is an important consideration for educational institutions, the model suggests that it may not significantly affect completion rates.
Demographic Composition:
UGDS_WHITE (0.1754200): This positive coefficient suggests that a higher percentage of White students positively influences completion rates. It implies that colleges with a larger proportion of White students tend to exhibit higher completion rates, possibly due to various socio-economic factors or institutional support mechanisms tailored to this demographic.
UGDS_BLACK (-0.0105645): Conversely, the negative coefficient indicates that a higher percentage of Black students is associated with lower completion rates. This underscores the importance of addressing disparities in educational outcomes and implementing targeted interventions to support Black students and enhance their chances of completing their academic programs.
UGDS_HISP (0.1787894): The positive coefficient suggests that a higher percentage of Hispanic students positively impacts completion rates. Educational institutions with a significant Hispanic student population may benefit from cultural competency initiatives and support services tailored to the needs of Hispanic students to improve their retention and completion rates.
UGDS_ASIAN (0.5943168): With the highest coefficient among demographic variables, a higher percentage of Asian students significantly boosts completion rates. This indicates that colleges with a substantial Asian student population tend to have higher completion rates, possibly due to cultural factors, academic preparedness, or other institutional characteristics that facilitate student success.
Financial Aid Metrics:
PCTPELL (-0.0817687): The negative coefficient suggests that a higher percentage of Pell Grant recipients is associated with lower completion rates. This highlights the challenges faced by students from low-income backgrounds and underscores the importance of financial aid policies and support programs in promoting student success and retention.
PCTFLOAN (0.2646317): The positive coefficient indicates that a higher percentage of students with federal loans correlates with higher completion rates. This suggests that access to federal loans may enable students to overcome financial barriers and complete their academic programs successfully.
Therefore, the LASSO regression model with the coefficients at
lambda.min is as follows:
Predicted (C150_4_L4) = 0.3504087 + (-0.0074946) *
NUMBRANCH + 0.0494927 * UG25ABV + (-0.0000036)
* UGDS + 0.1754200 * UGDS_WHITE + (-0.0105645)
* UGDS_BLACK + 0.1787894 * UGDS_HISP +
0.5943168 * UGDS_ASIAN + (-0.0817687) *
PCTPELL + 0.2646317 * PCTFLOAN
# ------------------------------------
# Apply Lasso and Fit the model
# alpha = 1 for Lasso
# Use Cv.glmnet to find the best lambda
# ------------------------------------
# Creating Model
CR_Model_Lasso_CV = cv.glmnet(as.matrix(CR_Train[, -ncol(CR_Train)]), CR_Train$C150_4_L4, alpha = 1)
# print the best lambda (lambda.min)
# CR_Model_Lasso_CV$lambda.min # lambda.min = 0.0003104733
# Print coefficients for the best lambda
CR_Coefficients_LASSO <- predict(CR_Model_Lasso_CV, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CR_Coefficients_LASSO_Matrix <- as.matrix(CR_Coefficients_LASSO)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CR_Coefficients_LASSO_df <- as.data.frame(CR_Coefficients_LASSO_Matrix)
# Create a basic table using kable
kable(CR_Coefficients_LASSO_df, caption = "Table 5A: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate")
| lambda.min | |
|---|---|
| (Intercept) | 0.3504087 |
| NUMBRANCH | -0.0074946 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.0494927 |
| UGDS | -0.0000036 |
| UGDS_WHITE | 0.1754200 |
| UGDS_BLACK | -0.0105645 |
| UGDS_HISP | 0.1787894 |
| UGDS_ASIAN | 0.5943168 |
| PCTPELL | -0.0817687 |
| PCTFLOAN | 0.2646317 |
# print the best lambda (lambda.1se)
# CR_Model_Lasso_CV$lambda.1se # lambda.1se = 0.007341109
# Print coefficients for the best lambda (lambda.1se)
CR_Coefficients_LASSO_Lambda1se <- predict(CR_Model_Lasso_CV, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CR_Coefficients_LASSO_Lambda1se_Matrix <- as.matrix(CR_Coefficients_LASSO_Lambda1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CR_Coefficients_LASSO_Lambda1se_df <- as.data.frame(CR_Coefficients_LASSO_Lambda1se_Matrix)
# Create a basic table using kable
kable(CR_Coefficients_LASSO_Lambda1se_df, caption = "Table 5B: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate")
| lambda.1se | |
|---|---|
| (Intercept) | 0.4853050 |
| NUMBRANCH | -0.0048453 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.0000000 |
| UGDS | -0.0000021 |
| UGDS_WHITE | 0.0235416 |
| UGDS_BLACK | -0.1174150 |
| UGDS_HISP | 0.0000000 |
| UGDS_ASIAN | 0.3369200 |
| PCTPELL | 0.0000000 |
| PCTFLOAN | 0.2025453 |
Table 5B (lambda.1se) displays the coefficients derived from the LASSO regression analysis, aimed at identifying significant predictors of college completion rates.
Intercept (0.4853050): This intercept represents the baseline completion rate when all predictor variables are zero, providing a reference point for comparison.
Predictor Variables:
NUMBRANCH (-0.0048453): The negative coefficient suggests that an increase in the number of branches is associated with a slight decrease in completion rates. While the effect is minimal, it highlights the importance of strategic planning in managing multiple branches to mitigate any adverse impact on completion rates.
HIGHDEG (0): With a coefficient of zero, the types of degrees awarded by the college do not significantly influence completion rates in the LASSO model.
UG25ABV (0): The coefficient’s proximity to zero indicates that the percentage of undergraduate students over 25 years old has minimal impact on completion rates.
UGDS (-0.0000021): The coefficient suggests that total undergraduate enrollment has negligible impact on completion rates.
Demographic Composition:
UGDS_WHITE (0.0235416): A positive coefficient implies that a higher percentage of White students positively influences completion rates, possibly due to socio-economic factors or institutional support mechanisms tailored to this demographic.
UGDS_BLACK (-0.1174150): Conversely, the negative coefficient indicates that a higher percentage of Black students is associated with lower completion rates, highlighting the importance of addressing disparities in educational outcomes.
UGDS_HISP (0): With a coefficient of zero, the percentage of Hispanic students does not significantly impact completion rates in the LASSO model.
UGDS_ASIAN (0.3369200): The positive coefficient indicates that a higher percentage of Asian students significantly boosts completion rates, suggesting cultural factors or institutional characteristics that facilitate student success.
Financial Aid Metrics:
PCTPELL (0): The coefficient’s proximity to zero suggests that the percentage of Pell Grant recipients has minimal impact on completion rates.
PCTFLOAN (0.2025453): A positive coefficient implies that a higher percentage of students with federal loans correlates with higher completion rates, indicating the role of federal loans in overcoming financial barriers to academic completion.
Therefore, the LASSO regression model with the coefficients at
lambda.1se is as follows:
Predicted (C150_4_L4) = 0.4853050 -0.0048453 *
NUMBRANCH - 0.0000021 * UGDS + 0.0235416 *
UGDS_WHITE - 0.1174150 * UGDS_BLACK +
0.3369200 * UGDS_ASIAN + 0.2025453 *
PCTFLOAN
These findings provide valuable insights into the complex interplay of factors influencing college completion rates and can inform strategic decision-making and policy development aimed at enhancing student success and retention in higher education institutions.
In this LASSO analysis, the data is partitioned into training and
testing sets, with 80% earmarked for training and 20% for testing. The
analysis begins with a focus on the training set, which consists of 3554
observations. In the examination of the training data, the LASSO
regression analysis revealed noteworthy insights. When employing the
lambda.min (5C) parameter, the model achieved an R-squared
value of 0.1244594. This indicates that approximately 12.45% of the
variability in college completion rate prediction can be explained by
the predictor variables included in the model.
# make predictions (lambda.min)
CR_Predictions_Lasso_CV_Train = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]), s = "lambda.min")
# Calculate R2
CR_R2_Lasso_CV_Train = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Lasso_CV_Train)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))
kable(CR_R2_Lasso_CV_Train, caption = "5C: LASSO Regression R-squared for College Completion Rate Prediction (lambda.min)")
| x |
|---|
| 0.1244594 |
# make predictions (lambda.1se)
CR_Predictions_Lasso_CV_Train_Lambda1se = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Train[, -ncol(CR_Train)]), s = "lambda.1se")
# Calculate R2
CR_R2_Lasso_CV_Train_Lambda1se = 1 - (sum((CR_Train$C150_4_L4 - CR_Predictions_Lasso_CV_Train_Lambda1se)^2) / sum((CR_Train$C150_4_L4 - mean(CR_Train$C150_4_L4))^2))
kable(CR_R2_Lasso_CV_Train_Lambda1se, caption = "5D: LASSO Regression R-squared for College Completion Rate Prediction (lambda.1se)")
| x |
|---|
| 0.1056068 |
Similarly, utilizing the lambda.1se (5D) parameter
resulted in an R-squared value of 0.1056068. These findings suggest that
while the LASSO model demonstrates some predictive ability, it explains
only a modest portion of the variance in college completion rates. This
implies that factors beyond those included in the current model may also
influence completion rates and should be considered for a more
comprehensive understanding. Further refinement of the model or
exploration of additional predictors may be necessary to improve its
predictive accuracy and capture a more substantial portion of the
variability in completion rates.
In assessing the test data, the LASSO regression analysis revealed
notable insights into the predictive performance of the model. When
using the lambda.min parameter, the model achieved an R^2
value of 0.1435921. This indicates that approximately 14.36% of the
variability in college completion rate prediction can be explained by
the predictor variables included in the model.
# make predictions (lambda.min)
CR_Predictions_Lasso_CV_Test = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.min")
# Calculate R2
CR_R2_Lasso_CV_Test = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Lasso_CV_Test)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))
kable(CR_R2_Lasso_CV_Test, caption = "5E: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)")
| x |
|---|
| 0.1435921 |
# make predictions (lambda.1se)
CR_Predictions_Lasso_CV_Test_Lambda1se = predict(CR_Model_Lasso_CV, newx = as.matrix(CR_Test[, -ncol(CR_Test)]), s = "lambda.1se")
# Calculate R2
CR_R2_Lasso_CV_Test_Lambda1se = 1 - (sum((CR_Test$C150_4_L4 - CR_Predictions_Lasso_CV_Test_Lambda1se)^2) / sum((CR_Test$C150_4_L4 - mean(CR_Test$C150_4_L4))^2))
kable(CR_R2_Lasso_CV_Test_Lambda1se, caption = "5F: LASSO Regression R-squared for College Completion Rate Prediction (Test Data)")
| x |
|---|
| 0.1083415 |
Similarly, employing the lambda.1se parameter resulted
in an R^2 value of 0.1083415. These findings suggest that the LASSO
model demonstrates some predictive ability on the test data but explains
a modest portion of the variance in college completion rates. While the
model exhibits a slightly higher R^2 value compared to the training
data, indicating better performance on the test set, it still suggests
that additional factors beyond those considered in the current model may
influence completion rates. Therefore, further refinement or exploration
of additional predictors may be necessary to enhance the model’s
predictive accuracy and capture a more substantial portion of the
variability in completion rates.
As previously noted in another analysis, the choice of “Completion Rate Medium” as the response variable is based on its balanced nature. This term refers to the scenario where the classes within a response variable are distributed relatively evenly, ensuring that neither class dominates the other in terms of frequency. In the case of “Completion Rate Medium,” the dataset displays a near-equal distribution between its categories. Specifically, there are 2,534 instances classified as “below median” (0) and 2,577 (1) instances classified as “above median.” This balance ensures that predictive models developed using this response variable don’t favor any particular outcome, leading to a more robust and unbiased analysis.
The frequency table provided in Table 6 represents this observation, demonstrating a nearly equivalent number of instances in each category, further validating the balanced nature of the response variable. This balance enhances the reliability of predictive models built on this variable, ensuring that they accurately represent the underlying data distribution and can provide actionable insights without being skewed towards any specific outcome.
A logistic regression analysis was conducted to investigate the
factors influencing college completion rate medium
(Completion_Rate_Median). The study focused on
Completion_Rate_Median as the response variable, with
various predictor variables including the number of branches, types of
degrees awarded, demographics of the student body, and financial aid
statistics. The insights from Table 7 provide valuable insights into the
factors influencing college completion rates and their respective
magnitudes within the regression model:
Intercept (-0.75): The intercept signifies the baseline completion rate when all predictor variables are zero. In this model, it indicates an estimated completion rate of -0.75%. However, it’s not statistically significant (p = 0.454).
NUMBRANCH (-0.03): For each additional branch, completion rates decrease by approximately 0.03%. This suggests that spreading resources across multiple branches might weaken support systems, hindering overall completion rates (p = 0.023*).
HIGHDEG Variables:
HIGHDEGCertificate Degree (2.14): A positive coefficient of 2.14 indicates that for each percentage point increase in the proportion of certificate degrees awarded, completion rates increase by approximately 2.14% (p = 0.023).
HIGHDEGAssociate Degree (-0.60): While the negative coefficient of -0.60 for associate degrees implies a decrease in completion rates, it’s not statistically significant (p = 0.524).
HIGHDEGBachelors Degree (-0.51): Similarly, the negative coefficient suggests a decrease in completion rates for bachelor’s degrees, but it’s not statistically significant (p = 0.592).
HIGHDEGGraduate Degree (-0.21): The insignificant coefficient indicates that the proportion of graduate degrees awarded doesn’t significantly influence completion rates (p = 0.822).
UG25ABV (-0.36): A negative coefficient suggests that a higher percentage of undergraduates over 25 years old correlates with a decrease in completion rates by approximately 0.36%, though not statistically significant (p = 0.070).
UGDS Variables:
UGDS_WHITE (0.29): For each percentage point increase in the proportion of White students, completion rates increase by approximately 0.29%, though not statistically significant (p = 0.384).
UGDS_BLACK (-1.84): Conversely, the negative coefficient suggests a decrease in completion rates by approximately 1.84% for each percentage point increase in the proportion of Black students, which is statistically significant (p < 0.001).
UGDS_HISP (0.57): A positive coefficient indicates that a higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.57%, though not statistically significant (p = 0.096).
UGDS_ASIAN (5.51): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 5.51% for each percentage point increase in their proportion, which is statistically significant (p < 0.001).
PCTPELL (-2.03): For each percentage point increase in Pell Grant recipients, completion rates decrease by approximately 2.03%, significantly impacting economically disadvantaged students (p < 0.001).
PCTFLOAN (2.94): Conversely, for each percentage point increase in students with federal loans, completion rates increase by approximately 2.94%, emphasizing the role of financial aid in supporting completion (p < 0.001).
Therefore, the logistic regression model is as follows:
logit (P (Y = Completion Rate Medium)) = − 0.75 −
0.03NUMBRANCH +
2.14HIGHDEG_Certificate_Degree −
0.60HIGHDEG_Associate_Degree −
0.51HIGHDEG_Bachelors_Degree −
0.21HIGHDEG_Graduate_Degree −
0.36UG25ABV + 0.00UGDS +
0.29UGDS_WHITE − 1.84UGDS_BLACK +
0.57UGDS_HISP + 5.51UGDS_ASIAN −
2.03PCTPELL + 2.94*PCTFLOAN
Furthermore, this model was analyzed on a train and test dataset. In the training dataset, which comprises 80% of the total data and consists of 4,059 observations, the logistic regression model was evaluated to understand its predictive performance. From this analysis, a confusion matrix was generated to assess the model’s predictions compared to the actual outcomes for college completion rates.
# Prep Work
# Subset the CollegeDataset2 to select to focus on "Completion_Rate_Median"
CRM_Data <- CollegeDataset2 [c("NUMBRANCH",
"HIGHDEG",
"UG25ABV",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"PCTPELL",
"PCTFLOAN",
"Completion_Rate_Median")]
# ----------------------------------------------------
# Convert y (Completion_Rate_Median) values to 0 and 1
# ----------------------------------------------------
CRM_Data$Completion_Rate_Median <- ifelse(CRM_Data$Completion_Rate_Median
== "Above Median", 1, 0)
CRM_Freq_Table <- table(CRM_Data$Completion_Rate_Median)
knitr::kable(CRM_Freq_Table, caption = "Table 6:
Frequency Table of Completion Rate Median")
| Var1 | Freq |
|---|---|
| 0 | 2534 |
| 1 | 2577 |
# ---------------------------------
# Fit the logistic regression model
# ---------------------------------
# Split the data into TRAIN sets
set.seed(1996) # Set seed for reproducibility
# TRAIN
CRM_Train_Indices <- sample(nrow(CRM_Data), nrow(CRM_Data) * 0.8) # 80% for training
CRM_Train_Data <- CRM_Data[CRM_Train_Indices, ]
# dim(CRM_Train_Data) 4059 rows with 12 columns
# TEST
CRM_Test_Data <- CRM_Data[-CRM_Train_Indices, ] # 20% for testing
# dim(CRM_Test_Data) 1015 rows with 12 columns
# Fit the logistic regression model on the CRM_Train_Data
CRM_Model <- glm(Completion_Rate_Median ~ NUMBRANCH + HIGHDEG + UG25ABV +
UGDS + UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN +
PCTPELL + PCTFLOAN, data = CRM_Train_Data, family = binomial)
# summary(CRM_Model) ##### USE TRAIN MODEL!!!
# Presenting findings in table format
CRM_Model_Table <- tidy(CRM_Model, conf.int = TRUE)
nice_table(CRM_Model_Table, title = "Table 7:
Logistic Regression Analysis of Factors Influencing College Completion Rate Medium")
Table 7: | |||||
|---|---|---|---|---|---|
Term | estimate | std.error | statistic | p | 95% CI |
(Intercept) | -0.75 | 1.00 | -0.75 | .454 | [-2.94, 1.12] |
NUMBRANCH | -0.03 | 0.02 | -2.28 | .023* | [-0.06, -0.01] |
HIGHDEGCertificate Degree | 2.14 | 0.94 | 2.27 | .023* | [0.39, 4.25] |
HIGHDEGAssociate Degree | -0.60 | 0.94 | -0.64 | .524 | [-2.36, 1.51] |
HIGHDEGBachelors Degree | -0.51 | 0.95 | -0.54 | .592 | [-2.28, 1.61] |
HIGHDEGGraduate Degree | -0.21 | 0.95 | -0.22 | .822 | [-1.98, 1.90] |
UG25ABV | -0.36 | 0.20 | -1.81 | .070 | [-0.76, 0.03] |
UGDS | 0.00 | 0.00 | 0.48 | .632 | [-0.00, 0.00] |
UGDS_WHITE | 0.29 | 0.33 | 0.87 | .384 | [-0.35, 0.94] |
UGDS_BLACK | -1.84 | 0.36 | -5.09 | < .001*** | [-2.54, -1.13] |
UGDS_HISP | 0.57 | 0.35 | 1.66 | .096 | [-0.10, 1.26] |
UGDS_ASIAN | 5.51 | 0.78 | 7.03 | < .001*** | [4.00, 7.08] |
PCTPELL | -2.03 | 0.27 | -7.47 | < .001*** | [-2.56, -1.50] |
PCTFLOAN | 2.94 | 0.19 | 15.46 | < .001*** | [2.57, 3.32] |
# TRAIN
# Make predictions
CRM_Predictions_Logistic = predict(CRM_Model, CRM_Train_Data, type = "response")
# convert the probabilities to 0 and 1
CRM_Predictions_Logistic = ifelse(CRM_Predictions_Logistic > 0.5, 1, 0)
# Print the confusion matrix
CRM_Train_Confusion_Matrix <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Logistic)
# print(CRM_Train_Confusion_Matrix)
kable(CRM_Train_Confusion_Matrix, caption = "Table 8:
Confusion Matrix for College Completion Rate Medium from Train Data")
| 0 | 1 | |
|---|---|---|
| 0 | 1524 | 498 |
| 1 | 470 | 1596 |
# TEST
# Make predictions
CRM_Predictions_Logistic_Test = predict(CRM_Model, CRM_Test_Data, type = "response")
# convert the probabilities to 0 and 1
CRM_Predictions_Logistic_Test = ifelse(CRM_Predictions_Logistic_Test > 0.5, 1, 0)
# Print the confusion matrix
CRM_Test_Confusion_Matrix <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Logistic_Test)
# print(CRM_Train_Confusion_Matrix)
kable(CRM_Test_Confusion_Matrix, caption = "Table 9:
Confusion Matrix for College Completion Rate Medium from Test Data")
| 0 | 1 | |
|---|---|---|
| 0 | 393 | 119 |
| 1 | 129 | 382 |
In the confusion matrix for the train dataset (Table 8), the model correctly identified 1,524 instances where colleges did not complete their courses (true negatives) and 1,596 instances where colleges did complete (true positives). However, there were instances of misclassification, with 498 colleges incorrectly classified as not completing when they did (false negatives), and 470 colleges misclassified as completing when they did not (false positives).
Similarly, the test dataset, representing 20% of the data with 1015 observations, was used to evaluate the model’s generalization performance. In the test dataset’s confusion matrix (Table 9), the model correctly identified 393 instances where colleges did not complete (true negatives) and 382 instances where they did (true positives). However, there were misclassifications as well, with 119 colleges incorrectly predicted as not completing when they did (false negatives), and 129 colleges incorrectly predicted as completing when they did not (false positives).
Overall, the logistic regression model demonstrates promising performance in predicting college completion rates. However, there is room for improvement, particularly in reducing misclassifications. Further investigation into the misclassified instances, continuous monitoring, and recalibration of the model can enhance its predictive accuracy over time. Additionally, considering additional variables or refining existing ones may improve the model’s performance in accurately predicting college completion rates.
In the exploration of logistic regression to analyze the factors influencing college completion rates, ridge regression was employed to enhance the model’s predictive performance. Ridge regression introduces regularization, mitigating issues such as multicollinearity and overfitting by incorporating a penalty term into the coefficient estimates.
In this analysis, two key lambda values, lambda.min and
lambda.1se, were used to determine the optimal level of
regularization for the model. lambda.min (0.01134566) and
lambda.1se (0.05516942) were selected through
cross-validation to strike a balance between model complexity and
predictive accuracy.
Table 10A presents the ridge regression coefficients at
lambda.minand provides detailed insights into the impact of
predictor variables on college completion rates. For example:
Intercept (-1.3229162): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that in the absence of other factors, the expected completion rate is approximately -132.29%. However, this interpretation may not have practical significance, and it’s important to consider the other predictor variables.
NUMBRANCH (-0.0411465): For each additional branch, the completion rate decreases by approximately 0.041 units. This indicates that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, suggesting that the proportion of various degree levels awarded does not significantly impact completion rates in this model.
UG25ABV (0.5633525): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.563 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates.
UGDS (0.0000343): The coefficient for UGDS is 0.0000343, indicating that total enrollment has a negligible effect on completion rates.
UGDS_WHITE (0.8949482): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.895 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.6196046): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.620 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (1.1171618): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.117 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (4.3124523): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.312 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-0.7402258): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.740 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (2.2515051): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.252 units. This highlights the role of financial aid in supporting students through completion.
Therefore, the logistic regression model is as follows:
Predicted (Completion_Rate_Median) = -1.3229162 -
0.0411465NUMBRANCH + 0.5633525UG25ABV
+ 0.0000343UGDS + 0.8949482UGDS_WHITE
- 0.6196046UGDS_BLACK +
1.1171618UGDS_HISP +
4.3124523UGDS_ASIAN -
0.7402258PCTPELL + 2.2515051*PCTFLOAN
#-------------------------
# Apply Ridge
# Fit the model
# Find the best lambda
#------------------------
# Creating Model
CRM_Model_Ridge_CV_Logistic = cv.glmnet(as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), CRM_Train_Data$Completion_Rate_Median, alpha = 0, family = "binomial")
# print the best lambda (lambda.min)
# CRM_Model_Ridge_CV_Logistic$lambda.min # -- lambda.min = 0.01134566
# print coefficients for the best lambda
CRM_Coefficients_Ridge <- predict(CRM_Model_Ridge_CV_Logistic, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRM_Coefficients_Ridge_matrix <- as.matrix(CRM_Coefficients_Ridge)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRM_Coefficients_Ridge_df <- as.data.frame(CRM_Coefficients_Ridge_matrix)
# Create a basic table using kable
kable(CRM_Coefficients_Ridge_df, caption = "Table 10A:
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium")
| lambda.min | |
|---|---|
| (Intercept) | -1.3229162 |
| NUMBRANCH | -0.0411465 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.5633525 |
| UGDS | -0.0000343 |
| UGDS_WHITE | 0.8949482 |
| UGDS_BLACK | -0.6196046 |
| UGDS_HISP | 1.1171618 |
| UGDS_ASIAN | 4.3124523 |
| PCTPELL | -0.7402258 |
| PCTFLOAN | 2.2515051 |
# print the best lambda (lambda.1se)
# CRM_Model_Ridge_CV_Logistic$lambda.1se # -- lambda.1se = 0.05516942
# print coefficients for the best lambda
CRM_Coefficients_Ridge_Lambda1se <- predict(CRM_Model_Ridge_CV_Logistic, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRM_Coefficients_Ridge_Lambda1se_matrix <- as.matrix(CRM_Coefficients_Ridge_Lambda1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRM_Coefficients_Ridge_Lambda1se_df <- as.data.frame(CRM_Coefficients_Ridge_Lambda1se_matrix)
# Create a basic table using kable
kable(CRM_Coefficients_Ridge_Lambda1se_df, caption = "Table 10B:
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate Medium")
| lambda.1se | |
|---|---|
| (Intercept) | -0.8796900 |
| NUMBRANCH | -0.0312556 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.3992692 |
| UGDS | -0.0000275 |
| UGDS_WHITE | 0.5035802 |
| UGDS_BLACK | -0.7721856 |
| UGDS_HISP | 0.5752159 |
| UGDS_ASIAN | 3.0434189 |
| PCTPELL | -0.3536183 |
| PCTFLOAN | 1.7197953 |
Table 10B presents the ridge regression coefficients at
lambda.1se and offers detailed insights into the impact of
predictor variables on college completion rates. For example:
Intercept (-0.8796900): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that, in the absence of other factors, the expected completion rate is approximately -87.97%. However, this interpretation may not have practical significance on its own, and it’s crucial to consider the other predictor variables.
NUMBRANCH (-0.0312556): For each additional branch a college has, the completion rate decreases by approximately 0.031 units. This implies that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, suggesting that the proportion of various degree levels awarded does not significantly impact completion rates in this model.
UG25ABV (0.3992692): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.399 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates.
UGDS (-0.0000275): The coefficient for UGDS is -0.0000275, indicating that total enrollment has a negligible effect on completion rates.
UGDS_WHITE (0.5035802): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.504 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.7721856): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.772 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (0.5752159): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.575 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (3.0434189): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 3.043 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-0.3536183): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.354 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
Therefore, the ridge regression model is as follows:
Predicted (Completion_Rate_Median) = -0.8796900 -
0.0312556NUMBRANCH + 0.3992692UG25ABV
- 0.0000275UGDS + 0.5035802UGDS_WHITE
- 0.7721856UGDS_BLACK +
0.5752159UGDS_HISP +
3.0434189UGDS_ASIAN -
0.3536183PCTPELL + 1.7197953*PCTFLOAN
In Table 11A (lambda.min), the confusion matrix reveals the model’s predictions categorized into four outcomes: true positives, true negatives, false positives, and false negatives. True negatives (0,0) indicate instances where the model correctly predicted that colleges did not complete, and indeed, they did not. The model accurately identified 1,272 such cases. Additionally, false positives (0,1) represent instances where the model incorrectly predicted completion when colleges did not complete. Here, the model made 750 false positive predictions. False negatives (1,0) occur when the model predicted non-completion, but colleges did complete. The model made 678 false negative predictions in this scenario. Finally, true positives (1,1) signify cases where the model correctly predicted that colleges completed, and they did. The model accurately identified 1,388 such cases.
# Make Predictions (lambda.min)
CRM_Predictions_Ridge_Logistic_Train = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Ridge_Logistic_Train = ifelse(CRM_Predictions_Ridge_Logistic_Train > 0.5, 1, 0)
# Print the confusion matrix
CRM_Train_CM_Ridge <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Train)
# print(CRM_Tain_CM_Ridge)
kable(CRM_Train_CM_Ridge, caption = "Table 11A:
Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 1272 | 750 |
| 1 | 678 | 1388 |
# Make Predictions (lambda.1se)
CRM_Predictions_Ridge_Logistic_Train_lambda.1se = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Ridge_Logistic_Train_lambda.1se = ifelse(CRM_Predictions_Ridge_Logistic_Train_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRM_Train_CM_Ridge_lambda.1se <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Train_lambda.1se)
# print(CRM_Train_CM_Ridge_lambda.1se)
kable(CRM_Train_CM_Ridge_lambda.1se, caption = "Table 11B:
Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 1271 | 751 |
| 1 | 668 | 1398 |
Similarly, Table 11B (lambda.1se) provides a breakdown of the model’s performance with slightly different regularization parameters. It mirrors the structure of Table 11A, revealing true negatives, false positives, false negatives, and true positives. True negatives (0,0) remained consistent, with the model correctly predicting 1,271 instances where colleges did not complete. False positives (0,1) increased slightly to 751 instances, indicating a higher number of incorrect predictions of completion. False negatives (1,0) decreased to 668 instances, suggesting an improvement in correctly identifying colleges that completed. True positives (1,1) increased to 1,398 instances, indicating a higher number of accurate predictions of completion by the model.
In this matrix (Table 13A), true negatives (0,0) represent instances where the model correctly predicted colleges that did not complete, aligning with the actual outcomes. The model identified 328 such cases. False positives (0,1) indicate instances where the model inaccurately predicted completion when colleges did not complete, accounting for 184 instances. False negatives (1,0) occur when the model failed to predict completion for colleges that did complete, totaling 169 instances. True positives (1,1) represent instances where the model correctly predicted colleges that completed, matching the actual outcomes. The model accurately identified 342 such cases.
# Make Predictions (lambda.min)
CRM_Predictions_Ridge_Logistic_Test = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Ridge_Logistic_Test = ifelse(CRM_Predictions_Ridge_Logistic_Test > 0.5, 1, 0)
# Print the confusion matrix
CRM_Test_CM_Ridge <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Test)
# print(CRM_Test_CM_Ridge)
kable(CRM_Test_CM_Ridge, caption = "Table 12A:
Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 328 | 184 |
| 1 | 169 | 342 |
# Make Predictions (lambda.1se)
CRM_Predictions_Ridge_Logistic_Test_lambda.1se = predict(CRM_Model_Ridge_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Ridge_Logistic_Test_lambda.1se = ifelse(CRM_Predictions_Ridge_Logistic_Test_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRM_Test_CM_Ridge_lambda.1se <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Ridge_Logistic_Test_lambda.1se)
# print(CRM_Test_CM_Ridge)
kable(CRM_Test_CM_Ridge_lambda.1se, caption = "Table 12B:
Ridge Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 321 | 191 |
| 1 | 169 | 342 |
Similarly, Table 12B (lambda.1se) provides insights into the model’s predictions with slightly different regularization parameters. True negatives (0,0) remained consistent, with the model correctly predicting 321 instances where colleges did not complete. False positives (0,1) increased to 191 instances, indicating a higher number of incorrect predictions of completion. False negatives (1,0) remained at 169 instances, suggesting consistent challenges in accurately predicting completion for colleges that did complete. True positives (1,1) increased to 342 instances, indicating a higher number of accurate predictions of completion by the model.
The variation in true positives, true negatives, false positives, and false negatives between the two matrices highlights the model’s performance sensitivity to different regularization parameters. While true positives and true negatives reflect the model’s accurate predictions of completion and non-completion, false positives and false negatives indicate areas of misclassification. The model’s ability to accurately predict completion rates, as evidenced by high true positive and true negative rates, is crucial for decision-making processes in education and business contexts. However, the presence of false positives and false negatives underscores the model’s limitations and areas for improvement.
To analyze factors influencing college completion rates median, logistic regression was employed, and to enhance the model’s predictive capability, LASSO regression was implemented. LASSO regression, known for its regularization technique, effectively mitigates issues like multicollinearity and overfitting by incorporating a penalty term into the coefficient estimates.
Throughout this analysis, two pivotal lambda values, termed as
lambda.min and lambda.1se, were instrumental
in determining the optimal degree of regularization for the model. These
lambda values, specifically lambda.min at 0.0007464686 and
lambda.1se at 0.008385261, were meticulously selected
through cross-validation to strike a delicate equilibrium between the
model’s complexity and its predictive accuracy.
In Table 13A, the LASSO regression coefficients provide crucial
insights into the factors influencing college completion rates. These
coefficients are derived at the lambda.min value, which
represents the optimal level of regularization for the model. Key
findings include:
Intercept (-1.6472394): The intercept serves as the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -164.72%. However, this interpretation may not have practical significance, and it’s important to consider other predictor variables.
NUMBRANCH (-0.0424150): A negative coefficient indicates that for each additional branch, the completion rate decreases by approximately 0.042 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is zero, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model.
UG25ABV (0.6031363): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.603 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates.
UGDS (-0.0000365): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has a insignificant effect on completion rates.
UGDS_WHITE (1.2263659): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.226 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.3344819): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.334 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (1.5149724): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.515 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (5.0068296): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 5.007 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-0.9134661): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.913 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (2.4440316): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.444 units. This highlights the role of financial aid in supporting students through completion.
Therefore, the LASSO regression model with the coefficients at
lambda.min is as follows:
Predicted (Completion_Rate_Median) = -1.6472394 -
0.0424150NUMBRANCH+ 0.6031363UG25ABV
- 0.0000365UGDS + 1.2263659UGDS_WHITE
- 0.3344819UGDS_BLACK +
1.5149724UGDS_HISP +
5.0068296UGDS_ASIAN -
0.9134661PCTPELL + 2.4440316*PCTFLOAN
# Apply Lasso
# Fit the model
# Find the best lambda
# Create Model
CRM_Model_Lasso_CV_Logistic = cv.glmnet(as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), CRM_Train_Data$Completion_Rate_Median, alpha = 1, family = "binomial")
# print the best lambda (lambda.min)
# CRM_Model_Lasso_CV_Logistic$lambda.min # lambda.min = 0.0007464686
# print coefficients for the best lambda
CRM_Coefficients_LASSO <- predict(CRM_Model_Lasso_CV_Logistic, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRM_Coefficients_LASSO_matrix <- as.matrix(CRM_Coefficients_LASSO)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRM_Coefficients_LASSO_df <- as.data.frame(CRM_Coefficients_LASSO_matrix)
# Create a basic table using kable
kable(CRM_Coefficients_LASSO_df, caption = "Table 13A: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate Medium")
| lambda.min | |
|---|---|
| (Intercept) | -1.6472394 |
| NUMBRANCH | -0.0424150 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.6031363 |
| UGDS | -0.0000365 |
| UGDS_WHITE | 1.2263659 |
| UGDS_BLACK | -0.3344819 |
| UGDS_HISP | 1.5149724 |
| UGDS_ASIAN | 5.0068296 |
| PCTPELL | -0.9134661 |
| PCTFLOAN | 2.4440316 |
# print the best lambda (lambda.1se)
# CRM_Model_Lasso_CV_Logistic$lambda.1se # -- lambda.1se = 0.008385261
# print coefficients for the best lambda
CRM_Coefficients_LASSO_lambda.1se <- predict(CRM_Model_Lasso_CV_Logistic, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRM_Coefficients_LASSO_lambda.1se_matrix <- as.matrix(CRM_Coefficients_LASSO_lambda.1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRM_Coefficients_LASSO_lambda.1se_df <- as.data.frame(CRM_Coefficients_LASSO_lambda.1se_matrix)
# Create a basic table using kable
kable(CRM_Coefficients_LASSO_lambda.1se_df, caption = "Table 13B: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate Medium")
| lambda.1se | |
|---|---|
| (Intercept) | -0.9059768 |
| NUMBRANCH | -0.0296236 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.3132804 |
| UGDS | -0.0000269 |
| UGDS_WHITE | 0.3977125 |
| UGDS_BLACK | -0.9236192 |
| UGDS_HISP | 0.5292159 |
| UGDS_ASIAN | 3.3484793 |
| PCTPELL | -0.4000180 |
| PCTFLOAN | 2.0676466 |
Table 13B presents the LASSO regression coefficients obtained at the
lambda.1se value, offering valuable insights into the
factors influencing college completion rates. Key findings include:
Intercept (-0.9059768): The intercept represents the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -90.60%. However, this interpretation may not hold practical significance and requires consideration of other predictor variables.
NUMBRANCH (-0.0296236): The negative coefficient for NUMBRANCH implies that for each additional branch, the completion rate decreases by approximately 0.030 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is zero, indicating that the proportion of various degree levels awarded does not significantly influence completion rates in this model.
UG25ABV (0.3132804): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.313 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates.
UGDS (-0.0000269): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has a negligible effect on completion rates.
UGDS_WHITE (0.3977125): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.398 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.9236192): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.924 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (0.5292159): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.529 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (3.3484793): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 3.348 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-0.4000180): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.400 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (2.0676466): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 2.068 units. This highlights the role of financial aid in supporting students through completion.
Therefore, the LASSO regression model with the coefficients at
lambda.1se is as follows:
Predicted (Completion_Rate_Median) = -0.9059768 -
0.0296236NUMBRANCH + 0.3132804UG25ABV
- 0.0000269UGDS + 0.3977125UGDS_WHITE
- 0.9236192UGDS_BLACK +
0.5292159UGDS_HISP +
3.3484793UGDS_ASIAN -
0.4000180PCTPELL + 2.0676466*PCTFLOAN
In analyzing the train data, the LASSO regression model’s performance
was evaluated using confusion matrices to assess its predictive accuracy
in determining college completion rates median. In Table 14A, at the
lambda.min value, the model correctly classified 1,278
instances where completion rates were predicted to be low (0) and were
indeed low (true negatives), while it correctly classified 1,387
instances where completion rates were predicted to be high (1) and were
indeed high (true positives). However, there were 744 instances where
completion rates were predicted to be low but were actually high (false
negatives), and 679 instances where completion rates were predicted to
be high but were actually low (false positives). This indicates that the
model, at the lambda.min value, achieved a balance between
identifying both low and high completion rates but had a moderate number
of misclassifications.
# Make Predictions (lambda.min)
CRM_Predictions_Lasso_Log_Train = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.min", type = "response")
# convert the probabilities to 0 and 1
CRM_Predictions_Lasso_Log_Train = ifelse(CRM_Predictions_Lasso_Log_Train > 0.5, 1, 0)
# Print the confusion matrix
CRM_Train_CM_LASSO <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Train)
# print(CRM_Tain_CM_LASSO)
kable(CRM_Train_CM_LASSO, caption = "Table 14A:
LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 1278 | 744 |
| 1 | 679 | 1387 |
# Make Predictions (lambda.1se)
CRM_Predictions_Lasso_Log_Train_lambda.1se = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Train_Data[, -ncol(CRM_Train_Data)]), s = "lambda.1se", type = "response")
# convert the probabilities to 0 and 1
CRM_Predictions_Lasso_Log_Train_lambda.1se = ifelse(CRM_Predictions_Lasso_Log_Train_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRM_Train_CM_LASSO_lambda.1se <- table(CRM_Train_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Train_lambda.1se)
# print(CRM_Train_CM_LASSO_lambda.1se)
kable(CRM_Train_CM_LASSO_lambda.1se, caption = "Table 14B:
LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 1267 | 755 |
| 1 | 661 | 1405 |
Similarly, in Table 14B, at the lambda.1se value, the
model correctly classified 1,267 instances of low completion rates (true
negatives) and 1,405 instances of high completion rates (true
positives). However, there were 755 instances where completion rates
were predicted to be low but were actually high (false negatives), and
661 instances where completion rates were predicted to be high but were
actually low (false positives). This suggests that at the
lambda.1se value, the model’s predictive accuracy slightly
improved in correctly identifying high completion rates but had a
slightly higher number of misclassifications compared to the
lambda.min value.
In analyzing the test data, the LASSO regression model’s performance
was assessed using confusion matrices to evaluate its predictive
accuracy in determining college completion rates. In Table 15A, at the
lambda.min value, the model correctly classified 329
instances where completion rates were predicted to be low (0) and were
indeed low (true negatives), while it correctly classified 338 instances
where completion rates were predicted to be high (1) and were indeed
high (true positives). However, there were 183 instances where
completion rates were predicted to be low but were actually high (false
negatives), and 173 instances where completion rates were predicted to
be high but were actually low (false positives). This indicates that the
model, at the lambda.min value, achieved a balance between
identifying both low and high completion rates but had a moderate number
of misclassifications.
# Make Predictions (lambda.min)
CRM_Predictions_Lasso_Log_Test = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Lasso_Log_Test = ifelse(CRM_Predictions_Lasso_Log_Test > 0.5, 1, 0)
# Print the confusion matrix
CRM_Test_CM_LASSO <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Test)
# print(CRM_Test_CM_LASSO)
kable(CRM_Test_CM_LASSO, caption = "Table 15A:
LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 329 | 183 |
| 1 | 173 | 338 |
# Make Predictions (lambda.1se)
CRM_Predictions_Lasso_Log_Test_lambda.1se = predict(CRM_Model_Lasso_CV_Logistic, newx = as.matrix(CRM_Test_Data[, -ncol(CRM_Test_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRM_Predictions_Lasso_Log_Test_lambda.1se = ifelse(CRM_Predictions_Lasso_Log_Test_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRM_Test_CM_LASSO_lambda.1se <- table(CRM_Test_Data$Completion_Rate_Median, CRM_Predictions_Lasso_Log_Test_lambda.1se)
# print(CRM_Test_CM_LASSO)
kable(CRM_Test_CM_LASSO_lambda.1se, caption = "Table 15A:
LASSO Regression Confusion Matrix for College Completion Rate Medium (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 319 | 193 |
| 1 | 166 | 345 |
Similarly, in Table 15B, at the lambda.1se value, the
model correctly classified 319 instances of low completion rates (true
negatives) and 345 instances of high completion rates (true positives).
However, there were 193 instances where completion rates were predicted
to be low but were actually high (false negatives), and 166 instances
where completion rates were predicted to be high but were actually low
(false positives). This suggests that at the lambda.1se
value, the model’s predictive accuracy slightly improved in correctly
identifying high completion rates but had a slightly higher number of
misclassifications compared to the lambda.min value.
Upon review of Table 16, it becomes evident that the dataset exhibits
an imbalance in the distribution of the
Completion Rate High variable, where the classes are
unevenly distributed. Specifically, there are 795 instances categorized
as “High” and 4316 instances falling into “Other” categories, as
presented in the table.
This discrepancy highlights the class imbalance within the
Completion Rate High variable, with a notable disparity
between observations classified as “High” and those classified
otherwise.
Acknowledging this imbalance is crucial for devising effective modeling strategies. Imbalanced datasets can present challenges for predictive modeling techniques. To address this, strategies such as oversampling, under sampling, or leveraging specialized algorithms tailored to handle imbalanced data may be employed, as warranted by the findings presented in Table 16.
A logistic regression analysis was performed to explore the
determinants of high college completion rates
(Completion_Rate_High). The investigation centered on
Completion_Rate_High as the response variable, examining a
range of predictor variables encompassing the number of branches, types
of degrees conferred, student demographic characteristics, and financial
aid metrics.
The findings presented in Table 17 offer significant insights into the factors influencing high college completion rates and their respective magnitudes within the regression model, aligning with the focus on completion rates categorized as “high” in this analysis.
In Table 17, a logistic regression analysis provides insights into the factors influencing high college completion rates. The intercept, with an estimate of 0.13, represents the baseline probability of high completion rates when all predictor variables are zero. However, its high p-value indicates insignificance, suggesting minimal influence on high completion rates. Key findings include:
NUMBRANCH (-0.22): Each additional branch is associated with a decrease in the likelihood of achieving high completion rates by approximately 0.22 units. This suggests that institutions with more branches may face challenges in ensuring high completion rates across all locations.
HIGHDEG Variables:
Certificate Degree (-0.37): The presence of certificate degree programs is associated with a decrease in the log odds of high completion rates by 0.37 units. This indicates that institutions offering a higher proportion of certificate degrees may need to address specific challenges or barriers faced by students pursuing these programs.
Associate Degree (-2.69): Similarly, the presence of associate degree programs is associated with a decrease in the log odds of high completion rates by 2.69 units. This suggests that institutions offering a higher proportion of associate degrees may face significant challenges in achieving high completion rates.
Bachelor’s Degree (-1.90): The presence of bachelor’s degree programs is associated with a decrease in the log odds of high completion rates by 1.90 units. Understanding and addressing factors that impede bachelor’s degree attainment is crucial for improving completion outcomes.
Graduate Degree (-2.42): Institutions offering graduate degree programs face challenges in achieving high completion rates, as indicated by a decrease in the log odds by 2.42 units. Identifying the unique needs of graduate students is essential for improving completion outcomes in these programs.
UG25ABV (0.85): A one-unit increase in the proportion of older undergraduate students is associated with an increase in the log odds of high completion rates by 0.85 units. This implies that institutions with a more mature student body may have support structures or programs tailored to meet the needs of non-traditional students, thus enhancing completion rates.
UGDS (0): A one-unit increase in undergraduate enrollment is associated with an increase in the log odds of high completion rates by 0.00 units. This suggests that larger undergraduate enrollments are associated with higher probabilities of high completion rates.
Demographic Variables:
UGDS_WHITE (0.36): The presence of White students is associated with a slight increase in the log odds of high completion rates by 0.36 units. This may indicate that institutions with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-1.38): Conversely, the presence of Black students is associated with a decrease in the log odds of high completion rates by 1.38 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (-0.13): The presence of Hispanic students is associated with a slight decrease in the log odds of high completion rates by 0.13 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, but the effect is relatively small.
UGDS_ASIAN (4.42): A one-unit increase in the proportion of Asian students is associated with an increase in the log odds of high completion rates by 4.42 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
Financial Aid Variables:
PCTPELL (-2.10): A one-unit increase in the percentage of Pell Grant recipients is associated with a decrease in the log odds of high completion rates by 2.10 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (0.85): A one-unit increase in the percentage of students with federal loans is associated with an increase in the log odds of high completion rates by 0.85 units. This underscores the role of financial aid in supporting students through completion.
Therefore, the logistic regression model is as follows:
logit (P (Y = Completion Rate Medium)) = = 0.13 -
0.22NUMBRANCH -
0.37HIGHDEGCertificate Degree -
2.69HIGHDEGAssociate Degree -
1.90HIGHDEGBachelors Degree -
2.42HIGHDEGGraduate Degree +
0.85UG25ABV + 0.36UGDS_WHITE -
1.38UGDS_BLACK - 0.13UGDS_HISP +
4.42UGDS_ASIAN - 2.10PCTPELL +
0.85PCTFLOAN
# Prep Work
# Subset data with factors of focus
CRH_Data <- CollegeDataset2[c("NUMBRANCH",
"HIGHDEG",
"UG25ABV",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"PCTPELL",
"PCTFLOAN",
"Completion_Rate_High")]
# ----------------------------------------------------
# Convert y (Completion_Rate_High) values to 0 and 1
# ----------------------------------------------------
CRH_Data$Completion_Rate_High <- ifelse(CRH_Data$Completion_Rate_High
== "High", 1, 0)
CRH_Freq_Table <- table(CRH_Data$Completion_Rate_High)
kable(CRH_Freq_Table, caption = "Table 16:
Frequency Table of College Completion Rate High")
| Var1 | Freq |
|---|---|
| 0 | 4316 |
| 1 | 795 |
# ---------------------------------
# Fit the logistic regression model
# ---------------------------------
# Split the data into TEST and TRAIN sets
set.seed(1996) # Set seed for reproducibility
# TRAIN
CRH_Train_Indices <- sample(nrow(CRH_Data), nrow(CRH_Data) * 0.8) # 80% for training
CRH_Train_Data <- CRH_Data[CRH_Train_Indices, ]
# dim(CRH_Train_Data) 4123 rows and 11 columns
# TEST
CRH_Test_Data <- CRH_Data[-CRH_Train_Indices, ] # 20% for testing
# dim(CRH_Test_Data) 1031 rows and 11 columns
# Fit the logistic regression model on the CRM_Train_Data
CRH_Model <- glm(Completion_Rate_High ~ NUMBRANCH + HIGHDEG + UG25ABV +
UGDS + UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN +
PCTPELL + PCTFLOAN, data = CRH_Train_Data, family = binomial)
# summary(CRH_Model)
# Presenting findings in table format
CRH_Model_Table <- tidy(CRH_Model, conf.int = TRUE)
nice_table(CRH_Model_Table, title = "Table 17:
Logistic Regression Analysis of Factors Influencing College Completion Rate High")
Table 17: | |||||
|---|---|---|---|---|---|
Term | estimate | std.error | statistic | p | 95% CI |
(Intercept) | 0.13 | 1.02 | 0.13 | .900 | [-2.07, 2.08] |
NUMBRANCH | -0.22 | 0.04 | -5.60 | < .001*** | [-0.30, -0.15] |
HIGHDEGCertificate Degree | -0.37 | 0.93 | -0.40 | .691 | [-2.13, 1.69] |
HIGHDEGAssociate Degree | -2.69 | 0.94 | -2.87 | .004** | [-4.48, -0.61] |
HIGHDEGBachelors Degree | -1.90 | 0.94 | -2.02 | .043* | [-3.70, 0.18] |
HIGHDEGGraduate Degree | -2.42 | 0.94 | -2.58 | .010** | [-4.21, -0.34] |
UG25ABV | 0.85 | 0.25 | 3.38 | .001*** | [0.36, 1.35] |
UGDS | 0.00 | 0.00 | 3.58 | < .001*** | [0.00, 0.00] |
UGDS_WHITE | 0.36 | 0.42 | 0.86 | .390 | [-0.44, 1.21] |
UGDS_BLACK | -1.38 | 0.47 | -2.96 | .003** | [-2.28, -0.45] |
UGDS_HISP | -0.13 | 0.44 | -0.29 | .775 | [-0.98, 0.77] |
UGDS_ASIAN | 4.42 | 0.79 | 5.57 | < .001*** | [2.90, 6.02] |
PCTPELL | -2.10 | 0.33 | -6.44 | < .001*** | [-2.75, -1.47] |
PCTFLOAN | 0.85 | 0.23 | 3.62 | < .001*** | [0.39, 1.31] |
# TRAIN
# Make predictions
CRH_Predictions_Logistic_Train = predict(CRH_Model, CRH_Train_Data, type = "response")
# convert the probabilities to 0 and 1
CRH_Predictions_Logistic_Train = ifelse(CRH_Predictions_Logistic_Train > 0.5, 1, 0)
# Print the confusion matrix
CRH_Train_Confusion_Matrix <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Logistic_Train)
# print(CRH_Train_Confusion_Matrix)
kable(CRH_Train_Confusion_Matrix, caption = "Table 18:
Confusion Matrix for College Completion Rate High from Train Data")
| 0 | 1 | |
|---|---|---|
| 0 | 3406 | 40 |
| 1 | 587 | 55 |
# TEST
# Make predictions
CRH_Predictions_Logistic_Test = predict(CRH_Model, CRH_Test_Data, type = "response")
# convert the probabilities to 0 and 1
CRH_Predictions_Logistic_Test = ifelse(CRH_Predictions_Logistic_Test > 0.5, 1, 0)
# Print the confusion matrix
CRH_Test_Confusion_Matrix <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Logistic_Test)
# print(CRM_Train_Confusion_Matrix)
kable(CRH_Test_Confusion_Matrix, caption = "Table 19:
Confusion Matrix for College Completion Rate High from Test Data")
| 0 | 1 | |
|---|---|---|
| 0 | 861 | 9 |
| 1 | 141 | 12 |
Furthermore, this model was analyzed on a train and test dataset. In the training dataset, which comprises 80% of the total data and consists of 4,123 observations, the logistic regression model was evaluated to understand its predictive performance. From this analysis, a confusion matrix was generated to assess the model’s predictions compared to the actual outcomes for college completion rates.
In the confusion matrix corresponding to the train dataset (Table 18), the logistic regression model accurately identified 3,406 instances where colleges did not complete their courses (true negatives) and 55 instances where colleges did complete (true positives). However, there were cases of misclassification, with 40 colleges incorrectly labeled as not completing when they did (false negatives), and 587 colleges misclassified as completing when they did not (false positives).
Similarly, the test dataset, representing 20% of the total data with 1,031 observations, was utilized to assess the model’s generalization performance. In the confusion matrix for the test dataset (Table 19), the model correctly identified 861 instances where colleges did not complete (true negatives) and 12 instances where they did (true positives). Nevertheless, misclassifications occurred, with 9 colleges incorrectly predicted as not completing when they did (false negatives), and 141 colleges incorrectly predicted as completing when they did not (false positives).
Overall, while the logistic regression model shows promising performance in predicting college completion rates, there is room for improvement, particularly in reducing misclassifications. Further exploration of misclassified instances, ongoing monitoring, and adjusting of the model can enhance its predictive accuracy over time. Additionally, considering additional variables or refining existing ones may contribute to improving the model’s ability to accurately predict college completion rates.
In the exploration of logistic regression to analyze the factors influencing college completion rates high, ridge regression was applied to augment the model’s predictive performance. Ridge regression, known for its regularization technique, addresses concerns such as multicollinearity and overfitting by introducing a penalty term into the coefficient estimates.
Throughout this analysis, two crucial lambda values, namely
lambda.min and lambda.1se, were utilized to
ascertain the optimal level of regularization for the model. These
lambda values, specifically lambda.min (0.01134566) and
lambda.1se (0.05516942), were determined via
cross-validation to strike a balance between model complexity and
predictive accuracy.
Table 20A presents the ridge regression coefficients at
lambda.min, revealing the impact of predictor variables on
college completion rates high:
Intercept (-2.7698630): The intercept represents the estimated completion rate when all predictor variables are zero. In this model, the intercept suggests that in the absence of other factors, the expected completion rate is approximately -276.99%. However, this interpretation may not have practical significance without considering other predictor variables.
NUMBRANCH (-0.1688791): For each additional branch, the completion rate decreases by approximately 0.169 units. This indicates that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is 0, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model.
UG25ABV (1.7869716): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 1.787 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates.
UGDS (-0.0000173): The coefficient for UGDS is -0.0000173, indicating that total enrollment has a negligible effect on completion rates.
UGDS_WHITE (1.3168060): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.317 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.2853582): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.285 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (0.8180263): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 0.818 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (4.3771100): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.377 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-1.0062817): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 1.006 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (0.5674224): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.567 units. This highlights the role of financial aid in supporting students through completion.
Therefore, the ridge regression model (lambda.min) is as
follows:
Predicted (Completion_Rate_High) = -2.7698630 -
0.1688791NUMBRANCH + 1.7869716UG25ABV
- 0.0000173UGDS + 1.3168060UGDS_WHITE
- 0.2853582UGDS_BLACK +
0.8180263UGDS_HISP +
4.3771100UGDS_ASIAN -
1.0062817PCTPELL + 0.5674224*PCTFLOAN
#-------------------------
# Apply Ridge
# Fit the model
# Find the best lambda
#------------------------
# Creating Model
CRH_Model_Ridge_CV_Logistic = cv.glmnet(as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), CRH_Train_Data$Completion_Rate_High, alpha = 0, family = "binomial")
# print the best lambda (lambda.min)
# CRH_Model_Ridge_CV_Logistic$lambda.min # lambda.min = 0.004535024
# print coefficients for the best lambda
CRH_Coefficients_Ridge <- predict(CRH_Model_Ridge_CV_Logistic, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRH_Coefficients_Ridge_matrix <- as.matrix(CRH_Coefficients_Ridge)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRH_Coefficients_Ridge_df <- as.data.frame(CRH_Coefficients_Ridge_matrix)
# Create a basic table using kable
kable(CRH_Coefficients_Ridge_df, caption = "Table 20A:
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High")
| lambda.min | |
|---|---|
| (Intercept) | -2.7698630 |
| NUMBRANCH | -0.1688791 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 1.7869716 |
| UGDS | -0.0000173 |
| UGDS_WHITE | 1.3168060 |
| UGDS_BLACK | -0.2853582 |
| UGDS_HISP | 0.8180263 |
| UGDS_ASIAN | 4.3771100 |
| PCTPELL | -1.0062817 |
| PCTFLOAN | 0.5674224 |
# print the best lambda (lambda.1se)
# CRH_Model_Ridge_CV_Logistic$lambda.1se #-- lambda.1se = 0.1555726
# print coefficients for the best lambda
CRH_Coefficients_Ridge_lambda.1se <- predict(CRH_Model_Ridge_CV_Logistic, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRH_Coefficients_Ridge_matrix_lambda.1se <- as.matrix(CRH_Coefficients_Ridge_lambda.1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRH_Coefficients_Ridge_lambda.1se_df <- as.data.frame(CRH_Coefficients_Ridge_matrix_lambda.1se)
# Create a basic table using kable
kable(CRH_Coefficients_Ridge_lambda.1se_df, caption = "Table 20B:
Ridge Regression Coefficients Analyzing Factors Influencing College Completion Rate High")
| lambda.1se | |
|---|---|
| (Intercept) | -1.8601314 |
| NUMBRANCH | -0.0473951 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.6482804 |
| UGDS | -0.0000085 |
| UGDS_WHITE | 0.2874011 |
| UGDS_BLACK | -0.3799900 |
| UGDS_HISP | 0.0022770 |
| UGDS_ASIAN | 1.9933289 |
| PCTPELL | -0.3080749 |
| PCTFLOAN | 0.0881586 |
In Table 20B, which presents the ridge regression coefficients at
lambda.1se, valuable insights into the factors influencing
college completion rates, particularly focusing on the Completion Rate
High category, are revealed. Key findings include:
Intercept (-1.8601314): The intercept represents the estimated completion rate when all predictor variables are zero. Here, it suggests that in the absence of other factors, the expected completion rate is approximately -186.01%. However, this interpretation might not have practical significance on its own.
NUMBRANCH (-0.0473951): For each additional branch, the completion rate decreases by approximately 0.047 units. This suggests that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is 0.0000000, indicating that the proportion of various degree levels awarded does not significantly impact completion rates in this model.
UG25ABV (0.6482804): A higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.648 units. This suggests that older students may be more motivated or have better support systems, leading to higher completion rates.
UGDS (-0.0000085): The coefficient for UGDS is -0.0000085, suggesting that total enrollment has a negligible effect on completion rates.
UGDS_WHITE (0.2874011): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.287 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.3799900): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.380 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (0.0022770): The coefficient suggests that the proportion of Hispanic students has a minimal effect on completion rates, with an increase of approximately 0.002 units.
UGDS_ASIAN (1.9933289): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 1.993 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-0.3080749): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 0.308 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (0.0881586): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.088 units. This highlights the role of financial aid in supporting students through completion.
Therefore, the ridge regression model (lambda.1se) is as
follows:
Predicted (Completion_Rate_High) = 2.7698630 -
0.1688791NUMBRANCH + 1.7869716UG25ABV
- 0.0000173UGDS + 1.3168060UGDS_WHITE
- 0.2853582UGDS_BLACK +
0.8180263UGDS_HISP +
4.3771100UGDS_ASIA - 1.0062817PCTPELL
+ 0.5674224*PCTFLOAN
In the analysis of the train dataset, the ridge regression model revealed insights into the classification performance regarding college completion rate high, as shown in Tables 21A and 21B. In Table 21A, the model accurately identified 3,435 instances where colleges didn’t complete their courses (true negatives) and 19 instances where they did (true positives). However, there were misclassifications, with 11 colleges incorrectly classified as not completing when they did (false negatives), and 623 colleges misclassified as completing when they did not (false positives).
Upon evaluating the model with different regularization strengths, Table 21B displays the confusion matrix for lambda.1se. Here, the model correctly identified 3,445 instances where colleges did not complete (true negatives) and 2 instances where they did (true positives). However, there were misclassifications as well, with 1 college incorrectly predicted as not completing when it did (false negative), and 640 colleges incorrectly predicted as completing when they did not (false positives).
# Make Predictions (lambda.min)
CRH_Predictions_Ridge_Logistic_Train = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Ridge_Logistic_Train = ifelse(CRH_Predictions_Ridge_Logistic_Train > 0.5, 1, 0)
# Print the confusion matrix
CRH_Train_CM_Ridge <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Train)
# print(CRH_Train_CM_Ridge)
kable(CRH_Train_CM_Ridge, caption = "Table 21A:
Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 3435 | 11 |
| 1 | 623 | 19 |
# Make Predictions (lambda.1se)
CRH_Predictions_Ridge_Logistic_Train_lambda.1se = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Ridge_Logistic_Train_lambda.1se = ifelse(CRH_Predictions_Ridge_Logistic_Train_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRH_Train_CM_Ridge_lambda.1se <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Train_lambda.1se)
# print(CRH_Train_CM_Ridge)
kable(CRH_Train_CM_Ridge_lambda.1se, caption = "Table 21B:
Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 3445 | 1 |
| 1 | 640 | 2 |
The true positives and true negatives represent the correct predictions made by the model regarding completion and non-completion of courses, respectively. However, false positives indicate instances where the model incorrectly predicted completion when the outcome was non-completion. Similarly, false negatives represent instances where the model incorrectly predicted non-completion when the actual outcome was completion.
These findings highlight the classification performance of the ridge regression model in predicting college completion rates. While the model demonstrates relatively high accuracy in identifying non-completion instances, it faces challenges in correctly identifying completion instances, as evidenced by the higher number of false positives and false negatives. Improving the model’s ability to detect true completion instances is crucial for enhancing its effectiveness in supporting decision-making processes related to college completion rates.
In the assessment of the test dataset using ridge regression, the
confusion matrices provided valuable insights into the model’s
classification performance for college completion rate high, as
displayed in Tables 22A and 22B. Table 22A illustrates the confusion
matrix for lambda.min. Here, the model correctly identified
865 instances where colleges did not complete their courses (true
negatives) and 4 instances where they did (true positives). However,
there were misclassifications, with 5 colleges incorrectly classified as
not completing when they did (false negatives), and 149 colleges
misclassified as completing when they did not (false positives).
Subsequently, Table 22B presents the confusion matrix for
lambda.1se. In this matrix, the model correctly identified
870 instances where colleges did not complete (true negatives) and 153
instances where they did (true positives). Notably, there were no
instances of false negatives, indicating that the model correctly
identified all instances of college completion. However, the model
misclassified 153 colleges as completing when they did not (false
positives).
# Make Predictions (lambda.min)
CRH_Predictions_Ridge_Logistic_Test = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Ridge_Logistic_Test = ifelse(CRH_Predictions_Ridge_Logistic_Test > 0.5, 1, 0)
# Print the confusion matrix
CRH_Test_CM_Ridge <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Test)
# print(CRH_Test_CM_Ridge)
kable(CRH_Test_CM_Ridge, caption = "Table 22A:
Ridge Regression Confusion Matrix for College Completion Rate High (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 865 | 5 |
| 1 | 149 | 4 |
# Make Predictions (lambda.1se)
CRH_Predictions_Ridge_Logistic_Test_lambda.1se = predict(CRH_Model_Ridge_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Ridge_Logistic_Test_lambda.1se = ifelse(CRH_Predictions_Ridge_Logistic_Test_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRH_Test_CM_Ridge_lambda.1se <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Ridge_Logistic_Test_lambda.1se)
# print(CRH_Test_CM_Ridge)
kable(CRH_Test_CM_Ridge_lambda.1se, caption = "Table 22B:
Ridge Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
| 0 | |
|---|---|
| 0 | 870 |
| 1 | 153 |
True positives and true negatives represent the correct predictions made by the model regarding completion and non-completion of courses, respectively. However, false positives indicate instances where the model incorrectly predicted completion when the outcome was non-completion. While false negatives represent instances where the model incorrectly predicted non-completion when the actual outcome was completion.
These findings highlight the model’s classification performance in predicting college completion rates on the test dataset. Despite achieving relatively high accuracy in identifying non-completion instances, the model exhibited challenges in accurately identifying completion instances, as evidenced by the presence of false positives. Addressing these misclassifications is essential for improving the model’s reliability and effectiveness in predicting college completion rates, thereby supporting informed decision-making processes.
To analyze factors influencing college completion rates high,
logistic regression was employed, and to enhance the model’s predictive
capability, LASSO regression was implemented. Throughout this analysis,
two pivotal lambda values, termed as lambda.min and
lambda.1se, were instrumental in determining the optimal
degree of regularization for the model. These lambda values,
specifically lambda.min at 0.0006280493 and
lambda.1se at 0.01788707, were meticulously selected
through cross-validation to strike a delicate equilibrium between the
model’s complexity and its predictive accuracy.
In Table 23A, the LASSO regression coefficients provide crucial
insights into the factors influencing college completion rates high.
These coefficients are derived at the lambda.min value,
which represents the optimal level of regularization for the model. Key
findings include:
Intercept (-3.0166681): The intercept represents the baseline completion rate when all predictor variables are zero. In this context, it suggests that the expected completion rate, in the absence of other factors, is approximately -301.67%.
NUMBRANCH (-0.1853803): A negative coefficient indicates that for each additional branch, the completion rate decreases by approximately 0.185 units. This implies that spreading resources across multiple branches might weaken support systems, leading to lower completion rates.
HIGHDEG (0): The coefficient for HIGHDEG is zero, implying that the proportion of various degree levels awarded does not significantly influence completion rates in this model.
UG25ABV (1.8535113): With a positive coefficient, a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 1.854 units. This suggests that older students may have better support systems or motivation, leading to higher completion rates.
UGDS (-0.0000166): The coefficient for UGDS is negative but very close to zero, indicating that total enrollment has an insignificant effect on completion rates.
UGDS_WHITE (1.6001851): An increase in the percentage of White students corresponds to an increase in completion rates by approximately 1.600 units. This may indicate that colleges with a higher proportion of White students have better support systems or resources for completion.
UGDS_BLACK (-0.0180945): Conversely, an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.018 units. This highlights potential disparities or challenges faced by Black students in completing their degrees.
UGDS_HISP (1.0913299): A higher proportion of Hispanic students corresponds to an increase in completion rates by approximately 1.091 units. This suggests that colleges with a larger Hispanic student population may have support systems tailored to their needs, contributing to higher completion rates.
UGDS_ASIAN (4.7485007): With the highest coefficient, the presence of Asian students significantly boosts completion rates, increasing them by approximately 4.749 units. This underscores the positive impact of diversity on completion rates and may reflect the strong academic performance of Asian students.
PCTPELL (-1.0592512): An increase in Pell Grant recipients corresponds to a decrease in completion rates by approximately 1.059 units. This highlights the challenges faced by economically disadvantaged students in completing their degrees.
PCTFLOAN (0.5948730): Conversely, an increase in students with federal loans correlates with an increase in completion rates by approximately 0.595 units. This highlights the role of financial aid in supporting students through completion.
The LASSO regression model with the coefficients at
lambda.min is as follows:
Predicted (Completion_Rate_High) = -3.0166681 -
0.1853803NUMBRANCH + 1.8535113UG25ABV
- 0.0000166UGDS + 1.6001851UGDS_WHITE
- 0.0180945UGDS_BLACK +
1.0913299UGDS_HISP +
4.7485007UGDS_ASIAN -
1.0592512PCTPELL + 0.5948730*PCTFLOAN
# Apply Lasso
# Fit the model
# Find the best lambda
# Create Model
CRH_Model_Lasso_CV_Logistic = cv.glmnet(as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), CRH_Train_Data$Completion_Rate_High, alpha = 1, family = "binomial")
# print the best lambda (lambda.min)
# CRH_Model_Lasso_CV_Logistic$lambda.min # -- lambda.min = 0.0006280493
# print coefficients for the best lambda
CRH_Coefficients_LASSO <- predict(CRH_Model_Lasso_CV_Logistic, s = "lambda.min", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRH_Coefficients_LASSO_matrix <- as.matrix(CRH_Coefficients_LASSO)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRH_Coefficients_LASSO_df <- as.data.frame(CRH_Coefficients_LASSO_matrix)
# Create a basic table using kable
kable(CRH_Coefficients_LASSO_df, caption = "Table 23A: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate High")
| lambda.min | |
|---|---|
| (Intercept) | -3.0166681 |
| NUMBRANCH | -0.1853803 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 1.8535113 |
| UGDS | -0.0000166 |
| UGDS_WHITE | 1.6001851 |
| UGDS_BLACK | -0.0180945 |
| UGDS_HISP | 1.0913299 |
| UGDS_ASIAN | 4.7485007 |
| PCTPELL | -1.0592512 |
| PCTFLOAN | 0.5948730 |
# print the best lambda (lambda.1se)
# CRH_Model_Lasso_CV_Logistic$lambda.1se #-- lambda.1se = 0.01788707
# print coefficients for the best lambda
CRH_Coefficients_LASSO_lambda.1se <- predict(CRH_Model_Lasso_CV_Logistic, s = "lambda.1se", type = "coefficients")
# Convert the sparse matrix to a regular matrix
CRH_Coefficients_LASSO_matrix_lambda.1se <- as.matrix(CRH_Coefficients_LASSO_lambda.1se)
# Convert the matrix to a data frame
# Convert coefficients to a data frame
CRH_Coefficients_LASSO_df_lambda.1se <- as.data.frame(CRH_Coefficients_LASSO_matrix_lambda.1se)
# Create a basic table using kable
kable(CRH_Coefficients_LASSO_df_lambda.1se, caption = "Table 23B: LASSO Regression Coefficients
Analyzing Factors Influencing College Completion Rate High")
| lambda.1se | |
|---|---|
| (Intercept) | -2.0358658 |
| NUMBRANCH | -0.0623581 |
| HIGHDEG | 0.0000000 |
| UG25ABV | 0.7650294 |
| UGDS | 0.0000000 |
| UGDS_WHITE | 0.2964689 |
| UGDS_BLACK | -0.2249161 |
| UGDS_HISP | 0.0000000 |
| UGDS_ASIAN | 2.2646633 |
| PCTPELL | 0.0000000 |
| PCTFLOAN | 0.0000000 |
The LASSO regression analysis, examining the factors influencing college completion rates high, produced intriguing results as depicted in Table 23B. At the lambda.1se value, the intercept is estimated to be -2.0358658. This represents the baseline completion rate when all predictor variables are zero.
For the predictor variables, NUMBRANCH shows a negative coefficient of -0.0623581. This suggests that for each additional branch, the completion rate decreases by approximately 0.062 units. In contrast, UG25ABV exhibits a positive coefficient of 0.7650294, indicating that a higher percentage of undergraduates over 25 years old correlates with an increase in completion rates by approximately 0.765 units.
Interestingly, several predictor variables such as HIGHDEG, UGDS, UGDS_HISP, PCTPELL, and PCTFLOAN have coefficients estimated at zero. This suggests that these variables do not significantly impact completion rates in the model at the specified level of regularization.
Furthermore, UGDS_BLACK has a negative coefficient of -0.2249161, implying that an increase in the percentage of Black students correlates with a decrease in completion rates by approximately 0.225 units. Additionally, UGDS_WHITE shows a positive coefficient of 0.2964689, indicating that an increase in the percentage of White students corresponds to an increase in completion rates by approximately 0.296 units.
Additionally, UGDS_ASIAN demonstrates a notably high coefficient of 2.2646633, suggesting that the presence of Asian students significantly boosts completion rates, increasing them by approximately 2.265 units.
Overall, these findings provide valuable insights into the factors influencing college completion rates high, highlighting the importance of demographic composition and student characteristics in predicting completion outcomes.
The LASSO regression model for analyzing factors influencing college
completion rates high, with the corresponding coefficients at
lambda.1se, is as follows:
Predicted (Completion_Rate_High) = -2.0358658 -
0.0623581NUMBRANCH + 0.7650294UG25ABV
- 0.2249161UGDS_BLACK +
0.2964689UGDS_WHITE +
2.2646633UGDS_ASIAN +
0.5948730PCTFLOAN
In the analysis of college completion rates high using LASSO
regression, the confusion matrices for the train dataset were examined
to assess the model’s predictive performance. In Table 24A, representing
the confusion matrix at lambda.min, the model correctly
identified 3,446 instances where colleges didn’t completes their courses
(true negatives) and 19 instances where colleges did complete (true
positives). However, there were misclassifications, with 637 colleges
incorrectly classified as not completing when they did (false
negatives), and 14 colleges misclassified as completing when they did
not (false positives).
Similarly, in Table 24B, which represents the confusion matrix at
lambda.1se, the model correctly identified 3,446 instances
where colleges did not complete (true negatives) and 5 instances where
they did (true positives). However, there were misclassifications as
well, with 623 colleges incorrectly predicted as not completing when
they did (false negatives), and 3 colleges incorrectly predicted as
completing when they did not (false positives).
# Make Predictions (lambda.min)
CRH_Predictions_Lasso_Log_Train = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.min", type = "response")
# convert the probabilities to 0 and 1
CRH_Predictions_Lasso_Log_Train = ifelse(CRH_Predictions_Lasso_Log_Train > 0.5, 1, 0)
# Print the confusion matrix
CRH_Train_CM_LASSO <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Train)
# print(CRH_Train_CM_LASSO)
kable(CRH_Train_CM_LASSO, caption = "Table 24A:
LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 3432 | 14 |
| 1 | 623 | 19 |
# Make Predictions (lambda.1se)
CRH_Predictions_Lasso_Log_Train_lambda.1se = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Train_Data[, -ncol(CRH_Train_Data)]), s = "lambda.1se", type = "response")
# convert the probabilities to 0 and 1
CRH_Predictions_Lasso_Log_Train_lambda.1se = ifelse(CRH_Predictions_Lasso_Log_Train_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRH_Train_CM_LASSO_lambda.1se <- table(CRH_Train_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Train_lambda.1se)
# print(CRH_Train_CM_LASSO_lambda.1se)
kable(CRH_Train_CM_LASSO_lambda.1se, caption = "Table 24B:
LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 3443 | 3 |
| 1 | 637 | 5 |
These findings suggest that the LASSO regression model, while generally effective in identifying instances of college completion rates high, still exhibits limitations in accurately predicting these outcomes. The high number of false negatives indicates that the model underestimates the completion rates, potentially overlooking colleges that complete their courses. Conversely, the occurrence of false positives suggests that the model sometimes identifies completion rates that are not realized.
These misclassifications can have implications for decision-making processes, as they may lead to inaccurate assessments of college performance or resource allocation. Therefore, further refinement of the model, possibly through adjusting regularization parameters or including additional predictor variables, may be necessary to improve its predictive accuracy and reduce misclassifications.
In the evaluation of the LASSO regression model’s performance on the
test dataset, two confusion matrices were analyzed to assess its ability
to predict college completion rates high. In Table 25A, representing the
confusion matrix at lambda.min, the model correctly
identified 865 instances where colleges did not complete their courses
(true negatives) and 4 instances where colleges did complete (true
positives). However, there were misclassifications, with 152 colleges
incorrectly classified as not completing when they did (false
negatives), and 5 colleges misclassified as completing when they did not
(false positives).
Similarly, in Table 25B, which represents the confusion matrix at
lambda.1se, the model correctly identified 869 instances
where colleges did not complete (true negatives) and 1 instance where
they did (true positive). However, there were misclassifications as
well, with 149 colleges incorrectly predicted as not completing when
they did (false negatives), and 1 college incorrectly predicted as
completing when it did not (false positive).
# Make Predictions (lambda.min)
CRH_Predictions_Lasso_Log_Test = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.min", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Lasso_Log_Test = ifelse(CRH_Predictions_Lasso_Log_Test > 0.5, 1, 0)
# Print the confusion matrix
CRH_Test_CM_LASSO <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Test)
# print(CRH_Test_CM_LASSO)
kable(CRH_Test_CM_LASSO, caption = "Table 25A:
LASSO Regression Confusion Matrix for College Completion Rate High (lambda.min)")
| 0 | 1 | |
|---|---|---|
| 0 | 865 | 5 |
| 1 | 149 | 4 |
# Make Predictions (lambda.1se)
CRH_Predictions_Lasso_Log_Test_lambda.1se = predict(CRH_Model_Lasso_CV_Logistic, newx = as.matrix(CRH_Test_Data[, -ncol(CRH_Test_Data)]), s = "lambda.1se", type = "response")
# Convert the probabilities to 0 and 1
CRH_Predictions_Lasso_Log_Test_lambda.1se = ifelse(CRH_Predictions_Lasso_Log_Test_lambda.1se > 0.5, 1, 0)
# Print the confusion matrix
CRH_Test_CM_LASSO_lambda.1se <- table(CRH_Test_Data$Completion_Rate_High, CRH_Predictions_Lasso_Log_Test_lambda.1se)
# print(CRH_Test_CM_LASSO)
kable(CRH_Test_CM_LASSO_lambda.1se, caption = "Table 25B:
LASSO Regression Confusion Matrix for College Completion Rate High (lambda.1se)")
| 0 | 1 | |
|---|---|---|
| 0 | 869 | 1 |
| 1 | 152 | 1 |
These results suggest that the LASSO regression model, while demonstrating some capability in identifying instances of college completion rates high, still exhibits limitations in accurately predicting these outcomes, especially for the positive class. The relatively high number of false negatives indicates that the model tends to underestimate the completion rates, potentially overlooking colleges that complete their courses. Conversely, the occurrence of false positives suggests that the model sometimes identifies completion rates that are not realized.
Such misclassifications can have implications for decision-making processes, as they may lead to inaccurate assessments of college performance or resource allocation. Therefore, further refinement of the model, such as adjusting regularization parameters or considering additional predictor variables, may be necessary to enhance its predictive accuracy and reduce misclassifications.
The analysis of logistic regression, ridge regression, and LASSO regression models provided valuable insights into the factors influencing high college completion rates. By comparing the performance of these models using lambda.min and lambda.1se values, significant differences emerged, shedding light on their predictive capabilities.
Logistic Regression vs. Ridge Regression: In the logistic regression analysis, lambda.min was not explicitly mentioned, as logistic regression does not incorporate regularization. However, ridge regression introduced regularization to address issues like multicollinearity and overfitting. At lambda.min, ridge regression produced a more complex model compared to lambda.1se, as indicated by the higher number of non-zero coefficients. However, both lambda values in ridge regression created similar confusion matrix results, with a notable number of misclassifications.
LASSO Regression: LASSO regression, like ridge regression, used regularization but with a different penalty term, promoting sparsity in the coefficient estimates. At lambda.min, LASSO regression demonstrated a more parsimonious model with fewer non-zero coefficients compared to lambda.1se. However, lambda.min resulted in a higher number of false negatives in both the train and test datasets, indicating an underestimation of completion rates.
Multiple Linear Regression: In conjunction with logistic regression, ridge regression, and LASSO regression, multiple linear regression, a conventional modeling technique, provided essential insights into the intricate relationships among predictor variables and high college completion rates. Despite its simplicity relative to regularization-based methods, multiple linear regression unearthed significant linear associations between various predictors and completion rates. For instance, in the multiple linear regression analysis, the coefficient estimates revealed how changes in predictor variables like the number of branches, demographic composition, and financial aid metrics correlated with changes in completion rates. This traditional approach contributed valuable insights into the linear dynamics driving completion outcomes, complementing the findings from other regression models.
Comparison of Lambda Values: Across all three models, lambda.min consistently produced models with more non-zero coefficients compared to lambda.1se, suggesting a preference for complexity over simplicity. However, this complexity did not necessarily translate into improved predictive performance, as evidenced by the occurrence of misclassifications.
Performance Evaluation: In terms of predictive performance, logistic regression, despite its simplicity, showed promising results, with relatively low misclassification rates in both the train and test datasets. However, ridge regression and LASSO regression struggled to accurately predict completion rates, especially for the positive class, as indicated by the higher number of false negatives and false positives.
Unexpected Insights: While logistic regression was expected to perform adequately due to its straightforward nature, the underperformance of ridge regression and LASSO regression was somewhat unexpected. The regularization introduced by ridge regression and LASSO regression was anticipated to improve predictive accuracy by mitigating issues such as overfitting. However, the complexity introduced by regularization may have led to overfitting or underfitting issues, resulting in suboptimal predictive performance.
In evaluating the performance of the models developed in this analysis report, it becomes evident that each approach offers unique insights into the factors influencing high college completion rates. Among the models employed – logistic regression, ridge regression, LASSO regression, and multiple linear regression – each has its strengths and limitations, which are crucial to consider in determining the most effective predictive tool.
The logistic regression model, which focused on predicting high college completion rates, offered a comprehensive understanding of the relationships between various predictor variables and the likelihood of achieving high completion rates. For instance, it highlighted the impact of factors such as the presence of different degree programs, student demographics, and financial aid metrics on completion rates. However, despite its interpretability, logistic regression struggled with misclassifications, particularly in identifying instances of high completion rates accurately.
On the other hand, ridge regression, known for its regularization technique, addressed concerns such as multicollinearity and overfitting by introducing a penalty term into the coefficient estimates. The ridge regression model demonstrated improved classification performance compared to logistic regression, particularly in accurately identifying non-completion instances. However, it still faced challenges in accurately predicting completion instances, indicating room for further refinement.
Similarly, LASSO regression, another regularization-based approach, provided valuable insights into the factors influencing high college completion rates. By selecting a subset of relevant predictor variables while shrinking others to zero, LASSO regression offered a balance between model complexity and predictive accuracy. However, like ridge regression, it exhibited limitations in accurately predicting completion instances, especially for the positive class.
In contrast, multiple linear regression, a more traditional modeling technique, offered insights into the linear associations between predictor variables and high completion rates. Despite its simplicity compared to the regularization-based approaches, multiple linear regression provided valuable insights into the relationships between predictor variables and completion rates, albeit without addressing issues like multicollinearity or overfitting.
Comparing the performance of these models, while ridge regression and LASSO regression demonstrated improved classification performance compared to logistic regression, none of the models achieved optimal predictive accuracy. Misclassifications persisted across all models, indicating the complexity of predicting high college completion rates accurately. This outcome was somewhat expected, given the inherent challenges associated with predicting human behavior and educational outcomes.
Ultimately, the choice of the “better” model depends on the specific goals and requirements of the analysis. If interpretability and understanding the linear relationships between predictor variables and completion rates are paramount, multiple linear regression may be preferred. However, if predictive accuracy and addressing issues like multicollinearity and overfitting are crucial, ridge regression or LASSO regression may offer more suitable alternatives.
For instance, consider the logistic regression model:
logit (P (Y = `Completion Rate High`)) = 0.13 - 0.22*`NUMBRANCH` - 0.37*`HIGHDEGCertificate Degree` - 2.69*`HIGHDEGAssociate Degree` - 1.90*`HIGHDEGBachelors Degree` - 2.42*`HIGHDEGGraduate Degree` + 0.85*`UG25ABV` + 0.36*`UGDS_WHITE` - 1.38*`UGDS_BLACK` - 0.13*`UGDS_HISP` + 4.42*`UGDS_ASIAN` - 2.10*`PCTPELL` + 0.85*`PCTFLOAN`
While no single model emerged as the definitive “best” performer, each provided valuable insights into the complex interplay of factors influencing high college completion rates. Further refinement and iteration of these models, possibly through incorporating additional predictor variables or exploring alternative modeling techniques, are necessary to enhance their predictive accuracy and reliability for informing decision-making processes related to educational outcomes.
Bluman, A. (2018). Elementary statistics: A step by step approach (10th ed.). McGraw Hill. Goodreads. (n.d.).
Kabacoff, R.I. (2022). R in action: Data analysis and graphics with R and tidyverse (3rd edition).