Project Overview

Project Overview

Abstract

Our dashboard is a tool that utilizes data from the US Department of Education’s College Scorecard. Its purpose is to provide a comprehensive view of higher education institutions across the United States. The dashboard presents key metrics such as graduation rates, median earnings, student demographics, and institutional characteristics in an integrated and user-friendly dashboard.

By consolidating this diverse set of data points, our dashboard enables users to uncover trends, patterns, and correlations. This information can be used for strategic planning, policy development, and program evaluation. In short, our dashboard offers a valuable resource for anyone interested in gaining insights into the higher education landscape in the United States.

Data Introduction

The data we are using is from the US Department of Education’s College Scorecard. We are using the Institution level data set. The data set includes information for each institution 1996-97 through 2021-22. It includes information on institutional characteristics, enrollment, student aid, costs, and student outcomes.The data comes from federal reporting from institutions, data on federal financial aid, and tax information. With most of the data coming from the Integrated Post-secondary Education Data System (IPEDS). The data is categorized under data area 4, Education.

The full dataset can be accessed through the following link

More Information about the data can be found here

The work was split up in the following ways:

Divya

  • Multiple Regression
  • Ridge Regression
  • Naive Bayes

Grace

  • Loess Regression
  • Knn Classification
  • Classification using Logistic Regression

Row

Data Dictionary

Data Dictionary for Institution Dataset
Name Variable Name  Data Type Description
Institute Name INSTNM Categorical Institute name
City CITY Categorical The city in which the institute is located
Region REGION Categorical 0: U.S. Service Schools: 1: New England 2: Mid East 3: Great Lakes 4: Plains 5: Southeast 6: Southwest 7: Rocky Mountains 8: Far West
Median earnings of students working and not enrolled 10 years after entry MN_EARN_WNE_P10 Numerical NA
Control of institution CONTROL Categorical 1: Public 2: Private nonprofit 3: Private for-profit
Bachelor’s degree in Engineering CIP14BACHL Categorical 0: Program not offered 1: Program offered 2: Program offered through an exclusively distance-education program
Undergraduate student to instructional faculty STUFACR Numerical NA
Average faculty salary AVGFACSAL Numerical Average faculty salary per month
Retention RET_FT4 Numerical First-time, full-time student retention rate at four-year institutions
Number of Branches NUMBRANCH Numerical NA
Percent of all undergraduate students receiving a federal student loan PCTFLOAN Numerical NA
Admission rate ADM_RATE Numerical NA
Average SAT score SAT_AVG Numerical NA
Open admission OPENADMP Categorical 1: Yes 2: No 3: Does not enroll first-time students
Graduate population GRADS Numerical NA
On-time Completion in a 4 year institute C100_4 Numerical Completion rate for first-time, full-time students at four-year institutions (100% of expected time to completion)
Number of undergraduate students UGDS Numerical NA
Total share of enrollment of undergraduate students who are white UGDS_WHITE Numerical NA
Total share of enrollment of undergraduate students who are Asian UGDS_ASIAN Numerical NA
Total share of enrollment of undergraduate students who are Black UGDS_BLACK Numerical NA
Total share of enrollment of undergraduate students who are Hispanic UGDS_HISP Numerical NA
Share of full time faculty that are women IRPS_WOMEN Numerical NA
Bachelor’s degree in Biological And Biomedical Sciences CIP26BACHL Categorical 0: Program not offered 1: Program offered 2: Program offered through an exclusively distance-education program

Introduction

Research Question

What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?

Row

Methodology

I will be using multiple linear regression to find the beset combinations of predictors and testing its accuracy. I will be using “Retention” as my response variable and the following predictor variables:

  • Admission rate (ADM_RATE)
  • Average SAT score (SAT_AVG)
  • Completion rate (C100_4)
  • Undergraduate population (UGDS)
  • Undergraduate student to instructional faculty ratio (STUFACR)

Row

Data Table

INSTNM RET_FT4 ADM_RATE SAT_AVG C100_4 UGDS STUFACR
1 Alabama A & M University 0.5797 0.7160 954 0.1121 5098 18
2 University of Alabama at Birmingham 0.8392 0.8854 1266 0.4209 13284 19
4 University of Alabama in Huntsville 0.7899 0.7367 1300 0.3528 7358 19
5 Alabama State University 0.6436 0.9799 955 0.1014 3495 13
6 The University of Alabama 0.8855 0.7890 1244 0.5264 30725 19
9 Auburn University at Montgomery 0.6303 0.9680 1069 0.1272 3742 15

Model Building and Assumptions

Model Building and Assumptions

What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?

Row

Stepwise varible selection

After performing stepwise regression analysis, it was found that SAT average, on-time completion rate, and undergraduate population have a statistically significant association with retention rate. However, Admission Rates and Student Faculty ratio were found to be less significant and were subsequently removed. The best model is RET_FT4 ~ SAT_AVG + C100_4 + UGDS.

Row

Normality Assumption

Based on the plot, we can conclude that although a significant number of data points are aligned with the QQline, the tails of the plot exhibit some non-normal behavior. This suggests that the normality assumption is violated, which is further supported by the Shapiro-Wilk test. The test indicates a small p-value, providing evidence against the null hypothesis. Therefore, we can conclude that normality is violated.

Row

QQ Plot

Shapiro test

Shapiro-Wilk Test Results
Test Statistic P_Value
W Shapiro-Wilk Test 0.9306323 0

Row

Independence Assumption

According to the Durbin-Watson test results, the D-W statistic is less than 2, indicating a positive autocorrelation. Furthermore, the p-value is less than 0.05, providing evidence to reject the null hypothesis, and thereby violating the independence assumption.

Durbin Watson Test

Durbin-Watson Test Results
Test Statistic P_Value
Durbin-Watson Test 1.822582 0.004

Row

Constant Variance Assumption

A Levene test was not conducted on the linear regression. Based on the results from the Breuch-Pagan test, we can conclude that the constant variance assumption is violated since the p-value is less than 0.05.

Breusch-Pagan Test

Breusch-Pagan Test Results
Test LM_Statistic P_Value
BP Breusch-Pagan Test 59.29666 0

Analyzing Relationships, Model Assumptions, Outliers, and Influential Points

Analyzing Relationships, Model Assumptions, Outliers, and Influential Points

What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?

Row

Linearity

The scatterplot matrix below shows the relationship between key variables in our analysis visually. Our dataset includes scatterplots for “RET_FT4,” “SAT_AVG,” “C100_4,” and “UGDS” variables.

Each cell in the upper triangle of the matrix presents a smoothed scatterplot between two variables. Moreover, the lower triangle of the matrix displays correlation coefficients. Positive correlations (values close to 1) suggest that if one variable increases, the other tends to increase as well. On the other hand, negative correlations (values close to -1) indicate an inverse relationship.

The second plot displays a cone shape, indicating heteroscedasticity in the model’s Standardized Residuals vs Fitted Values plot.

Row

Scatterplot Matrix

Residuals vs Fitted Values

Row

Outliers and influential points

According to the plot, we can observe that there are three substantial outliers, which are 458, 1857, and 5979. However, they were not eliminated from the dataset. Additionally, based on the table, we can identify six influential points that must be considered.

Row

Outliers

Influential points

Influential points Test Results
Student_Residuals Unadjusted_P_Value
158 -6.738894 0.00e+00
4680 6.019783 0.00e+00
1857 5.939086 0.00e+00
3024 -4.964702 8.00e-07
936 4.934349 9.00e-07
824 -4.174032 3.25e-05

Chosen Model

Chosen Model

What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?

Row

Summary Statistics

According to the low p-value for the F statistic, we can conclude that the model is statistically significant. Furthermore, all the predictors have a significant association with the response (retention rate).

The multiple \(R^2\) value of 0.66 indicates that 66% of the variability in retention rate of first years is explained by the predictors. However, there may be some additional predictors that were not considered during the initial model building process, which could account for some of the variability.

The adjusted \(R^2\) value of 0.665 adjusts the multiple \(R^2\) value to account for the number of predictors in the model. Since the two R^2 values are very close, we can conclude that the inclusion of the predictors has not substantially increased the multiple \(R^2\) value.

Row

Summary Statistics

Model Summary for RET_FT4
term estimate std.error statistic p.value
(Intercept) 0.1435778 0.0256953 5.587715 0
SAT_AVG 0.0004355 0.0000275 15.812946 0
C100_4 0.1958750 0.0177822 11.015231 0
UGDS 0.0000028 0.0000003 9.331482 0

Row

Scatter Plots of predictors against repsonse variable

Based on the plots provided, we can observe that the number of undergraduate students in most universities tends to be less than 20,000. However, some universities have significantly larger undergraduate populations, making the distribution of the undergraduate population right-skewed. Moving on to the SAT scores, the average score falls mostly between 1000 and 1200. This is not surprising since the majority of students score between this range. While there are fewer students who score less than 1000 or above 1500 (although there are more above 1500 than below 1000), not many students tend to get scores below 1000.

Row

SAT Score

Completion Rate

Undergraduate Population

Row

Correlation coefficients

There exists a significant positive relationship between the average SAT scores and student retention rates. This suggests that colleges and universities with higher SAT scores typically have higher retention rates for first-time, full-time students.

Similarly, retention rates also have a strong positive correlation with on-time completion rates.

Retention rates have a moderate positive correlation with the undergraduate population. Furthermore, there is a moderate negative correlation (-0.406) between retention rates and admission rates (ADM_RATE). This indicates that institutions with lower admission rates tend to have higher retention rates for first-time, full-time students.

The other correlations between variables also provide valuable insights into their relationships. For example, average SAT scores exhibit a negative correlation with admission rates, which implies that colleges and universities with higher average SAT scores typically have lower admission rates. This correlation makes sense.

Row

Correlation coefficients matrix

RET_FT4 ADM_RATE SAT_AVG C100_4 UGDS STUFACR
RET_FT4 1.0000000 -0.4061152 0.7809558 0.7148796 0.3158825 -0.1387544
ADM_RATE -0.4061152 1.0000000 -0.5366314 -0.4438455 -0.0200048 0.3105386
SAT_AVG 0.7809558 -0.5366314 1.0000000 0.7807436 0.2422476 -0.2550316
C100_4 0.7148796 -0.4438455 0.7807436 1.0000000 0.0632319 -0.3767704
UGDS 0.3158825 -0.0200048 0.2422476 0.0632319 1.0000000 0.4633320
STUFACR -0.1387544 0.3105386 -0.2550316 -0.3767704 0.4633320 1.0000000

Row

VIF points

Since the VIF scores are all below 10, this suggests that multicollinearity is not a significant concern among the predictor variables in the model. For SAT_AVG and C100_4 they both have moderate correlation but not high enough to cause any concern.

Row

VIF

x
SAT_AVG 2.832160
C100_4 2.676659
UGDS 1.110226

Row

Predictions

I attempted to predict the retention rate for a fictitious school. I assigned values to the predictor variables and used the predict function to obtain a retention rate for my multiple linear regression model. Based on this prediction, it appears that a school with an average SAT score of 1530, an undergraduate population of 15,000, and an on-time completion rate of 0.27 would have a 90% retention rate.

Ridge Regression

Research Question

How does adding regularization affect the prediction accuracy of models for predicting completion rates for first-time students at 4-year institutions, while controlling for other variables such as in-state tuition and fees and number of undergraduate students?

Row

Variables for model

I will be using the completion rate as the response variable in my analysis. This variable indicates the percentage of first-time, full-time students who complete their degree within the expected time of completion (i.e., 100% completion rate). The predictor variables that I will be using in my analysis are as follows:

  • In-state tuition/fees (TUITIONFEE_IN)
  • Number of undergrad students (UGDS)
  • Admission rate (ADM_RATE)
  • Undergraduate student to instructional faculty ratio (STUFACR)
  • Average SAT score (SAT_AVG)

Row

Data Table

INSTNM TUITIONFEE_IN UGDS ADM_RATE STUFACR SAT_AVG C100_4
1 Alabama A & M University 10024 5098 0.7160 18 954 0.1121
2 University of Alabama at Birmingham 8568 13284 0.8854 19 1266 0.4209
4 University of Alabama in Huntsville 11488 7358 0.7367 19 1300 0.3528
5 Alabama State University 11068 3495 0.9799 13 955 0.1014
6 The University of Alabama 11620 30725 0.7890 19 1244 0.5264
9 Auburn University at Montgomery 8860 3742 0.9680 15 1069 0.1272

Row

Fitting the model

Var1 - This column represents the names of variables or attributes

Var2 - This column represents the types or classes of the variables or attributes

Freq - This column represents the frequencies or counts associated with each variable or attribute

For example: a0 has a length of 100, belongs to the class ‘numeric’, and has a mode of ‘numeric’.

We can conclude from this that there are variables with different lengths, which means varying number of observations. The dataset contains variables of different classes, such as numeric, logical, or call. The variables have different storage modes, indicating how they are stored in memory. Finally, the frequencies of different attributes provide insights into the variable characteristic distribution in the dataset.

Row

Model

Var1 Var2 Freq
a0 Length 100
beta Length 500
df Length 100
dim Length 2
lambda Length 100
dev.ratio Length 100
nulldev Length 1
npasses Length 1
jerr Length 1
offset Length 1
call Length 4
nobs Length 1
a0 Class -none-
beta Class dgCMatrix
df Class -none-
dim Class -none-
lambda Class -none-
dev.ratio Class -none-
nulldev Class -none-
npasses Class -none-
jerr Class -none-
offset Class -none-
call Class -none-
nobs Class -none-
a0 Mode numeric
beta Mode S4
df Mode numeric
dim Mode numeric
lambda Mode numeric
dev.ratio Mode numeric
nulldev Mode numeric
npasses Mode numeric
jerr Mode numeric
offset Mode logical
call Mode call
nobs Mode numeric

Row

cv_fit

By using the cv_glmnet() function, we can evaluate how well each model generalizes through cross-validation. The output plot displays a curve, where the lowest point indicates the optimal lambda value. In this case, the optimal lambda value is 0.01, which is the logarithmic value that minimizes the error in cross-validation. This value suggests that a relatively low amount of regularization is being applied in the model.

Row

Lambda

Extracted fitted model

Length Class Mode
a0 100 -none- numeric
beta 500 dgCMatrix S4
df 100 -none- numeric
dim 2 -none- numeric
lambda 100 -none- numeric
dev.ratio 100 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 4 -none- call
nobs 1 -none- numeric

Row

Making predictions

According to the analysis, the ridge regression model is able to explain approximately 68% of the variability in the completion rates for first-time students at 4-year institutions. This indicates that the model is fairly effective in predicting completion rates using the given predictors.

Row

Model Evaluation

Model Evaluation Metrics
Metric Value
Mean Squared Error (MSE) 0.0130827
R-squared (R^2) 0.6786567

Introduction

Research Question

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Methodology

I will be using LOESS fit to analyze this. I will be using the “Mean earnings of students working and not enrolled 10 years after entry” variable as my response and “Retention at 4-year Instiution” as my predictor.

In addition, I will splitting the data by region and will using regions 1-8 which eliminates service schools and the outlying areas.

  • 1: New England (CT, ME, MA, NH, RI, VT)
  • 2: Mid East (DE, DC, MD, NJ, NY, PA)
  • 3: Great Lakes (IL, IN, MI, OH, WI)
  • 4: Plains (IA, KS, MN, MO, NE, ND, SD)
  • 5: Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
  • 6: Southwest (AZ, NM, OK, TX)
  • 7: Rocky Mountains (CO, ID, MT, UT, WY)
  • 8: Far West (AK, CA, HI, NV, OR, WA)

For each region I will perform LOESS for span values from 0.25 to 0.75 in steps of 0.05 and calculate the RSE values for each iteration. I will then choose the top three most accurate LOESS and graph them.

Row

Region 1

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
539 Albertus Magnus College CT 1 0.6420 56000
542 University of Bridgeport CT 1 0.5516 47100
543 Central Connecticut State University CT 1 0.7232 51600
544 Charter Oak State College CT 1 0.6667 43700
546 Connecticut College CT 1 0.8611 62900
548 University of Connecticut CT 1 0.9164 66000

Region 2

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
590 Delaware State University DE 2 0.7510 38400
591 University of Delaware DE 2 0.9148 64000
592 Goldey-Beacom College DE 2 0.8214 48800
594 Wilmington University DE 2 0.5866 46600
595 American University DC 2 0.9046 67700
596 The Catholic University of America DC 2 0.8800 60500

Region 3

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
839 American Academy of Art College IL 3 0.7037 35600
841 School of the Art Institute of Chicago IL 3 0.8266 40800
842 Augustana College IL 3 0.8527 56400
843 Aurora University IL 3 0.7789 46200
847 Blackburn College IL 3 0.6724 41700
849 Bradley University IL 3 0.8651 58300

Region 4

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
473 Walden University MN 4 0.2500 62500
1061 Briar Cliff University IA 4 0.6337 44400
1062 Buena Vista University IA 4 0.7301 42600
1065 Central College IA 4 0.7901 48400
1066 Clarke University IA 4 0.6507 44800
1067 Coe College IA 4 0.7391 50000

Region 5

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
1 Alabama A & M University AL 5 0.5797 35500
2 University of Alabama at Birmingham AL 5 0.8392 48400
4 University of Alabama in Huntsville AL 5 0.7899 52000
5 Alabama State University AL 5 0.6436 30600
6 The University of Alabama AL 5 0.8855 51600
9 Auburn University at Montgomery AL 5 0.6303 38000

Region 6

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
69 Brookline College-Phoenix AZ 6 0.4667 23700
70 Arizona State University Campus Immersion AZ 6 0.8600 55500
72 University of Arizona AZ 6 0.8403 56000
79 Embry-Riddle Aeronautical University-Prescott AZ 6 0.7754 70300
82 Grand Canyon University AZ 6 0.7059 58500
87 Dine College AZ 6 0.0909 23400

Region 7

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
488 Adams State University CO 7 0.5924 38100
490 Arapahoe Community College CO 7 0.5225 40700
492 University of Colorado Denver/Anschutz Medical Campus CO 7 0.7480 71200
493 University of Colorado Colorado Springs CO 7 0.6664 49300
495 University of Colorado Boulder CO 7 0.8738 59700
496 Colorado Christian University CO 7 0.8229 44600

Region 8

INSTNM STABBR REGION RET_FT4 MN_EARN_WNE_P10
55 University of Alaska Anchorage AK 8 0.6869 51200
57 University of Alaska Fairbanks AK 8 0.6793 45100
58 University of Alaska Southeast AK 8 0.6471 39500
59 Alaska Pacific University AK 8 0.6842 45300
164 Academy of Art University CA 8 0.6899 47300
176 Art Center College of Design CA 8 0.8504 62300

Region 1

Region 1: New England

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 13517.94
0.30 13640.93
0.35 13737.59
0.40 13844.75
0.45 13926.70
0.50 14027.89
0.55 14117.45
0.60 14190.65
0.65 14341.78
0.70 14491.43
0.75 14658.62

Row

Span 0.25

Span 0.30

Span 0.35

Region 2

Region 2: Mid East

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 13371.78
0.30 13433.87
0.35 13472.65
0.40 13507.91
0.45 13545.81
0.50 13575.71
0.55 13609.19
0.60 13626.63
0.65 13634.73
0.70 13643.16
0.75 13661.11

Row

Span 0.25

Span 0.30

Span 0.35

Region 3

Region 3: Great Lakes

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 9300.800
0.30 9308.149
0.35 9307.209
0.40 9310.907
0.45 9314.890
0.50 9320.280
0.55 9325.684
0.60 9333.756
0.65 9341.502
0.70 9356.009
0.75 9374.692

Row

Span 0.25

Span 0.30

Span 0.35

Region 4

Region 4: Plains

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 9002.696
0.30 9059.466
0.35 9100.983
0.40 9133.296
0.45 9160.274
0.50 9187.892
0.55 9215.928
0.60 9237.815
0.65 9260.071
0.70 9276.991
0.75 9299.451

Row

Span 0.25

Span 0.30

Span 0.35

Region 5

Region 5: Southeast

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 8891.443
0.30 8924.220
0.35 8917.468
0.40 8940.496
0.45 8956.736
0.50 8971.868
0.55 8984.336
0.60 8994.147
0.65 9005.815
0.70 9013.932
0.75 9023.108

Row

Span 0.25

Span 0.30

Span 0.35

Region 6

Region 6: Southwest

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 9799.169
0.30 9823.838
0.35 9807.848
0.40 9867.717
0.45 9915.619
0.50 9955.095
0.55 9984.825
0.60 10004.162
0.65 9989.940
0.70 9993.568
0.75 10004.536

Row

Span 0.25

Span 0.30

Span 0.35

Region 7

Region 7: Rocky Mountains

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 10114.56
0.30 10254.46
0.35 10279.87
0.40 10330.13
0.45 10446.56
0.50 10646.09
0.55 10811.20
0.60 10895.39
0.65 10928.38
0.70 11021.68
0.75 11064.21

Row

Span 0.25

Span 0.30

Span 0.35

Region 8

Region 8: Far West

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Residual Standard Error (RSE) Values for LOESS Regression

span RSE_loess
0.25 12636.05
0.30 12787.71
0.35 12915.42
0.40 13001.78
0.45 13037.83
0.50 13068.02
0.55 13093.43
0.60 13117.40
0.65 13135.49
0.70 13158.21
0.75 13177.23

Row

Span 0.25

Span 0.30

Span 0.35

Conclusion

Research Question

How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?

Row

Conclusion

In general, the LOESS regression analysis indicates that most regions have a positive exponential growth with a positive y-intercept. This suggests that as retention rates increase in a school, the median earnings after graduation are expected to be higher. The only region that did not have a clear exponential curve was region 7, which could be due to the significantly lower number of schools in this region. Nonetheless, there is still a general positive trend between retention and median earnings after graduation.

For all regions, the most accurate span observed was 0.25, followed by 0.3 and 0.35. Based on the graphs for all regions, the LOESS regression lines with a span of 0.25 are the best fit, as the lines for the other two spans tend to over fit the data.

Row

Region 1

Region 2

Region 3

Region 4

Region 5

Region 6

Region 7

Region 8

Knn Classification

Research Question

Can we accurately classify schools as public, private nonprofit, or private for-profit based on features various features in the top five states with the most schools?

Row

Which states have the most colleges?

Based on the bar graph, California has the most schools, with around 700. New York and Texas have the next most schools, with 450 and 415 schools, respectively. I will use the data from the top five states for my kNN classification, which will be California, New York, Texas, Florida, and Pennsylvania.

Row {data-width: 400}

Number of Schools by State

Row

Variables for my kNN model

I will be using the following variables for my kNN model:

  • Number of undergraduate students (UGDS)
  • Undergraduate student to instructional faculty ratio (STUFACR)
  • Average faculty salary per month (AVGFACSAL)
  • Retention (RET_FT4)
  • Number of Branches (NUMBRANCH)
  • Percent of all undergraduate students receiving a federal student loan (PCTFLOAN)

I will be using “Control of Institution” as my response variable. This variable indicates whether an institution is public (1), private non-profit (2), or private for-profit (3). For information about the other variables, please refer to the data dictionary located in the Project Overview tab.

Row {data-width: 400}

Head of Dataset

INSTNM STABBR UGDS RET_FT4 STUFACR AVGFACSAL PCTFLOAN NUMBRANCH CONTROL
164 Academy of Art University CA 5294 0.6899 14 9482 0.4938 1 3
176 Art Center College of Design CA 2026 0.8504 8 7942 0.4686 1 2
179 Azusa Pacific University CA 3937 0.7928 10 8957 0.4772 1 2
183 Bethesda University CA 264 0.4706 10 2482 0.2035 1 2
184 Biola University CA 3592 0.8780 13 9807 0.4786 1 2
190 California Baptist University CA 8234 0.7866 19 10151 0.6467 1 2

Row

Scatter Plots of predictors against repsonse variable

The plots presented below provide a comparison of each predictor with the response variable. While there is a significant overlap in the graphs, it is apparent that most private for-profit colleges tend to have a smaller number of undergraduate students, less than 10000, when compared to public and private non-profit colleges. Moreover, the faculty members at public schools receive an average salary that exceeds $5000 per month, whereas the faculty members at private schools can earn less than $5000 per month. Private for-profit faculty members earn between $2500 and $10000.

Row

Undergraduates

Student to Faculty Ratio

Faculty Salary

Federal Student Loan

Retention

Branches

Row

Splitting data into training and testing

I split the dataset into a 70% training and 30% testing split. Then, I plotted the training data and the testing points (represented by diamonds) on separate graphs. Overall, all graphs revealed that the two datasets exhibit similar patterns, indicating that the model predictions will align with the actual values.

Row

RET_FT4 vs UGDS

STUFACR vs AVGFACSAL

PCTFLOAN vs NUMBRANCH

Row

Running kNN for several k values

The graph below displays the accuracy of the model for different k-values. According to the graph, the k-value of 5 resulted in the highest accuracy of 80%. As the k-values increase, the accuracy of the model decreases. To further analyze the performance of the kNN model, I will execute a confusion matrix for the k-value of 5.

Row {data-width: 400}

kNN Accuracy by k Value

Row

Confusion matrix for k = 5

This is the confusion matrix for k=5, which represents the most accurate kNN with an accuracy rate of approximately 80%. There were 35 correct predictions out of 44 actual instances for Public colleges, 104 correct predictions out of 129 for Private non-profit colleges, and 10 correct predictions out of 13 instances for Private for-profit colleges.

However, there were 7 instances where Public colleges were incorrectly classified as Private Nonprofit, and 13 instances where Private Nonprofit colleges were incorrectly classified as Public. These misclassifications suggest some overlap in the features or characteristics of these classes. The classifier also struggled with Private For-Profit, misclassifying 3 instances as Private Nonprofit, but none as Public.

It is essential to note that a smaller number of instances for Private For-Profit colleges may have contributed to the lower accuracy rate for this class.

Row

Confusion Matrix

Confusion Matrix for kNN classification
Public Private Nonprofit Private For-Profit
Predicted Public 35 7 2
Predicted Nonprofit 13 104 12
Predicted For-Profit 0 3 10

Naive Bayes Classification

Research Question

Based on the following features, can we accurately predict the region where a school is located?

Row

Geographic Distribution of Schools by Region

The plot shows the geographic distribution of Schools by the 10 regions defined by the College scorecard dataset. The regions are as follows:

  • 0: U.S. Service Schools
  • 1: New England (CT, ME, MA, NH, RI, VT)
  • 2: Mid East (DE, DC, MD, NJ, NY, PA)
  • 3: Great Lakes (IL, IN, MI, OH, WI)
  • 4: Plains (IA, KS, MN, MO, NE, ND, SD)
  • 5: Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
  • 6: Southwest (AZ, NM, OK, TX)
  • 7: Rocky Mountains (CO, ID, MT, UT, WY)
  • 8: Far West (AK, CA, HI, NV, OR, WA)
  • 9: Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)

Row {data-height: 200}

Geographic Distribution of Schools

Row {data-height: 400}

Variables in model

I will be using the following variables as predictors in my Naive Bayes:

  • Admission rate (ADM_RATE)
  • Average SAT score (SAT_AVG)
  • Open admission (OPENADMP)
  • Graduate population (GRADS)
  • On-time Completion in a 4 year institute (C100_4)
  • Undergraduate population (UGDS)
  • State (STABBR)

I will be using the term ‘Region’ as my response variable. There are 10 regions, but I will only be using 8 of them. Specifically, I will be excluding U.S. Service Schools (0) and Outlying Areas (9). The remaining regions that I will be using are as follows:

  • 1: New England (CT, ME, MA, NH, RI, VT)
  • 2: Mid East (DE, DC, MD, NJ, NY, PA)
  • 3: Great Lakes (IL, IN, MI, OH, WI)
  • 4: Plains (IA, KS, MN, MO, NE, ND, SD)
  • 5: Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
  • 6: Southwest (AZ, NM, OK, TX)
  • 7: Rocky Mountains (CO, ID, MT, UT, WY)
  • 8: Far West (AK, CA, HI, NV, OR, WA)

Row

Head of data table

INSTNM STABBR ADM_RATE SAT_AVG OPENADMP GRADS C100_4 UGDS REGION
1 Alabama A & M University AL 0.7160 954 2 862 0.1121 5098 5
2 University of Alabama at Birmingham AL 0.8854 1266 2 8742 0.4209 13284 5
4 University of Alabama in Huntsville AL 0.7367 1300 2 2067 0.3528 7358 5
5 Alabama State University AL 0.9799 955 2 465 0.1014 3495 5
6 The University of Alabama AL 0.7890 1244 2 6631 0.5264 30725 5
9 Auburn University at Montgomery AL 0.9680 1069 2 982 0.1272 3742 5

Row

Naive Bayes

I split the dataset into 70% training and 30% testing for Naive Bayes. The model was 95% accurate and produced a confusion matrix. Notable correct predictions include 15 instances of class 1, 39 of region 2, 48 of region 3, 35 of region 4, 70 of region 5, 20 of region 6, 8 of region 7, and 16 of region 8. However, there were instances of misclassification, such as 4 instances of region 1 falsely predicted as region 2, 1 instance of region 5 as region 3, and 1 instance of region 8 as region 7. These suggest the need for further investigation and refinement of the model’s features or training approach.

Row {data-height: 400}

Confusion matrix

Confusion Matrix for Naive Bayes
1 2 3 4 5 6 7 8
Predicted 1 15 4 1 0 2 1 0 0
Predicted 2 0 39 0 1 0 1 0 0
Predicted 3 0 0 48 0 0 0 0 0
Predicted 4 0 0 0 35 0 0 0 0
Predicted 5 0 0 1 0 70 0 0 1
Predicted 6 0 0 0 0 0 20 0 0
Predicted 7 0 0 0 0 0 0 8 1
Predicted 8 0 0 0 0 0 0 0 16

Classification using Logistic Regression

Research Question

How does logistic regression predict whether colleges in the United States offer a Bachelor’s degree in engineering?

Row

What Precentage of Colleges offer Bachelor’s degree in engineering?

The following pie chart displays the percentage of colleges in the United States that provide a Bachelor’s degree in engineering. The chart shows nearly an equal number of colleges that provide and don’t provide this degree. Based on this observation, I am interested in knowing whether logistic regression can be used to predict with accuracy if a college offers this type of degree.

Row {data-height: 400}

Distribution of engineering degrees

Row

Varibles for the model

I will be using the following variables for my regression:

  • Admission Rate (ADM_RATE)
  • Average SAT Scores (SAT_AVG)
  • Total share of enrollment of undergraduate students who are white (UGDS_WHITE)
  • Total share of enrollment of undergraduate students who are Asian (UGDS_ASIAN)
  • Total share of enrollment of undergraduate students who are Black (UGDS_BLACK)
  • Total share of enrollment of undergraduate students who are Hispanic (UGDS_HISP)
  • Share of full time faculty that are women (IRPS_WOMEN)
  • Bachelor’s degree in Biological And Biomedical Sciences (CIP26BACHL)

I will be using the “Bachelor’s degree in Engineering” variable as my response. This variable indicates whether a college offers a Bachelor’s degree in Engineering, with 0 indicating no, 1 indicating yes, and 2 indicating that the program is offered through an exclusively distance-education program. I will be eliminating any schools with a value of 2 from the data set, so I will just focus on classifying whether a school simply offers the program or not.

The head of the data table that will be used for the regression is displayed below.

Row {data-height: 400}

Head of data table

INSTNM ADM_RATE SAT_AVG CIP26BACHL UGDS_WHITE UGDS_ASIAN UGDS_BLACK UGDS_HISP IRPS_WOMEN CIP14BACHL
1 Alabama A & M University 0.7160 954 1 0.0184 0.0014 0.8978 0.0114 0.5024 1
2 University of Alabama at Birmingham 0.8854 1266 1 0.5297 0.0767 0.2458 0.0669 0.4433 1
4 University of Alabama in Huntsville 0.7367 1300 1 0.7196 0.0357 0.0871 0.0610 0.4644 1
5 Alabama State University 0.9799 955 1 0.0152 0.0020 0.9259 0.0129 0.4796 1
6 The University of Alabama 0.7890 1244 1 0.7676 0.0137 0.1050 0.0549 0.4585 1
9 Auburn University at Montgomery 0.9680 1069 1 0.4129 0.0283 0.4487 0.0131 0.5022 0

Row

Finding the right model

I used stepwise variable selection to find the best model and the best model is CIP14BACHL ~ SAT_AVG + CIP26BACHL + UGDS_ASIAN + IRPS_WOMEN.

The coefficients of the best regression model is displayed below.

Row

Model building

Coefficients
x
(Intercept) -2.5994839
SAT_AVG 0.0025489
CIP26BACHL1 3.1441358
UGDS_ASIAN 6.0400929
IRPS_WOMEN -7.6897723

Row

Plotting the predictors

The following plots plot the numerical predictors against the response varible catgroized by the catefgorial preictor of offering Bachelor’s degree in Biological And Biomedical Sciences.

Row

SAT_AVG

UGDS_ASIAN

IRPS_WOMEN

Row

Making predictions using testing and training data

We have evaluated the performance of our model on both training and testing data. On the training data, our model has an accuracy of 68% and a misclassification rate of approximately 31.79%. The confusion matrix reveals that the model made 285 correct predictions of “Yes” and 202 correct predictions of “No.” However, the model also misclassified 125 instances as “Yes” and 102 instances as “No.”

On the testing data, our model’s accuracy is slightly lower at 63%, with a misclassification rate of approximately 36.93%. The confusion matrix shows that there were 65 instances incorrectly predicted as “Yes” and 48 instances incorrectly predicted as “No.” Overall, the model’s performance on the testing data is not as good as the training data, with a higher rate of misclassification.

To further assess the model’s effectiveness, we performed 10-fold cross-validation. The average misclassification rate across the 10 folds was approximately 20%. These results suggest that the logistic regression model performs well and is effective with new data. Furthermore, these results indicate that the model’s performance is stable and consistent across different subsets of the dataset.

Row

Confusion matrix for training data

Yes No
Predicted Yes 285 125
Predicted No 102 202

Confusion matrix for testing data

Yes No
Predicted Yes 110 65
Predicted No 48 83