Our dashboard is a tool that utilizes data from the US Department of Education’s College Scorecard. Its purpose is to provide a comprehensive view of higher education institutions across the United States. The dashboard presents key metrics such as graduation rates, median earnings, student demographics, and institutional characteristics in an integrated and user-friendly dashboard.
By consolidating this diverse set of data points, our dashboard enables users to uncover trends, patterns, and correlations. This information can be used for strategic planning, policy development, and program evaluation. In short, our dashboard offers a valuable resource for anyone interested in gaining insights into the higher education landscape in the United States.
The data we are using is from the US Department of Education’s College Scorecard. We are using the Institution level data set. The data set includes information for each institution 1996-97 through 2021-22. It includes information on institutional characteristics, enrollment, student aid, costs, and student outcomes.The data comes from federal reporting from institutions, data on federal financial aid, and tax information. With most of the data coming from the Integrated Post-secondary Education Data System (IPEDS). The data is categorized under data area 4, Education.
The full dataset can be accessed through the following link
More Information about the data can be found here
The work was split up in the following ways:
Divya
Grace
| Name | Variable Name | Data Type | Description |
|---|---|---|---|
| Institute Name | INSTNM | Categorical | Institute name |
| City | CITY | Categorical | The city in which the institute is located |
| Region | REGION | Categorical | 0: U.S. Service Schools: 1: New England 2: Mid East 3: Great Lakes 4: Plains 5: Southeast 6: Southwest 7: Rocky Mountains 8: Far West |
| Median earnings of students working and not enrolled 10 years after entry | MN_EARN_WNE_P10 | Numerical | NA |
| Control of institution | CONTROL | Categorical | 1: Public 2: Private nonprofit 3: Private for-profit |
| Bachelor’s degree in Engineering | CIP14BACHL | Categorical | 0: Program not offered 1: Program offered 2: Program offered through an exclusively distance-education program |
| Undergraduate student to instructional faculty | STUFACR | Numerical | NA |
| Average faculty salary | AVGFACSAL | Numerical | Average faculty salary per month |
| Retention | RET_FT4 | Numerical | First-time, full-time student retention rate at four-year institutions |
| Number of Branches | NUMBRANCH | Numerical | NA |
| Percent of all undergraduate students receiving a federal student loan | PCTFLOAN | Numerical | NA |
| Admission rate | ADM_RATE | Numerical | NA |
| Average SAT score | SAT_AVG | Numerical | NA |
| Open admission | OPENADMP | Categorical | 1: Yes 2: No 3: Does not enroll first-time students |
| Graduate population | GRADS | Numerical | NA |
| On-time Completion in a 4 year institute | C100_4 | Numerical | Completion rate for first-time, full-time students at four-year institutions (100% of expected time to completion) |
| Number of undergraduate students | UGDS | Numerical | NA |
| Total share of enrollment of undergraduate students who are white | UGDS_WHITE | Numerical | NA |
| Total share of enrollment of undergraduate students who are Asian | UGDS_ASIAN | Numerical | NA |
| Total share of enrollment of undergraduate students who are Black | UGDS_BLACK | Numerical | NA |
| Total share of enrollment of undergraduate students who are Hispanic | UGDS_HISP | Numerical | NA |
| Share of full time faculty that are women | IRPS_WOMEN | Numerical | NA |
| Bachelor’s degree in Biological And Biomedical Sciences | CIP26BACHL | Categorical | 0: Program not offered 1: Program offered 2: Program offered through an exclusively distance-education program |
What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?
I will be using multiple linear regression to find the beset combinations of predictors and testing its accuracy. I will be using “Retention” as my response variable and the following predictor variables:
| INSTNM | RET_FT4 | ADM_RATE | SAT_AVG | C100_4 | UGDS | STUFACR | |
|---|---|---|---|---|---|---|---|
| 1 | Alabama A & M University | 0.5797 | 0.7160 | 954 | 0.1121 | 5098 | 18 |
| 2 | University of Alabama at Birmingham | 0.8392 | 0.8854 | 1266 | 0.4209 | 13284 | 19 |
| 4 | University of Alabama in Huntsville | 0.7899 | 0.7367 | 1300 | 0.3528 | 7358 | 19 |
| 5 | Alabama State University | 0.6436 | 0.9799 | 955 | 0.1014 | 3495 | 13 |
| 6 | The University of Alabama | 0.8855 | 0.7890 | 1244 | 0.5264 | 30725 | 19 |
| 9 | Auburn University at Montgomery | 0.6303 | 0.9680 | 1069 | 0.1272 | 3742 | 15 |
What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?
After performing stepwise regression analysis, it was found that SAT average, on-time completion rate, and undergraduate population have a statistically significant association with retention rate. However, Admission Rates and Student Faculty ratio were found to be less significant and were subsequently removed. The best model is RET_FT4 ~ SAT_AVG + C100_4 + UGDS.
Based on the plot, we can conclude that although a significant number of data points are aligned with the QQline, the tails of the plot exhibit some non-normal behavior. This suggests that the normality assumption is violated, which is further supported by the Shapiro-Wilk test. The test indicates a small p-value, providing evidence against the null hypothesis. Therefore, we can conclude that normality is violated.
| Test | Statistic | P_Value | |
|---|---|---|---|
| W | Shapiro-Wilk Test | 0.9306323 | 0 |
According to the Durbin-Watson test results, the D-W statistic is less than 2, indicating a positive autocorrelation. Furthermore, the p-value is less than 0.05, providing evidence to reject the null hypothesis, and thereby violating the independence assumption.
| Test | Statistic | P_Value |
|---|---|---|
| Durbin-Watson Test | 1.822582 | 0.004 |
A Levene test was not conducted on the linear regression. Based on the results from the Breuch-Pagan test, we can conclude that the constant variance assumption is violated since the p-value is less than 0.05.
| Test | LM_Statistic | P_Value | |
|---|---|---|---|
| BP | Breusch-Pagan Test | 59.29666 | 0 |
What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?
The scatterplot matrix below shows the relationship between key variables in our analysis visually. Our dataset includes scatterplots for “RET_FT4,” “SAT_AVG,” “C100_4,” and “UGDS” variables.
Each cell in the upper triangle of the matrix presents a smoothed scatterplot between two variables. Moreover, the lower triangle of the matrix displays correlation coefficients. Positive correlations (values close to 1) suggest that if one variable increases, the other tends to increase as well. On the other hand, negative correlations (values close to -1) indicate an inverse relationship.
The second plot displays a cone shape, indicating heteroscedasticity in the model’s Standardized Residuals vs Fitted Values plot.
According to the plot, we can observe that there are three substantial outliers, which are 458, 1857, and 5979. However, they were not eliminated from the dataset. Additionally, based on the table, we can identify six influential points that must be considered.
| Student_Residuals | Unadjusted_P_Value | |
|---|---|---|
| 158 | -6.738894 | 0.00e+00 |
| 4680 | 6.019783 | 0.00e+00 |
| 1857 | 5.939086 | 0.00e+00 |
| 3024 | -4.964702 | 8.00e-07 |
| 936 | 4.934349 | 9.00e-07 |
| 824 | -4.174032 | 3.25e-05 |
What combination of predictors best explains the variation in retention rate of first-time, full-time students at 4-year institutions?
According to the low p-value for the F statistic, we can conclude that the model is statistically significant. Furthermore, all the predictors have a significant association with the response (retention rate).
The multiple \(R^2\) value of 0.66 indicates that 66% of the variability in retention rate of first years is explained by the predictors. However, there may be some additional predictors that were not considered during the initial model building process, which could account for some of the variability.
The adjusted \(R^2\) value of 0.665 adjusts the multiple \(R^2\) value to account for the number of predictors in the model. Since the two R^2 values are very close, we can conclude that the inclusion of the predictors has not substantially increased the multiple \(R^2\) value.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.1435778 | 0.0256953 | 5.587715 | 0 |
| SAT_AVG | 0.0004355 | 0.0000275 | 15.812946 | 0 |
| C100_4 | 0.1958750 | 0.0177822 | 11.015231 | 0 |
| UGDS | 0.0000028 | 0.0000003 | 9.331482 | 0 |
Based on the plots provided, we can observe that the number of undergraduate students in most universities tends to be less than 20,000. However, some universities have significantly larger undergraduate populations, making the distribution of the undergraduate population right-skewed. Moving on to the SAT scores, the average score falls mostly between 1000 and 1200. This is not surprising since the majority of students score between this range. While there are fewer students who score less than 1000 or above 1500 (although there are more above 1500 than below 1000), not many students tend to get scores below 1000.
There exists a significant positive relationship between the average SAT scores and student retention rates. This suggests that colleges and universities with higher SAT scores typically have higher retention rates for first-time, full-time students.
Similarly, retention rates also have a strong positive correlation with on-time completion rates.
Retention rates have a moderate positive correlation with the undergraduate population. Furthermore, there is a moderate negative correlation (-0.406) between retention rates and admission rates (ADM_RATE). This indicates that institutions with lower admission rates tend to have higher retention rates for first-time, full-time students.
The other correlations between variables also provide valuable insights into their relationships. For example, average SAT scores exhibit a negative correlation with admission rates, which implies that colleges and universities with higher average SAT scores typically have lower admission rates. This correlation makes sense.
| RET_FT4 | ADM_RATE | SAT_AVG | C100_4 | UGDS | STUFACR | |
|---|---|---|---|---|---|---|
| RET_FT4 | 1.0000000 | -0.4061152 | 0.7809558 | 0.7148796 | 0.3158825 | -0.1387544 |
| ADM_RATE | -0.4061152 | 1.0000000 | -0.5366314 | -0.4438455 | -0.0200048 | 0.3105386 |
| SAT_AVG | 0.7809558 | -0.5366314 | 1.0000000 | 0.7807436 | 0.2422476 | -0.2550316 |
| C100_4 | 0.7148796 | -0.4438455 | 0.7807436 | 1.0000000 | 0.0632319 | -0.3767704 |
| UGDS | 0.3158825 | -0.0200048 | 0.2422476 | 0.0632319 | 1.0000000 | 0.4633320 |
| STUFACR | -0.1387544 | 0.3105386 | -0.2550316 | -0.3767704 | 0.4633320 | 1.0000000 |
Since the VIF scores are all below 10, this suggests that multicollinearity is not a significant concern among the predictor variables in the model. For SAT_AVG and C100_4 they both have moderate correlation but not high enough to cause any concern.
| x | |
|---|---|
| SAT_AVG | 2.832160 |
| C100_4 | 2.676659 |
| UGDS | 1.110226 |
I attempted to predict the retention rate for a fictitious school. I assigned values to the predictor variables and used the predict function to obtain a retention rate for my multiple linear regression model. Based on this prediction, it appears that a school with an average SAT score of 1530, an undergraduate population of 15,000, and an on-time completion rate of 0.27 would have a 90% retention rate.
How does adding regularization affect the prediction accuracy of models for predicting completion rates for first-time students at 4-year institutions, while controlling for other variables such as in-state tuition and fees and number of undergraduate students?
I will be using the completion rate as the response variable in my analysis. This variable indicates the percentage of first-time, full-time students who complete their degree within the expected time of completion (i.e., 100% completion rate). The predictor variables that I will be using in my analysis are as follows:
| INSTNM | TUITIONFEE_IN | UGDS | ADM_RATE | STUFACR | SAT_AVG | C100_4 | |
|---|---|---|---|---|---|---|---|
| 1 | Alabama A & M University | 10024 | 5098 | 0.7160 | 18 | 954 | 0.1121 |
| 2 | University of Alabama at Birmingham | 8568 | 13284 | 0.8854 | 19 | 1266 | 0.4209 |
| 4 | University of Alabama in Huntsville | 11488 | 7358 | 0.7367 | 19 | 1300 | 0.3528 |
| 5 | Alabama State University | 11068 | 3495 | 0.9799 | 13 | 955 | 0.1014 |
| 6 | The University of Alabama | 11620 | 30725 | 0.7890 | 19 | 1244 | 0.5264 |
| 9 | Auburn University at Montgomery | 8860 | 3742 | 0.9680 | 15 | 1069 | 0.1272 |
Var1 - This column represents the names of variables or attributes
Var2 - This column represents the types or classes of the variables or attributes
Freq - This column represents the frequencies or counts associated with each variable or attribute
For example: a0 has a length of 100, belongs to the class ‘numeric’, and has a mode of ‘numeric’.
We can conclude from this that there are variables with different lengths, which means varying number of observations. The dataset contains variables of different classes, such as numeric, logical, or call. The variables have different storage modes, indicating how they are stored in memory. Finally, the frequencies of different attributes provide insights into the variable characteristic distribution in the dataset.
| Var1 | Var2 | Freq |
|---|---|---|
| a0 | Length | 100 |
| beta | Length | 500 |
| df | Length | 100 |
| dim | Length | 2 |
| lambda | Length | 100 |
| dev.ratio | Length | 100 |
| nulldev | Length | 1 |
| npasses | Length | 1 |
| jerr | Length | 1 |
| offset | Length | 1 |
| call | Length | 4 |
| nobs | Length | 1 |
| a0 | Class | -none- |
| beta | Class | dgCMatrix |
| df | Class | -none- |
| dim | Class | -none- |
| lambda | Class | -none- |
| dev.ratio | Class | -none- |
| nulldev | Class | -none- |
| npasses | Class | -none- |
| jerr | Class | -none- |
| offset | Class | -none- |
| call | Class | -none- |
| nobs | Class | -none- |
| a0 | Mode | numeric |
| beta | Mode | S4 |
| df | Mode | numeric |
| dim | Mode | numeric |
| lambda | Mode | numeric |
| dev.ratio | Mode | numeric |
| nulldev | Mode | numeric |
| npasses | Mode | numeric |
| jerr | Mode | numeric |
| offset | Mode | logical |
| call | Mode | call |
| nobs | Mode | numeric |
By using the cv_glmnet() function, we can evaluate how well each model generalizes through cross-validation. The output plot displays a curve, where the lowest point indicates the optimal lambda value. In this case, the optimal lambda value is 0.01, which is the logarithmic value that minimizes the error in cross-validation. This value suggests that a relatively low amount of regularization is being applied in the model.
| Length | Class | Mode | |
|---|---|---|---|
| a0 | 100 | -none- | numeric |
| beta | 500 | dgCMatrix | S4 |
| df | 100 | -none- | numeric |
| dim | 2 | -none- | numeric |
| lambda | 100 | -none- | numeric |
| dev.ratio | 100 | -none- | numeric |
| nulldev | 1 | -none- | numeric |
| npasses | 1 | -none- | numeric |
| jerr | 1 | -none- | numeric |
| offset | 1 | -none- | logical |
| call | 4 | -none- | call |
| nobs | 1 | -none- | numeric |
According to the analysis, the ridge regression model is able to explain approximately 68% of the variability in the completion rates for first-time students at 4-year institutions. This indicates that the model is fairly effective in predicting completion rates using the given predictors.
| Metric | Value |
|---|---|
| Mean Squared Error (MSE) | 0.0130827 |
| R-squared (R^2) | 0.6786567 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
I will be using LOESS fit to analyze this. I will be using the “Mean earnings of students working and not enrolled 10 years after entry” variable as my response and “Retention at 4-year Instiution” as my predictor.
In addition, I will splitting the data by region and will using regions 1-8 which eliminates service schools and the outlying areas.
For each region I will perform LOESS for span values from 0.25 to 0.75 in steps of 0.05 and calculate the RSE values for each iteration. I will then choose the top three most accurate LOESS and graph them.
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 539 | Albertus Magnus College | CT | 1 | 0.6420 | 56000 |
| 542 | University of Bridgeport | CT | 1 | 0.5516 | 47100 |
| 543 | Central Connecticut State University | CT | 1 | 0.7232 | 51600 |
| 544 | Charter Oak State College | CT | 1 | 0.6667 | 43700 |
| 546 | Connecticut College | CT | 1 | 0.8611 | 62900 |
| 548 | University of Connecticut | CT | 1 | 0.9164 | 66000 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 590 | Delaware State University | DE | 2 | 0.7510 | 38400 |
| 591 | University of Delaware | DE | 2 | 0.9148 | 64000 |
| 592 | Goldey-Beacom College | DE | 2 | 0.8214 | 48800 |
| 594 | Wilmington University | DE | 2 | 0.5866 | 46600 |
| 595 | American University | DC | 2 | 0.9046 | 67700 |
| 596 | The Catholic University of America | DC | 2 | 0.8800 | 60500 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 839 | American Academy of Art College | IL | 3 | 0.7037 | 35600 |
| 841 | School of the Art Institute of Chicago | IL | 3 | 0.8266 | 40800 |
| 842 | Augustana College | IL | 3 | 0.8527 | 56400 |
| 843 | Aurora University | IL | 3 | 0.7789 | 46200 |
| 847 | Blackburn College | IL | 3 | 0.6724 | 41700 |
| 849 | Bradley University | IL | 3 | 0.8651 | 58300 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 473 | Walden University | MN | 4 | 0.2500 | 62500 |
| 1061 | Briar Cliff University | IA | 4 | 0.6337 | 44400 |
| 1062 | Buena Vista University | IA | 4 | 0.7301 | 42600 |
| 1065 | Central College | IA | 4 | 0.7901 | 48400 |
| 1066 | Clarke University | IA | 4 | 0.6507 | 44800 |
| 1067 | Coe College | IA | 4 | 0.7391 | 50000 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 1 | Alabama A & M University | AL | 5 | 0.5797 | 35500 |
| 2 | University of Alabama at Birmingham | AL | 5 | 0.8392 | 48400 |
| 4 | University of Alabama in Huntsville | AL | 5 | 0.7899 | 52000 |
| 5 | Alabama State University | AL | 5 | 0.6436 | 30600 |
| 6 | The University of Alabama | AL | 5 | 0.8855 | 51600 |
| 9 | Auburn University at Montgomery | AL | 5 | 0.6303 | 38000 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 69 | Brookline College-Phoenix | AZ | 6 | 0.4667 | 23700 |
| 70 | Arizona State University Campus Immersion | AZ | 6 | 0.8600 | 55500 |
| 72 | University of Arizona | AZ | 6 | 0.8403 | 56000 |
| 79 | Embry-Riddle Aeronautical University-Prescott | AZ | 6 | 0.7754 | 70300 |
| 82 | Grand Canyon University | AZ | 6 | 0.7059 | 58500 |
| 87 | Dine College | AZ | 6 | 0.0909 | 23400 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 488 | Adams State University | CO | 7 | 0.5924 | 38100 |
| 490 | Arapahoe Community College | CO | 7 | 0.5225 | 40700 |
| 492 | University of Colorado Denver/Anschutz Medical Campus | CO | 7 | 0.7480 | 71200 |
| 493 | University of Colorado Colorado Springs | CO | 7 | 0.6664 | 49300 |
| 495 | University of Colorado Boulder | CO | 7 | 0.8738 | 59700 |
| 496 | Colorado Christian University | CO | 7 | 0.8229 | 44600 |
| INSTNM | STABBR | REGION | RET_FT4 | MN_EARN_WNE_P10 | |
|---|---|---|---|---|---|
| 55 | University of Alaska Anchorage | AK | 8 | 0.6869 | 51200 |
| 57 | University of Alaska Fairbanks | AK | 8 | 0.6793 | 45100 |
| 58 | University of Alaska Southeast | AK | 8 | 0.6471 | 39500 |
| 59 | Alaska Pacific University | AK | 8 | 0.6842 | 45300 |
| 164 | Academy of Art University | CA | 8 | 0.6899 | 47300 |
| 176 | Art Center College of Design | CA | 8 | 0.8504 | 62300 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 13517.94 |
| 0.30 | 13640.93 |
| 0.35 | 13737.59 |
| 0.40 | 13844.75 |
| 0.45 | 13926.70 |
| 0.50 | 14027.89 |
| 0.55 | 14117.45 |
| 0.60 | 14190.65 |
| 0.65 | 14341.78 |
| 0.70 | 14491.43 |
| 0.75 | 14658.62 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 13371.78 |
| 0.30 | 13433.87 |
| 0.35 | 13472.65 |
| 0.40 | 13507.91 |
| 0.45 | 13545.81 |
| 0.50 | 13575.71 |
| 0.55 | 13609.19 |
| 0.60 | 13626.63 |
| 0.65 | 13634.73 |
| 0.70 | 13643.16 |
| 0.75 | 13661.11 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 9300.800 |
| 0.30 | 9308.149 |
| 0.35 | 9307.209 |
| 0.40 | 9310.907 |
| 0.45 | 9314.890 |
| 0.50 | 9320.280 |
| 0.55 | 9325.684 |
| 0.60 | 9333.756 |
| 0.65 | 9341.502 |
| 0.70 | 9356.009 |
| 0.75 | 9374.692 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 9002.696 |
| 0.30 | 9059.466 |
| 0.35 | 9100.983 |
| 0.40 | 9133.296 |
| 0.45 | 9160.274 |
| 0.50 | 9187.892 |
| 0.55 | 9215.928 |
| 0.60 | 9237.815 |
| 0.65 | 9260.071 |
| 0.70 | 9276.991 |
| 0.75 | 9299.451 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 8891.443 |
| 0.30 | 8924.220 |
| 0.35 | 8917.468 |
| 0.40 | 8940.496 |
| 0.45 | 8956.736 |
| 0.50 | 8971.868 |
| 0.55 | 8984.336 |
| 0.60 | 8994.147 |
| 0.65 | 9005.815 |
| 0.70 | 9013.932 |
| 0.75 | 9023.108 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 9799.169 |
| 0.30 | 9823.838 |
| 0.35 | 9807.848 |
| 0.40 | 9867.717 |
| 0.45 | 9915.619 |
| 0.50 | 9955.095 |
| 0.55 | 9984.825 |
| 0.60 | 10004.162 |
| 0.65 | 9989.940 |
| 0.70 | 9993.568 |
| 0.75 | 10004.536 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 10114.56 |
| 0.30 | 10254.46 |
| 0.35 | 10279.87 |
| 0.40 | 10330.13 |
| 0.45 | 10446.56 |
| 0.50 | 10646.09 |
| 0.55 | 10811.20 |
| 0.60 | 10895.39 |
| 0.65 | 10928.38 |
| 0.70 | 11021.68 |
| 0.75 | 11064.21 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
| span | RSE_loess |
|---|---|
| 0.25 | 12636.05 |
| 0.30 | 12787.71 |
| 0.35 | 12915.42 |
| 0.40 | 13001.78 |
| 0.45 | 13037.83 |
| 0.50 | 13068.02 |
| 0.55 | 13093.43 |
| 0.60 | 13117.40 |
| 0.65 | 13135.49 |
| 0.70 | 13158.21 |
| 0.75 | 13177.23 |
How does the relationship between retention rates and median earnings after graduation vary across different regions (e.g., New England, Mid East, Southwest, etc) based on the College Scorecard data?
In general, the LOESS regression analysis indicates that most regions have a positive exponential growth with a positive y-intercept. This suggests that as retention rates increase in a school, the median earnings after graduation are expected to be higher. The only region that did not have a clear exponential curve was region 7, which could be due to the significantly lower number of schools in this region. Nonetheless, there is still a general positive trend between retention and median earnings after graduation.
For all regions, the most accurate span observed was 0.25, followed by 0.3 and 0.35. Based on the graphs for all regions, the LOESS regression lines with a span of 0.25 are the best fit, as the lines for the other two spans tend to over fit the data.
Can we accurately classify schools as public, private nonprofit, or private for-profit based on features various features in the top five states with the most schools?
Based on the bar graph, California has the most schools, with around 700. New York and Texas have the next most schools, with 450 and 415 schools, respectively. I will use the data from the top five states for my kNN classification, which will be California, New York, Texas, Florida, and Pennsylvania.
I will be using the following variables for my kNN model:
I will be using “Control of Institution” as my response variable. This variable indicates whether an institution is public (1), private non-profit (2), or private for-profit (3). For information about the other variables, please refer to the data dictionary located in the Project Overview tab.
| INSTNM | STABBR | UGDS | RET_FT4 | STUFACR | AVGFACSAL | PCTFLOAN | NUMBRANCH | CONTROL | |
|---|---|---|---|---|---|---|---|---|---|
| 164 | Academy of Art University | CA | 5294 | 0.6899 | 14 | 9482 | 0.4938 | 1 | 3 |
| 176 | Art Center College of Design | CA | 2026 | 0.8504 | 8 | 7942 | 0.4686 | 1 | 2 |
| 179 | Azusa Pacific University | CA | 3937 | 0.7928 | 10 | 8957 | 0.4772 | 1 | 2 |
| 183 | Bethesda University | CA | 264 | 0.4706 | 10 | 2482 | 0.2035 | 1 | 2 |
| 184 | Biola University | CA | 3592 | 0.8780 | 13 | 9807 | 0.4786 | 1 | 2 |
| 190 | California Baptist University | CA | 8234 | 0.7866 | 19 | 10151 | 0.6467 | 1 | 2 |
The plots presented below provide a comparison of each predictor with the response variable. While there is a significant overlap in the graphs, it is apparent that most private for-profit colleges tend to have a smaller number of undergraduate students, less than 10000, when compared to public and private non-profit colleges. Moreover, the faculty members at public schools receive an average salary that exceeds $5000 per month, whereas the faculty members at private schools can earn less than $5000 per month. Private for-profit faculty members earn between $2500 and $10000.
I split the dataset into a 70% training and 30% testing split. Then, I plotted the training data and the testing points (represented by diamonds) on separate graphs. Overall, all graphs revealed that the two datasets exhibit similar patterns, indicating that the model predictions will align with the actual values.
The graph below displays the accuracy of the model for different k-values. According to the graph, the k-value of 5 resulted in the highest accuracy of 80%. As the k-values increase, the accuracy of the model decreases. To further analyze the performance of the kNN model, I will execute a confusion matrix for the k-value of 5.
This is the confusion matrix for k=5, which represents the most accurate kNN with an accuracy rate of approximately 80%. There were 35 correct predictions out of 44 actual instances for Public colleges, 104 correct predictions out of 129 for Private non-profit colleges, and 10 correct predictions out of 13 instances for Private for-profit colleges.
However, there were 7 instances where Public colleges were incorrectly classified as Private Nonprofit, and 13 instances where Private Nonprofit colleges were incorrectly classified as Public. These misclassifications suggest some overlap in the features or characteristics of these classes. The classifier also struggled with Private For-Profit, misclassifying 3 instances as Private Nonprofit, but none as Public.
It is essential to note that a smaller number of instances for Private For-Profit colleges may have contributed to the lower accuracy rate for this class.
| Public | Private Nonprofit | Private For-Profit | |
|---|---|---|---|
| Predicted Public | 35 | 7 | 2 |
| Predicted Nonprofit | 13 | 104 | 12 |
| Predicted For-Profit | 0 | 3 | 10 |
Based on the following features, can we accurately predict the region where a school is located?
The plot shows the geographic distribution of Schools by the 10 regions defined by the College scorecard dataset. The regions are as follows:
I will be using the following variables as predictors in my Naive Bayes:
I will be using the term ‘Region’ as my response variable. There are 10 regions, but I will only be using 8 of them. Specifically, I will be excluding U.S. Service Schools (0) and Outlying Areas (9). The remaining regions that I will be using are as follows:
| INSTNM | STABBR | ADM_RATE | SAT_AVG | OPENADMP | GRADS | C100_4 | UGDS | REGION | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Alabama A & M University | AL | 0.7160 | 954 | 2 | 862 | 0.1121 | 5098 | 5 |
| 2 | University of Alabama at Birmingham | AL | 0.8854 | 1266 | 2 | 8742 | 0.4209 | 13284 | 5 |
| 4 | University of Alabama in Huntsville | AL | 0.7367 | 1300 | 2 | 2067 | 0.3528 | 7358 | 5 |
| 5 | Alabama State University | AL | 0.9799 | 955 | 2 | 465 | 0.1014 | 3495 | 5 |
| 6 | The University of Alabama | AL | 0.7890 | 1244 | 2 | 6631 | 0.5264 | 30725 | 5 |
| 9 | Auburn University at Montgomery | AL | 0.9680 | 1069 | 2 | 982 | 0.1272 | 3742 | 5 |
I split the dataset into 70% training and 30% testing for Naive Bayes. The model was 95% accurate and produced a confusion matrix. Notable correct predictions include 15 instances of class 1, 39 of region 2, 48 of region 3, 35 of region 4, 70 of region 5, 20 of region 6, 8 of region 7, and 16 of region 8. However, there were instances of misclassification, such as 4 instances of region 1 falsely predicted as region 2, 1 instance of region 5 as region 3, and 1 instance of region 8 as region 7. These suggest the need for further investigation and refinement of the model’s features or training approach.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|
| Predicted 1 | 15 | 4 | 1 | 0 | 2 | 1 | 0 | 0 |
| Predicted 2 | 0 | 39 | 0 | 1 | 0 | 1 | 0 | 0 |
| Predicted 3 | 0 | 0 | 48 | 0 | 0 | 0 | 0 | 0 |
| Predicted 4 | 0 | 0 | 0 | 35 | 0 | 0 | 0 | 0 |
| Predicted 5 | 0 | 0 | 1 | 0 | 70 | 0 | 0 | 1 |
| Predicted 6 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 |
| Predicted 7 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 |
| Predicted 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 |
How does logistic regression predict whether colleges in the United States offer a Bachelor’s degree in engineering?
The following pie chart displays the percentage of colleges in the United States that provide a Bachelor’s degree in engineering. The chart shows nearly an equal number of colleges that provide and don’t provide this degree. Based on this observation, I am interested in knowing whether logistic regression can be used to predict with accuracy if a college offers this type of degree.
I will be using the following variables for my regression:
I will be using the “Bachelor’s degree in Engineering” variable as my response. This variable indicates whether a college offers a Bachelor’s degree in Engineering, with 0 indicating no, 1 indicating yes, and 2 indicating that the program is offered through an exclusively distance-education program. I will be eliminating any schools with a value of 2 from the data set, so I will just focus on classifying whether a school simply offers the program or not.
The head of the data table that will be used for the regression is displayed below.
| INSTNM | ADM_RATE | SAT_AVG | CIP26BACHL | UGDS_WHITE | UGDS_ASIAN | UGDS_BLACK | UGDS_HISP | IRPS_WOMEN | CIP14BACHL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Alabama A & M University | 0.7160 | 954 | 1 | 0.0184 | 0.0014 | 0.8978 | 0.0114 | 0.5024 | 1 |
| 2 | University of Alabama at Birmingham | 0.8854 | 1266 | 1 | 0.5297 | 0.0767 | 0.2458 | 0.0669 | 0.4433 | 1 |
| 4 | University of Alabama in Huntsville | 0.7367 | 1300 | 1 | 0.7196 | 0.0357 | 0.0871 | 0.0610 | 0.4644 | 1 |
| 5 | Alabama State University | 0.9799 | 955 | 1 | 0.0152 | 0.0020 | 0.9259 | 0.0129 | 0.4796 | 1 |
| 6 | The University of Alabama | 0.7890 | 1244 | 1 | 0.7676 | 0.0137 | 0.1050 | 0.0549 | 0.4585 | 1 |
| 9 | Auburn University at Montgomery | 0.9680 | 1069 | 1 | 0.4129 | 0.0283 | 0.4487 | 0.0131 | 0.5022 | 0 |
I used stepwise variable selection to find the best model and the best model is CIP14BACHL ~ SAT_AVG + CIP26BACHL + UGDS_ASIAN + IRPS_WOMEN.
The coefficients of the best regression model is displayed below.
| x | |
|---|---|
| (Intercept) | -2.5994839 |
| SAT_AVG | 0.0025489 |
| CIP26BACHL1 | 3.1441358 |
| UGDS_ASIAN | 6.0400929 |
| IRPS_WOMEN | -7.6897723 |
The following plots plot the numerical predictors against the response varible catgroized by the catefgorial preictor of offering Bachelor’s degree in Biological And Biomedical Sciences.
We have evaluated the performance of our model on both training and testing data. On the training data, our model has an accuracy of 68% and a misclassification rate of approximately 31.79%. The confusion matrix reveals that the model made 285 correct predictions of “Yes” and 202 correct predictions of “No.” However, the model also misclassified 125 instances as “Yes” and 102 instances as “No.”
On the testing data, our model’s accuracy is slightly lower at 63%, with a misclassification rate of approximately 36.93%. The confusion matrix shows that there were 65 instances incorrectly predicted as “Yes” and 48 instances incorrectly predicted as “No.” Overall, the model’s performance on the testing data is not as good as the training data, with a higher rate of misclassification.
To further assess the model’s effectiveness, we performed 10-fold cross-validation. The average misclassification rate across the 10 folds was approximately 20%. These results suggest that the logistic regression model performs well and is effective with new data. Furthermore, these results indicate that the model’s performance is stable and consistent across different subsets of the dataset.
| Yes | No | |
|---|---|---|
| Predicted Yes | 285 | 125 |
| Predicted No | 102 | 202 |
| Yes | No | |
|---|---|---|
| Predicted Yes | 110 | 65 |
| Predicted No | 48 | 83 |