In this project, I’ll be working with the CollegeScores4yr dataset from Lock5Stat to explore various factors that may influence college outcomes. This dataset provides a range of variables related to four-year colleges in the U.S., covering topics like graduation rates, cost of attendance, student demographics, and financial aid.
knitr::opts_chunk$set(echo = FALSE)
# Load data from the Excel file
data <- read_excel("CollegeScores4yr.xlsx")
As I reviewed the variable descriptions in the CollegeScores4yr dataset, several questions came to mind. Here are the questions I came up with:
ChatGPT provided an alternative set of questions focusing on statistical relationships and broader patterns:
After comparing both sets, I noticed overlap in areas like: 1. Graduation rates by college type (public vs. private) 2. The impact of financial aid on graduation rates 3. Student-to-faculty ratios and outcomes 4. Cost-related factors like family income and attendance costs
After reviewing both sets, here are the final ten questions selected for in-depth analysis:
## Name State ID Main Accred MainDegree
## 0 0 0 0 0 0
## HighDegree Control Region Locale Latitude Longitude
## 0 0 0 0 0 0
## AdmitRate MidACT AvgSAT Online Enrollment White
## 360 760 735 0 1 1
## Black Hispanic Asian Other PartTime NetPrice
## 1 1 1 1 1 162
## Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 162 94 94 2 2 54
## FullTimeFac Pell CompRate Debt Female FirstGen
## 127 5 167 152 166 225
## MedIncome
## 51
## [1] 1256 37
Outliers can reveal unique cases or unusual data points. Here, we’ll identify outliers in some key variables like “Cost” and “MedIncome” to understand the range and detect any extreme values.
The boxplots highlight any extreme values in college costs and median income. Outliers in these variables could suggest certain high-cost institutions or regions with very high or low family incomes.
## [1] 56.5987
The average graduation rate provides a baseline for comparison. A higher average might indicate overall success in supporting students to graduate.
## # A tibble: 3 × 2
## Control average_graduation_rate
## <chr> <dbl>
## 1 Private 59.4
## 2 Profit 50.0
## 3 Public 52.2
Comparing graduation rates by type (public vs. private) reveals if institutional control influences success rates, possibly due to factors like funding or student demographics.
## [1] 0.5830873
A positive correlation here might indicate socioeconomic factors influencing SAT scores, with students from higher-income families potentially achieving better results.
## # A tibble: 1,151 × 2
## Enrollment avg_cost
## <dbl> <dbl>
## 1 25 14795
## 2 59 NaN
## 3 63 24909
## 4 67 35522
## 5 75 17415
## 6 77 41282
## 7 78 18863
## 8 90 35504.
## 9 98 67350
## 10 100 28994
## # ℹ 1,141 more rows
The average attendance cost by enrollment size could reveal if larger or smaller colleges tend to have higher or lower costs, which may impact affordability and access.
## [1] 0.5823886
A significant positive or negative correlation could suggest that higher costs impact graduation rates, either as a barrier or a reflection of resources provided.
This plot shows if a higher minority percentage affects graduation rates, which could point to the effectiveness of inclusivity efforts.
## [1] 0.6254544
A significant correlation might indicate that smaller class sizes or lower student-to-faculty ratios support better SAT outcomes, possibly due to personalized attention.
## [1] 145
Identifying colleges with high graduation rates (above 80%) helps highlight successful institutions and establish benchmarks for excellence.
## [1] -0.4201928
A negative correlation might suggest that more selective institutions attract higher-achieving students, as reflected in SAT scores.
## [1] -0.6657039
This correlation may show if increased financial aid helps support students to graduate, highlighting the importance of accessible education funding.
The histogram shows the spread of graduation rates across colleges, helping identify common success levels and outliers.
This scatter plot reveals the potential influence of family income on SAT scores, showing if wealthier areas correlate with higher scores.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
A trend in this plot may indicate if larger or smaller institutions have higher attendance costs, potentially impacting student choices.
This analysis reveals key factors influencing college success, such as family income, financial aid availability, and institution type. These factors show strong connections to outcomes like graduation rates and SAT scores, suggesting that socioeconomic elements and institutional characteristics play important roles in student achievement. While the descriptive statistics and visualizations suggest correlations, they do not establish causation. Patterns related to minority representation and financial aid highlight broader systemic trends that likely affect student outcomes on a national scale, emphasizing the importance of inclusive and accessible support structures. This analysis provides a clearer picture of factors that influence student success in college. Future research could apply inferential statistics to deepen understanding and potentially identify causal relationships, offering more actionable insights for policy and educational planning.
Citations and References Lock5Stat. (2024). CollegeScores4yr dataset. Retrieved from https://www.lock5stat.com/datapage3e.html