I used the data from the CollegeScores4yr dataset, which provides detailed information on U.S. colleges and universities that primarily grant bachelor’s degrees. The data, collected by the U.S. Department of Education through the College Scorecard project, includes variables such as tuition, admission rates, student demographics, faculty salaries, and completion rates.
To guide my analysis, I developed 10 research questions through a combination of my own inquiries and AI-generated suggestions. I first proposed questions based on my understanding of the data, then used ChatGPT to generate additional questions, and finally selected the most relevant combination to explore key patterns in college characteristics, costs, and student outcomes.
I proposed the following 10 questions based on my own understanding of the data:
-1. What is the mean in-state tuition (TuitionIn) for four-year colleges in the dataset? -2. What is the sample variance of the total cost (Cost) among all institutions? -3: What are the quantiles of student debt (Debt) for graduates across the dataset? -4: What are the percentiles of median family income (MedIncome) for the listed colleges? -5: Create a stem-and-leaf plot to display the distribution of completion rate (CompRate). -6: What is the mean monthly salary for full-time faculty (FacSalary)? -7: Create a histogram showing the distribution of average net price (NetPrice) across colleges. -8: Create a boxplot to visualize the spread and outliers of total cost (Cost). -9: Create a boxplot to display the distribution of the percentage of female students (Female). -10: Create a histogram showing the distribution of first-generation students (FirstGen) across institutions.
These are questions generated by ChatGPT:
-1: What is the median admission rate (AdmitRate) across all four-year colleges? -2: What is the standard deviation of the average SAT score (AvgSAT) among all institutions? -3: What is the mean percentage of part-time students (PartTime) in Midwest vs. West regions? -4: Compare median instructional spending (InstructFTE) between public and private colleges using a boxplot. -5: What is the correlation between tuition cost (Cost) and average debt (Debt)? -6: Create a boxplot to compare completion rates (CompRate) across different types of school control (Control). -7: What is the average completion rate (CompRate) among colleges where more than 50% of students receive Pell grants? -8: Calculate the correlation between in-state tuition (TuitionIn) and median family income (MedIncome). -9: What is the correlation between median family income (MedIncome) and net price (NetPrice)? -10: Create a scatter diagram to display the relationship between average total cost (Cost) and completion rate (CompRate).
Here I will explore the questions in details. I
college= read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
I have selected 10 questions from a combination of my questions and the ones generated by chatGPT.
mean(college$TuitionIn, na.rm= TRUE)
## [1] 21948.55
The mean cost in in-state tuition is 21948.55
var(college$Cost, na.rm= TRUE)
## [1] 233433900
The sample variance of the total cost is 233433900
quantile(college$Debt, 0.95, na.rm= TRUE)
## 95%
## 10243
The quantiles of student debt(95%) for graduates across the dataset is 10243
boxplot(CompRate ~ Control, data = college,
main = "Completion Rate by School Control",
xlab = "School Control",
ylab = "Completion Rate (%)",
col = c("lightblue", "lightgreen", "lightpink"))
Private schools show the highest median completion rate at approximately 55%, followed by public schools at around 48%, while profit schools have the lowest median completion rate at roughly 25%.
stem(college$CompRate)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 000000000000000000000003345566677778889999
## 1 | 00011111111111222333333444444444444455555555666666666677777777888888+1
## 2 | 00000000000001111111111111222222222222222333333333334444444444444445+78
## 3 | 00000000000000000111111111111111111222222222222222222222222222333333+131
## 4 | 00000000000000000000000000000001111111111111111111111111111111111222+241
## 5 | 00000000000000000000000000000000000000000000001111111111111111111112+276
## 6 | 00000000000000000000000000000000001111111111111111111111112222222222+218
## 7 | 00000000000000000000111111111111111111111111222222222222222222222233+99
## 8 | 00000000000000000111111111111111222222222222333333333333333333344444+40
## 9 | 0000000011111111111112222222222233333334444444445555555555666677
## 10 | 000000000000000
The stem-and-leaf plot shows that most completion rates are between 2.0 and 8.0, with fewer institutions having very low or very high rates. Overall, the distribution is slightly skewed toward higher completion rates.
hist(college$FirstGen,
main= "Histogram of first-generation students",
col= "orange",
xlab="FirstGen"
)
The distribution of first-generation students across institutions is approximately normal, with the majority of institutions having between 20% and 50% first-generation students, and a peak frequency around 35-40%.
cor(college$MedIncome, college$NetPrice, use= "complete.obs")
## [1] 0.5151298
The correlation between median family income and net price is 0.5151298
median(college$AdmitRate, na.rm=TRUE)
## [1] 0.69505
The median admission rate across all four-year colleges is 0.69505
boxplot(college$Female,
main="The distribution of the the percentage of female students",
ylab="Female students"
)
The percentage of female students across institutions has a median of approximately 60%, with the middle 50% of institutions ranging from about 55% to 65%.
plot(college$Cost, college$CompRate,
main= "Completion Rate vs Total Cost",
xlab="Total Cost",
ylab="Completion Rate",
col="black")
The scatter plot reveals a weak positive relationship between total cost and completion rate, with institutions ranging from approximately $10,000 to $75,000 in total cost and completion rates spanning from near 0% to 100%.
This analysis of the CollegeScores4yr dataset revealed several important patterns in U.S. four-year colleges:
Cost and Affordability: The average in-state tuition is approximately $21,949, with substantial variation across institutions (variance of 233,433,900 in total costs). Student debt shows that 95% of graduates have debt below $10,243.
Completion Rates:Completion rates vary significantly by institutional type, with private schools achieving the highest median rate (~55%), followed by public schools (~48%), and for-profit institutions showing considerably lower rates (~25%). The relationship between cost and completion rate shows only a weak positive association, suggesting that higher costs do not guarantee better completion outcomes.
Student Demographics: First-generation students comprise 30-40% of enrollment at most institutions, following an approximately normal distribution. Female students represent about 60% of enrollment at typical institutions, with relatively consistent patterns across schools.
Key Findings:The moderate positive correlation between median family income and net price (r = 0.515) suggests that wealthier students tend to attend higher-priced institutions. However, the weak relationship between cost and completion rates indicates that factors beyond institutional cost—such as student support services, academic preparation, and institutional resources—likely play more important roles in student success.
These findings highlight the complexity of higher education outcomes and suggest that prospective students should consider multiple factors beyond cost when evaluating institutions.