In this project, I used the dataset CollegeScores4yr.csv from the Lock5Stat website (Lock et al., 2023). The data included information on four-year colleges in the United States, such as admission rates, test scores, tuition, net price, student debt, faculty salary, completion rates, and student demographics.
First, I thought about my own possible questions and then also asked ChatGPT to suggest additional simple questions based on Chapter 6 methods. From those, I chose the following final 10 questions to analyze. Each question can be addressed using descriptive statistics tools from Chapter 6, such as mean, median, standard deviation, histograms, box plots, and scatterplots.
I propose the following 10 questions:
We will explore the questions in detail.
The mean admission rate across all colleges in the dataset is 0.682, and the median is 0.715. This means that, on average, schools admit about 68% of their applicants, and half of all colleges admit more than 71.5%. Because the median is slightly higher than the mean, the distribution has a mild left skew, suggesting that while many schools are fairly open-admission, a smaller group of highly selective colleges pulls the mean downward.
The standard deviation of average student debt is $4,728, showing that debt levels vary moderately across different institutions. A standard deviation of this size means that while many colleges have debt levels near the overall average, others deviate by several thousand dollars, indicating meaningful differences in the financial burden students face depending on which school they attend.
The average monthly faculty salary for all schools is $6,878, indicating what full-time instructors typically earn across institutions. This value provides insight into how much colleges invest in their faculty. Although salaries differ by region and institution type, this mean reflects the overall level of compensation provided at four-year colleges in the United States.
When comparing median net prices across school types, public colleges have a median net price of $14,476, private colleges have a median of $24,443, and for-profit institutions have a median of $21,604. These results show that private schools tend to be the most expensive for students, while public schools remain the most affordable option. For-profit colleges fall in between but remain noticeably more costly than public institutions.
The histogram of admission rates shows that most colleges admit a large portion of their applicants, typically between 60% and 90%. Only a small number of schools fall in the highly selective range, near 0–20%. The overall distribution is left-skewed, with most schools having high admission rates and only a few being very selective, which pulls the tail toward the lower end of the scale.
The histogram of average SAT scores reveals that most schools cluster between 950 and 1200, indicating a moderate and fairly typical level of student academic preparation. The distribution is slightly right-skewed, with a small number of highly competitive colleges reporting much higher average SAT scores. This pattern suggests that while some institutions attract very high-achieving students, the majority fall within a more common mid-range.
The box plot of net price by region shows clear differences in affordability across the United States. New England has the highest median net price, reflecting the many private institutions concentrated in that area. Regions such as the Rocky Mountains and Plains have the lowest typical net prices, suggesting that students in those regions often face more affordable college costs. These regional differences highlight how location influences a student’s expected financial burden.
The box plot of completion rates by control type shows distinct patterns across school categories. Private colleges generally have higher completion rates with moderate variation, while public institutions show a wider spread, indicating more diversity in student outcomes. For-profit schools display both the lowest medians and the greatest variation, suggesting inconsistent performance and lower student success rates overall in that sector.
The scatter plot comparing admission rate and completion rate reveals a
moderately strong negative relationship, supported by a correlation of
–0.52. This means that more selective schools—those admitting a smaller
percentage of their applicants—tend to have higher graduation rates.
Conversely, schools with higher admission rates typically see lower
completion rates. This pattern suggests that selectivity may reflect
differences in academic preparation, institutional resources, or student
support.
The scatter plot of student debt versus median family income shows a moderate negative correlation of –0.42, indicating that students from higher-income families generally graduate with lower levels of debt. Meanwhile, students from lower-income backgrounds tend to incur more debt. This relationship reflects differences in financial aid eligibility, borrowing needs, and institutional funding structures that influence how much students must rely on loans.
In this project, I used descriptive statistics from Chapter 6 to explore the CollegeScores4yr dataset. I examined measures of center (mean and median), spread (standard deviation), and graphical summaries (histograms, box plots, scatterplots, and a barplot).
The results showed how admission rates, net price, faculty salaries, and student debt vary across different types of schools and regions. Histograms revealed the shapes of key variables such as admission rates and SAT scores. Box plots highlighted differences in net price and completion rates between regions and control types. Scatterplots and correlations suggested possible relationships, such as how completion rates might be related to admission rates, and how student debt might be related to median family income.
Overall, this analysis provided a clearer picture of how costs, selectivity, and outcomes differ among colleges, using only the descriptive tools from Chapter 6.
Dataset citation
Lock, R. H., Lock, P. F., Lock Morgan, K., Lock, E. F., & Lock, D.
F. (2023). CollegeScores4yr.csv [Data set]. Lock5Stat.
https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv
Below is the complete R code used in this project.
(This chunk displays the code but does not re-run it when knitting.)
# Load the dataset
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
# Question 1: Mean and median admission rate for all schools
mean(college$AdmitRate, na.rm = TRUE)
median(college$AdmitRate, na.rm = TRUE)
# Question 2: Standard deviation of average student debt
sd(college$Debt, na.rm = TRUE)
# Question 3: Average faculty salary for all schools
mean(college$FacSalary, na.rm = TRUE)
# Question 4: Median net price for public vs. private schools
aggregate(NetPrice ~ Control, data = college, median, na.rm = TRUE)
# Optional barplot for median net price by control type
barplot(
aggregate(NetPrice ~ Control, data = college, median, na.rm = TRUE)$NetPrice,
names.arg = aggregate(NetPrice ~ Control, data = college, median, na.rm = TRUE)$Control,
main = "Median Net Price by Control Type",
xlab = "Control Type", ylab = "Median Net Price"
)
# Question 5: Histogram of admission rates
hist(college$AdmitRate,
main = "Histogram of Admission Rates",
xlab = "Admission Rate",
col = "lightblue", border = "white")
# Question 6: Histogram of average SAT scores
hist(college$AvgSAT,
main = "Histogram of Average SAT Scores",
xlab = "Average SAT Score",
col = "lightgreen", border = "white")
# Question 7: Box plot of net price by region
boxplot(NetPrice ~ Region, data = college,
main = "Net Price by Region",
xlab = "Region", ylab = "Net Price",
col = "lightpink")
# Question 8: Box plot of completion rate by control type
boxplot(CompRate ~ Control, data = college,
main = "Completion Rate by Control Type",
xlab = "School Type", ylab = "Completion Rate",
col = "lightyellow")
# Question 9: Scatter plot and correlation of admission rate vs. completion rate
plot(college$AdmitRate, college$CompRate,
main = "Admission Rate vs. Completion Rate",
xlab = "Admission Rate", ylab = "Completion Rate",
pch = 19, col = "steelblue")
cor(college$AdmitRate, college$CompRate, use = "complete.obs")
# Question 10: Scatter plot and correlation of student debt vs. median family income
plot(college$MedIncome, college$Debt,
main = "Student Debt vs. Median Family Income",
xlab = "Median Family Income ($1000s)", ylab = "Average Student Debt",
pch = 19, col = "darkred")
cor(college$MedIncome, college$Debt, use = "complete.obs")