This project explores data on U.S. four–year colleges and universities from the CollegeScores4yr dataset. These data come from the U.S. Department of Education’s College Scorecard and include information about admission rates, test scores, costs, financial aid, graduation rates, faculty salaries, and student demographics for bachelor’s–granting institutions.
My goal is to use descriptive statistics from Chapter 6 to answer ten simple questions about these colleges. Each question involves at most one or two variables and focuses on topics such as total cost, net price, completion rate, types of institutions, and student characteristics. For each question, I use appropriate methods such as the mean, median, variance, standard deviation, correlation, and graphical tools like histograms, boxplots, barplots, and a pie chart.
Together, these summaries give an overview of how affordable different colleges are, how successful their students are, and how these characteristics vary by institution type and region.
We will use different statistical methods (mean, median, variance, standard deviation, correlation, histogram, boxplot, barplot, and pie chart) to answer each question about the colleges in the dataset.
# Numerical summaries for Cost
summary(college$Cost)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5950 21357 30699 34277 44876 72717 162
sd(college$Cost, na.rm = TRUE)
## [1] 15278.54
# Graphical summaries for Cost
hist(college$Cost,
main = "Distribution of Total Cost of Attendance",
xlab = "Total annual cost (tuition, room, board, etc.)")
boxplot(college$Cost,
main = "Boxplot of Total Cost of Attendance",
ylab = "Total annual cost")
Answer (Q1):
The histogram and boxplot show that total cost of attendance spans a
wide range and has a long right tail. Many colleges fall in a moderate
cost range, while a smaller number of schools are much more expensive,
appearing as high outliers in the boxplot. This pattern suggests that
the distribution of cost is right–skewed: most colleges
are moderately priced, but a few very costly institutions pull the mean
higher than the median.
# Numerical summaries for NetPrice
summary(college$NetPrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 923 14494 19338 19887 24443 55775 162
var(college$NetPrice, na.rm = TRUE)
## [1] 61686826
sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096
# Graphical summaries for NetPrice
hist(college$NetPrice,
main = "Distribution of Net Price",
xlab = "Net price (after average aid)")
boxplot(college$NetPrice,
main = "Boxplot of Net Price",
ylab = "Net price (after average aid)")
Answer (Q2):
The numerical summaries give a “typical” net price in the middle of the
distribution (as shown by the mean and median), while the variance and
standard deviation indicate that net price differs by many thousands of
dollars across schools. The histogram shows more colleges at
lower–to–middle net prices and fewer at very high net prices, with some
high-net-price colleges appearing as outliers in the boxplot. Overall,
the typical net price is moderate, but there is
substantial variation in what students actually pay.
# Numerical summaries for completion rate
summary(college$CompRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 38.18 52.45 52.14 66.67 100.00 167
var(college$CompRate, na.rm = TRUE)
## [1] 446.1695
sd(college$CompRate, na.rm = TRUE)
## [1] 21.12272
# Graphical summaries for completion rate
hist(college$CompRate,
main = "Distribution of Completion Rates",
xlab = "Completion rate (%)")
boxplot(college$CompRate,
main = "Boxplot of Completion Rates",
ylab = "Completion rate (%)")
Answer (Q3):
Completion rates vary widely from college to college. The histogram
shows that some institutions have quite low completion rates, others
have very high rates, and many schools fall in the middle. The boxplot
has a fairly large interquartile range, and the standard deviation is
fairly large, which means that completion rate is not tightly
clustered around a single value. Instead, there is considerable
spread in student success across institutions.
# Correlation between admission rate and completion rate
cor(college$AdmitRate, college$CompRate, use = "complete.obs")
## [1] -0.3482341
Answer (Q4):
The cor() output gives the correlation
between admission rate and completion rate. A negative
correlation would mean that colleges with higher admission rates (less
selective) tend to have lower completion rates, while more selective
colleges (lower admission rates) tend to have higher completion rates. A
positive correlation would indicate the opposite, and a
correlation close to 0 would indicate very little linear
relationship.
For this dataset, the correlation printed above is moderate in size and below zero, indicating a moderate negative linear relationship: as admission rate increases, completion rate tends to decrease, although admission rate alone does not perfectly predict completion rate.
# Filter to public vs private only
college_pub_priv = subset(college, Control %in% c("Public", "Private"))
# Numerical summaries by Control
tapply(college_pub_priv$TuitionIn,
college_pub_priv$Control,
mean, na.rm = TRUE)
## Private Public
## 28703.695 9199.717
tapply(college_pub_priv$TuitionIn,
college_pub_priv$Control,
median, na.rm = TRUE)
## Private Public
## 29200 8498
# Boxplot comparing in-state tuition for public vs private colleges
boxplot(TuitionIn ~ Control,
data = college_pub_priv,
main = "In-State Tuition by Institution Control",
xlab = "Control (Public vs Private)",
ylab = "In-state tuition (dollars)")
Answer (Q5):
The mean and median tuition values by control, together with the
side-by-side boxplot, show clear differences between public and private
colleges. Private institutions generally have
higher in–state tuition: their median and upper
quartiles are noticeably larger than those for public colleges. Public
institutions tend to charge less in-state tuition on average, although
there is some overlap between the groups. This suggests that the type of
institution (public vs private) is strongly related to how much students
pay in tuition.
# Mean net price by region
mean_netprice_region = tapply(college$NetPrice,
college$Region,
mean,
na.rm = TRUE)
mean_netprice_region
## Midwest Northeast Southeast Territory West
## 19624.613 21912.643 18953.509 7562.362 20061.101
# Barplot of mean net price by region
barplot(mean_netprice_region,
main = "Mean Net Price by Region",
xlab = "Region",
ylab = "Mean net price (dollars)",
las = 2)
Answer (Q6):
The barplot of mean net price by region shows that average net
price is not the same in every region. Some regions have
noticeably higher mean net prices than others, indicating that where a
college is located can affect typical affordability. Regions with taller
bars have colleges that tend to be more expensive on average after aid,
while regions with shorter bars tend to have lower average net prices.
Even so, there is still variation within each region that is not
captured by the regional averages alone.
# Numerical summaries of enrollment by locale (median)
median_enroll_locale = tapply(college$Enrollment,
college$Locale,
median,
na.rm = TRUE)
median_enroll_locale
## City Rural Suburb Town
## 1904.0 825.0 1669.5 1756.0
# Boxplot of enrollment by locale
boxplot(Enrollment ~ Locale,
data = college,
main = "Enrollment by Locale",
xlab = "Locale",
ylab = "Enrollment",
las = 2)
Answer (Q7):
The boxplot compares the distribution of enrollments across different
locales (such as City, Suburb, Town, and Rural). The medians and spreads
differ by locale, indicating that some types of locations tend to have
larger typical enrollments than others. For example,
the locale(s) with the highest median in the table above typically host
larger schools, while locales with lower medians tend to have smaller
institutions. Overall, enrollment size depends on
locale, with noticeable differences in the typical size of
institutions across geographic settings.
# Frequency and proportion of institution types
control_counts = table(college$Control)
control_counts
##
## Private Profit Public
## 1243 170 599
prop.table(control_counts)
##
## Private Profit Public
## 0.61779324 0.08449304 0.29771372
# Pie chart of institution types
pie(control_counts,
main = "Proportion of Colleges by Control Type",
labels = paste0(names(control_counts), " (",
round(100 * prop.table(control_counts), 1), "%)"))
Answer (Q8):
The table() output and pie chart show how many colleges in
the dataset are Public, Private, or
For-profit. One control type makes up the
largest share of institutions (the biggest slice of the
pie), while another is clearly least common (the smallest slice). This
graph summarizes the overall mix of institution types
in the population of four-year colleges.
# Numerical summaries for Pell grant percentage
summary(college$Pell)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 25.60 35.70 37.85 47.45 97.60 5
sd(college$Pell, na.rm = TRUE)
## [1] 17.88267
# Graphical summaries for Pell grant percentage
hist(college$Pell,
main = "Distribution of Pell Grant Percentage",
xlab = "Percent of students receiving Pell grants")
boxplot(college$Pell,
main = "Boxplot of Pell Grant Percentage",
ylab = "Percent of students receiving Pell grants")
Answer (Q9):
The histogram and boxplot show the distribution of the percentage of
students who receive Pell grants at each college. Many schools cluster
in a middle range of Pell percentages, while some institutions serve a
much higher or lower proportion of Pell students. The standard deviation
indicates that Pell participation varies substantially
from campus to campus, reflecting differences in the financial need of
the student populations that different colleges serve.
# Correlation between median family income and student debt
cor(college$MedIncome, college$Debt, use = "complete.obs")
## [1] -0.1207221
Answer (Q10):
The correlation between median family income and average student debt
tells us whether students from higher-income backgrounds tend to borrow
more or less. A positive correlation would suggest that
students at colleges with higher median family incomes also graduate
with higher average debt, while a negative correlation
would indicate that students from wealthier families tend to borrow
less.
The correlation value printed by cor() is not extremely
close to ±1, so any linear relationship is moderate at
best. This means that family income and student debt are
related, but many other factors (such as tuition, aid policies, and
student choices) also play an important role in how much students
borrow.
In this project, I used descriptive statistics and simple graphs to study four-year colleges in the CollegeScores4yr dataset. For costs, the histograms and boxplots showed that both total cost and net price are right–skewed: most colleges are moderately priced, but a smaller group of institutions are much more expensive. The spread (variance and standard deviation) of these cost variables is large, indicating big differences in affordability across schools.
Student outcomes also vary. Completion rates range from low to high, with a fairly large spread, and the correlation analysis suggests a moderate negative relationship between admission rate and completion rate: more selective colleges tend to have higher completion rates. The analysis of in-state tuition by control type showed that private colleges usually charge more than public institutions, and the barplot of net price by region suggested that typical affordability differs across regions.
Other summaries highlighted structural differences among institutions. The boxplot of enrollment by locale showed that typical school size depends on where the college is located. The pie chart of institution control summarized how common public, private, and for-profit colleges are in the dataset. Finally, the distribution of Pell grant percentages and the correlation between median family income and student debt demonstrated that student financial need and borrowing patterns vary widely across schools.
Overall, the Chapter 6 tools—numerical summaries (mean, median, variance, standard deviation), correlations, and basic plots (histograms, boxplots, barplots, and a pie chart)—provided a clear picture of how costs, outcomes, and student characteristics differ among U.S. four-year colleges.