What is the average admission rate (AdmitRate) for U.S. colleges that primarily grant bachelor’s degrees?
What is the median percentage of female students (Female) across all colleges in the dataset?
How much do average tuition and fees for in-state students (TuitionIn) vary across schools?
What is the shape of the distribution of undergraduate enrollment (Enrollment)?
Is there a correlation between average SAT scores (AvgSAT) and college completion rate (CompRate)?
What is the range and standard deviation of faculty salaries (FacSalary)?
Create a boxplot of student loan debt (Debt). Are there any visible outliers?
Which control type (Private, Public, Profit) has the highest average instructional spending per FTE student (InstructFTE)?
What proportion of schools are located in each U.S. region (Region)?
How does the percentage of first-generation students (FirstGen) differ between schools with high (>60%) and low (<40%) Pell Grant rates (Pell)?
Compare the distributions of in-state tuition (TuitionIn) between schools in the Northeast and Midwest regions using side-by-side boxplots.
What is the correlation between median family income (MedIncome) and net price (NetPrice)?
Is there more variability in faculty salaries (FacSalary) at public or private institutions?
Create a histogram of the completion rate (CompRate). Does the distribution appear symmetric, skewed left, or skewed right?
Do colleges in suburban areas (Locale == “Suburb”) have a higher average SAT score (AvgSAT) than those in rural areas (Locale == “Rural”)?
What is the median debt (Debt) among students at colleges where more than 50% of students receive Pell grants?
Create a barplot showing the average admission rate (AdmitRate) for each region (Region).
Compare the percentage of Hispanic students (Hispanic) and Asian students (Asian) across all schools using boxplots.
Among colleges with average ACT scores (MidACT) above 28, what is the mean completion rate (CompRate)?
Is there a relationship between part-time enrollment (PartTime) and total enrollment (Enrollment)?
What is the average admission rate (AdmitRate) for U.S. colleges that primarily grant bachelor’s degrees?
What is the median percentage of female students (Female) across all colleges in the dataset?
How much do average tuition and fees for in-state students (TuitionIn) vary across schools?
Create a histogram of undergraduate enrollment (Enrollment). What general shape does the distribution have?
Is there a correlation between average SAT scores (AvgSAT) and college completion rate (CompRate)?
Create a boxplot of student loan debt (Debt). Are there any visible outliers?
What proportion of schools are located in each U.S. region (Region)?
Compare the distributions of in-state tuition (TuitionIn) between schools in the Northeast and Midwest regions using side-by-side boxplots.
What is the correlation between median family income (MedIncome) and net price (NetPrice)?
Create a barplot showing the average admission rate (AdmitRate) for each region (Region).
We will explore the questions in detail.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025
The average admission rate across U.S. colleges is approximately 67%, meaning most schools accept over two-thirds of their applicants. This indicates a generally moderate level of selectivity among 4-year colleges in the dataset.
median(college$Female, na.rm = TRUE)
## [1] 59.15
The median percentage of female students is around 59.15%, which means half of the schools have a higher proportion of women. This reflects a slight gender imbalance favoring female enrollment at many institutions.
var(college$TuitionIn, na.rm = TRUE)
## [1] 199665280
sd(college$TuitionIn, na.rm = TRUE)
## [1] 14130.3
The standard deviation in in-state tuition is about $14,130 indicating substantial variation among colleges. This shows that tuition costs vary widely, likely due to differences in state funding and institutional types.
hist(college$Enrollment,
main = "Distribution of Enrollment",
col = "skyblue",
xlab = "Enrollment",
breaks = 30)
The enrollment distribution is right-skewed, with most colleges
enrolling relatively few students, and a few large universities
enrolling tens of thousands. This suggests that small to mid-sized
colleges dominate the higher education landscape.
cor(college$AvgSAT, college$CompRate, use = "complete.obs")
## [1] 0.8189495
plot(college$AvgSAT, college$CompRate,
xlab = "Average SAT", ylab = "Completion Rate",
main = "SAT vs Completion Rate", col = "blue")
The correlation coefficient is around 0.82, showing a moderate positive
relationship between SAT scores and completion rates. Schools with
higher-achieving students tend to have better graduation outcomes,
likely due to academic preparedness.
boxplot(college$Debt,
main = "Boxplot of Student Loan Debt",
col = "tomato", horizontal = TRUE)
The boxplot reveals several outliers, representing colleges where
students take on significantly higher or lower debt. This suggests that
student borrowing varies widely, possibly due to institutional aid
policies or cost of attendance.
region_counts = table(college$Region)
region_percent = prop.table(region_counts) * 100
pie(region_percent,
main = "Proportion of Schools by Region",
col = rainbow(length(region_percent)))
The Northeast has the largest share of colleges (~27.4%), followed by
the Midwest, Southeast, and West. This highlights regional clustering in
the U.S. college system, with some areas hosting significantly more
institutions.
boxplot(TuitionIn ~ Region, data = subset(college, Region %in% c("Northeast", "Midwest")),
main = "In-State Tuition: Northeast vs Midwest",
col = c("lightblue", "lightgreen"))
Colleges in the Northeast tend to have higher in-state tuition than
those in the Midwest, with greater variability. This indicates regional
cost differences, potentially driven by local economic factors or
institutional missions.
cor(college$MedIncome, college$NetPrice, use = "complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice,
xlab = "Median Family Income", ylab = "Net Price",
main = "Income vs Net Price", col = "darkgreen")
The correlation between median family income and net price is about
0.52, showing a moderate positive relationship. Families with higher
incomes tend to face higher net prices, likely because they receive less
need-based aid.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
region_admit = college %>%
group_by(Region) %>%
summarize(avg_admit = mean(AdmitRate, na.rm = TRUE))
barplot(region_admit$avg_admit,
names.arg = region_admit$Region,
col = "orchid",
main = "Average Admission Rate by Region",
ylab = "Admission Rate")
The Midwest region has the highest average admission rate (~69%), while
the Southeast has the lowest (~65%). This implies varying levels of
competitiveness across regions, with the Midwest being relatively more
accessible.
In this project, we used data on U.S. colleges to explore important trends using statistical tools like mean, median, standard deviation, boxplots, histograms, barplots, pie charts, and correlation. Each question helped us understand a different part of the college experience, from costs to admissions to student outcomes.
We found that the average college admits about 67% of applicants, and the typical school has around 59% female students. In-state tuition costs vary a lot, with some schools charging much more than others, likely due to differences in state policies and whether a school is public or private. Most colleges have small to mid-sized enrollments, while a few very large ones bring the average up.
There is a strong positive relationship between SAT scores and graduation rates, showing that students at colleges with higher SAT averages are more likely to graduate. We also found a link between family income and net price, meaning wealthier students often pay more for college—likely because they qualify for less financial aid.
A boxplot of student loan debt showed that some students borrow much more than others, depending on the college. When we looked at regions, the Northeast had the most schools, and the Midwest had the highest average admission rate, making it more accessible. Tuition in the Northeast was generally higher than in the Midwest, and we saw big differences in costs between regions.
Overall, this project showed how simple graphs and summary statistics can help us better understand the differences between colleges. These tools made it easier to see patterns in tuition, enrollment, diversity, and outcomes that would be hard to notice just by reading numbers in a table.
mean(college$AdmitRate, na.rm = TRUE)
median(college$Female, na.rm = TRUE)
var(college\(TuitionIn, na.rm = TRUE) sd(college\)TuitionIn, na.rm = TRUE)
hist(college$Enrollment, main = “Distribution of Enrollment”, col = “skyblue”, xlab = “Enrollment”, breaks = 30)
cor(college\(AvgSAT, college\)CompRate, use = “complete.obs”) plot(college\(AvgSAT, college\)CompRate, xlab = “Average SAT”, ylab = “Completion Rate”, main = “SAT vs Completion Rate”, col = “blue”)
boxplot(college$Debt, main = “Boxplot of Student Loan Debt”, col = “tomato”, horizontal = TRUE)
region_counts = table(college$Region) region_percent = prop.table(region_counts) * 100 pie(region_percent, main = “Proportion of Schools by Region”, col = rainbow(length(region_percent)))
boxplot(TuitionIn ~ Region, data = subset(college, Region %in% c(“Northeast”, “Midwest”)), main = “In-State Tuition: Northeast vs Midwest”, col = c(“lightblue”, “lightgreen”))
cor(college\(MedIncome, college\)NetPrice, use = “complete.obs”) plot(college\(MedIncome, college\)NetPrice, xlab = “Median Family Income”, ylab = “Net Price”, main = “Income vs Net Price”, col = “darkgreen”)
library(dplyr) region_admit = college %>% group_by(Region) %>% summarize(avg_admit = mean(AdmitRate, na.rm = TRUE))
barplot(region_admit\(avg_admit, names.arg = region_admit\)Region, col = “orchid”, main = “Average Admission Rate by Region”, ylab = “Admission Rate”)