This report presents an exploratory statistical analysis of the
CollegeScores4yr dataset, derived from the U.S.
Department of Education’s College Scorecard database.
The dataset provides information for more than 2,000 four-year
U.S. colleges and universities, including their costs,
selectivity, graduation outcomes, and demographic characteristics.
The purpose of this project is to:
All analyses focus on measures of center, spread,
distribution, and association.
Figures and tables were produced directly in RStudio (Posit Cloud) using
base R functions.
The complete R code is provided in Appendix A for
transparency and reproducibility.
# Load dataset directly from the Lock5 website
url <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read.csv(url, stringsAsFactors = FALSE)
# Examine structure and key features
str(college)
## 'data.frame': 2012 obs. of 37 variables:
## $ Name : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
## $ State : chr "AL" "AL" "AL" "AL" ...
## $ ID : int 100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
## $ Main : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Accred : chr "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" ...
## $ MainDegree : int 3 3 3 3 3 3 3 3 3 3 ...
## $ HighDegree : int 4 4 4 4 4 4 3 4 4 3 ...
## $ Control : chr "Public" "Public" "Private" "Public" ...
## $ Region : chr "Southeast" "Southeast" "Southeast" "Southeast" ...
## $ Locale : chr "City" "City" "City" "City" ...
## $ Latitude : num 34.8 33.5 32.4 34.7 32.4 ...
## $ Longitude : num -86.6 -86.8 -86.2 -86.6 -86.3 ...
## $ AdmitRate : num 0.903 0.918 NA 0.812 0.979 ...
## $ MidACT : int 18 25 NA 28 18 28 NA 22 27 26 ...
## $ AvgSAT : int 929 1195 NA 1322 935 1278 NA 1083 1282 1231 ...
## $ Online : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Enrollment : int 4824 12866 322 6917 4189 32387 2801 4211 23391 1283 ...
## $ White : num 2.5 57.8 7.1 74.2 1.5 78.5 77.4 49.6 81.8 79.3 ...
## $ Black : num 90.7 25.9 14.3 10.7 93.8 10.1 12.6 38.3 6.3 11.9 ...
## $ Hispanic : num 0.9 3.3 0.6 4.6 1 4.7 2.7 1.3 3.4 1.8 ...
## $ Asian : num 0.2 5.9 0.3 4 0.3 1.2 0.9 2.6 2.4 5.5 ...
## $ Other : num 5.6 7.1 77.6 6.5 3.5 5.6 6.5 8.2 6.2 1.6 ...
## $ PartTime : num 6.6 25.2 54.4 15 7.7 7.9 56.9 23.2 8.4 0.9 ...
## $ NetPrice : int 15184 17535 9649 19986 12874 21973 NA 15310 23416 22893 ...
## $ Cost : int 22886 24129 15080 22108 19413 28836 NA 19892 30458 50440 ...
## $ TuitionIn : int 9857 8328 6900 10280 11068 10780 NA 8020 10968 35804 ...
## $ TuitonOut : int 18236 19032 6900 21480 19396 28100 NA 17140 29640 35804 ...
## $ TuitionFTE : int 9227 11612 14738 8727 9003 13574 6713 8709 15479 10088 ...
## $ InstructFTE: int 7298 17235 5265 9748 7983 10894 8017 7487 12067 10267 ...
## $ FacSalary : int 6983 10640 3866 9391 7399 10016 8268 7518 10137 7774 ...
## $ FullTimeFac: num 71.3 89.9 100 64.6 54.2 74 42.7 97.4 85.5 66.2 ...
## $ Pell : num 71 35.3 74.2 27.7 73.8 18 44.6 44.2 14.9 19.2 ...
## $ CompRate : num 24 52.9 18.2 48.6 27.7 ...
## $ Debt : int 1068 3755 109 1347 1294 6430 913 959 4152 268 ...
## $ Female : num 56.4 63.9 64.9 47.6 61.3 61.5 70.5 69.3 53.2 52 ...
## $ FirstGen : num 36.6 34.1 51.3 31 34.3 22.6 47.7 38.2 17.3 17.2 ...
## $ MedIncome : num 23.6 34.5 15 44.8 22.1 66.7 29.6 29.7 72 68.1 ...
The dataset includes 2012 observations (colleges)
and 37 variables.
Key variables used in this project include:
| Variable | Description |
|---|---|
| Cost | Total annual cost (tuition + room + board) |
| NetPrice | Average annual net cost paid by students |
| AdmitRate | Proportion of applicants admitted |
| AvgSAT | Average SAT score of admitted students |
| Control | Institution type (Public, Private, or Profit) |
| Region | Geographic region of the institution |
| CompRate | Student completion (graduation) rate |
| Debt | Typical student loan debt at graduation |
| Enrollment | Number of enrolled undergraduates |
The primary goals of this project are: - To
summarize numerical variables using mean, median, SD,
variance, and IQR.
- To visualize distributions using histograms and
boxplots.
- To compare groups using categorical variables (e.g.,
Control, Region).
- To explore linear relationships via correlation.
All missing data are handled using na.rm = TRUE (for
single-variable statistics) and use = "complete.obs" (for
correlations).
The following final set of questions (F1–F10) are chosen to represent a balanced mix of cost, selectivity, outcomes, and institutional characteristics.
mean_cost <- mean(college$Cost, na.rm = TRUE)
median_cost <- median(college$Cost, na.rm = TRUE)
sd_cost <- sd(college$Cost, na.rm = TRUE)
iqr_cost <- IQR(college$Cost, na.rm = TRUE)
data.frame(Mean = mean_cost, Median = median_cost, SD = sd_cost, IQR = iqr_cost)
## Mean Median SD IQR
## 1 34277 30699 15279 23519
Interpretation:
The average total annual cost is roughly 3.4277^{4} dollars.
The wide SD and IQR values indicate substantial variability, reflecting
differences between public and private institutions.
summary(college$AdmitRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.540 0.695 0.670 0.825 1.000 360
hist(college$AdmitRate, main = "Distribution of Admission Rates",
xlab = "Admission Rate", col = "lightblue", border = "white")
The histogram shows that admission rates cluster around moderate to
high values,
implying that while some schools are highly selective, many are
relatively accessible.
tbl_control <- table(college$Control)
prop.table(tbl_control)
##
## Private Profit Public
## 0.61779 0.08449 0.29771
barplot(tbl_control, main = "Number of Schools by Control Type",
xlab = "Control", ylab = "Count", col = c("steelblue", "gray70", "tomato"))
Observation:
Most institutions are either Public or
Private, with for-profit colleges
forming a small minority.
hist(college$NetPrice, main = "Distribution of NetPrice",
xlab = "NetPrice (USD)", col = "lightgreen", border = "white")
summary(college$NetPrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 923 14494 19338 19887 24443 55775 162
Interpretation:
The distribution of NetPrice is right-skewed.
While many colleges fall in a moderate range, a few institutions have
significantly higher net prices.
cor_avgSAT_cost <- cor(college$AvgSAT, college$Cost, use = "complete.obs")
cor_avgSAT_cost
## [1] 0.5374
A positive correlation (0.537) indicates that more academically selective institutions tend to have higher total costs.
cor_admit_sat <- cor(college$AdmitRate, college$AvgSAT, use = "complete.obs")
cor_admit_sat
## [1] -0.4221
A strong negative correlation (-0.422) reveals that schools with lower admission rates typically have higher SAT scores, as expected.
boxplot(NetPrice ~ Control, data = college,
main = "NetPrice by Control Type", xlab = "Control", ylab = "NetPrice (USD)",
col = c("skyblue", "orange", "gray80"))
Public institutions generally have lower median net prices, while private institutions show greater variability and higher typical costs.
par(mar = c(8, 5, 4, 2) + 0.1)
boxplot(Cost ~ Region, data = college,
main = "Cost by Region", xlab = "", ylab = "Cost (USD)",
las = 2, cex.axis = 0.9, col = "lightgray")
mtext("Region", side = 1, line = 6)
tapply(college$Cost, college$Region, median, na.rm = TRUE)
## Midwest Northeast Southeast Territory West
## 35135 39046 28410 13250 27160
Observation:
Median costs differ by region, with the Northeast
showing the highest medians and the Territories having
the lowest.
cor_comp_debt <- cor(college$CompRate, college$Debt, use = "complete.obs")
cor_comp_debt
## [1] -0.1584
The correlation (-0.158) between completion rate and student debt is
mild,
suggesting limited linear association between these two outcomes.
hist(college$Enrollment, main = "Distribution of Enrollment",
xlab = "Enrollment", col = "lightgray", border = "white")
hist(log10(college$Enrollment[college$Enrollment > 0]),
main = "Distribution of log10(Enrollment)", xlab = "log10(Enrollment)",
col = "skyblue", border = "white")
Interpretation:
Enrollment sizes are highly skewed. Most colleges are relatively
small,
while a few large universities dominate the upper tail.
This project demonstrates how descriptive statistics and visual summaries can uncover meaningful patterns in educational data.
This analysis provides a descriptive foundation for future studies on accessibility, affordability, and educational equity.
The following code reproduces all analyses in this report.
# Load Data
url <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read.csv(url, stringsAsFactors = FALSE)
# 1. Summary of Cost
mean(college$Cost, na.rm=TRUE); median(college$Cost, na.rm=TRUE)
sd(college$Cost, na.rm=TRUE); IQR(college$Cost, na.rm=TRUE)
# 2. AdmitRate Distribution
summary(college$AdmitRate); hist(college$AdmitRate)
# 3. Control Composition
table(college$Control); prop.table(table(college$Control))
# 4. NetPrice Distribution
summary(college$NetPrice); hist(college$NetPrice)
# 5. AvgSAT vs Cost
cor(college$AvgSAT, college$Cost, use="complete.obs")
# 6. AdmitRate vs AvgSAT
cor(college$AdmitRate, college$AvgSAT, use="complete.obs")
# 7. NetPrice by Control
boxplot(NetPrice ~ Control, data=college)
# 8. Cost by Region
par(mar=c(8,5,4,2)+0.1)
boxplot(Cost ~ Region, data=college, las=2)
mtext("Region", side=1, line=6)
# 9. CompRate vs Debt
cor(college$CompRate, college$Debt, use="complete.obs")
# 10. Enrollment
hist(college$Enrollment)
hist(log10(college$Enrollment[college$Enrollment>0]))