1 Executive Summary

This report presents an exploratory statistical analysis of the CollegeScores4yr dataset, derived from the U.S. Department of Education’s College Scorecard database.
The dataset provides information for more than 2,000 four-year U.S. colleges and universities, including their costs, selectivity, graduation outcomes, and demographic characteristics.

The purpose of this project is to:

  1. Develop and analyze simple research questions using Chapter 6 descriptive methods.
  2. Employ R Markdown to integrate code, computation, and narrative.
  3. Present findings clearly, concisely, and reproducibly in a professional report.

All analyses focus on measures of center, spread, distribution, and association.
Figures and tables were produced directly in RStudio (Posit Cloud) using base R functions.
The complete R code is provided in Appendix A for transparency and reproducibility.


2 Dataset Overview

# Load dataset directly from the Lock5 website
url <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read.csv(url, stringsAsFactors = FALSE)
# Examine structure and key features
str(college)
## 'data.frame':    2012 obs. of  37 variables:
##  $ Name       : chr  "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
##  $ State      : chr  "AL" "AL" "AL" "AL" ...
##  $ ID         : int  100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
##  $ Main       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Accred     : chr  "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" "Southern Association of Colleges and Schools Commission on Colleges" ...
##  $ MainDegree : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ HighDegree : int  4 4 4 4 4 4 3 4 4 3 ...
##  $ Control    : chr  "Public" "Public" "Private" "Public" ...
##  $ Region     : chr  "Southeast" "Southeast" "Southeast" "Southeast" ...
##  $ Locale     : chr  "City" "City" "City" "City" ...
##  $ Latitude   : num  34.8 33.5 32.4 34.7 32.4 ...
##  $ Longitude  : num  -86.6 -86.8 -86.2 -86.6 -86.3 ...
##  $ AdmitRate  : num  0.903 0.918 NA 0.812 0.979 ...
##  $ MidACT     : int  18 25 NA 28 18 28 NA 22 27 26 ...
##  $ AvgSAT     : int  929 1195 NA 1322 935 1278 NA 1083 1282 1231 ...
##  $ Online     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Enrollment : int  4824 12866 322 6917 4189 32387 2801 4211 23391 1283 ...
##  $ White      : num  2.5 57.8 7.1 74.2 1.5 78.5 77.4 49.6 81.8 79.3 ...
##  $ Black      : num  90.7 25.9 14.3 10.7 93.8 10.1 12.6 38.3 6.3 11.9 ...
##  $ Hispanic   : num  0.9 3.3 0.6 4.6 1 4.7 2.7 1.3 3.4 1.8 ...
##  $ Asian      : num  0.2 5.9 0.3 4 0.3 1.2 0.9 2.6 2.4 5.5 ...
##  $ Other      : num  5.6 7.1 77.6 6.5 3.5 5.6 6.5 8.2 6.2 1.6 ...
##  $ PartTime   : num  6.6 25.2 54.4 15 7.7 7.9 56.9 23.2 8.4 0.9 ...
##  $ NetPrice   : int  15184 17535 9649 19986 12874 21973 NA 15310 23416 22893 ...
##  $ Cost       : int  22886 24129 15080 22108 19413 28836 NA 19892 30458 50440 ...
##  $ TuitionIn  : int  9857 8328 6900 10280 11068 10780 NA 8020 10968 35804 ...
##  $ TuitonOut  : int  18236 19032 6900 21480 19396 28100 NA 17140 29640 35804 ...
##  $ TuitionFTE : int  9227 11612 14738 8727 9003 13574 6713 8709 15479 10088 ...
##  $ InstructFTE: int  7298 17235 5265 9748 7983 10894 8017 7487 12067 10267 ...
##  $ FacSalary  : int  6983 10640 3866 9391 7399 10016 8268 7518 10137 7774 ...
##  $ FullTimeFac: num  71.3 89.9 100 64.6 54.2 74 42.7 97.4 85.5 66.2 ...
##  $ Pell       : num  71 35.3 74.2 27.7 73.8 18 44.6 44.2 14.9 19.2 ...
##  $ CompRate   : num  24 52.9 18.2 48.6 27.7 ...
##  $ Debt       : int  1068 3755 109 1347 1294 6430 913 959 4152 268 ...
##  $ Female     : num  56.4 63.9 64.9 47.6 61.3 61.5 70.5 69.3 53.2 52 ...
##  $ FirstGen   : num  36.6 34.1 51.3 31 34.3 22.6 47.7 38.2 17.3 17.2 ...
##  $ MedIncome  : num  23.6 34.5 15 44.8 22.1 66.7 29.6 29.7 72 68.1 ...

The dataset includes 2012 observations (colleges) and 37 variables.
Key variables used in this project include:

Variable Description
Cost Total annual cost (tuition + room + board)
NetPrice Average annual net cost paid by students
AdmitRate Proportion of applicants admitted
AvgSAT Average SAT score of admitted students
Control Institution type (Public, Private, or Profit)
Region Geographic region of the institution
CompRate Student completion (graduation) rate
Debt Typical student loan debt at graduation
Enrollment Number of enrolled undergraduates

3 Research Design

3.1 Project Goals

The primary goals of this project are: - To summarize numerical variables using mean, median, SD, variance, and IQR.
- To visualize distributions using histograms and boxplots.
- To compare groups using categorical variables (e.g., Control, Region).
- To explore linear relationships via correlation.

All missing data are handled using na.rm = TRUE (for single-variable statistics) and use = "complete.obs" (for correlations).

3.2 Part 1 — Ten Self-Proposed Questions

  1. What are the mean and median of Cost across all colleges?
  2. What is the spread of NetPrice (SD, variance, and IQR)?
  3. What is the distribution of AdmitRate?
  4. What proportion of schools are Public, Private, or Profit?
  5. Which Region has the highest median Cost?
  6. What is the correlation between AvgSAT and Cost?
  7. What is the correlation between AdmitRate and AvgSAT?
  8. How does NetPrice differ across Control groups?
  9. What is the correlation between CompRate and Debt?
  10. What is the distribution of Enrollment?

3.3 Part 2 — Ten ChatGPT-Proposed Questions

  1. What are the median and IQR of InstructFTE?
  2. What is the correlation between MedIncome and NetPrice?
  3. Do Locale categories differ in Cost (boxplots)?
  4. What share of schools are Online-only?
  5. What is the correlation between FullTimeFac and FacSalary?
  6. What is the distribution of Pell?
  7. How does Debt differ across Control groups?
  8. What is the correlation between AvgSAT and FacSalary?
  9. Which Region has the largest median CompRate?
  10. What is the correlation between TuitionIn and TuitionOut?

3.4 Part 3 — Final Ten Questions Analyzed

The following final set of questions (F1–F10) are chosen to represent a balanced mix of cost, selectivity, outcomes, and institutional characteristics.


4 Results & Discussion

4.1 F1. Central Tendency and Spread of Cost

mean_cost   <- mean(college$Cost, na.rm = TRUE)
median_cost <- median(college$Cost, na.rm = TRUE)
sd_cost     <- sd(college$Cost, na.rm = TRUE)
iqr_cost    <- IQR(college$Cost, na.rm = TRUE)
data.frame(Mean = mean_cost, Median = median_cost, SD = sd_cost, IQR = iqr_cost)
##    Mean Median    SD   IQR
## 1 34277  30699 15279 23519

Interpretation:
The average total annual cost is roughly 3.4277^{4} dollars.
The wide SD and IQR values indicate substantial variability, reflecting differences between public and private institutions.


4.2 F2. Distribution of Admission Rate

summary(college$AdmitRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.540   0.695   0.670   0.825   1.000     360
hist(college$AdmitRate, main = "Distribution of Admission Rates",
     xlab = "Admission Rate", col = "lightblue", border = "white")

The histogram shows that admission rates cluster around moderate to high values,
implying that while some schools are highly selective, many are relatively accessible.


4.3 F3. Composition of Institution Type (Control)

tbl_control <- table(college$Control)
prop.table(tbl_control)
## 
## Private  Profit  Public 
## 0.61779 0.08449 0.29771
barplot(tbl_control, main = "Number of Schools by Control Type",
        xlab = "Control", ylab = "Count", col = c("steelblue", "gray70", "tomato"))

Observation:
Most institutions are either Public or Private, with for-profit colleges forming a small minority.


4.4 F4. Spread and Distribution of NetPrice

hist(college$NetPrice, main = "Distribution of NetPrice",
     xlab = "NetPrice (USD)", col = "lightgreen", border = "white")

summary(college$NetPrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     923   14494   19338   19887   24443   55775     162

Interpretation:
The distribution of NetPrice is right-skewed.
While many colleges fall in a moderate range, a few institutions have significantly higher net prices.


4.5 F5. Relationship between AvgSAT and Cost

cor_avgSAT_cost <- cor(college$AvgSAT, college$Cost, use = "complete.obs")
cor_avgSAT_cost
## [1] 0.5374

A positive correlation (0.537) indicates that more academically selective institutions tend to have higher total costs.


4.6 F6. Relationship between AdmitRate and AvgSAT

cor_admit_sat <- cor(college$AdmitRate, college$AvgSAT, use = "complete.obs")
cor_admit_sat
## [1] -0.4221

A strong negative correlation (-0.422) reveals that schools with lower admission rates typically have higher SAT scores, as expected.


4.7 F7. Variation of NetPrice by Control

boxplot(NetPrice ~ Control, data = college,
        main = "NetPrice by Control Type", xlab = "Control", ylab = "NetPrice (USD)",
        col = c("skyblue", "orange", "gray80"))

Public institutions generally have lower median net prices, while private institutions show greater variability and higher typical costs.


4.8 F8. Variation of Cost by Region

par(mar = c(8, 5, 4, 2) + 0.1)
boxplot(Cost ~ Region, data = college,
        main = "Cost by Region", xlab = "", ylab = "Cost (USD)",
        las = 2, cex.axis = 0.9, col = "lightgray")
mtext("Region", side = 1, line = 6)

tapply(college$Cost, college$Region, median, na.rm = TRUE)
##   Midwest Northeast Southeast Territory      West 
##     35135     39046     28410     13250     27160

Observation:
Median costs differ by region, with the Northeast showing the highest medians and the Territories having the lowest.


4.9 F9. Relationship between Completion Rate and Debt

cor_comp_debt <- cor(college$CompRate, college$Debt, use = "complete.obs")
cor_comp_debt
## [1] -0.1584

The correlation (-0.158) between completion rate and student debt is mild,
suggesting limited linear association between these two outcomes.


4.10 F10. Distribution of Enrollment

hist(college$Enrollment, main = "Distribution of Enrollment",
     xlab = "Enrollment", col = "lightgray", border = "white")

hist(log10(college$Enrollment[college$Enrollment > 0]),
     main = "Distribution of log10(Enrollment)", xlab = "log10(Enrollment)",
     col = "skyblue", border = "white")

Interpretation:
Enrollment sizes are highly skewed. Most colleges are relatively small,
while a few large universities dominate the upper tail.


5 Conclusion

This project demonstrates how descriptive statistics and visual summaries can uncover meaningful patterns in educational data.

  • Cost and Net Price: Substantial variation across institutions; private and northeastern schools tend to be most expensive.
  • Selectivity: Higher AvgSAT correlates with higher Cost and lower AdmitRate.
  • Institutional Differences: Control type and region strongly influence both cost and price.
  • Scale: Enrollment distributions highlight major differences between small colleges and large universities.
  • Outcomes: Completion rate and debt show weak correlation, implying other factors may drive student success.

This analysis provides a descriptive foundation for future studies on accessibility, affordability, and educational equity.


6 Appendix A — Full R Code

The following code reproduces all analyses in this report.

# Load Data
url <- "https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv"
college <- read.csv(url, stringsAsFactors = FALSE)

# 1. Summary of Cost
mean(college$Cost, na.rm=TRUE); median(college$Cost, na.rm=TRUE)
sd(college$Cost, na.rm=TRUE); IQR(college$Cost, na.rm=TRUE)

# 2. AdmitRate Distribution
summary(college$AdmitRate); hist(college$AdmitRate)

# 3. Control Composition
table(college$Control); prop.table(table(college$Control))

# 4. NetPrice Distribution
summary(college$NetPrice); hist(college$NetPrice)

# 5. AvgSAT vs Cost
cor(college$AvgSAT, college$Cost, use="complete.obs")

# 6. AdmitRate vs AvgSAT
cor(college$AdmitRate, college$AvgSAT, use="complete.obs")

# 7. NetPrice by Control
boxplot(NetPrice ~ Control, data=college)

# 8. Cost by Region
par(mar=c(8,5,4,2)+0.1)
boxplot(Cost ~ Region, data=college, las=2)
mtext("Region", side=1, line=6)

# 9. CompRate vs Debt
cor(college$CompRate, college$Debt, use="complete.obs")

# 10. Enrollment
hist(college$Enrollment)
hist(log10(college$Enrollment[college$Enrollment>0]))