1. Introduction

For this project, we looked at data collected by the US department of Education as part of the College Scorecard Project. The dataset includes information on all colleges and universities that grant 4 year bachelor’s degrees. After looking at the provided data, these are the 10 questions I’ll be exploring:

Analysis

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Now that we have our questions, we’re going to analize each one to come up with an asnwer:

Q1:What is the average and spread of admission rates across U.S. 4 year colleges?

q1_mean   <- mean(college$AdmitRate, na.rm = TRUE)
q1_median <- median(college$AdmitRate, na.rm = TRUE)
q1_sd     <- sd(college$AdmitRate, na.rm = TRUE)
q1_range  <- range(college$AdmitRate, na.rm = TRUE)

q1_mean; q1_median; q1_sd; q1_range
## [1] 0.6702025
## [1] 0.69505
## [1] 0.208179
## [1] 0 1
hist(college$AdmitRate, main = "Admission Rate (Distribution)", xlab = "Admission Rate", col = "pink")

Based on the histogram, we can see that the majority of US colleges have an acceptance rate between 60% and 80%. This suggests that most institutions are relatively accessible to applicants, with fewer colleges being very selective. In general, colleges are more likely to have higher acceptance rates than lower ones. Out of curiosity, I looked up SCSU and found that our acceptance rate is around 95%, which aligns with this overall trend.

Q2:How variable is the total cost of attending a 4-year college, and are there any outliers?

q2_var <- var(college$Cost, na.rm = TRUE)
q2_sd  <- sd(college$Cost, na.rm = TRUE)
q2_rng <- range(college$Cost, na.rm = TRUE)
q2_iqr <- IQR(college$Cost, na.rm = TRUE)

q2_var; q2_sd; q2_rng; q2_iqr
## [1] 233433900
## [1] 15278.54
## [1]  5950 72717
## [1] 23519.25
hist(college$Cost, main = "Cost", xlab = "Cost", col = "blue")

The total cost of attending a 4 year college varies widely across the US. Most colleges fall within a moderate range towards the lower end of the spectrum, but a few have much higher costs as shown by graph above. This indicates that while college can be affordable for many, some institutions charge significantly more.

Q3: What is the typical completion rate, and how is it distributed?

q3_summary <- summary(college$CompRate)
q3_summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   38.18   52.45   52.14   66.67  100.00     167
boxplot(college$CompRate, horizontal = TRUE, main = "Completion Rate — Boxplot", col = "purple")

Looking at the above boxplot, we can see that the completion rate among 4 year colleges ranges from 0% to 100%, with most schools graduating about 40–70% of their students. The median completion rate is around 52%, indicating that half of all institutions graduate at least half of their students. The boxplot shows a fairly symmetrical distribution, suggesting that completion rates are evenly spread across institutions, with no extreme outliers.

Q4: What is the average student debt upon graduation?

q4_mean   <- mean(college$Debt, na.rm = TRUE)
q4_median <- median(college$Debt, na.rm = TRUE)
q4_sd     <- sd(college$Debt, na.rm = TRUE)
q4_rng    <- range(college$Debt, na.rm = TRUE)

q4_mean; q4_median; q4_sd; q4_rng
## [1] 2365.655
## [1] 713.5
## [1] 5360.986
## [1]    10 48216
boxplot(college$Debt, horizontal = TRUE, main = "Debt — Boxplot", col = "green")

The boxplot for average student debt shows a strong right-skewed distribution. This entails that while most students graduate with relatively low debt, a few institutions have students with extremely high loan balances. The mean debt $2,366, is much higher than the median, $714, further confirming this skew. The range of debt spans from $10 to over $48,000, highlighting how financial burdens can differ drastically between colleges.

Q5: What is the correlation between admission rate and average SAT score?

plot(college$AdmitRate, college$AvgSAT,
     xlab = "Admission Rate", ylab = "Average SAT Scores",
     main = "Admission Rate vs Average SAT Scores")

q5_cor <- cor(college$AdmitRate, college$AvgSAT, use = "complete.obs")
q5_cor
## [1] -0.4221255

The scatterplot shows a moderate negative correlation (r = –0.42) between admission rate and average SAT scores. This means that as colleges become more selective, their average SAT scores tend to be higher. In contrast, schools with high acceptance rates typically have lower SAT averages. This relationship aligns with expectations becasue private/Ivy league colleges have lower acceptance rates and higher SAT score requirements for admission.

Q6: Do colleges with higher median family income tend to have higher net prices?

plot(college$MedIncome, college$NetPrice,
     xlab = "Median Family Income (in $1,000)",
     ylab = "Net Price",
     main = "Median Income vs Net Price")

q6_cor <- cor(college$MedIncome, college$NetPrice, use = "complete.obs")
q6_cor
## [1] 0.5151298

The scatterplot shows a moderate positive correlation (r = 0.52) between median family income and net price. This suggests that colleges with students from higher-income families tend to have higher net prices, while those serving lower-income populations are generally more affordable. Although the data show some variation, the upward trend indicates that family income and college cost are meaningfully related.

Q7: How do net prices differ between Public, Private, and For-Profit colleges?

boxplot(NetPrice ~ Control, data = college,
        main = "Net Price by Control",
        xlab = "Control", ylab = "NetPrice")

q7_means   <- tapply(college$NetPrice, college$Control, mean,   na.rm = TRUE)
q7_medians <- tapply(college$NetPrice, college$Control, median, na.rm = TRUE)
q7_means; q7_medians
##  Private   Profit   Public 
## 22259.02 23309.99 14295.06
## Private  Profit  Public 
##   21836   23179   14376

The boxplot comparing net price by control type shows that public colleges have the lowest overall costs, with a median net price of about $14,000, while private and for-profit institutions are considerably more expensive, averaging around $22,000–$23,000. The spread of data is wider for private and for-profit schools, indicating large differences in pricing within those groups. This suggests that public colleges are generally more affordable and consistent in cost, while private and for-profit institutions vary more and tend to charge higher prices.

Q8: What is the average percentage of female students, and what does its distribution look like?

q8_mean <- mean(college$Female, na.rm = TRUE)
q8_sd   <- sd(college$Female, na.rm = TRUE)
q8_rng  <- range(college$Female, na.rm = TRUE)

q8_mean; q8_sd; q8_rng
## [1] 59.29588
## [1] 12.34421
## [1] 11.8 98.0
hist(college$Female, main = "Female Percentage", xlab = "Female (%)", col = "pink")

The histogram shows that most 4 year colleges have between 55% and 65% female students, indicating a slight majority of women in higher education. The distribution is fairly symmetrical, suggesting that gender representation is consistent across most institutions. Very few schools have extremely high or low female enrollment, meaning most colleges maintain a relatively balanced student population

Q9: What is the typical difference between out of state and in state tuition?

tuition_diff <- college$TuitonOut - college$TuitionIn   
q9_mean   <- mean(tuition_diff, na.rm = TRUE)
q9_median <- median(tuition_diff, na.rm = TRUE)
q9_sd     <- sd(tuition_diff, na.rm = TRUE)
q9_rng    <- range(tuition_diff, na.rm = TRUE)

q9_mean; q9_median; q9_sd; q9_rng
## [1] 3388.112
## [1] 0
## [1] 6081.32
## [1]     0 32650
hist(tuition_diff, main = "Tuition Difference", xlab = "Out - In", col = "brown")

The histogram above shows that most colleges have little or no gap between in state and out of state tuition, but a few institutions charge non residents significantly more. The median difference is about $6,000, with some colleges charging up to $32,650 more for out of state students. This strong right skewed distribution suggests that while most schools keep tuition consistent, some universities have extremely high tuition rates for non residents. ### Q10: Is there a correlation between faculty salary and completion rate?

plot(college$FacSalary, college$CompRate,
     xlab = "Faculty Salary (monthly)",
     ylab = "Completion Rate",
     main = "Faculty Salary vs Completion Rate")

q10_cor <- cor(college$FacSalary, college$CompRate, use = "complete.obs")
q10_cor
## [1] 0.577221

The scatterplot shows a moderate to strong positive correlation (r = 0.58) between faculty salary and completion rate. Colleges that pay their faculty more tend to have higher student success rates. This suggests that institutions with better resources attract experienced faculty and provide stronger academic support to students. While correlation does not imply causation, the trend suggests that investment in faculty may contribute to improved student outcomes.

Summary

After analyzing the CollegeScores4yr dataset, I noticed a few clear patterns about colleges in the US. Most schools have acceptance rates between 60%–80%, meaning they’re fairly open to applicants. Public universities also tend to be the most affordable, while private and for-profit schools cost noticeably more and vary a lot in price. The median completion rate is around 52%, so about half of all students graduate on time, and while student debt is usually moderate, there are a few schools where the average debt is really high. I also found that more selective schools usually have higher SAT scores, and colleges in wealthier areas tend to have higher net prices. Most schools have between 55%–65% female students, showing a slight majority of women. When comparing in state and out of state tuition, most schools don’t have a big difference, but some charge non-residents a lot more. Finally, colleges with higher faculty salaries also tend to have better student completion rates. Overall, these trends show that cost, selectivity, and resources can really impact student outcomes across different types of institutions.