Introduction

10 Simple Questions from Diverse Perspectives that can be addressed using Chapter 6 Methods:

  1. What is the average admission rate (AdmitRate) for U.S. colleges that primarily grant bachelor’s degrees?

  2. What is the median percentage of female students (Female) across all colleges in the dataset?

  3. How much do average tuition and fees for in-state students (TuitionIn) vary across schools?

  4. What is the shape of the distribution of undergraduate enrollment (Enrollment)?

  5. Is there a correlation between average SAT scores (AvgSAT) and college completion rate (CompRate)?

  6. What is the range and standard deviation of faculty salaries (FacSalary)?

  7. Create a boxplot of student loan debt (Debt). Are there any visible outliers?

  8. Which control type (Private, Public, Profit) has the highest average instructional spending per FTE student (InstructFTE)?

  9. What proportion of schools are located in each U.S. region (Region)?

  10. How does the percentage of first-generation students (FirstGen) differ between schools with high (>60%) and low (<40%) Pell Grant rates (Pell)?

10 Questions asked from ChatGPT:

  1. Compare the distributions of in-state tuition (TuitionIn) between schools in the Northeast and Midwest regions using side-by-side boxplots.

  2. What is the correlation between median family income (MedIncome) and net price (NetPrice)?

  3. Is there more variability in faculty salaries (FacSalary) at public or private institutions?

  4. Create a histogram of the completion rate (CompRate). Does the distribution appear symmetric, skewed left, or skewed right?

  5. Do colleges in suburban areas (Locale == “Suburb”) have a higher average SAT score (AvgSAT) than those in rural areas (Locale == “Rural”)?

  6. What is the median debt (Debt) among students at colleges where more than 50% of students receive Pell grants?

  7. Create a barplot showing the average admission rate (AdmitRate) for each region (Region).

  8. Compare the percentage of Hispanic students (Hispanic) and Asian students (Asian) across all schools using boxplots.

  9. Among colleges with average ACT scores (MidACT) above 28, what is the mean completion rate (CompRate)?

  10. Is there a relationship between part-time enrollment (PartTime) and total enrollment (Enrollment)?

10 Questions picked from the previous 20 that we will be continuing with:

  1. What is the average admission rate (AdmitRate) for U.S. colleges that primarily grant bachelor’s degrees?

  2. What is the median percentage of female students (Female) across all colleges in the dataset?

  3. How much do average tuition and fees for in-state students (TuitionIn) vary across schools?

  4. Create a histogram of undergraduate enrollment (Enrollment). What general shape does the distribution have?

  5. Is there a correlation between average SAT scores (AvgSAT) and college completion rate (CompRate)?

  6. Create a boxplot of student loan debt (Debt). Are there any visible outliers?

  7. What proportion of schools are located in each U.S. region (Region)?

  8. Compare the distributions of in-state tuition (TuitionIn) between schools in the Northeast and Midwest regions using side-by-side boxplots.

  9. What is the correlation between median family income (MedIncome) and net price (NetPrice)?

  10. Create a barplot showing the average admission rate (AdmitRate) for each region (Region).

Analysis

We will explore the questions in detail.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Q1: What is the average admission rate (AdmitRate) for U.S. colleges that primarily grant bachelor’s degrees?

mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025

The average admission rate across U.S. colleges is approximately 67%, meaning most schools accept over two-thirds of their applicants. This indicates a generally moderate level of selectivity among 4-year colleges in the dataset.

Q2: What is the median percentage of female students (Female) across all colleges in the dataset?

median(college$Female, na.rm = TRUE)
## [1] 59.15

The median percentage of female students is around 59.15%, which means half of the schools have a higher proportion of women. This reflects a slight gender imbalance favoring female enrollment at many institutions.

Q3: How much do average tuition and fees for in-state students (TuitionIn) vary across schools?

var(college$TuitionIn, na.rm = TRUE)
## [1] 199665280
sd(college$TuitionIn, na.rm = TRUE)
## [1] 14130.3

The standard deviation in in-state tuition is about $14,130 indicating substantial variation among colleges. This shows that tuition costs vary widely, likely due to differences in state funding and institutional types.

Q4: Create a histogram of undergraduate enrollment (Enrollment). What general shape does the distribution have?

hist(college$Enrollment, 
     main = "Distribution of Enrollment", 
     col = "skyblue", 
     xlab = "Enrollment", 
     breaks = 30)

The enrollment distribution is right-skewed, with most colleges enrolling relatively few students, and a few large universities enrolling tens of thousands. This suggests that small to mid-sized colleges dominate the higher education landscape.

Q5: Is there a correlation between average SAT scores (AvgSAT) and college completion rate (CompRate)?

cor(college$AvgSAT, college$CompRate, use = "complete.obs")
## [1] 0.8189495
plot(college$AvgSAT, college$CompRate, 
     xlab = "Average SAT", ylab = "Completion Rate",
     main = "SAT vs Completion Rate", col = "blue")

The correlation coefficient is around 0.82, showing a moderate positive relationship between SAT scores and completion rates. Schools with higher-achieving students tend to have better graduation outcomes, likely due to academic preparedness.

Q6: Create a boxplot of student loan debt (Debt). Are there any visible outliers?

boxplot(college$Debt, 
        main = "Boxplot of Student Loan Debt", 
        col = "tomato", horizontal = TRUE)

The boxplot reveals several outliers, representing colleges where students take on significantly higher or lower debt. This suggests that student borrowing varies widely, possibly due to institutional aid policies or cost of attendance.

Q7: What proportion of schools are located in each U.S. region (Region)?

region_counts = table(college$Region)
region_percent = prop.table(region_counts) * 100
pie(region_percent, 
    main = "Proportion of Schools by Region", 
    col = rainbow(length(region_percent)))

The Northeast has the largest share of colleges (~27.4%), followed by the Midwest, Southeast, and West. This highlights regional clustering in the U.S. college system, with some areas hosting significantly more institutions.

Q8: Compare the distributions of in-state tuition (TuitionIn) between schools in the Northeast and Midwest regions using side-by-side boxplots.

boxplot(TuitionIn ~ Region, data = subset(college, Region %in% c("Northeast", "Midwest")),
        main = "In-State Tuition: Northeast vs Midwest",
        col = c("lightblue", "lightgreen"))

Colleges in the Northeast tend to have higher in-state tuition than those in the Midwest, with greater variability. This indicates regional cost differences, potentially driven by local economic factors or institutional missions.

Q9: What is the correlation between median family income (MedIncome) and net price (NetPrice)?

cor(college$MedIncome, college$NetPrice, use = "complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice,
     xlab = "Median Family Income", ylab = "Net Price",
     main = "Income vs Net Price", col = "darkgreen")

The correlation between median family income and net price is about 0.52, showing a moderate positive relationship. Families with higher incomes tend to face higher net prices, likely because they receive less need-based aid.

Q10: Create a barplot showing the average admission rate (AdmitRate) for each region (Region).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
region_admit = college %>%
  group_by(Region) %>%
  summarize(avg_admit = mean(AdmitRate, na.rm = TRUE))

barplot(region_admit$avg_admit, 
        names.arg = region_admit$Region, 
        col = "orchid", 
        main = "Average Admission Rate by Region", 
        ylab = "Admission Rate")

The Midwest region has the highest average admission rate (~69%), while the Southeast has the lowest (~65%). This implies varying levels of competitiveness across regions, with the Midwest being relatively more accessible.

Summary

In this project, we used data on U.S. colleges to explore important trends using statistical tools like mean, median, standard deviation, boxplots, histograms, barplots, pie charts, and correlation. Each question helped us understand a different part of the college experience, from costs to admissions to student outcomes.

We found that the average college admits about 67% of applicants, and the typical school has around 59% female students. In-state tuition costs vary a lot, with some schools charging much more than others, likely due to differences in state policies and whether a school is public or private. Most colleges have small to mid-sized enrollments, while a few very large ones bring the average up.

There is a strong positive relationship between SAT scores and graduation rates, showing that students at colleges with higher SAT averages are more likely to graduate. We also found a link between family income and net price, meaning wealthier students often pay more for college—likely because they qualify for less financial aid.

A boxplot of student loan debt showed that some students borrow much more than others, depending on the college. When we looked at regions, the Northeast had the most schools, and the Midwest had the highest average admission rate, making it more accessible. Tuition in the Northeast was generally higher than in the Midwest, and we saw big differences in costs between regions.

Overall, this project showed how simple graphs and summary statistics can help us better understand the differences between colleges. These tools made it easier to see patterns in tuition, enrollment, diversity, and outcomes that would be hard to notice just by reading numbers in a table.

Appendix: R Code Used

Q1: Average Admission Rate

mean(college$AdmitRate, na.rm = TRUE)

Q2: Median Percentage of Female Students

median(college$Female, na.rm = TRUE)

Q3: TuitionIn Variance and Standard Deviation

var(college\(TuitionIn, na.rm = TRUE) sd(college\)TuitionIn, na.rm = TRUE)

Q4: Histogram of Enrollment

hist(college$Enrollment, main = “Distribution of Enrollment”, col = “skyblue”, xlab = “Enrollment”, breaks = 30)

Q5: Correlation and Scatterplot of AvgSAT vs CompRate

cor(college\(AvgSAT, college\)CompRate, use = “complete.obs”) plot(college\(AvgSAT, college\)CompRate, xlab = “Average SAT”, ylab = “Completion Rate”, main = “SAT vs Completion Rate”, col = “blue”)

Q6: Boxplot of Student Loan Debt

boxplot(college$Debt, main = “Boxplot of Student Loan Debt”, col = “tomato”, horizontal = TRUE)

Q7: Proportion of Schools by Region (Pie Chart)

region_counts = table(college$Region) region_percent = prop.table(region_counts) * 100 pie(region_percent, main = “Proportion of Schools by Region”, col = rainbow(length(region_percent)))

Q8: TuitionIn Boxplots for Northeast vs Midwest

boxplot(TuitionIn ~ Region, data = subset(college, Region %in% c(“Northeast”, “Midwest”)), main = “In-State Tuition: Northeast vs Midwest”, col = c(“lightblue”, “lightgreen”))

Q9: Correlation and Scatterplot of MedIncome vs NetPrice

cor(college\(MedIncome, college\)NetPrice, use = “complete.obs”) plot(college\(MedIncome, college\)NetPrice, xlab = “Median Family Income”, ylab = “Net Price”, main = “Income vs Net Price”, col = “darkgreen”)

Q10: Barplot of Average Admission Rate by Region

library(dplyr) region_admit = college %>% group_by(Region) %>% summarize(avg_admit = mean(AdmitRate, na.rm = TRUE))

barplot(region_admit\(avg_admit, names.arg = region_admit\)Region, col = “orchid”, main = “Average Admission Rate by Region”, ylab = “Admission Rate”)