Introduction

This project explores data on U.S. four–year colleges and universities from the CollegeScores4yr dataset. These data come from the U.S. Department of Education’s College Scorecard and include information about admission rates, test scores, costs, financial aid, graduation rates, faculty salaries, and student demographics for bachelor’s–granting institutions.

My goal is to use descriptive statistics from Chapter 6 to answer ten simple questions about these colleges. Each question involves at most one or two variables and focuses on topics such as total cost, net price, completion rate, types of institutions, and student characteristics. For each question, I use appropriate methods such as the mean, median, variance, standard deviation, correlation, and graphical tools like histograms, boxplots, barplots, and a pie chart.

Together, these summaries give an overview of how affordable different colleges are, how successful their students are, and how these characteristics vary by institution type and region.

Analysis

We will use different statistical methods (mean, median, variance, standard deviation, correlation, histogram, boxplot, barplot, and pie chart) to answer each question about the colleges in the dataset.

Q1: What is the distribution of total cost (Cost)?

# Numerical summaries for Cost
summary(college$Cost)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5950   21357   30699   34277   44876   72717     162
sd(college$Cost, na.rm = TRUE)
## [1] 15278.54
# Graphical summaries for Cost
hist(college$Cost,
     main = "Distribution of Total Cost of Attendance",
     xlab = "Total annual cost (tuition, room, board, etc.)")

boxplot(college$Cost,
        main = "Boxplot of Total Cost of Attendance",
        ylab = "Total annual cost")

Answer (Q1):
The histogram and boxplot show that total cost of attendance spans a wide range and has a long right tail. Many colleges fall in a moderate cost range, while a smaller number of schools are much more expensive, appearing as high outliers in the boxplot. This pattern suggests that the distribution of cost is right–skewed: most colleges are moderately priced, but a few very costly institutions pull the mean higher than the median.


Q2: What is the typical net price (NetPrice) students pay?

# Numerical summaries for NetPrice
summary(college$NetPrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     923   14494   19338   19887   24443   55775     162
var(college$NetPrice, na.rm = TRUE)
## [1] 61686826
sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096
# Graphical summaries for NetPrice
hist(college$NetPrice,
     main = "Distribution of Net Price",
     xlab = "Net price (after average aid)")

boxplot(college$NetPrice,
        main = "Boxplot of Net Price",
        ylab = "Net price (after average aid)")

Answer (Q2):
The numerical summaries give a “typical” net price in the middle of the distribution (as shown by the mean and median), while the variance and standard deviation indicate that net price differs by many thousands of dollars across schools. The histogram shows more colleges at lower–to–middle net prices and fewer at very high net prices, with some high-net-price colleges appearing as outliers in the boxplot. Overall, the typical net price is moderate, but there is substantial variation in what students actually pay.


Q3: How variable are completion rates (CompRate) across colleges?

# Numerical summaries for completion rate
summary(college$CompRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   38.18   52.45   52.14   66.67  100.00     167
var(college$CompRate, na.rm = TRUE)
## [1] 446.1695
sd(college$CompRate, na.rm = TRUE)
## [1] 21.12272
# Graphical summaries for completion rate
hist(college$CompRate,
     main = "Distribution of Completion Rates",
     xlab = "Completion rate (%)")

boxplot(college$CompRate,
        main = "Boxplot of Completion Rates",
        ylab = "Completion rate (%)")

Answer (Q3):
Completion rates vary widely from college to college. The histogram shows that some institutions have quite low completion rates, others have very high rates, and many schools fall in the middle. The boxplot has a fairly large interquartile range, and the standard deviation is fairly large, which means that completion rate is not tightly clustered around a single value. Instead, there is considerable spread in student success across institutions.


Q4: Is there a linear relationship between admission rate (AdmitRate) and completion rate (CompRate)?

# Correlation between admission rate and completion rate
cor(college$AdmitRate, college$CompRate, use = "complete.obs")
## [1] -0.3482341

Answer (Q4):
The cor() output gives the correlation between admission rate and completion rate. A negative correlation would mean that colleges with higher admission rates (less selective) tend to have lower completion rates, while more selective colleges (lower admission rates) tend to have higher completion rates. A positive correlation would indicate the opposite, and a correlation close to 0 would indicate very little linear relationship.

For this dataset, the correlation printed above is moderate in size and below zero, indicating a moderate negative linear relationship: as admission rate increases, completion rate tends to decrease, although admission rate alone does not perfectly predict completion rate.


Q5: Do public and private colleges differ in average in–state tuition (TuitionIn)?

# Filter to public vs private only
college_pub_priv = subset(college, Control %in% c("Public", "Private"))

# Numerical summaries by Control
tapply(college_pub_priv$TuitionIn,
       college_pub_priv$Control,
       mean, na.rm = TRUE)
##   Private    Public 
## 28703.695  9199.717
tapply(college_pub_priv$TuitionIn,
       college_pub_priv$Control,
       median, na.rm = TRUE)
## Private  Public 
##   29200    8498
# Boxplot comparing in-state tuition for public vs private colleges
boxplot(TuitionIn ~ Control,
        data = college_pub_priv,
        main = "In-State Tuition by Institution Control",
        xlab = "Control (Public vs Private)",
        ylab = "In-state tuition (dollars)")

Answer (Q5):
The mean and median tuition values by control, together with the side-by-side boxplot, show clear differences between public and private colleges. Private institutions generally have higher in–state tuition: their median and upper quartiles are noticeably larger than those for public colleges. Public institutions tend to charge less in-state tuition on average, although there is some overlap between the groups. This suggests that the type of institution (public vs private) is strongly related to how much students pay in tuition.


Q6: How does average net price vary across regions (Region)?

# Mean net price by region
mean_netprice_region = tapply(college$NetPrice,
                              college$Region,
                              mean,
                              na.rm = TRUE)
mean_netprice_region
##   Midwest Northeast Southeast Territory      West 
## 19624.613 21912.643 18953.509  7562.362 20061.101
# Barplot of mean net price by region
barplot(mean_netprice_region,
        main = "Mean Net Price by Region",
        xlab = "Region",
        ylab = "Mean net price (dollars)",
        las = 2)

Answer (Q6):
The barplot of mean net price by region shows that average net price is not the same in every region. Some regions have noticeably higher mean net prices than others, indicating that where a college is located can affect typical affordability. Regions with taller bars have colleges that tend to be more expensive on average after aid, while regions with shorter bars tend to have lower average net prices. Even so, there is still variation within each region that is not captured by the regional averages alone.


Q7: Do colleges in different locales (Locale) tend to have different enrollments (Enrollment)?

# Numerical summaries of enrollment by locale (median)
median_enroll_locale = tapply(college$Enrollment,
                              college$Locale,
                              median,
                              na.rm = TRUE)
median_enroll_locale
##   City  Rural Suburb   Town 
## 1904.0  825.0 1669.5 1756.0
# Boxplot of enrollment by locale
boxplot(Enrollment ~ Locale,
        data = college,
        main = "Enrollment by Locale",
        xlab = "Locale",
        ylab = "Enrollment",
        las = 2)

Answer (Q7):
The boxplot compares the distribution of enrollments across different locales (such as City, Suburb, Town, and Rural). The medians and spreads differ by locale, indicating that some types of locations tend to have larger typical enrollments than others. For example, the locale(s) with the highest median in the table above typically host larger schools, while locales with lower medians tend to have smaller institutions. Overall, enrollment size depends on locale, with noticeable differences in the typical size of institutions across geographic settings.


Q8: What proportion of colleges are public, private, or for-profit (Control)?

# Frequency and proportion of institution types
control_counts = table(college$Control)
control_counts
## 
## Private  Profit  Public 
##    1243     170     599
prop.table(control_counts)
## 
##    Private     Profit     Public 
## 0.61779324 0.08449304 0.29771372
# Pie chart of institution types
pie(control_counts,
    main = "Proportion of Colleges by Control Type",
    labels = paste0(names(control_counts), " (",
                    round(100 * prop.table(control_counts), 1), "%)"))

Answer (Q8):
The table() output and pie chart show how many colleges in the dataset are Public, Private, or For-profit. One control type makes up the largest share of institutions (the biggest slice of the pie), while another is clearly least common (the smallest slice). This graph summarizes the overall mix of institution types in the population of four-year colleges.


Q9: What is the distribution of the percentage of students receiving Pell grants (Pell)?

# Numerical summaries for Pell grant percentage
summary(college$Pell)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   25.60   35.70   37.85   47.45   97.60       5
sd(college$Pell, na.rm = TRUE)
## [1] 17.88267
# Graphical summaries for Pell grant percentage
hist(college$Pell,
     main = "Distribution of Pell Grant Percentage",
     xlab = "Percent of students receiving Pell grants")

boxplot(college$Pell,
        main = "Boxplot of Pell Grant Percentage",
        ylab = "Percent of students receiving Pell grants")

Answer (Q9):
The histogram and boxplot show the distribution of the percentage of students who receive Pell grants at each college. Many schools cluster in a middle range of Pell percentages, while some institutions serve a much higher or lower proportion of Pell students. The standard deviation indicates that Pell participation varies substantially from campus to campus, reflecting differences in the financial need of the student populations that different colleges serve.


Q10: Is there a linear relationship between median family income (MedIncome) and average student debt (Debt)?

# Correlation between median family income and student debt
cor(college$MedIncome, college$Debt, use = "complete.obs") 
## [1] -0.1207221

Answer (Q10):
The correlation between median family income and average student debt tells us whether students from higher-income backgrounds tend to borrow more or less. A positive correlation would suggest that students at colleges with higher median family incomes also graduate with higher average debt, while a negative correlation would indicate that students from wealthier families tend to borrow less.

The correlation value printed by cor() is not extremely close to ±1, so any linear relationship is moderate at best. This means that family income and student debt are related, but many other factors (such as tuition, aid policies, and student choices) also play an important role in how much students borrow.


Summary

In this project, I used descriptive statistics and simple graphs to study four-year colleges in the CollegeScores4yr dataset. For costs, the histograms and boxplots showed that both total cost and net price are right–skewed: most colleges are moderately priced, but a smaller group of institutions are much more expensive. The spread (variance and standard deviation) of these cost variables is large, indicating big differences in affordability across schools.

Student outcomes also vary. Completion rates range from low to high, with a fairly large spread, and the correlation analysis suggests a moderate negative relationship between admission rate and completion rate: more selective colleges tend to have higher completion rates. The analysis of in-state tuition by control type showed that private colleges usually charge more than public institutions, and the barplot of net price by region suggested that typical affordability differs across regions.

Other summaries highlighted structural differences among institutions. The boxplot of enrollment by locale showed that typical school size depends on where the college is located. The pie chart of institution control summarized how common public, private, and for-profit colleges are in the dataset. Finally, the distribution of Pell grant percentages and the correlation between median family income and student debt demonstrated that student financial need and borrowing patterns vary widely across schools.

Overall, the Chapter 6 tools—numerical summaries (mean, median, variance, standard deviation), correlations, and basic plots (histograms, boxplots, barplots, and a pie chart)—provided a clear picture of how costs, outcomes, and student characteristics differ among U.S. four-year colleges.