Introduction

This report aims to explore Descriptive Statistics through the use of R and a data set consisting of 37 variables utilized as a scorecard system for 4-year colleges and universities.This information was sourced from the US Department of Education’s College Scorecards and aggregated and stored at https://www.lock5stat.com/datapage3e.html, with detailed descriptions of each variable located at https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf.

I proposed 10 questions independently, and then utilized a popular Large Language Model (LLM) to propose another 10 questions. The purpose of this was to compile a final list of 10 questions chosen from the two sets to form the foundational structure of my exploratory activities.

The 10 questions that I proposed were:

What is the median family income of students at public, private, and for-profit universities?
What is the mean admission rate among all institutions?
What is the variance in the completion rate between public, private, and for-profit institutions?
What is the standard deviation of net cost among public, private, and for-profit institutions?
Is there a correlation between the percentage of enrolled first-generation students and program completion rate?
What is the distribution of test scores - both ACT and SAT - across all institutions?
How do Faculty Salaries vary by state?
What is the proportion of private vs. public vs. for-profit institutions?
What is the average percentage of students receiving Pell Grants per state?
What is the median difference between in-state and out-of-state tuition by region?

The 10 questions proposed by my chosen LLM (ChatGPT, model GPT-4o) were:

What is the average admission rate across all colleges?
What is the median student enrollment across all colleges?
How much do faculty salaries vary across different institutions?
How consistent are the net prices of colleges?
Is there a relationship between median family income and student debt?
What is the distribution of SAT scores among colleges?
How do tuition fees vary across states?
What is the average Pell Grant recipient percentage for different regions?
What proportion of colleges are public vs. private?
How much do college completion rates fluctuate among institutions?

ChatGPT generated these questions from the prompt: “Suggest ten simple questions from diverse perspectives that can be addressed using descriptive statistical methods. The methods can be mean, median, variance, standard deviation, correlation, histogram, boxplot, barplot, or pie chart. Don’t choose a question that involves more than two variables.”

After reviewing the questions that I had posited and what had been generated by the LLM, I settled on 10 questions for this exploratory activity.

The Final 10:

What is the mean admission rate among all institutions?
What is the median student enrollment across all colleges?
What is the variance in the completion rate between public, private, and for-profit institutions?
How consistent are the net prices of colleges?
What is the distribution of test scores - both ACT and SAT - across all institutions?
Is there a relationship between median family income and student debt?
How do in-state tuition fees vary across states?
What is the proportion of private vs. public vs. for-profit institutions?
What is the average percentage of students receiving Pell Grants per state?
What is the median family income of students at public, private, and for-profit universities?

Analysis

In this section, we will be exploring each of the 10 questions in more detail. I did not necessarily focus on any one aspect of the scorecards, but rather asked questions which I thought would be insightful on their own, or when comparing the data of each type of institution.

Question 1

What is the mean admission rate across all Colleges?

The first question we asked is easily answered by simply averaging the “AdmitRate” column of the provided data. When we do this we see that the mean admission rate across all institutions is 67.02 percent of applicants.

But when we look a little further, we can see that:

–The mean admission rate for Private institutions is: 64.996 percent of applicants;

–The mean admission rate for Profit institutions is: 74.489 percent of applicants;

–The mean admission rate for Public institutions is: 70.080 percent of applicants;

I found this result to be more insightful as it shows that public and for-profit institutions are actually less accepting than many might think, while private institutions, on average, are not as restrictive in their admissions as their reputations might lead us to believe. I was actually very surprised by the result of the admission rate for for-profit (Profit) institutions: less than 75% of applicants.

Question 2

What is the median student enrollment across all colleges?

The median admission rate for all institutions is: 1722 students. However, when we once again look at the data by institution control type, we see the following:

–The median enrollment rate for Private institution is: 1275.5 students;

–The median enrollment rate for Profit institution is: 509.5 students;

–The median enrollment rate for Public institution is: 6622.0 students;

When I asked this question, I had an assumption in mind about how the results would skew. I assumed that the median enrollment for for-profit institutions would be quite low when compared against the other two control types, as these schools tend to have an associated stigma and often offer somewhat niche (but relevant) degrees. I was quite surprised to see that the disparity between private and public institutions was as great as it was. I initially expected that there would be closer to a 2:1 ratio of students enrolled at public institutions over private, so a ratio of more than 5:1 was quite shocking to me.I suspect the reason for this disparity is due largely to religously based institutions being overwhelmingly private, and such institutions can be exceptionally restrictive in regards to enrollment. I chose to analyze this question using median over mean in an effort to mitigate the impact of significant outlier values, but a quick review of the data set shows that there are 145 institutions with an enrollment of less than or equal to 150 students.

Question 3

What is the variance in completion rate between public, private, and for-profit schools?

Control	Variance	Average
Private	445.905	55.681
Profit	417.920	29.417
Public	305.820	50.201

This result is interesting. It shows us that while private institutions have a markedly higher average rate of completion, that completion rate also varies significantly more from institution to institution compared to public schools. While there is a lower average completion rate, it would appear that public schools are much more consistent in their results. The for-profit institutions seem to fare the worst, being both significantly lower in regards to average completion rate, but also varying significantly from institution to institution, although slightly less than private institutions.

Question 4

How consistent are the net prices of colleges?

Control	Average Cost	Standard Deviation
Overall	19886.82	7854.10
Private	22259.02	7853.95
Profit	23309.99	7132.69
Public	14295.06	4359.57

The results of this analysis were mostly unsurprising to me. Public schools are often subsidized by state governments in order to encourage individuals to attend higher learning institutions. That, combined with public universities generally needing to be more accessible, the lower average cost and (in my opinion, significantly) lower standard deviation means that roughly two-thirds of public schools will be less than the overall average cost, assuming a normal distribution. What did surprise me – albeit slightly – was that for-profit institutions had a lower standard deviation that private ones. Given the nature of for-profit institutions, I had originally assumed that there would be a wider distribution in overall cost than private institutions.

Question 5

What is the distribution of test scores among institutions?

The results of this particular analysis were very insightful to me. When positing this question originally, I expected to see higher frequencies of middling test scores distributed to public institutions, but was shocked to see that private institutions actually had a higher frequency of such test scores. It was quite unsurprising to see that very high test scores were exclusively distributed to private institutions. Similarly, I was not shocked to see that very few test scores overall were distributed to for-profit institutions, as I expected such places to largely disregard the results of standardized acceptance tests.

Question 6

Is there a relationship between median family income and student debt?

While this question may seem like a “no-brainer” initially, the results of the analysis tell a story that one might not expect to find as the result of such a question. Students from private institutions seem to overwhelmingly accrue little or no debt, regardless of family income level. I suspect that this may be related significantly to academic success and scholarships, but that data is not available to me and I’d rather not conjecture. In contrast to this, at all levels of family income, there appear to be significantly more public institutions which do not follow the overall trend line, indicating that students who attend public institutions are more likely to accrue debt over the course of their education. It is also worth mentioning that from approximately $45,000 and above, there appears to be no for-profit institutions in which debt accrues that is not in-line with the overall trend. This result I did find interesting, because I expected there to be significantly higher levels of student debt at all income levels until $100,000 or greater.

Question 7

How do in-state tuition fees vary across states?

For this question, I chose to limit the analysis to only in-state tuition as I felt that the restriction would help with reducing the effort necessary to create the function for analyzing the data, as well as keep the plot from being too complex or too crowded to easily analyze. Wyoming immediately stood out to me, prompting me to do a quick review of the data set. There is only one institution in Wyoming that is included in the data set, which quickly explains the resulting box which was(n’t) created. For the most part, I think the results paint a somewhat obvious picture: States with lower median tuition fees generally tend to have more institutions with tuition rates higher than the median, resulting in a significantly larger upper whisker and quartile, while the inverse also seems to generally be true. I found the similarity in variance among states to be a good indicator of accessability. While a high variance can be frustrating for prospective students trying to “shop” for the right center of higher education, it also shows that we, as a society, have done rather well to ensure that there are an adequate amount of institutions available to those who may be less financially privileged, even before accounting for assistance programs.

Question 8

What is the proportion of institutions that are private vs. public vs. for-profit?

While I wish that this was a question that I could explore or expand upon more, it was truly just one that I was curious about. I was slightly surprised to see that there are more than 2 private institutions for every public one, but it does also make sense to me that this would be the case.

Question 9

What is the average percentage of students receiving Pell Grants by state or territory?

When I asked this question, I truly wasn’t sure what to expect as a result. Seeing that the three top recipient territories were in fact not states was an outcome I had not previously considered. Mississippi being the top (actual) state for Pell Grant receipt was not a surprise to me; Mississippi has a rather weak GDP and relies heavily on federal assistance. What was slightly surprising, however, was that less than half of students in 35 out of 50 states, as well as D.C., receive Pell Grants to assist with their tuition and educational costs. This shouldn’t be surprising, as Pell Grants are intended for students with a need for greater financial assistance than the average, yet it is a topic that has piqued my interest further and would be interesting to explore in the future.

Question 10

What is the median family income of students at public, private, and for-profit universities?

Median Family Income by Institution Type

When posing this question, I initially wondered if we would see public institutions having the lowest median family income due to their general availability and access to significant subsidies and tuition assistance programs. I am shocked to find that it is actually for-profit institutions which have the lowest median income. My initial follow-up question to these results is to wonder if for-profit institutions trend towards predatory behavior focused on lower-income families who are less likely to exhibit financial literacy skills and therefore more likely to take on loans with high interest rates in order to pay for their education. This thought is, of course, entirely speculative. I was pleasantly surprised to see that there is not a significant disparity in median income between the public and private institutions, which signifies to me that both institution types are admitting students from a wide variety of familial incomes.

Conclusion

While certainly not comprehensive, this was an interesting an insightful exploration through some of the data and statistics around colleges and universities in the United States. I was surprised by some of the results of the analysis, which is a good thing. We need to have our biases and preconceived ideas around things shaken from time to time, and data analysis is a wonderful tool for doing just that. We also need to take care to be responsible when interpreting the outputs and forming our hypotheses so that we do not inadvertently frame the results in a dishonest or manipulative way. Statistical analysis is a powerful, powerful tool, but it can be used to paint a single setting in a nearly infinite number of ways. As the saying goes: “All models are wrong, some just happen to be useful.”

Appendix

A list of all code chunks used for data processing in each question:

#Question 1:
  #   ```{r Admission Rates Mean, message=FALSE, warning=FALSE,
  #       echo=FALSE, results='asis'}
  #   # Function to calculate the mean admission rate
  #   mean_admission_rate <- function(data) {
  #     mean_rate <- mean(college_data$AdmitRate, na.rm = TRUE)
  #     return(paste("The mean admission rate is:", ... =
  #       round(mean_rate*100, 3), "percent of applicants."))
  #   }
  # 
  # # Calculate mean admission rate and output result
  #   output_text <- mean_admission_rate(college_data)
  #   invisible(cat(output_text))
  #   ```
  #   ```{r Admission by Control, message=FALSE, warning=FALSE, echo=FALSE, results='asis'}
  #   # Function to calculate the mean admission rate by Control type
  #   mean_admission_rate_by_control <- function(data) {
  #     admission_rate_by_control <- data %>%
  #       group_by(Control) %>%
  #       summarise(mean_admission_rate = mean(AdmitRate, na.rm = TRUE)) %>%
  #       mutate(mean_admission_rate = round(mean_admission_rate * 100, 3))  # Round to 3 decimal places
  #     
  #     result_text <- apply(admission_rate_by_control, 1, function(row) {
  #       paste("The mean admission rate for", row["Control"], "institutions is:", row["mean_admission_rate"], "percent of applicants.")
  #     })
  #     
  #     return(result_text)
  #   }
  #   
  #   # Output the admission rates by control type
  #   control_adm_text <- mean_admission_rate_by_control(college_data)
  #   writeLines(control_adm_text, sep = "\n")
  #   ```
#Question 2:
  # ```{r Enrollment Rates Median, message=FALSE, warning=FALSE, echo=FALSE}
  # # Function to calculate the median student enrollment
  # median_enrollment <- function(data) {
  #   median(college_data$Enrollment, na.rm = TRUE)
  # }
  # 
  # median_enrollment_by_control <- function(data) {
  #   enroll_rate <- data %>%
  #     group_by(Control) %>%
  #     summarise(median_enrollment_control = median(Enrollment, na.rm = TRUE))
  # 
  # 
  #   result_text <- apply(enroll_rate, 1, function(row) {
  #     paste("  --The median enrollment rate for", row["Control"], "institution is:", row["median_enrollment_control"], "students;\n")
  #     })
  #   return(result_text)
  # }
  # # Calculate median student enrollment
  # median_enr_result <- median_enrollment(college_data)
  # median_enr_control_result <- median_enrollment_by_control(college_data)
  # 
  # cat(paste("The median admission rate for all institutions is:", median_enr_result, "students.\n"))
  # cat(paste("However, when we once again look at the data by institution control type, we see the following:\n"))
  # writeLines(median_enr_control_result)
  # ```

#Question 3:
  # ```{r Completion Rate Variance, message=FALSE, warning=FALSE, echo=FALSE, results='asis'}
  # # # Calculate variance in completion rate by institution type
  # # variance_completion_rate(data)
  # library(knitr)  # For table formatting
  # library(kableExtra)
  # # Function to calculate variance in completion rate by institution type
  # variance_completion_rate <- function(data) {
  #   data %>%
  #     group_by(Control) %>%
  #     summarise(
  #       Variance = var(CompRate, na.rm = TRUE),
  #       Average = mean(CompRate, na.rm = TRUE)  # New column for mean completion rate
  #     )  # Capitalized column name for readability
  # 
  # }
  # 
  # # Calculate variance in completion rate by institution type and format output as a table
  # variance_table <- variance_completion_rate(college_data)
  # kable(variance_table, format = "html", digits = 3) %>% # Display with 3 decimal places
  #   kable_styling(full_width = FALSE) %>%  # Keep a nice table width
  #   column_spec(1, width = "20px")  # Set column width for more spacing
  # ```

#Question 4:
  # ```{r Net Price Std Dev, message=FALSE, warning=FALSE, echo=FALSE, results='asis'}
  # # # Function to calculate the standard deviation of net prices (measure of consistency)
  # library(dplyr)
  # library(knitr)
  # library(kableExtra)
  # 
  # # Function to calculate standard deviation of NetPrice
  # net_price_consistency <- function(data) {
  #   overall_sd <- sd(data$NetPrice, na.rm = TRUE)  # Overall standard deviation
  #   overall_mean <- mean(data$NetPrice, na.rm = TRUE)
  #   
  #   control_stats <- data %>%
  #     group_by(Control) %>%
  #     summarise(
  #       `Average Cost` = mean(NetPrice, na.rm = TRUE),
  #       `Standard Deviation` = sd(NetPrice, na.rm = TRUE)
  #       )  # SD for each Control type
  #   
  #   # Combine overall and grouped SDs into a formatted table
  #   overall_row <- data.frame(Control = "Overall", "Average Cost" = overall_mean, "Standard Deviation" = overall_sd)
  #   colnames(overall_row) <- colnames(control_stats)
  #   final_table <- bind_rows(overall_row, control_stats)
  #   
  #   return(final_table)
  # }
  # 
  # # Generate and display the table
  # net_price_table <- net_price_consistency(college_data)
  # 
  # kable(net_price_table, format = "html", digits = 2) %>%
  #   kable_styling(full_width = FALSE) %>%
  #   column_spec(1, width = "15px")  # Adjust spacing for readability
  # ```

#Question 5:
  # ```{r Test Score Dist Histo, message=FALSE, warning=FALSE,
  # echo=FALSE}
  # # Function to create layered histograms of test scores
  # test_score_distribution <- function(data) {
  #   # Reshape data to long format for plotting both test scores
  #   data_long <- data %>%
  #     select(MidACT, AvgSAT, Control) %>%
  #     pivot_longer(cols = c(MidACT, AvgSAT), names_to =
  #        "TestType", values_to = "Score")
  #   
  #   # Plot histogram with different test scores
  #   ggplot(data_long, aes(x = Score, fill = Control)) +
        # Transparent overlapping bins
  #     geom_histogram(position = "identity", alpha = 0.6, bins = 30) +  
  #     # Separate ACT & SAT with different x scales
  #     facet_wrap(~TestType, scales = "free_x") +  
  #     labs(title = "Distribution of Test Scores by Institution Type",
  #          x = "Test Score", y = "Frequency", fill = "Institution Type") +
  #     theme_minimal()
  # }
  # 
  # # Call function with dataset
  # test_score_distribution(college_data)
  # ```

#Question 6:
  # ```{r MedInc rel StudDebt Plot,
  #   message=FALSE, warning=FALSE, echo=FALSE}
  # # Function to create a scatter plot showing the relationship
  # income_debt_relationship <- function(data) {
  #   ggplot(college_data, aes(x = MedIncome, 
  #                            y = Debt)) +
  #     geom_point(aes(color = Control)) +
  #     geom_smooth(method = "lm", se = FALSE,
  #                 color = "black") +
  #     labs(
  #       title = "Relationship between Median Family Income and Student Debt",
  #          x = "Median Family Income", y = "Student Debt") +
  #     theme_minimal()
  # }
  # 
  # # Create scatter plot of income vs debt
  # income_debt_relationship(data)
  # ```
#Question 7:
  # ```{r I-S Tuition Variance, message=FALSE, fig.width=10, fig.height=10, 
  #     warning=FALSE, echo=FALSE}
  # # Function to create a boxplot of tuition fees by state
  # tuition_by_state <- function(data) {
  #   data %>%
  #     # Sort by median tuition
  #     mutate(State = reorder(State, TuitionIn, median, na.rm = TRUE)) %>%  
  #     ggplot(aes(x = State, y = TuitionIn)) +
  #     # Highlight outliers
  #     geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +  
  #     coord_flip() +  # Flip coordinates to prevent label overlap
  #     labs(title = "In-State Tuition Fees by State",
  #          x = "State",
  #          y = "Tuition Fees") +
  #     theme_minimal() +
  #     # Increase label size for readability
  #     theme(axis.text.y = element_text(size = 10))  
  # }
  # 
  # # Create boxplot of tuition fees by state
  # tuition_by_state(college_data)
  # ```
#Question 8:
  # ```{r Control Proportion Pie, message=FALSE, warning=FALSE, echo=FALSE}
  # # Function to create a pie chart showing the proportion of school types
  # school_type_proportion <- function(data) {
  #   school_type_counts <- college_data %>%
  #     count(Control) %>%
  #     mutate(proportion = n / sum(n), percentage = scales::percent(proportion))
  #   
  #   ggplot(school_type_counts, aes(x = "", y = proportion, fill = Control)) +
  #     geom_bar(stat = "identity", width = 1) +
  #     coord_polar(theta = "y") +
  #     geom_text(aes(label = percentage), position = position_stack(vjust = 0.5), size = 8) +
  #     labs(title = "Proportion of Institutions by Control Type", fill = "Control Type") +
  #     theme_void()
  # }
  # 
  # # Create pie chart of school type proportions
  # school_type_proportion(data)
  # ```
#Question 9:
  # ```{r Avg Pell by State BarPlot, message=FALSE, warning=FALSE, echo=FALSE, fig.height=10}
  # # Function to create a bar plot of average Pell Grants by state
  # avg_pell_grants_by_state_plot <- function(data) {
  #   # Calculate average Pell Grant percentage by state
  #   avg_pell_grants <- college_data %>%
  #     group_by(State) %>%
  #     summarise(avg_pell_grants = mean(Pell, na.rm = TRUE)) %>%
  #     arrange(avg_pell_grants) %>%
  #     as.data.frame()
  #   
  #   # Create a bar plot
  #   ggplot(avg_pell_grants, aes(x = reorder(State, avg_pell_grants), y = avg_pell_grants)) +
  #     geom_bar(stat = "identity", fill = "skyblue") +
  #     coord_flip() +  # Flip the axes to make the states easier to read
  #     labs(
  #       title = "Average Percentage of Students Receiving Pell Grant Awards
  #               by State/Territory, \nOrdered Highest to Lowest Percentage",
  #       x = "State",
  #       y = "Average Percentage of Students"
  #     ) +
  #     theme_minimal() +  # Use a minimal theme for cleaner appearance
  #     # Optional: adjust x-axis text angle for readability
  #     theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  # }
  # 
  # # Call the function to display the plot
  # avg_pell_grants_by_state_plot(data)
  # ```
#Question 10:
  # ```{r Median Family Income Barplot, fig.width=10, fig.height=6, warning=FALSE, message=FALSE, echo=FALSE, results='asis'}
  # library(ggplot2)
  # library(dplyr)
  # library(knitr)
  # library(kableExtra)
  # 
  # # Function to compute median income and create a barplot
  # plot_median_income_barplot <- function(data) {
  #   
  #   # Compute median family income by institution type
  #   median_income_summary <- data %>%
  #     group_by(Control) %>%
  #     summarise(Median_Family_Income = median(MedIncome, na.rm = TRUE)) %>%
  #     arrange(Median_Family_Income)  # Sort by median income
  #   
  #   # Output formatted table
  #   cat("### Median Family Income by Institution Type\n\n")
  #   kable(median_income_summary, format = "html", digits = 0) %>%
  #     kable_styling(full_width = FALSE, position = "left")
  #   
  #   # Create a barplot
  #   ggplot(median_income_summary, aes(x = reorder(Control, Median_Family_Income), y = Median_Family_Income, fill = Control)) +
  #     geom_col(color = "black") +
  #     labs(
  #       title = "Median Family Income by Institution Type",
  #       x = "Institution Type",
  #       y = "Median Family Income (USD)"
  #     ) +
  #     scale_fill_manual(values = c("Private" = "blue", "Profit" = "red", "Public" = "green")) +  
  #     theme_minimal() +
  #     coord_flip()  # Flip for better readability
  # }
  # 
  # # Run the function
  # plot_median_income_barplot(college_data)
  # ```

Analyzing 4-Year College Scorecards Through Descriptive Statistical Methods

Caleb Fitzsimonds

2025-03-27