Introduction

In this project, I will be using R to statistically analyze data from US four-year colleges and universities that primarily grant bachelor’s degrees (referred to as “colleges” from here on for brevity) from the year 2012. The data is from the Lock5 Datasets, Third Edition, authored by Robin H. Lock, Patti Frazer Lock, Kari Lock Morgan, Eric F. Lock, and Dennis F. Lock.

The goal was primarily to demonstrate the capabilities of R to interpret and visualize data. I chose to analyze questions largely related to how educational access and attainment relate to economic factors, as well as a few other questions that provide for reasons to use various other analysis methods.

Methodology

I used a mixture of statistical methods in R to analyze and visualize the data, including mean, median, mode, range, variance, standard deviation, correlation, histograms, box plots, bar plots, and pie charts.

Since data is missing in parts of the data set, all statistical methods were set to ignore missing data.

The questions to analyze in the data were varied and were largely meant to stand alone rather than prove an overarching narrative. They were chosen based on curiosity and an attempt to use a wide variety of statistical methods. The questions were restricted to only require two variables to be compared at a time.

Questions to Analyze

I was tasked with suggesting ten questions to analyze in this data set, using ChatGPT to suggest another ten questions, and then choosing ten final questions from those two sets of questions to use for the project.

My Questions

These are the ten questions that I came up with for this data set:

  1. What is the average cost by region?
  2. What is the median cost by control type? (To compare costs between public/private/profit.)
  3. How does the admittance rate correlate with the cost?
  4. What is the percentage and standard deviation of the number of female students?
  5. What is the average cost of online college versus in-person?
  6. What percentage of enrolled students are at online colleges?
  7. What percentage of enrolled students are at public, private, or for-profit colleges?
  8. How does the median income vary by state?
  9. How does the completion rate correlate with the admittance rate?
  10. How does the completion rate correlate with the median income?

ChatGPT Questions

These are the ten questions that ChatGPT came up with for this data set:

  1. What is the distribution of admission rates (AdmitRate) among four-year colleges?
  2. What are the mean and median faculty salaries (FacSalary) across all colleges?
  3. What is the variance and standard deviation of in-state tuition (TuitionIn) among colleges?
  4. How does the average enrollment (Enrollment) compare between public and private colleges (Control)?
  5. What is the distribution of female student percentages (Female) across colleges?
  6. Is there a correlation between completion rates (CompRate) and median income (MedIncome)?
  7. What is the proportion of colleges in each region (Region)?
  8. What is the distribution of average SAT scores (AvgSAT) among colleges?
  9. What is the mean proportion of part-time students (PartTime) across colleges?
  10. How does the average net price (NetPrice) vary across different locales (Locale)?

Final Questions

These are the final questions that I decided to investigate in this project:

  1. What is the average cost by region?
  2. What is the median cost by control type? (To compare costs between public/private/profit.)
  3. How does the admittance rate correlate with the cost?
  4. What is the percentage and standard deviation of the number of female students?
  5. What is the average cost of online college versus in-person?
  6. What percentage of enrolled students are at online colleges?
  7. What percentage of enrolled students are at public, private, or for-profit colleges?
  8. What is the variance and standard deviation of in-state tuition (TuitionIn) among colleges?
  9. Is there a correlation between completion rates (CompRate) and median income (MedIncome)?
  10. How does the completion rate correlate with the admittance rate?

These questions will provide for interesting data points to analyze that will require using a variety of statistical analysis methods.

Analysis

Q1: Average Cost by Region

We will analyze the average cost of colleges and universities in each region to understand how costs vary geographically.

##      Region     Cost
## 1   Midwest 34306.75
## 2 Northeast 39767.80
## 3 Southeast 31455.29
## 4 Territory 12772.89
## 5      West 32775.23

Colleges in the Northeast are the most expensive at an average tuition of $39,768, followed by the Midwest with $34,307. The West and South have relatively similar costs of $32,775 and $31,455, respectively. Colleges in territories are much cheaper than colleges anywhere else at only $12,773.

The region with the highest number of colleges, the mode, is the Northeast region. This is an interesting data point considering that it is also the most expensive region.

From this, we can conclude that average college costs are relatively similar across regions except for territories, where they are considerably cheaper, and the Northeast, where they are moderately more expensive. The greater abundance of colleges in the Northeast does not appear to reduce costs relative to other regions.

Q2: Median Cost by Control Type

To compare how college costs vary by control type, meaning public, private, or for-profit, we will plot the median cost by the control type.

##   Control  Cost
## 1 Private 41488
## 2  Profit 27935
## 3  Public 21148

Predictably, public colleges are the cheapest at a median tuition of $21,148. Somewhat unexpectedly considering their designated purpose, for-profit colleges are cheaper than private colleges at $27,935, while private colleges are the most expensive at $41,488.

Private colleges show the widest distribution, spanning both the most and least expensive colleges in the data set. Public colleges span a much narrower range, while for-profit colleges are also narrow but slightly wider. Public colleges having a narrower range and lower overall cost makes sense due to public funding and regulations keeping costs down.

Q3: Correlation between Admission Rate and Cost

To analyze how admissions selectivity varies according to cost, we will plot the correlation between the admittance rate and the cost.

## [1] "Correlation between Admission Rate and Cost: -0.304"

This graph shows that at admittance rates of around 50% to the highest admission rates of 100%, costs vary greatly from the lowest to the highest tuitions of around $5,950 to $72,717.

Yet below an admission rate of about 50%, colleges begin to cluster more heavily around the higher end of the cost range, and below an admittance rate of 20%, they are almost exclusively colleges with higher costs above about $60,000. This reinforces the hypothesis that more prestigious colleges, which charge higher tuitions, are also more difficult to gain admittance to due to high demand and strict requirements.

Q4: Percentage and Standard Deviation of Female Students

We will analyze the number, percentage, and standard deviation of female students across colleges and visualize the percentages with a histogram to gain a better understanding of the distribution of female students across colleges.

## Standard deviation of female students: 12.34%
## Overall average percentage of female students: 59.3%
## Total number of female students across all colleges: 5,198,993

The overall average percentage of female students across all colleges is 59.3%, for a total of 5,198,993 female students.

Graphing the data using a histogram:

We can see that the graph of female student percentages forms a bell curve, with most of the colleges having near the average of 59.3%. The standard deviation from this is 12.34% from the mean. This shows that while there is considerable variation in the percentage of female students among some outlier colleges, about 2/3 of them are clustered within 12.34% from the mean.

Q5: Average Cost by Online Availability

We will compare and graph the average costs of in-person and online colleges to analyze the cost difference between them.

##   Online     Cost
## 1      0 34392.04
## 2      1 19231.00

This graph shows the considerable difference in average costs between in-person colleges, $34,392.04 and online colleges, $19,231, with online colleges costing only 55.9% as much as in-person colleges on average. As the box plot shows, online colleges also span a much narrower range in costs. With this, we can conclude that online colleges tend to be considerably cheaper, though some in-person colleges can be even cheaper.

Q6: Enrollment Percentage at Online Colleges

We will visualize the percentage of students that are online students in this data set.

##   Online Enrollment Percentage
## 1      0    8801493   97.58839
## 2      1     217503    2.41161

We can see that in this data from 2012, online students make up a small sliver of the total percentage of students, about 2.41% of the total students. Given more recent data, it would be a reasonable assumption that this percentage has grown as online schooling has become more common, as this seems like a remarkably low percentage today.

Q7: Enrollment Percentage by Control Type

We will visualize the percentages of students at each college control type: public, private, and for-profit, to see which ones are most popular.

##   Control Enrollment Percentage
## 1 Private    2608979  28.927599
## 2  Profit     463281   5.136725
## 3  Public    5946736  65.935676

The majority of enrolled students, 65.94%, are at public colleges, while a considerable portion, 28.93%, are at private colleges. For-profit colleges make up a much smaller percentage, %. This fits with our finding earlier that public colleges have the lowest cost, making them accessible to the most students.

Q8: Variance and Standard Deviation of In-State Tuition

We will analyze the variance and standard deviation of in-state tuition to see how much it varies between institutions.

## [1] "Variance of In-State Tuition: 199665279.72"
## [1] "Standard Deviation of In-State Tuition: 14130.3"

The calculated variance is 1.9966528^{8}, and the standard deviation is $14,130.3. This high standard deviation indicates that there is significant variation in in-state tuition fees among colleges. Most tuition costs deviate from the average by approximately $14,130.3, showing significant variability in tuition costs for in-state students.

Q9: Correlation between Completion Rate and Median Income

To understand how income can impact academic outcomes, we will analyze the correlation between completion rates and median incomes for colleges in the data set.

## [1] "Correlation between Completion Rate and Median Income: 0.644"

The correlation between the completion rate and median income is 0.644, indicating a moderate positive correlation. This means that as the completion rate increases, so does the median family income, on average.

This plot visualizes the general trend that as median income increases, so does the completion rate. At above a 70% completion rate, there are far fewer colleges with a median income below $50,000. This supports the hypothesis that lower income levels can correlate with lower completion rates, showing that students at an economic disadvantage may also be at a disadvantage for graduating.

One outlier in the data, the Jewish Theological Seminary of America with a median income of $179,900, was omitted from the graph due to its considerably higher median family income, which compressed the rest of the data points. With a completion rate of 80.49%, it fits the trend of colleges with higher median incomes generally having higher completion rates, though it demonstrates that having the highest median income on its own is not enough to guarantee the highest completion rate due to other factors involved in completion rates.

Q10: Correlation between Admission Rate and Completion Rate

To analyze how greater selectivity in college admissions, meaning a lower admission rate, can correlate with the completion rate, we will plot the admission rate versus the completion rate.

## [1] "Correlation between Admission Rate and Completion Rate: -0.348"

The correlation between the admission and completion rates is -0.348, a weak to moderate negative correlation. It means that as the admittance rate increases, the completion rate decreases slightly, on average. This supports the hypothesis that the most selective colleges with the lowest admission rates would also have the highest completion rates, with most colleges below a 20% admission rate having a completion rate of above 80%.

As the admission rate increases, however, completion rates become much more variable, indicating a stronger correlation between admission rates and completion rates only at the lowest admission rates. Above those admission rates, other factors are likely more significant for the completion rate.

Conclusion

We used R to perform statistical analysis of various data points and correlations in this four-year college data set. Methods used include mean, median, mode, range, variance, standard deviation, correlation, histograms, box plots, bar plots, and pie charts to demonstrate the capabilities of R to surface statistical patterns in the data set to aid in answering various inquiries.