In this project, I will be using R to statistically analyze data from US four-year colleges and universities that primarily grant bachelor’s degrees (referred to as “colleges” from here on for brevity) from the year 2012. The data is from the Lock5 Datasets, Third Edition, authored by Robin H. Lock, Patti Frazer Lock, Kari Lock Morgan, Eric F. Lock, and Dennis F. Lock.
The goal was primarily to demonstrate the capabilities of R to interpret and visualize data. I chose to analyze questions largely related to how educational access and attainment relate to economic factors, as well as a few other questions that provide for reasons to use various other analysis methods.
I used a mixture of statistical methods in R to analyze and visualize the data, including mean, median, mode, range, variance, standard deviation, correlation, histograms, box plots, bar plots, and pie charts.
Since data is missing in parts of the data set, all statistical methods were set to ignore missing data.
The questions to analyze in the data were varied and were largely meant to stand alone rather than prove an overarching narrative. They were chosen based on curiosity and an attempt to use a wide variety of statistical methods. The questions were restricted to only require two variables to be compared at a time.
I was tasked with suggesting ten questions to analyze in this data set, using ChatGPT to suggest another ten questions, and then choosing ten final questions from those two sets of questions to use for the project.
These are the ten questions that I came up with for this data set:
These are the ten questions that ChatGPT came up with for this data set:
These are the final questions that I decided to investigate in this project:
These questions will provide for interesting data points to analyze that will require using a variety of statistical analysis methods.
We will analyze the average cost of colleges and universities in each region to understand how costs vary geographically.
## Region Cost
## 1 Midwest 34306.75
## 2 Northeast 39767.80
## 3 Southeast 31455.29
## 4 Territory 12772.89
## 5 West 32775.23
Colleges in the Northeast are the most expensive at an average tuition of $39,768, followed by the Midwest with $34,307. The West and South have relatively similar costs of $32,775 and $31,455, respectively. Colleges in territories are much cheaper than colleges anywhere else at only $12,773.
The region with the highest number of colleges, the mode, is the Northeast region. This is an interesting data point considering that it is also the most expensive region.
From this, we can conclude that average college costs are relatively similar across regions except for territories, where they are considerably cheaper, and the Northeast, where they are moderately more expensive. The greater abundance of colleges in the Northeast does not appear to reduce costs relative to other regions.
To compare how college costs vary by control type, meaning public, private, or for-profit, we will plot the median cost by the control type.
## Control Cost
## 1 Private 41488
## 2 Profit 27935
## 3 Public 21148
Predictably, public colleges are the cheapest at a median tuition of $21,148. Somewhat unexpectedly considering their designated purpose, for-profit colleges are cheaper than private colleges at $27,935, while private colleges are the most expensive at $41,488.
Private colleges show the widest distribution, spanning both the most and least expensive colleges in the data set. Public colleges span a much narrower range, while for-profit colleges are also narrow but slightly wider. Public colleges having a narrower range and lower overall cost makes sense due to public funding and regulations keeping costs down.
To analyze how admissions selectivity varies according to cost, we will plot the correlation between the admittance rate and the cost.
## [1] "Correlation between Admission Rate and Cost: -0.304"
This graph shows that at admittance rates of around 50% to the highest admission rates of 100%, costs vary greatly from the lowest to the highest tuitions of around $5,950 to $72,717.
Yet below an admission rate of about 50%, colleges begin to cluster more heavily around the higher end of the cost range, and below an admittance rate of 20%, they are almost exclusively colleges with higher costs above about $60,000. This reinforces the hypothesis that more prestigious colleges, which charge higher tuitions, are also more difficult to gain admittance to due to high demand and strict requirements.
We will analyze the number, percentage, and standard deviation of female students across colleges and visualize the percentages with a histogram to gain a better understanding of the distribution of female students across colleges.
## Standard deviation of female students: 12.34%
## Overall average percentage of female students: 59.3%
## Total number of female students across all colleges: 5,198,993
The overall average percentage of female students across all colleges is 59.3%, for a total of 5,198,993 female students.
Graphing the data using a histogram:
We can see that the graph of female student percentages forms a bell curve, with most of the colleges having near the average of 59.3%. The standard deviation from this is 12.34% from the mean. This shows that while there is considerable variation in the percentage of female students among some outlier colleges, about 2/3 of them are clustered within 12.34% from the mean.
We will compare and graph the average costs of in-person and online colleges to analyze the cost difference between them.
## Online Cost
## 1 0 34392.04
## 2 1 19231.00
This graph shows the considerable difference in average costs between in-person colleges, $34,392.04 and online colleges, $19,231, with online colleges costing only 55.9% as much as in-person colleges on average. As the box plot shows, online colleges also span a much narrower range in costs. With this, we can conclude that online colleges tend to be considerably cheaper, though some in-person colleges can be even cheaper.
We will visualize the percentage of students that are online students in this data set.
## Online Enrollment Percentage
## 1 0 8801493 97.58839
## 2 1 217503 2.41161
We can see that in this data from 2012, online students make up a small sliver of the total percentage of students, about 2.41% of the total students. Given more recent data, it would be a reasonable assumption that this percentage has grown as online schooling has become more common, as this seems like a remarkably low percentage today.
We will visualize the percentages of students at each college control type: public, private, and for-profit, to see which ones are most popular.
## Control Enrollment Percentage
## 1 Private 2608979 28.927599
## 2 Profit 463281 5.136725
## 3 Public 5946736 65.935676
The majority of enrolled students, 65.94%, are at public colleges, while a considerable portion, 28.93%, are at private colleges. For-profit colleges make up a much smaller percentage, %. This fits with our finding earlier that public colleges have the lowest cost, making them accessible to the most students.
We will analyze the variance and standard deviation of in-state tuition to see how much it varies between institutions.
## [1] "Variance of In-State Tuition: 199665279.72"
## [1] "Standard Deviation of In-State Tuition: 14130.3"
The calculated variance is 1.9966528^{8}, and the standard deviation is $14,130.3. This high standard deviation indicates that there is significant variation in in-state tuition fees among colleges. Most tuition costs deviate from the average by approximately $14,130.3, showing significant variability in tuition costs for in-state students.
To understand how income can impact academic outcomes, we will analyze the correlation between completion rates and median incomes for colleges in the data set.
## [1] "Correlation between Completion Rate and Median Income: 0.644"
The correlation between the completion rate and median income is 0.644, indicating a moderate positive correlation. This means that as the completion rate increases, so does the median family income, on average.
This plot visualizes the general trend that as median income increases, so does the completion rate. At above a 70% completion rate, there are far fewer colleges with a median income below $50,000. This supports the hypothesis that lower income levels can correlate with lower completion rates, showing that students at an economic disadvantage may also be at a disadvantage for graduating.
One outlier in the data, the Jewish Theological Seminary of America with a median income of $179,900, was omitted from the graph due to its considerably higher median family income, which compressed the rest of the data points. With a completion rate of 80.49%, it fits the trend of colleges with higher median incomes generally having higher completion rates, though it demonstrates that having the highest median income on its own is not enough to guarantee the highest completion rate due to other factors involved in completion rates.
To analyze how greater selectivity in college admissions, meaning a lower admission rate, can correlate with the completion rate, we will plot the admission rate versus the completion rate.
## [1] "Correlation between Admission Rate and Completion Rate: -0.348"
The correlation between the admission and completion rates is -0.348, a weak to moderate negative correlation. It means that as the admittance rate increases, the completion rate decreases slightly, on average. This supports the hypothesis that the most selective colleges with the lowest admission rates would also have the highest completion rates, with most colleges below a 20% admission rate having a completion rate of above 80%.
As the admission rate increases, however, completion rates become much more variable, indicating a stronger correlation between admission rates and completion rates only at the lowest admission rates. Above those admission rates, other factors are likely more significant for the completion rate.
We used R to perform statistical analysis of various data points and correlations in this four-year college data set. Methods used include mean, median, mode, range, variance, standard deviation, correlation, histograms, box plots, bar plots, and pie charts to demonstrate the capabilities of R to surface statistical patterns in the data set to aid in answering various inquiries.