Introduction

The purpose of this document is to analyze universities within the state of Ohio and how they comapre to other regional and national schools by using R, a programming langauge and software deisgned for statistical analysis.

I will be using data from the US Department of Education, which is one of many federal organizations responsible for collecting and reporting data on institutions of higher education. The data collected includes vital statistics of all institutions of post-secondary education in the United States receiving some form of financial aid.

This document will reveal key analysis and trends of how institutions of post-secondary education in Ohio compare to other schools regionally and nationally. This document will share visualizations, data tables, and commentary of such trends to help an individual learn more about the United States’ post-secondary education system.

Packages Required

The packages in R I loaded up and used are listed below along with their purpose in my analysis:

tidyverse: The tidyverse is a collection of open source R packages that help model, transform, and visualize data. In the tidyverse, I used the ggplot2 package frequently throughout my analysis to create my visualizations, and I used the dplyr package to filter and manipulate the data.

DT: The DT package is used to create an HTML widget to display R data objects in an easily readable data table.

knitr: Ths package helped create many of the tables I used throughout the document.

scales: I used the scales library to prevent R from labeling my axes in a scientific format.

To install these packages in R, use the following command: install.packages(c(“tidyverse”, “DT”, “knitr”, “scales”))

Data Preparation

Cleaning the Data

The data of 7,115 institutions from the US Department of Education is hosted here:http://asayanalytics.com/scorecard.

Before any analysis can begin, necessary cleaning and wrangling of the data must be done. The dataset contains missing values, blatant data errors, and data conformity issues.

Missing Values: Thousands of schools had missing values in their data. Most of their missing data came from their admisstion rates and test scores. For the purpose of this research, I did not exclude any schools that had any missing data.

Blatant Data Errors: There were several schools who had “-3” listed in the “LOCALE” column, which is outside of the possible range of values as described by the data dictionary. I changed all “-3” occurences in the “LOCALE” column to “NA”. Furthermore, the “UGDS” column was not characterized as numeric, so I modified the column as such so it’s recognized as numeric in R.

Data Conformity: I unified the variations of values in the “CONTROL” column, and I unified all “NULL” values in the “HBCU” and “UGDS” columns to “NA”.

Dummy Variables

After cleaning up the data, I added some dummy variables to assist me in my study:

“FAMINC_GREATER_THAN_OHIO” indicates whether the average family income of students attending the school is greater or less than the median household income for Ohio in 2019, which was $54,021.

“UNIVERSITY_OR_COLLEGE” indicates whether the instiution has the word ‘College’ or ‘Univeristy’ in its name.

“BORDERS_OHIO” indicates if the institution is in a state that borders Ohio.

“AGE25” indicates if the institution has an average age of entry greater than or equal to 25 years.

“COST_ABOVE_AVERAGE” indicates if the institution has an average cost of attendance greater than the average cost nationwide.

“UGDS_ABOVE_AVERAGE” indicates if the institution has a greater number of undergraduate students than the nationwide average.

About the Data

There are 7,115 schools listed in the data. Basic information, such as the school’s name and location are included. In addition, there are two columns describing the control of the institution and its locale. The control explains whether the school is public (1), private nonprofit (2), or private for-profit (3). The locale describes if the school is located in a city (11-13), a suburb (21-23), a town (31-33), or a rural area (41-43). The dataset already had dummy variables indicating if the institution is men only, women only, or designated as an HBCU. Lastly, the data includes many demographic information, such as ACT & SAT scores, admission rates, family income of students (in real 2015 dollars), and so on.

Data Table

Here is the cleaned up data where you can search, sort, or filter through at your own pleasure. I selected 11 out of the 35 variables in the data that I considered vital to include in an easily readable datatable.

Summary Statistics

The following lists summary statistics for particular variables of interest. The average admission rate across the country is 68.2%. The average ACT and SAT score for a college student is 23.4 and 1,132, respectively. The average institution has a population of 2,426 undergraduate students. The average cost of attendance for all post-secondary schools is $26,337. Lastly, the average family income of college students is $38,483.

Admission Rate Summary Statistics
Statistic Value
Min. 0.000
1st Qu. 0.550
Median 0.710
Mean 0.682
3rd Qu. 0.840
Max. 1.000
NA’s 5078.000
ACT Midpoint Summary Statistics
Statistic Value
Min. 6.000
1st Qu. 21.000
Median 23.000
Mean 23.442
3rd Qu. 25.000
Max. 35.000
NA’s 5823.000
SAT Average Summary Statistics
Statistic Value
Min. 564.000
1st Qu. 1044.750
Median 1117.000
Mean 1131.774
3rd Qu. 1195.000
Max. 1558.000
NA’s 5795.000
Undergraduate Population Summary Statistics
Statistic Value
Min. 0.000
1st Qu. 106.000
Median 401.000
Mean 2426.058
3rd Qu. 2018.000
Max. 77269.000
NA’s 748.000
Cost of Attendance Summary Statistics
Statistic Value
Min. 0.00
1st Qu. 14000.25
Median 22646.50
Mean 26337.07
3rd Qu. 33941.75
Max. 93704.00
NA’s 3531.00
Age of Entry Summary Statistics
Statistic Value
Min. 17.430
1st Qu. 23.175
Median 25.780
Mean 26.007
3rd Qu. 28.505
Max. 58.900
NA’s 500.000
Family Income Summary Statistics
Statistic Value
Min. 321.00
1st Qu. 22668.00
Median 31447.00
Mean 38482.72
3rd Qu. 48098.50
Max. 174263.00
NA’s 500.00

Directed Analysis

To find out conclusively if the old adage of “private schools cost more than public schools” is true, I grouped all institutions by their control and calculated the mean of the cost of attendance in R. The graph below shows that public schools are definitely cheaper than private schools at just over $15,000. Private nonprofit schools average out at almost $40,000. Private for-profit schools are considerably cheaper than their nonprofit counterparts at over $25,000, but they are still signnficiantly more expensive than public schools. In short, the old adage is generally true. On average, a private school will cost more than a public one.

The following shows the massive difference of the average family incomes of Xavier students with the rest of Ohio and nationwide. I simply filtered out Xavier University and universities in Ohio and extracted the means of their family income values along with the family income mean nationwide. I then placed them into vectors to be used in a new dataframe, so I can display the results into this bar graph. The average family income of a Xavier student is over $110,000 while Ohio and nationwide students have an average hovering around $40,000.

The following graph and table compares the average cost of attendance of all Ohio schools compared to bordering states and the nation, respectively. To find these values for the graph, I first filtered out schools that only have “university” or “college” in their name. Then, I filtered out Ohio and its bordering states into a new dataframe and grouped all the schools by their state to calculate the mean for each state. For the table, I filtered out Ohio and its average cost along with the nationwide average cost into a vector to be placed into a new dataframe. I then created a table using the dataframe. Based on the results, it appears Indiana and Pennsylvania have the most expensive schools in the region while West Virginia offers the cheapest. Ohio is the 3rd most expensive school on average in the region. Compared nationally, Ohio is around $400 more expensive on average.

Average Cost of Attendance of Ohio Compared Nationwide
Location Average Cost of Attendance
Ohio 26583.51
Nationwide 26145.10

To find out the top 10 colleges with the highest and lowest percentage of undergraduate students receiving a Pell grant, I had to do some filtering, namely with the amount of undergraduates in the school. Without filtering, many of the top schools with a 100% Pell grant percentage hardly had any undergraduates attenting their institution. Some colleges with a 0% Pell grant percentage didn’t even have a single undergraduate at their school. To resolve this, I simply filtered out for schools that at least 500 undergraduates at their institution. Jarvis Christian College had the highest Pell grant percentage at 98% with 861 undergraduates attending. Ten schools had a 0% Pell grant percentage, including but not limited to, the United States Coast Guard Academy, the United States Naval Academy, and the American College of Financial Services.

Top 10 Schools that have Highest Percentage of Undergraduate Students Receiving a Pell Grant
Institution Undergraduate Population Pell Grant Pct
Jarvis Christian College 861 0.98
Atenas College 932 0.97
Platt College-Miller-Motte Technical-Columbus 635 0.97
Automeca Technical College-Bayamon 528 0.96
Vatterott College-Berkeley 545 0.95
Inter American University of Puerto Rico-Barranquitas 1784 0.94
Talladega College 782 0.93
Instituto de Banca y Comercio Inc 11239 0.93
Mech-Tech College 2448 0.93
Vatterott College-Dividend 689 0.93
Top 10 Schools that have Lowest Percentage of Undergraduate Students Receiving a Pell Grant
Institution Undergraduate Population Pell Grant Pct
United States Coast Guard Academy 1044 0
The Southern Baptist Theological Seminary 873 0
United States Naval Academy 4495 0
Hillsdale College 1511 0
University of Oklahoma-Health Sciences Center 786 0
American College of Financial Services 8354 0
Grove City College 2327 0
Martinsburg College 879 0
Northeast Lakeview College 3779 0
California Institute of Arts & Technology 804 0

Self-Directed Analysis

The following four graphs compare the average cost of attendance acros the number of undergraduates, the percent of studetns receiving a Pell grant, the average faculty salary, and the average family income. If one of these variables were to be classified as a dependent variable, I would argue that the average cost of attendance would be it because it seems logical that the average cost to attend an institution would depend on numerous other factors and variables, such as how much the faculty are paid and how much income the student or the student’s family has avaialable. I would probably use a linear regression analysis to evaluate the effect these variables have on the cost of an institution. Furthermore, I color coded the control of the institution for 3 of the 4 graphs which highlight the clusters and trends of those schools better. I did not see any patterns when color coding the points for the attendance vs. undergrad population graph, so I decided to leave that one out.

Generally speaking, as the Pell grant percentage increases, the cost decreases, especially for private nonprofit schools. As the average faculty salary increases, the cost increases. Lastly, as the average family income increases, the cost increases as well.

To focus on analyzing only the differences in populations of urban and rural schools, I filtered only for schools that have “university” or “college” in its name, filtered out gendered schools, and filtered for only public schools because private schools tend be smaller. I only selected schools that received the “11” classification on their locale because it fits the description for “heavily urbanized”. Finally, I compared these urban schools with schools that received the “43” classification on their locale becaise it fits the description for “very rural”. The following graph reiterates the obvious insight that urban schools have a sginficantly higher undergrad population than rural schools with very urban schools averaging out at over 15,000 students and very rural schools averaging out 1,250 students.

I wanted to find out if there were any signficiant differences in the price of a university depending on its locale. Are urban schools the most expensive? Is there a price difference between schools in suburbs or towns. How much cheaper (or more expensive) are rural schools compared to urban schools? I ended up filtering out any biases and calculated the mean cost of each locale to be displayed into a bar graph. Based on the graph below, rural schools have the cheapest cost of attendance at around $13,000. Urban schools are the most expensive, but suburban and town schools aren’t that much cheaper in comparison. Urban, suburban, and town schools all have an average price range at just above $15,000. To provide further validation of my findings, a linear regression analysis would be appropriate.

Next, I wanted to find out if the average famlily income of the student had an effect on the average ACT score across all universities. Does a higher family income usually result in a higher ACT score. I plotted the relationship on a scatterplot and drew a line of regression, and it appears it does. As family income increases, the average ACT score of the institution increases as well. To provide further validation of my findings, a linear regression analysis would be appropriate.