This document demonstrates the results from anaylses conducted within R regarding the comparison of Xavier University to other schools in Ohio and nationwide.
The data was collected and reported by the US Department of Education from institutions of post-secondary education within the United States. This federal body serves as the governmental arm responsible for the management and tracking of university and college data as it pertains to demographic and internal qualities. As a result, a lot of the data collected can be deemed as a reliable means for intercomparison between the various institutions represented.
This analysis will help you, the reader, develop a firm understanding of how post-secondary education in Ohio compares regionally and nationally. These comparisons will be drawn on the basis of relative factors, such as economic demographics among the undergraduate students, gender equality, and costs of an education. By reviewing the analyses provided, you may be able to draw further conclusions about which school may or may not be a fit for them. Beyond that, you will also be able to utilize these analyses to potentially draw conclusions on larger societal demographics and statistics within the various geographical regions of the United States.
Before we can begin the analysis we first have to download the packages that we will be using to run our analysis in R:
Gives us a bunch of analysis tools combined into one seamless package. Overall, this package allows us to manipulate, wrangle, and visualize data for its intended purpose.
Allows us to dig into text-string values in order to manipulate them for data observation purposes.
A simple package that lets us design data tables.
This package gives us a lot of data wrangling capabilities in order to narrow down observations to create in-depth analyses
library(tidyverse)
library(stringr)
library(DT)
library(dplyr)
The data was downloaded from a .csv file stored in a OneDrive link and named “college_scorecard”. This represents the master data set from which all further anlaysis will be conducted. Now that we have our primary data set, we need to clean and prepare the data.
college_scorecard <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/kelinj_xavier_edu/EctJCArrkZhGp6WpKOfARLEBYU1DHDGErhED4rDYw9TO9Q?download=1")
If you take a look at the data set, you can see that there are a handful of missing values. We don’t want to elimintate all of these values because then we would remove nearly all of our observations. We do, however, still want to be mindful of where these missing values are. The code below and its corresponding output shows us the number of rows of missing values (marked “NA”) there are in each column:
colSums(is.na(college_scorecard))
## ID INSTNM CITY STABBR ZIP CONTROL LOCALE LATITUDE
## 0 0 0 0 0 0 444 445
## LONGITUDE HBCU MENONLY WOMENONLY ADM_RATE ACTCM25 ACTCM75 ACTCMMID
## 445 0 444 444 5078 5823 5823 5823
## SAT_AVG UGDS COSTT4_A AVGFACSAL PCTPELL PCTFLOAN AGE_ENTRY FEMALE
## 5795 0 3531 2868 770 770 500 1429
## MARRIED DEPENDENT VETERAN FIRST_GEN FAMINC
## 1392 921 4538 1247 500
At first glance, you may notice the large number of missing values for the Average SAT and three ACT columns. While this may be alarming right off the bat, it’s situation becomes less dire with some logical reasoning; not all schools may require or report their students’ SAT or ACT scores. Altogether, most of these missing values can be explained by different standards and internal practices adopted by each school.
Next, we need to inspect the data values to make sure there weren’t any data entry errors. The lines of code below illustrate examples where we need to test for these errors, such as in income or percent share denominations. The corresponding outputs indicated that there were none, so we are good to go.
#Testing to see if any of the variables where it wouldn't make sense to have a negative number have any negative values.
college_scorecard %>%
filter(college_scorecard$UGDS < 0) %>%
head(1)
## # A tibble: 0 x 29
## # … with 29 variables: ID <dbl>, INSTNM <chr>, CITY <chr>, STABBR <chr>,
## # ZIP <chr>, CONTROL <chr>, LOCALE <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # HBCU <chr>, MENONLY <dbl>, WOMENONLY <dbl>, ADM_RATE <dbl>, ACTCM25 <dbl>,
## # ACTCM75 <dbl>, ACTCMMID <dbl>, SAT_AVG <dbl>, UGDS <chr>, COSTT4_A <dbl>,
## # AVGFACSAL <dbl>, PCTPELL <dbl>, PCTFLOAN <dbl>, AGE_ENTRY <dbl>,
## # FEMALE <dbl>, MARRIED <dbl>, DEPENDENT <dbl>, VETERAN <dbl>,
## # FIRST_GEN <dbl>, FAMINC <dbl>
#Testing to see if any of the variables where it wouldn't make sense to have a value over 100% have any such values.
college_scorecard %>%
filter(college_scorecard$PCTPELL > 1) %>%
head(1)
## # A tibble: 0 x 29
## # … with 29 variables: ID <dbl>, INSTNM <chr>, CITY <chr>, STABBR <chr>,
## # ZIP <chr>, CONTROL <chr>, LOCALE <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # HBCU <chr>, MENONLY <dbl>, WOMENONLY <dbl>, ADM_RATE <dbl>, ACTCM25 <dbl>,
## # ACTCM75 <dbl>, ACTCMMID <dbl>, SAT_AVG <dbl>, UGDS <chr>, COSTT4_A <dbl>,
## # AVGFACSAL <dbl>, PCTPELL <dbl>, PCTFLOAN <dbl>, AGE_ENTRY <dbl>,
## # FEMALE <dbl>, MARRIED <dbl>, DEPENDENT <dbl>, VETERAN <dbl>,
## # FIRST_GEN <dbl>, FAMINC <dbl>
We must also ensure that all terminology in our data is consistent with R syntax. Additionally, we should change any vague numerical values that represent code for more direct terms to their intended string form.
#Changes the SQL terminology for missingness to the standard that R uses.
college_scorecard[college_scorecard == 'NULL'] <- NA
#Changes the assigned numerical codes for the control of the institutions to the actual text.
college_scorecard$CONTROL[college_scorecard$CONTROL == '1'] <- 'Public'
college_scorecard$CONTROL[college_scorecard$CONTROL == '2'] <- 'Private nonprofit'
college_scorecard$CONTROL[college_scorecard$CONTROL == '3'] <- 'Private for-profit'
college_scorecard$CONTROL[college_scorecard$CONTROL == 'For profit'] <- 'Private for-profit'
In some instances, it makes sense to set up “dummy variables” for ease of future analysis. These variables are usually binary, meaning they only contain two possible, and often opposite, values. The code below creates the following dummy variables:
Indicates whether or not the average family income for students at a given university is above or below Ohio’s median of $54,021.
college_scorecard$OHMEDINC <- ifelse(college_scorecard$FAMINC > 54021, "Above", "Below")
Classifies of an instituon as a University or College based on what is in their name. Some exceptions might apply to this rule, such as Boston College, which is technically a university.
college_scorecard$INSTCL <- ifelse(str_detect(college_scorecard$INSTNM, "University"),"University", "College")
Indicates whether the institution is in a state that borders Ohio. Ohio itself was included as a “Yes” this since dummy variable most comparisons where this variable is drawn will include Ohio.
college_scorecard$BORDEROH <- ifelse(college_scorecard$STABBR %in% c("MI", "KY", "IN", "PA", "WV", "OH"), "Yes", "No")
Indicates where the insitution’s admission rate is above or below the average for Ohio’s institutions.
#Need to run in order to find the average admissions rate for institutions in Ohio
college_scorecard %>%
filter(STABBR == "OH") %>%
summarize(mean(ADM_RATE, na.rm = TRUE))
## # A tibble: 1 x 1
## `mean(ADM_RATE, na.rm = TRUE)`
## <dbl>
## 1 0.708
college_scorecard$OHAVG_ADM_RATE <- ifelse(college_scorecard$ADM_RATE > 0.708, "Above", "Below")
Indicates whether or not the majority of an institution’s students are listed as being female.
college_scorecard$MAJFEM <- ifelse(college_scorecard$FEMALE > 0.50, "Yes", "No")
Indicates whether or not the majority of an institution’s students recevie federal student loans to help pay for their education.
college_scorecard$MAJPCTFLOAN <- ifelse(college_scorecard$PCTFLOAN > 0.50, "Yes", "No")
Below is an interactive data table in order to explore the primary dataframe.
for_data_table <- data.frame(college_scorecard$INSTNM, college_scorecard$CITY, college_scorecard$STABBR, college_scorecard$CONTROL,college_scorecard$ADM_RATE, college_scorecard$UGDS, college_scorecard$COSTT4_A, college_scorecard$FAMINC)
datatable(for_data_table)
Finally, we’ve cleaned and prepared our data for analysis, so let’s dive in. To start off on a basic note, let’s observe the number of post-secondary institutions in Ohio and each state that borders Ohio.
border_states <- college_scorecard %>%
group_by(STABBR) %>%
filter(BORDEROH == "Yes") %>%
summarize(count = n())
ggplot(data = border_states, aes(x = STABBR, y = count)) +
geom_col(fill = "dodgerblue3") +
ggtitle("# of Institutions in Ohio & Bordering States") +
xlab("States") +
ylab("# of Institutions")
Based on this simple graph, we can deduce that Pennsylvania has the most post-secondary institutions out of the represented states, with Ohio following up as second. This should come as no surprise, considering Pennsylvania also has the largest population and population density out of all the listed states.
Another analysis we can run is examining how the cost for attendance at institutions varies by household income.
ggplot(data = college_scorecard, aes(x = FAMINC, y = COSTT4_A)) +
geom_point() +
ggtitle("Cost of Attendance vs. Household Income") +
scale_x_continuous(name = "Average Household Income",
labels = scales::dollar) +
scale_y_continuous(name = "Cost of Attendance",
labels = scales::dollar)
## Warning: Removed 3554 rows containing missing values (geom_point).
As expected, the graph shows that students who come from households with overall lower average incomes tend to go to institutions where they pay a lower cost of attendance.
Another simple analysis: comparing the number of undergraduates across each institutional type:
as.numeric(college_scorecard$UGDS) -> college_scorecard$UGDS
undergraduate_control <- college_scorecard %>%
group_by(CONTROL) %>%
summarize(total = sum(UGDS, na.rm = TRUE))
ggplot(data = undergraduate_control, aes(x = CONTROL, y = total)) +
geom_col(fill = "coral3") +
ggtitle("# of Undergraduates Enrolled Per Institition Type") +
xlab("Institution Type") +
scale_y_continuous(name ="# of Undergraduates",
labels = scales::comma)
Again, no surprises here… There are far more undergraduates attending public institutions.
In a more intutitive analysis, we can explore the relationship between average SAT scores and average family incomes across Ohio and the states that border it.
sat_income_border <- college_scorecard %>%
filter(BORDEROH == "Yes")
ggplot(data = sat_income_border, aes(x = SAT_AVG, y = FAMINC)) +
geom_point() +
facet_wrap(~ STABBR) +
ggtitle("SAT Scores vs. Household Income in Ohio & Bordering States") +
xlab("Average SAT Scores") +
scale_y_continuous(name = "Average Household Income",
labels = scales::dollar)
## Warning: Removed 947 rows containing missing values (geom_point).
When looking at this graph, it’s easy to discern that higher average household incomes correspond with generally higher average SAT scores. Undoubtedly, this is attributed to the availability of additional resources that would allow, or disallow, a student to have access to better testing materials and lower familial stress.
Now it’s time to dive a little deeper into our ability to analyze this dataset by answering some discerning quesitons.
The first thing we will explore is an old adage that states, “Private schools cost more than public schools.” In order to figure this question out, I had to breakdown my dataset by creating a new data frame that was grouped by the control of the institutions and summarized by the average cost of attendance. This allowed me to then create my plot…
tuition_control <- college_scorecard %>%
group_by(CONTROL) %>%
summarize(AVGCOST = mean(COSTT4_A, na.rm = TRUE))
ggplot(data = tuition_control, aes(x = CONTROL, y = AVGCOST)) +
geom_col(fill = "goldenrod3") +
ggtitle("Average Attendance Cost by Institution Type") +
xlab("Institution Type") +
scale_y_continuous(name = "Average Cost of Attendance",
labels = scales::dollar)
Indeed, the adage rings true. On average, private institutions are more than twice as expensive as their public counterparts. Diving into private institutions themselves, we can see that nonprofits are slightly more than $10,000 more expensive on average than for-profits. This would make sense as for-profit institutions can use profits to potentially offset tuition costs.
We turn more locally in our next inquiry as we explore how the median family income of students at Xavier University compares to the national and Ohio median To start, we need to filter and summarize the median family income of students for just Xavier University and load it into a unique data table. Following up, we do the same for median family incomes on the national and state levels. Then, by binding all of these new data tables into one larger table, we have a means for evaluating this question.
income_xavier <- college_scorecard %>%
filter(INSTNM == "Xavier University") %>%
summarize(XAV_FAMINC = median(FAMINC, na.rm = TRUE))
income_national <- college_scorecard %>%
summarize(NAT_FAMINC = median(college_scorecard$FAMINC, na.rm = TRUE))
income_ohio <- college_scorecard %>%
filter(STABBR == "OH") %>%
summarize(OH_FAMINC = median(FAMINC, na.rm = TRUE))
income_comparison <- bind_cols(income_xavier, income_ohio, income_national)
income_comparison
## # A tibble: 1 x 3
## XAV_FAMINC OH_FAMINC NAT_FAMINC
## <dbl> <dbl> <dbl>
## 1 114330. 34061. 31447.
This produces some shocking results as median family incomes for students at Xavier is more 3 times that of Ohio’s and the nation’s median. Since Xavier is a private school, this does not come to much surprise. As we saw in our earlier table, higher costs of attendance correspond to higher median household incomes. Since Xavier’s all-included tuition is overall $50,000, this holds true to the results of that prior analysis.
On that note, let’s dig into some cost of attendance. Specifically, we will analyze how the cost of attending an Ohio ‘university’ compares to universities in states that border Ohio, in addition to nationally. In this case, we have to filter for two variables: first, institutions that are specifically universities, and second, states that border Ohio. After filtering, we need to summarize for the average cost of attendance for each state. Finally, we need to add in the national averages alongside the other comparisons to create a discernable table.
attendance_costs <- college_scorecard %>%
group_by(STABBR) %>%
filter(INSTCL == "University",
BORDEROH == "Yes") %>%
summarize(AVG_COST = mean(COSTT4_A, na.rm = TRUE))
attendance_costs_national <- college_scorecard %>%
summarize(AVG_COST = mean(COSTT4_A, na.rm = TRUE))
attendance_costs <- bind_rows(attendance_costs, attendance_costs_national)
attendance_costs
## # A tibble: 7 x 2
## STABBR AVG_COST
## <chr> <dbl>
## 1 IN 31127.
## 2 KY 30710.
## 3 MI 29689.
## 4 OH 28736.
## 5 PA 36888.
## 6 WV 22005
## 7 <NA> 26337.
In the table above, we can see the states/region in the left-hand column and the average cost for attending an institution in that state in the right-hand column. Please note that the “NA” in this instance represents “Nationally” (It’s really saying Not Applicable but I thought the letters worked too well to not mean Nationally). On average, it is more expensive to attend school in Pennsylvania than the national levels and in the other states bordering Ohio. In fact, West Virginia is the only bordering state that isn’t more expensive than the national average.
Turning the corner, we’ll now explore which schools have the highest and lowest percentage of undergraduate students receiving a Pell grant. For this, I decided to filter for only universities and display the top and bottom 3 institutions that are represented. In order to do so, I loaded the lowest 3 percentages, which were all 0, into a dataframe and did the same for the highest 3. I then combined the dataframes and arranged them so the in ascending order for ease of observation.
lowest_pell <- college_scorecard %>%
filter(INSTCL == "University") %>%
select(INSTNM, STABBR, CONTROL, PCTPELL) %>%
arrange(PCTPELL) %>%
head(3)
highest_pell <- college_scorecard %>%
filter(INSTCL == "University") %>%
select(INSTNM, STABBR, CONTROL, PCTPELL) %>%
arrange(desc(PCTPELL)) %>%
head(3)
pell_grants <- bind_rows(lowest_pell, highest_pell) %>%
arrange(PCTPELL)
pell_grants
## # A tibble: 6 x 4
## INSTNM STABBR CONTROL PCTPELL
## <chr> <chr> <chr> <dbl>
## 1 University of Oklahoma-Health Sciences Center OK Public 0
## 2 Faith International University WA Private nonprofit 0
## 3 Taft University System CO Private for-prof… 0
## 4 Selma University AL Private nonprofit 0.96
## 5 Dewey University-Hato Rey PR Private nonprofit 0.96
## 6 Dewey University-Bayamon PR Private nonprofit 1
Overall, there were far more than 3 institutions that had 0% of students with a Pell grant, so the 3 represented here are far from unique. Branches of Dewey University both made the top two of percentage of students with Pell grants at a whopping 100% and 96%.