Introduction

This document shows analysis of over 7,000 post-secondary education schools using data from the Department of Education which is updated yearly. The purpose is to discover how the institutions in Ohio compare to those outside of Ohio on a multitude of factors. Hopefully, you will gain insights on where Ohio excels in education and where they are able to improve after reading.

Stay tuned for my first exam to see data cleaning, visualizations, and beautiful coding/formatting. I hope you enjoy the ride.

Packages Required

library(tidyverse) – Takes advantage of all the capabilities of the tidyverse to work with data

library(dplyr) – Allows for easy manipulation of dataframes to work with variables

library(DT) – Allows me to create aesthetically pleasing data tables

library(data.table) – Is useful for adding new columns and quickly analyzing data ## Initial Steps

Data Import

This step is essential - downloading the data from the website it is stored and into the environment for further manipulation

Data Cleaning

Investigating missing values within each column

  • While there are a lot of missing values within many of the variables, it is because some institutions do not require things like SAT or ACT scores, or do not report some of these metrics. I don’t see how imputing data into these missing entries would help analysis at all and think it is best to keep the values as “NA” so I can remove them in calculations.
       ID    INSTNM      CITY    STABBR       ZIP   CONTROL    LOCALE  LATITUDE 
        0         0         0         0         0         0       444       445 
LONGITUDE      HBCU   MENONLY WOMENONLY  ADM_RATE   ACTCM25   ACTCM75  ACTCMMID 
      445         0       444       444      5078      5823      5823      5823 
  SAT_AVG      UGDS  COSTT4_A AVGFACSAL   PCTPELL  PCTFLOAN AGE_ENTRY    FEMALE 
     5795         0      3531      2868       770       770       500      1429 
  MARRIED DEPENDENT   VETERAN FIRST_GEN    FAMINC 
     1392       921      4538      1247       500 

Change 0% admission rates to NA, a data entry error

Adjust values within Locale column to fix data entry error

Create uniformity in missing values by changing “NULL” to NA which is R friendly.

Change data types once Nulls are removed

Adjust control values to be dummy variables rather than character strings for analysis

Variable Creation

Some additional variables are necessary to do the analysis I am looking to do on the data. These dummy variables will make visualizations in the future easier to create and understand!

For the columns created in these six questions, a value of 1 indicates Yes, and 0 indicates No

  1. Does this post-secondary institution have a household income higher than the Ohio median in 2018?

  2. Is this institution a University or College?

  3. Does the state this institution is located in border Ohio?

  4. Is the enrollment of this institution higher than the nationwide average?

  5. Are majority of entrants above the age of 22 (the societal norm for a graudation age)?

  6. If the post-secondary institution is public, is its admission rate below average (when available)?

Data and Variable Explanations

  • There are 7,115 observations, each of which represent a post-secondary institution.
  • Missing values are recorded as NA
Variable Description
ID A unique ID assigned to each institution
INSTNM Name of institution
CITY City of institution
STABBR State of institution
ZIP Zipcode of institution
CONTROL Type of university (1 = public, 2 = private, 3 = private for-profit)
LOCALE Type of location (urban, suburbs, rural)
LATITUDE Latitude
LONGITUDE Longitude
HBCU 1 if it is a Historically Black College/University
MENONLY 1 if only men are admitted
WOMENONLY 1 if only women are admitted
ADM_RATE Admission rate as a percentage
ACT 25th, 75th, and middle percentiles of ACT scores
SAT_AVG Average SAT scores
UGDS Number of enrolled undergraduates
COSTT4_A Cost for attendance each year
AVGFACSAL Average faculty salary
PCTPELL Percent of students on Pell Grant
PCTFLOAN Percent of students using federal loans
AGE_ENTRY Average age of students when enrolling
FEMALE Percent of students that are female
MARRIED Percent of students that are married
DEPENDENT Percent of students that are dependent
VETERAN Percent of students that are veterans
FIRST_GEN Percent of students that are first-generation college students
FAMINC Average family income of students enrolled
Dummy Var. The last six columns were created based on questions in Variable Creation tab.

Data

Analysis

Summary Statistics

  • In the 7,115 institutions listed in the data set, the maximum amount of students at one place is 77269 but the overall average is only 2426.

  • Interestingly the average admission rate for this for these institutions is 2426.06 and an average SAT score of 1132.

  • In regards to the actual students, 46% of students are the first member of their family to attend college (first generation). On average, 64% of students at an institution are females and the average age a student enters into a school is 26

  • Lastly, families that send students to post-seconday institutions have an average income of 38483. The richest family sending someone to school? Their income is 174263.

Simple Analysis

  1. Number of institutions in Ohio and in each border state
  2. Cost of attendance as it varies by family income
  3. Number of undergraduate students by institution type
  4. Relationship between ACT/SAT scores and family income for states that border Ohio

Directed Analysis

1. Do private schools cost more than public schools?

  • This shows that on average, private schools cost more than private schools across the nation. I thought a bar graph was the most simplistic way to visualize this. Specifically, nonprofit private schools cost more than $20,000 more on average than a public school. I do support the usual thinking that private schools cost more based off of this visualization

2. How does Xavier’s average family income compare to institutions within Ohio and nationally?

  • Xavier’s Average Family Income
  avg_income
1   114329.6
  • Ohio Institutions’ Average Family Income
  avg_income
1   42138.52
  • National Institutions’ Average Family Income
  avg_income
1   38471.25

3. How does the cost to attend a university in Ohio compare to border states?

  • This boxplot visualization shows that Ohio is right about in the middle of the pack compared to the border states, but there is a left skew to Ohio’s distribution as well as an outlier on the high end. Pennsylvania has the highest average cost and West Viriginia the lowest.

How does the cost to attend a university in Ohio compare nationally?

  • Secondly, Ohio is just above the national average for tuition costs. This visualization does not include states that border Ohio. Nationally, there is more variability to the data as there are more outliers on the high end of costs, especially compared to Ohio.

4. What schools have the highest and lowest percentage of undergraduate students receiving a Pell grant?

  • There are 74 institutions where 0% of their students have Pell grants. I chose to show this within a searchable data table because there are too many to just list, and it allows for easy use for viewers.

  • On the other hand, there are 48 institutions where 100% of their students use Pell Grants

Self-Directed Analysis!

What variable has most effect on determining where students attend school?

1. Compare the average cost of attendance across the number of undergraduates, the percent of students receiving a Pell grant, the average faculty salary and the average family income.

  • I think that family income is the dependent variable for students deciding where to attend school because of the strong positive correlation it has with cost for attendance. You could evaluate the effect by looking at r-squared values to determine the actual correlation between the two.

How do student populations change from urban to rural areas?

2. Compare the student populations of schools in heavily urbanized areas with those in very rural areas.

  • The populations that attend heavily urbanized schools vs. definitively rural schools are different - and not just in number of students. Rural schools have a lower percentage of females compared to males, and their rate of students using federal loans is almost halved.

  • Likewise, the distribution of students who are fist generation college students is much wider for urban institutions and there are outliers on both sides of the scale. This shows that there is a clear difference in the type of students who seek out a rural school - they are more likely to have a parent that also attended college.

  • Lastly, students who go attend a rural institution are younger, on average, by about 2 years. People who are older may not want to go to the middle of nowhere to get their degree and instead stick to a urban area where they are able to work or be close to family and friends.

I would use a two sample t-test to evaluate if the differences between the two means in these cases is meaningful

Do female only institutions actually have smarter students than male only and co-ed?

  1. There are often stereotypes attributed to gender, i.e, “girls go to college to get more knowledge, boys go to jupiter to get more stupider.” I want to investigate if schools that only have women do in fact have higher test scores or are more difficult to get into than their male only counterparts.
  • Approach: I will use the columns which designate men and women only as 1s and 0s, the control variable, as well as the SAT/ACT columns, admission rates, and the number of undergraduates to compare samples.I want to first understand the populations of men vs. women’s schools.

  • Visualization:

*Statistical Test to Confirm Results: I’d use a two sample ttest here to look at the differences between the means of each “type” of school to see if it is significantly different. I could also use a regression analysis to look at the correlation in the last visualization.

What type of school are dependent students most likely to attend?

  • As someone who is a member of the dependent college student population, I think it will be interesting to see if there is a difference in where they attend school in terms of location and cost. I will look at schools that have an above average rate of dependency

  • Approach: I want to use the dummy variable which determines if it is a college or university, as well as the locale, and control variables to determine characteristics of the school.

  • Visualization:

There are 3301 schools below average dependent rate and 2893 above the average. 921 schools do not report a rate for dependent students.

  • These visualizations show that private, non profit group of schools have the highest count of above average dependency rate as well as a family income above the Ohio median. This shows that students that are dependent on someone else’s income (aka do not pay for their own school), and have a higher family income will likely choose a private, non-profit school.

  • Private, non-profit schools have the highest average dependent rate of around65%%, compared to public schools at around 60%.

  • When looking specifically at the location of these schools, those located in “31” which is a town, less than ten miles from an urban cluster, have the highest rate. I attribute this to be the college town phenomenon. 21 has the lowest rate, and these are schools located in suburbs.

  • Statistical Test to Confirm Results: I would use a one-way ANOVA to test for differences in dependency in both the location of the school and the type of school.