1.1 Purpose of the document: This document compares higher education institutions in Ohio to other regional and national schools.
1.2 Explanation of data: This data is a subset of the Department of Education’s college scorecard from the 2018-2019 reporting year. Each observation is a unique higher education institution. For each institution, the data includes their location, acceptance rate, gender ratio, financial aid recipient ratio, average standardized test scores, tuition cost, average family income, and more.
1.3 How analysis helps: This analysis compares institutions on the aforementioned variables by grouping and categorizing schools based on their size, exclusivity, cost, location, and more. You will be able to understand how Ohio schools compare regionally and nationally and see opportunities for further analysis.
Packages
2.1 There are 3 packages necessary for the analysis.
2.3 Tidyverse: The tidyverse is a collection of packages that have different notations to create a more seamless data science approach.
Dplyr: Dplyr is comparable to the SQL language and helps users manipulate datasets easily.
Datatable: Datatable creates interactive Java tables for the data.
Import Data 3.1
Cleaning Data 3.2
## ID INSTNM CITY STABBR ZIP CONTROL LOCALE LATITUDE
## 0 0 0 0 0 0 444 445
## LONGITUDE HBCU MENONLY WOMENONLY ADM_RATE ACTCM25 ACTCM75 ACTCMMID
## 445 0 444 444 5078 5823 5823 5823
## SAT_AVG UGDS COSTT4_A AVGFACSAL PCTPELL PCTFLOAN AGE_ENTRY FEMALE
## 5795 0 3531 2868 770 770 500 1429
## MARRIED DEPENDENT VETERAN FIRST_GEN FAMINC
## 1392 921 4538 1247 500
A. Several observations have a locale of “-3”. This is not on the data dictionary as an option. I chose to keep these values as they were and consider the “-3”s as their own group. In a real-life scenario, I would research where each of the institutions is located and reclassify them.
B. Some institutions have an admission rate of 0.00. This could mean that the institution is shutting down and no more students are being admitted. Again, further research could be conducted into these specific schools to see if that is the case or if the data is erroneous.
## ID
## " 100654"
## INSTNM
## "A T Still University of Health Sciences"
## CITY
## "Aberdeen"
## STABBR
## "AK"
## ZIP
## "00602"
## CONTROL
## "1"
## LOCALE
## "-3"
## LATITUDE
## "-14.322636"
## LONGITUDE
## "-100.03748"
## HBCU
## "0"
## MENONLY
## " 0"
## WOMENONLY
## " 0"
## ADM_RATE
## "0.00"
## ACTCM25
## " 1"
## ACTCM75
## " 9"
## ACTCMMID
## " 6"
## SAT_AVG
## " 564"
## UGDS
## "0"
## COSTT4_A
## " 0"
## AVGFACSAL
## " 0"
## PCTPELL
## "0.00"
## PCTFLOAN
## "0.00"
## AGE_ENTRY
## "17.43"
## FEMALE
## "0.02"
## MARRIED
## "0.00"
## DEPENDENT
## "0.03"
## VETERAN
## "0.00"
## FIRST_GEN
## "0.09"
## FAMINC
## " 321.3853"
## ID INSTNM CITY STABBR
## "49005401" "ZMS The Academy" "Zanesville" "WY"
## ZIP CONTROL LOCALE LATITUDE
## "99801" "Public" "43" " 71.324702"
## LONGITUDE HBCU MENONLY WOMENONLY
## " 171.37813" "NULL" " 1" " 1"
## ADM_RATE ACTCM25 ACTCM75 ACTCMMID
## "1.00" "34" "35" "35"
## SAT_AVG UGDS COSTT4_A AVGFACSAL
## "1558" "NULL" "93704" "22924"
## PCTPELL PCTFLOAN AGE_ENTRY FEMALE
## "1.00" "1.00" "58.90" "0.98"
## MARRIED DEPENDENT VETERAN FIRST_GEN
## "0.82" "0.99" "0.35" "0.96"
## FAMINC
## "174263.2500"
UGDS, the number of undergraduate, certificate/degree-seeking students is listed as a factor variable instead of number. This has been rectified by adding a ‘UGDS_num’ column that is in the number format.
In several columns, the “NA” values are listed as “NULL”. For the analysis and the sake of consistency, all “NULL” values have been changed to “NA”.
The CONTROL variable describes if the institution is public or private. In some instances, it is coded as a number and in others, it is written in words. The data set has been corrected so all CONTROL data is the number code.
Dummy Variables
Many dummy variables have been added to the data set to help the analysis.
A ‘Relative_Income’ variable shows if the family income is under or over the average Ohio household income of $54,021.
A ‘Inst_Status’ variable shows if the name of the institution contains the word “University” (identified by the value “1”), “College”, (identified by the value “2”), or neither (identified by the value “3”).
A ‘Border_OH’ variable reports “Yes” if the state of the institution borders the state of Ohio and “No” if not.
A ‘Cost_HML’ variable classifies the cost to attend the institution as “High” (over $30,000), “Medium” ($20,000 to $30,000), or “Low” (less than $20,000).
A ‘Selective_Status’ variable calls schools that admit less than 10% of applicants “Selective”. All other institutions are labeled “Not Selective”.
A ‘Service_Academy_Status’ variable labels the 5 United States service academies (ex. United States Naval Academy) as “Yes” and all other institutions “No”.
Data Explanation 3.3
There are 7,115 institutions listed in the data set.
## [1] 7115
Unavailable data is listed as “NA” in the data set.
Below is a description of the key variables used in the analysis.
## Variable_Name
## 1 ID
## 2 INSTNM
## 3 STABBR
## 4 ADM_RATE
## 5 ACTCMMID
## 6 COSTT4_A
## 7 PCTPELL
## 8 FAMINC
## 9 Inst_Status
## 10 Border_OH
## 11 UGDS_num
## Description
## 1 Unique ID for institution
## 2 Institution name
## 3 State postcode
## 4 admission rate
## 5 midpoint of the ACT cumulative score
## 6 average total cost of attendance
## 7 percentage of undergraduate students receiving a Pell Grant
## 8 average family income in real 2015 dollars
## 9 1 for university, 2 for college, 3 for neither university or college
## 10 Yes for being in a state that borders Ohio, No otherwise
## 11 enrollment of undergraduate certificate/degree-seeking students
3.4 Below is an interactive table with a small sample of the data set and key variables. The observations chosen for the sample have the highest mid-ACT scores.
Summary 3.5
Here is a brief summary of the average values for all key continuous variables in the data set. For example, the average admission rate across all higher education institutions is 68%.
## # A tibble: 1 x 6
## Avg_Pct_Pell Avg_Adm_Rate Avg_ACT_Mid Avg_Cost Avg_Fam_Inc Avg_UGDS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.482 0.682 23.4 26337. 38483. 2426.
Here are some summaries for the key categorical variables.
Below is a list of the states with the highest number of higher education institutions. California tops the list with 716 institutions and Ohio falls at #6 with 318 schools.
## # A tibble: 59 x 2
## STABBR n
## <chr> <int>
## 1 CA 716
## 2 NY 452
## 3 TX 446
## 4 FL 412
## 5 PA 378
## 6 OH 318
## 7 IL 277
## 8 MI 195
## 9 NC 188
## 10 MO 180
## # ... with 49 more rows
Below is a list of the number of institutions within each Institution Status (1 being univeristy, 2 college, and 3 neither).
## # A tibble: 3 x 2
## Inst_Status n
## <chr> <int>
## 1 2 2829
## 2 3 2515
## 3 1 1771
Below is a list of the number of institutions in each type of control (1 being public, 2 being private non-profit, and 3 being private for-profit).
## # A tibble: 3 x 2
## CONTROL n
## <chr> <int>
## 1 3 2998
## 2 1 2076
## 3 2 2041
Simple Analysis and Trends
4.1 Show the number of institutions in Ohio and in each of the states that borders Ohio. Ohio has the second greatest number of institutions behind Pennsylvania at 378.
## # A tibble: 6 x 2
## STABBR n
## <chr> <int>
## 1 PA 378
## 2 OH 318
## 3 MI 195
## 4 IN 155
## 5 KY 101
## 6 WV 74
4.2 Illustrate how the cost for attendance varies by family income for all institutions. As the graph shows, there is a pretty strong postitive correlation between family income and cost of attendance.
4.3 Compare the number of undergraduates across each of the 3 institutional control types for all institutions.The average number of undergraduates across public and private non-profit institutions are very similar, around 2,050 students. Private for-profit institutions (3) have the greatest number of students on average.
## # A tibble: 3 x 3
## CONTROL mean n
## <chr> <dbl> <int>
## 1 1 NA 2076
## 2 2 NA 2041
## 3 3 NA 2998
4.4 Show a relationship between ACT or SAT scores and family income across each of the states that border the state of Ohio. Family income and average SAT scores are positively correlated. Indiana and Pennsylvania tend to have higher incomes and test scores than the other states while West Virginia and Kentucky have lower incomes and test scores.
Directed Analysis
5.1 Do you find support for the old adage: “Private schools cost more than public schools” Explain.
Approach: The CONTROL variable categorizes institutions based on private or public status. I will use a boxplot to show the average cost of attendance across each of the institution types.
Result: Public schools are coded as 1 and private schools are coded as 2 (non-profit) and 3 (for profit). Yes- this does support the claim that private schools cost more. However, there is much greater variance in private non-profit costs.
5.2 How does the average family income of students at Xavier University compare nationally? Within Ohio?
Approach: I will filter the data to only include certain institutions within the calcultion (ie. Xavier, then Ohio schools, then all nationwide). After, I will average the FAMINC variable.
Result: Xavier families earn on average $114,330. Ohio families earn on average $42,380. National families earn on average $38,483. Xavier families are very affluent compared to the families of Ohio and nationwide institutions.
## # A tibble: 1 x 2
## INSTNM FAMINC
## <chr> <dbl>
## 1 Xavier University 114330.
## # A tibble: 1 x 1
## national_income_mean
## <dbl>
## 1 38483.
## # A tibble: 1 x 1
## ohio_income_mean
## <dbl>
## 1 42380.
5.3 How does the cost of attending an Ohio ‘university’ compare to universities in states that border Ohio? What about universities nationally, not considering state?
Approach: I will filter the data to only include universitites and another filter for certain institutions within the calcultion (ie. Ohio schools, then regional, then all nationwide). After, I will average the attendance cost variable.
Result: Ohio universities cost on average $28,736. Border state universities cost on average $32,779. National universities cost on average $30,351. Therefore, on average, Ohio universities are the most affordable.
## # A tibble: 1 x 1
## ohio_tuition_mean
## <dbl>
## 1 28736.
## # A tibble: 1 x 1
## border_tuition_mean
## <dbl>
## 1 32779.
## # A tibble: 1 x 1
## national_tuition_mean
## <dbl>
## 1 30351.
5.4 What schools have the highest and lowest percentage of undergraduate students receiving a Pell grant?
Approach: I will sort all institutions by the percent of students receiving a Pell Grant. The first table shows 10 schools where all students receive a Pell Grant. The second table shows 10 schools where no students receive a Pell Grant.
Result: I chose to show 10 for formatting and ease of understanding but there are many more institutions that have 100% and 0% receiving Pell Grants. Many of the schools where all students receive Pell Grants are beauty schools. Many of the schools where no students receive a Pell Grant are theology schools.
## # A tibble: 48 x 2
## INSTNM PCTPELL
## <chr> <dbl>
## 1 MTI Business College Inc 1
## 2 Mr Bela's School of Cosmetology Inc 1
## 3 Southern School of Beauty Inc 1
## 4 Victoria Beauty College Inc 1
## 5 Central School of Practical Nursing 1
## 6 Virginia School of Hair Design 1
## 7 Instituto de Educacion Tecnica Ocupacional La Reine-Manati 1
## 8 Colegio Mayor de Tecnologia Inc 1
## 9 Liceo de Arte-Dise-O y Comercio 1
## 10 Nouvelle Institute 1
## # ... with 38 more rows
## # A tibble: 74 x 2
## INSTNM PCTPELL
## <chr> <dbl>
## 1 Bais Binyomin Academy 0
## 2 United States Coast Guard Academy 0
## 3 American Islamic College 0
## 4 Principia College 0
## 5 The Southern Baptist Theological Seminary 0
## 6 New Orleans Baptist Theological Seminary 0
## 7 United States Naval Academy 0
## 8 MGH Institute of Health Professions 0
## 9 Saint John's Seminary 0
## 10 Hillsdale College 0
## # ... with 64 more rows
Self-Directed Analysis
6.1 Compare the average cost of attendance across the number of undergraduates, the percent of students receiving a Pell grant, the average faculty salary and the average family income in whatever way you choose. If one of these variables was classified as a ‘dependent’ variable, which would you say it is and how would you evaluate the effect of the other variables on your dependent variable?
Approach:I created a dummy variable to classify institutions’ cost as High, Medium, or Low. This is my dependent variable. I used this grouping to calculate the average number of undergraduates, family income, etc.
Result: High cost institutions tend to have the highest average faculty salary and family income. Low cost institutions have the greatest number of undergraduate students on average. Medium cost universitites have the most students receiving Pell Grants.
## # A tibble: 4 x 5
## Cost_HML Avg_Pct_Pell Avg_Fac_Sal Avg_Fam_Inc Avg_UGDS
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 High 0.354 7673. 68967. 2574.
## 2 Low 0.423 6372. 32605. 5051.
## 3 Medium 0.520 6392. 43898. 4361.
## 4 <NA> 0.552 5932. 28716. 268.
6.2 Compare the student populations of schools in heavily urbanized areas with those in very rural areas. Keep in mind, the type of school varies considerably by urban and rural areas. Do your best to control for this bias with the variables you have available to focus on differences within the populations of urban and rural schools and NOT the differences between the type of school.
Approach: “11” is the most urban location and “43” is the most rural locale. I will select filter institutions in these two locales and compare the student populations on the number of undergraduate students, percent receiving a Pell Grant, their age of entry, dependent status, and percent of first generation students.
Results: Many more students in at urban universities receive federal loans and the average age of entry is 27 compared to very rural schools age of entry of 25. Both of these are much higher than the conventional age of 18 that someone graduates high school and begins higher education.
To check for statistical significance, I would test if there was a statistically significant difference in the means with an ANOVA.
## # A tibble: 2 x 6
## LOCALE Avg_Fed_Loan Avg_Age_Entry Avg_Pct_Dependent Avg_Pct_First_Gen Avg_UGDS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11 0.506 26.6 0.454 0.463 2426.
## 2 43 0.271 24.8 0.555 0.433 2426.
6.3
Question 1: I applied to four universities when I was a senior in high school. Some of the important variables I considered when making my decision were mid-ACT, gender ratio, cost, locale, and family income. How do these schools compare?
Approach: I will filter for the four schools I applied to (John Carroll University, University of Dayton, Xavier University and Villanova University). Then I will report their mid-ACT, locale, etc.
Results: UD and XU are city schools whereas JCU and VU are suburban. JCU is the only school with more men than women. VU is more prestigious, with a much higher mid-ACT score (32 compared to the others around 26) however, it also costs much more than the other schools. All 4 of the schools have average family incomes over $110,000.
Statistical Method: If I wanted to see if there were statistical differences in their means, I would use an ANOVA test across each of the continuous variables. This would have helped inform my college decision to see where the schools significantly differed from each other.
## # A tibble: 4 x 6
## INSTNM LOCALE FEMALE ACTCMMID COSTT4_A FAMINC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 University of Dayton 12 0.52 27 56370 134621.
## 2 John Carroll University 21 0.47 25 52159 111360.
## 3 Xavier University 11 0.570 25 50880 114330.
## 4 Villanova University 21 0.580 32 65649 139368.
Question 2: In impoverish states, the cycle of poverty is often hard for families to break from. Adults with a higher education degree/certificate are more likely to break the cycle than those without. Often, these students are first generation. In the most impoverish states, what percent of students are first generation? How does this compare to the richest states?
Approach: After some research on states’ poverty levels, I filtered for the 5 most impoverish states (MS, NM, LA, WV, and AR). I will average the percent of first generation students and the percent of Pell Grant recipients for each of the 5 states. Then I will do the same thing but filter for the 5 richest states (NH, MN, HI, ND, and MD).
Results: The most impoverish states all have a similar share of first generation students (~42-51%) whereas the richest states have a lower range of first generation students (~31-45%). The same pattern holds for those receiving Pell Grants. This is good news because it seems that more students are given the opportunity to go to a higher education institution in the states where it is needed the most.
Statistical Method: I would use an ANOVA to test the difference in means between the impoverish and richest states to see if there is a statistically significant difference.
## # A tibble: 5 x 3
## STABBR Avg_First_Gen_Rate Avg_Pct_Pell
## <chr> <dbl> <dbl>
## 1 AR 0.489 0.565
## 2 LA 0.508 0.571
## 3 MS 0.424 0.592
## 4 NM 0.472 0.495
## 5 WV 0.512 0.541
## # A tibble: 5 x 3
## STABBR Avg_First_Gen_Rate Avg_Pct_Pell
## <chr> <dbl> <dbl>
## 1 HI 0.453 0.398
## 2 MD 0.453 0.464
## 3 MN 0.331 0.380
## 4 ND 0.316 0.358
## 5 NH 0.368 0.382