R EXAM
## Parsed with column specification:
## cols(
## .default = col_double(),
## INSTNM = col_character(),
## CITY = col_character(),
## STABBR = col_character(),
## ZIP = col_character(),
## CONTROL = col_character(),
## HBCU = col_character(),
## UGDS = col_character()
## )
## See spec(...) for full column specifications.
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Introduction (Section 1)
1.1 The purpose of this doucment is to analyize the different universities across the great state of Ohio and see how they compare to other universities across the United States. ##
1.2 The data that I used was reported by the Department of Education and compares Ohio’s universities to the rest of the US. Some of the data that is crucial to comparing these universites are: 1. Admission Rate 2. 25th, 75th, and midpoint ACT cumulative scores 3. Average SAT scores 4. Faculty salary 5. Average cost of attendance
1.3 This analysis will help any individual better understand how institutions of post secondary education in Ohio compare to other schools regionally and nationally. This analysis will include visualizations comparing the number of insitituions in ohio to ohio’s border states a comparasin between cost of attendance and family income for all instituions
a comparasin between the number of undergrads across 3 institutions a comparasin between the relationship between ACT scores and fmaily incomes across each state that borders Ohio. plus more!!
Packages (section 2)
Here are the packages that I used to analyze, create and define this dataset. I used all of them I think.
| Package | Explanation |
|---|---|
| Tidyverse | To tidy up the data. |
| ggplot2 | I chose the 2nd ggplot because sequels are usually better. Also because this makes better visualizations than base r. |
| skimr | This is really good at summarising data (see how I spelled summarising? that’s a R joke, normies wouldn’t understand.) |
| dplyr | Used for data manipulation. (I misspelled this like a thousand times. dlpyr? dpylr? dlypr? idk this was very frusterating) |
| data.table | more data manipulatoin. |
| tidyr | to clean messy data |
Data Prep, Cleaning, and Wranglin’ (Section 3)
Okay, So the first step for me was to conform the data. I looked at the the data, and there was a lot of NULL’s where there should of been NA’s. Null is SQL language (gross) and wee need all missing data to be NA.
I ran this command to transform all NULLS to NA:
is.na(scorecard) <- scorecard == “NULL”
Next I created three Dummy variables that my professor required and 3 variables of my choosing
Dummy Variable 1
“Median household income for Ohio in 2019 is $54,021. Create a dummy variable indicating whether the average family income of students attending the school is greater or less than this value.”
For this I created a dummy variable using mutate, which is apart of dlypr
x <- 54021
scorecard <- scorecard%>%
mutate(above_or_below_median_HHIncome=ifelse(FAMINC < x ,"lower than median",
ifelse(FAMINC >= x , "greater than median",
"exactly the median"))) this added a new column that told you if household income for each institution was above or below $54,021.
Dummy Variable 2
“Create a dummy variable indicating whether the institution is a ‘Univeristy’ or ‘College’, as determined by whether the institution has the word ‘College’ or ‘University’ in its name”
scorecard<-scorecard%>%
mutate(College_or_university = ifelse(grepl("College", INSTNM), "College",
ifelse(grepl("University", INSTNM), "University", "Other")))Dummy Variable 3
“Create a dummy variable for institutions in states that border Ohio.”
scorecard<-scorecard %>%
mutate(Borders_Ohio = ifelse(grepl("IN", STABBR), 1,
ifelse(grepl("PA", STABBR), 1,
ifelse(grepl("WV", STABBR), 1,
ifelse(grepl("KY", STABBR), 1,
ifelse(grepl("OH", STABBR), 1,
ifelse(grepl("MI", STABBR),1, 0)))))))scorecard<-scorecard %>%
mutate(Borders_Ohio = ifelse(grepl("IN", STABBR), 1,
ifelse(grepl("PA", STABBR), 1,
ifelse(grepl("WV", STABBR), 1,
ifelse(grepl("KY", STABBR), 1,
ifelse(grepl("OH", STABBR), 1,
ifelse(grepl("MI", STABBR),1, 0)))))))Dummy Variable 4
This is the first dummy variable that I created myself. It takes the average age of entry into a institution and indicates whether the age of entry of students attending the school is greater or less than this the average.
y <- 26.00699169
scorecard <- scorecard%>%
mutate(above_or_below_avg_age_of_entry =ifelse(AGE_ENTRY < y ,"lower than avg age",
ifelse(AGE_ENTRY >= y , "greater than avg age",
"exactly the avg age")))Dummy Variable 5
This is my second dummy variable. This tells you whether or not a school is considered urban or rural. If it is neither, it is labeled as “somewhere inbetween.”
scorecard<-scorecard %>%
mutate(Rural_or_Urban = ifelse(grepl("11", LOCALE), "Urban",
ifelse(grepl("12", LOCALE), "Urban",
ifelse(grepl("13", LOCALE), "Urban",
ifelse(grepl("41", LOCALE), "Rural",
ifelse(grepl("42", LOCALE), "Rural",
ifelse(grepl("43", LOCALE), "Rural", "Somewhere Inbetwen")))))))Dummy Variable 6
This is my last dummy variable. This variable tells you whehter or not a univeristy is located in Ohio’s three largest cities by population: Cinci, Columbus, and Cleveland.
Simple Analysis and Trends (Section 4)
4.1 “Show the number of institutions in Ohio and in each of the states that borders Ohio.”
4.2 “Illustrate how the cost for attendance varies by family income for all institutions”
ggplot(scorecard, aes(x= scorecard$FAMINC, y = scorecard$COSTT4_A)) +
geom_point() +
labs(x= "Family Income", y= "cost for attendance")## Warning: Removed 3554 rows containing missing values (geom_point).
4.3
ggplot(scorecard, aes(x=scorecard$CONTROL, y= scorecard$UGDS)) +
geom_point() +
labs(x= "type of control", y= "amount of students")4.4
OHIO
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 262 rows containing non-finite values (stat_smooth).
## Warning: Removed 262 rows containing missing values (geom_point).
PENNSYLVANIA
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 278 rows containing non-finite values (stat_smooth).
## Warning: Removed 278 rows containing missing values (geom_point).
INDIANA
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 117 rows containing non-finite values (stat_smooth).
## Warning: Removed 117 rows containing missing values (geom_point).
WEST VIRGINIA
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 57 rows containing non-finite values (stat_smooth).
## Warning: Removed 57 rows containing missing values (geom_point).
KENTUCKY
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 76 rows containing missing values (geom_point).
MICHIGAN
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Warning: Removed 157 rows containing missing values (geom_point).
Directed Analysis (Section 5)
5.1
Do you find support for the old adage "private schools cost more than public schools?
When analyzing the data it is easy to find answers to questions you didn’t even know you were asking. one of these questions is exactly tihs. I created a ggplot bar graph an compared how much universirties cost based on their control types. from this, it is easy to see that private nonproft schools are much more expensive. Thus, the old addage is true.
ggplot(scorecard, aes(x=scorecard$CONTROL, y= scorecard$COSTT4_A)) +
geom_bar(stat = "identity") +
labs(x= "control type", y = "cost for attendance")## Warning: Removed 3531 rows containing missing values (position_stack).
5.2
How does the average family income of students at Xavier University compare nationally? Within Ohio?
When looking at the data, it is easy to see that Xavier University has one of the highest. Within the nation, you can see Xavier falls in the 68th spot for highest family income with an average income of $114,329.
I did this by creating a ggplot histogram and comparing family income to the other universities and institutions in Ohio, and then The rest of the US.
here is the code and visualizations implemented:
scorecard %>%
filter(!STABBR %in% c("OH" )) %>%
group_by(INSTNM) %>%
summarize(count=n()) %>%
filter(count > 100) %>%
arrange(desc(count)) %>%
mutate(highlight_flag = ifelse(INSTNM == 'Xavier University', T, F)) %>%
ggplot(aes(x=scorecard$INSTNM, y=scorecard$FAMINC))+
ggtitle("Average Family Income of Higher Learning Institutions in the USA") +
labs(x= "Institution Name", y="Average Family Income") +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 500 rows containing missing values (position_stack).
5.3
How does the cost of attending an Ohio ‘university’ compare to universities in states that border Ohio? What about universities nationally, not considering state?
When comparing the cost of attending an Ohio Uni to Ohio’s border states, we can see that Ohio actually comes in second to last place, right before West Virginia. Ohio’s average cost for attendace is $29253.34.
When comparing nationally, can see that Ohio universities are averagly priced. It is not among the outliers such as NY MA, or SC.
BY STATE
scorecard_filtered1 <- filter(scorecard, College_or_university == "University" )
by_state <- group_by(scorecard_filtered1, STABBR)
summary_by_cost_by_state <- summarise(by_state, mean_cost = mean(COSTT4_A, na.rm = TRUE))
scorecard_filteredstates <- filter(summary_by_cost_by_state, STABBR == 'PA' |STABBR == 'MI' | STABBR == 'IN' | STABBR == 'KY' | STABBR == 'WV' | STABBR == 'OH')
ggplot(scorecard_filteredstates, aes(x= STABBR, y=mean_cost)) +
geom_bar(stat = "identity",) +
ggtitle("Ohio and Its Border States Higher Education: Average Cost of Attendance") +
labs(x= "State", y= "Cost Average")BY NATION
ggplot(summary_by_cost_by_state, aes(x= STABBR, y=mean_cost)) +
geom_bar(stat = "identity") +
ggtitle("US States Higher Education: Average Cost of Attendance ") +
labs(x= "state", y= "cost average") +
theme(axis.text.x = element_text(angle = 45, hjust=1))5.4
What schools have the highest and lowest percentage of undergraduate students receiving a Pell grant?
To do this I first found the maximum and minimum of the pell grant column. From there, I created a new dataset and filtered out all the data that was not the max and min pell grants. From this data we can see that the are multitple schools that have a 0% pell grants and 100% pell grant undergrads. 74 schools have 0 undergrads with pell grants and 48 schools have an undergrad class that all have pell grants (100%)
## [1] 1
## [1] 0
scorecard_maxmin <- filter(scorecard, PCTPELL == "0" | PCTPELL == "1")
table(scorecard_maxmin$PCTPELL)##
## 0 1
## 74 48
Self-Directed Analysis (Section 6)
6.1
“Compare the average cost of attendance across the number of undergraduates, the percent of students receiving a Pell grant, the average faculty salary and the average family income in whatever way you choose. If one of these variables was classified as a ‘dependent’ variable, which would you say it is and how would you evaluate the effect of the other variables on your dependent variable?”
Okay, for this visualization I was supposed to compare the cost of attendance across the number of undergrads, the percent of students receiving a pell grants, the averge faculty salary, and the average family income. I tried to do this using a ggplot comparing cost of attendance, average faculty salary, and average family income to the percent of students recieving a pell grant. However, I could never get my visualization to work. rip.
here is what I tried to run
ggplot(
geom_point(scorecard, aes(x=COSTT4_A, y=PCTPELL)) +
geom_smooth(scorecard, aes(x=COSTT4_A, y=PCTPELL), fill="blue",
color="blue", size=1)
geom_point(scorecard, aes(x=AVGFACSAL, y=PCTPELL)) +
geom_smooth(scorecard, aes(x=AVGFACSAL, y=PCTPELL), fill="red",
color="red", size=1)
geom_point(scorecard, aes(x=FAMINC, y=PCTPELL)) +
geom_smooth(scorecard, aes(x=FAMINC, y=PCTPELL), fill="green",
color="green", size=1) ## Error: <text>:5:3: unexpected symbol
## 4: color="blue", size=1)
## 5: geom_point
## ^
I would have classified PCTPELL as the dependent variable, and the other three would have been dependents. I would evalute the effect of the other varibales depending on my dependent variable by observing to see if the amount of pell grants given shows a correlation between family income, faculty salary, and cost of attendance. The assumption I would make is that the lower these three independent variables are, the percent of Pell Grants would increase.
6.2 “Keep in mind, the type of school varies considerably by urban and rural areas. Do your best to control for this bias with the variables you have available to focus on differences within the populations of urban and rural schools and NOT the differences between the type of school.”
“If you were to evaluate the results of your initial findings using a form of statistical inference, what form would it take and how would you pursue it”
scorecard_Urban <- filter(scorecard, Rural_or_Urban == "Urban")
scorecard_Rural <- filter(scorecard, Rural_or_Urban == "Rural")
scorecard_RuralOH <- filter(scorecard_Rural, STABBR == "OH" )
scorecard_UrbanOH <- filter(scorecard_Urban, STABBR == "OH" )PCTPELL
In this example, I am comparing rural and urban institutions in OHIO and the percent of undergrads that have recieved pell grants. I chose to only look at institutions in Ohio because it would make the visualizations easier to interpret.
When looking at both visuals, we can see Urban schools are much more likely to have students who have recived pell grants. Only one Ohio Urban School has zero students receiving a pell grant, Athenaeum of Ohio, which is a seminary school with only 51 students.
We can also see that Urban schools have a large amount of schools that have .75% to 100% of students reciving pell grants. 15 different Urban schools have these percentages compared to rural schools who have only have 3 schools.
ggplot(scorecard_RuralOH, aes(x=INSTNM, y= PCTPELL)) +
geom_point(stat= "identity", color = "blue") +
ggtitle("Percent of Students Reciving Pell Grant at RURAL Schools in OHIO") +
labs(x= "Rural Schools" , y= "pctpell") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(scorecard_UrbanOH, aes(x=INSTNM, y= PCTPELL)) +
geom_point(stat= "identity", color = "red") +
ggtitle("Percent of Students Reciving Pell Grant at URBAN Schools in OHIO") +
labs(x= "Urban Schools" , y= "pctpell") +
theme(axis.text.x = element_text(angle = 45, hjust=1))AVGFACSAL
The next visualization compares average faculty salry with schools in Urban and Rural areas.
From this data, we can see that faculty salary in urban areas is much higher than in rural. Only one Urban institution has a average faculty salary below $2500 a month, God’s Bible School and College, which only has an enrollment of 239 students.
We can also see that the maximum monthly salary is much much lower in rural areas. The maximum is Wright State University-Lake Campus for the rural deomgraphic, with an average montly salary of 9598 USD. Whereas the Urban demographics average monthly salary is 12,153 USD at THE Ohio State University-Main Campus.
ggplot(scorecard_RuralOH, aes(x=INSTNM, y= AVGFACSAL)) +
geom_point(stat= "identity", color = "blue") +
ggtitle("Average Faculty Salary at RURAL Schools in OHIO", subtitle = "Salary per Month") +
labs(x= "Rural Schools" , y= "Average Faculty Salary") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 21 rows containing missing values (geom_point).
ggplot(scorecard_UrbanOH, aes(x=INSTNM, y= AVGFACSAL)) +
geom_point(stat= "identity", color = "red") +
ggtitle("Average Faculty Salary at URBAN Schools in OHIO", subtitle = "Salary per Month") +
labs(x= "Urban Schools" , y= "AVerage Faculty Salary") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 20 rows containing missing values (geom_point).
If I were to evalute the result of my findings using a statistical method of inference, I would use Anova.
6.3
Two Additional Self Directed Analytical Questions
What is the relationship between Admission Rates of UNIVERSITIES and COLLEGES and the share of female students in Ohio?
I find this quesiton interesting. I want to see if there is any correlation to lower or higher admission rates if the female share is higher.
I intend to answer this question by seperating institutions into two different subsets: those below the national average of for share of females and those above the average. Then, I will filter out Schools that are not in Ohio. After that I will filter out schools that are not colleges or universities. Then I will create a scatterplot using ggplot to see if there is any correlation to Admission rate and being above or below the national average for share of females as students.
1 ABOVE AVERAGE
## [1] 0.6401899
f <- 0.6401899
scorecard_female_high <- filter(scorecard, FEMALE > f)
scorecard_female_highOH <- filter(scorecard_female_high, STABBR == "OH")
scorecard_female_highOH_UNI <- filter(scorecard_female_highOH, College_or_university == "College" | College_or_university == "University" )
ggplot(scorecard_female_highOH_UNI, aes(x=INSTNM, y= ADM_RATE)) +
geom_point(stat= "identity", color = "orange") +
ggtitle("Admission Rate of Schools with a Share of Females Above Average", subtitle = "National Average Share of Females = 64%") +
labs(x= "Schools with Share of Females ABOVE Average" , y= "Admission Rate (%)") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 48 rows containing missing values (geom_point).
BELOW AVERAGE
scorecard_female_low <-filter(scorecard, FEMALE < f)
scorecard_female_lowOH <- filter(scorecard_female_low, STABBR == "OH")
scorecard_female_lowOH_UNI <- filter(scorecard_female_lowOH, College_or_university == "College" | College_or_university == "University" )
ggplot(scorecard_female_lowOH_UNI, aes(x=INSTNM, y= ADM_RATE)) +
geom_point(stat= "identity", color = "magenta") +
ggtitle("Admission Rate of Schools with a Share of Females Below Average", subtitle = "National Average Share of Females = 64%") +
labs(x= "Schools with Share of Females BELOW Average" , y= "Admission Rate") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 55 rows containing missing values (geom_point).
This is what I did. First, I found the national average for share of female students. it was right around 64%. than I assigned it the the value “f.” Next I created a subset only containing colleges and universities in Ohio where the share of female students is above the national average. Then, I created the same subset except for colleges and universities in Ohio below the national average for share of female students. I then created two scatterplots for each subset.
DATA ANALYZATION of ABOVE AVERAGE PLOT
From the data, the first thing you can see is that of the 65 institutions that have a Female Averge (FEMAVG for short) that is higher than the national average, 48 do not record their admission rate. I thought this was very interesting, thus did not remove the missing data.
The next thing you notice is that from the 17 schools that do record the admission rate, 11 of them have an acceptance rate above 70% or above. if you assume that the of the schools that do not record their admission rate follow the same pattern, you would assume that 42 schools would have an admission rate over 70%. that is 64.6% of all observations for schools with a share of females above average. WOW!
DATA ANALYZATION of BELOW AVERAGE PLOT
Tthe first big difference you see is that a much higher percentage of schools record their data compated to schools with an above average share of females. 48% of schools below average have a recorded admission rate, whereas only 26% of schools above average only have a recorded admission rate.
Tthe next thing you notice is that there are more schools recorded with lower admission rates. 7% of recorded admission rates are below 40%. this is a little more compared to the above average plot where only 6% of recorded admission rates are below 40%.
From this data we can assume that schools with a below average share of females are a little harder to get into. but only by 1%. So really, it’s not that different. Very very interesting!!
The method of inference I would use to privude further valudation would be ANOVA.
2
QUESTION: Does the national average age of entry for colleges and universities across Ohio correlate to average faculty salary in any way?
I think this is interesting because I would love if faculty are paid more or less for teaching younger or older students. I really don’t thing there is going to be a difference.
I intend to answer this question by seperating average age of entry into two different subsets: those below the national average age of entry and those above the average the national average. Then, I will filter out universties and colleges that are not in Ohio. Then I will create a scatterplot using ggplot to see if there is any correlation between the average age of entry and faculty salary.
BELOW AVERAGE
a <- mean(scorecard$AGE_ENTRY, na.rm = TRUE)
scorecard_age_low <- filter(scorecard, AGE_ENTRY < a)
scorecard_age_lowOH <- filter(scorecard_age_low, STABBR == "OH")
scorecard_age_lowOH_UNI <- filter(scorecard_age_lowOH, College_or_university == "College" | College_or_university == "University" )
ggplot(scorecard_age_lowOH_UNI, aes(x=INSTNM, y= AVGFACSAL)) +
geom_point(stat= "identity", color = "purple") +
ggtitle("Average Faculty Salary of Schools where Age of Entry is Below the National Average", subtitle = "National Average = 26") +
labs(x= "Schools with age of entry is BELOW Average" , y= "Average Faculty Salary (per Month)") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 3 rows containing missing values (geom_point).
scorecard_age_high <- filter(scorecard, AGE_ENTRY > a)
scorecard_age_highOH <- filter(scorecard_age_high, STABBR == "OH")
scorecard_age_highOH_UNI <- filter(scorecard_age_highOH, College_or_university == "College" | College_or_university == "University" )
ggplot(scorecard_age_highOH_UNI, aes(x=INSTNM, y= AVGFACSAL)) +
geom_point(stat= "identity", color = "light blue") +
ggtitle("Average Faculty Salary of Schools where Age of Entry is Above the National Average", subtitle = "National Average = 26") +
labs(x= "Schools with age of entry is ABOVE Average" , y= "Average Faculty Salary (per Month)") +
theme(axis.text.x = element_text(angle = 45, hjust=1))## Warning: Removed 9 rows containing missing values (geom_point).
This is what I did.
First, I found the national average age of entry. It is 26. than I assigned it the the value “a.” Second, I created a subset that only contained universities and colleges in Ohio where the average age of entry was above the national average. I did this two times. one scappterplot for above average and one for below.
DATA ANALYZATION of BELOW AVERAGE PLOT
The first thing I noticed when looking at the above average plot was that most universities and colleges pay their faculty between 5000 and 10000 a year. of the 105 observations, 94 (89% of observations) are between $5000 and $10000. only 3 (2.8% of observations) data points are above $10,000.
DATA ANALYZATION of ABOVE AVERAGE PLOT
When looking at this scatterplot, you can see that the many of the dots have shifted downward. most schools seem to pay between $3750 and $6250. Of the 61 observable data points 37 are between $3750 and $6250. This is 60.7% of all observable points. Only 16.39% of observations are above $6250.
From this data we can assume that if you are a professor and want to make a lot more money, go work at a school where the students are much older. after seeing this data, It doesn’t surprise me. Grad schools, where students are usually older, are usually much more expensive than an undergradute degree. WOW!!!! The method of inference I would use to privude further valudation would be ANOVA.