Hello everyone, I have prepared a comprehensive analysis on the college scorecard table. The U.S. Department of Education collected this data. This includes information on colleges and universities such as: location, standardized test scores of incoming students, enrollment, cost of attendance, faculty salary, student demographics, among others.
There is a focus on post-secondary education in the state of Ohio. How will these schools compare to other states in the Midwest? How will they compare nationally? I will use a variety of ways to show that, through tables and visualizations.
Let’s get started!
Tidyverse is a versatile package used to read in and wrangle the data, as well as visualize it. Stringr is a package used to manipulate obsrevations that are character-based. Such as names of Institutions, or state abbreviations, among others. DT is a package we can use to make nice data tables.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(DT)
The following data analysis is brought to you by Casey Donegan. The data is made available from this link: https://myxavier-my.sharepoint.com/:x:/g/personal/doneganm_xavier_edu/EVBhwdVcFFhGhZK5RxvpbvcBScE_krfax9UZoGaP-Fm3bw?download=1
We are loading the data directly from OneDrive using read_csv()
scorecard<- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/doneganm_xavier_edu/EVBhwdVcFFhGhZK5RxvpbvcBScE_krfax9UZoGaP-Fm3bw?download=1")
## Parsed with column specification:
## cols(
## .default = col_double(),
## INSTNM = col_character(),
## CITY = col_character(),
## STABBR = col_character(),
## ZIP = col_character(),
## CONTROL = col_character(),
## HBCU = col_character(),
## UGDS = col_character()
## )
## See spec(...) for full column specifications.
Before conducting analysis, I want to see if there is any missing data.
colSums(is.na(scorecard))
## ID INSTNM CITY STABBR ZIP CONTROL LOCALE LATITUDE
## 0 0 0 0 0 0 444 445
## LONGITUDE HBCU MENONLY WOMENONLY ADM_RATE ACTCM25 ACTCM75 ACTCMMID
## 445 0 444 444 5078 5823 5823 5823
## SAT_AVG UGDS COSTT4_A AVGFACSAL PCTPELL PCTFLOAN AGE_ENTRY FEMALE
## 5795 0 3531 2868 770 770 500 1429
## MARRIED DEPENDENT VETERAN FIRST_GEN FAMINC
## 1392 921 4538 1247 500
5823 schools do not have ACT score data (82% of them). Almost as many schools do not have SAT score data (81%). This makes sense because many schools do not require standardized testing for acceptance. One demographic that is missing a lot of information is veteran presence at schools. Plus, nearly half of the schools do not have cost of attendance available.
We will keep those nuggets of information in mind when conducting analysis.
There are some minor changes that need to be made to the data.
scorecard[scorecard == 'NULL'] <- NA
scorecard[scorecard == 'Public'] <- 1
scorecard[scorecard == 'Private nonprofit'] <- 2
scorecard[scorecard == 'Private for-profit'] <- 3
scorecard[scorecard == 'For profit'] <- 3
This helps straighten out some discrepancies in the data. There was not much uniformity in the data until these changes. High five!
Let’s create a new column that indicates whether a school’s average family income is greater than $54,021. That is the median house income for Ohio in 2019. Let’s make another column that indicates whether a school is a ‘University’ or ‘College’. How about a column that indicates whether a school is in a state that borders Ohio.
scorecard$UpperIncome<- ifelse(scorecard$FAMINC > 54021, "Yes", "No")
scorecard$CorU <- ifelse(str_detect(scorecard$INSTNM, "University"),
"University", "College/Other")
scorecard$border<- ifelse(scorecard$STABBR %in% c("KY","IN","MI","WV","PA"), "bordering OH", "not bordering OH")
Let’s make a column that indicates whether a school admits first-generation students at a higher rate than the median of all schools. A column that indicates whether a school’s average faculty salary is higher than the national median. How about a column that indicates whether a school’s students begin at an age greater than 30 years old on average.
scorecard$starter<- ifelse(scorecard$FIRST_GEN > median(scorecard$FIRST_GEN, na.rm = TRUE), "Yes", "No")
scorecard$ExpensiveFaculty<- ifelse(scorecard$AVGFACSAL > median(scorecard$AVGFACSAL, na.rm = TRUE), "Yes", "No")
scorecard$OlderStudents<- ifelse(scorecard$AGE_ENTRY > 30, "Yes", "No")
Let’s see this in a nice, neat table:
FinalScorecard<- data.frame(scorecard$INSTNM, scorecard$STABBR, scorecard$CONTROL, scorecard$LOCALE, scorecard$UGDS, scorecard$COSTT4_A, scorecard$AVGFACSAL, scorecard$PCTPELL, scorecard$AGE_ENTRY, scorecard$FIRST_GEN, scorecard$FAMINC)
datatable(FinalScorecard)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
How neat!
I am going to show a few visualizations.
4.1: The first one will show the number of schools in the Ohio Valley (OH, PA, WV, MI, IN, KY).
states<- scorecard %>%
group_by(STABBR) %>%
summarize(count = n(),
AvgSAT = mean(SAT_AVG, na.rm = TRUE),
AvgIncome = mean(FAMINC, na.rm = TRUE),
MTuition = mean(COSTT4_A, na.rm = TRUE),
MIncome = mean(FAMINC, na.rm = TRUE))
OhioValley<- states %>%
filter(states$STABBR %in% c("OH","KY","IN","MI","WV","PA"))
ggplot(data = OhioValley,
aes(x = OhioValley$STABBR, y = OhioValley$count))+
geom_col(fill = "red")+
ggtitle("Number of Schools in the Ohio Valley")+
xlab("State")+
ylab("Number of Institutions")
As we can see: Ohio and Pennsylvania have the most schools in the Ohio Valley.
4.2: This next one shows the relationship between cost of attendance and family income. This includes all schools in the college scorecard table.
ggplot(data = scorecard,
aes(x = scorecard$COSTT4_A, y = scorecard$FAMINC))+
geom_point(alpha = .25)+
ggtitle("Family Income vs. Cost of Attendance")+
scale_x_continuous(name = "Cost of Attendance",
labels = scales::dollar)+
scale_y_continuous(name = "Family Income",
labels = scales::dollar)+
geom_smooth(method = "lm")
## Warning: Removed 3554 rows containing non-finite values (stat_smooth).
## Warning: Removed 3554 rows containing missing values (geom_point).
Family income and Cost of Attendance appear to have a positive, linear relationship.
4.3: The third visual shows enrollment across the three institution control types (public, private nonprofit, and private for-profit).
scorecard$UGDS<- as.numeric(scorecard$UGDS)
by_control<- scorecard %>%
group_by(CONTROL, starter) %>%
summarize(students = sum(UGDS, na.rm = TRUE),
AVGcost = mean(COSTT4_A, na.rm = TRUE),
count = n(), na.rm = TRUE)
by_control[by_control == '1']<- "Public"
by_control[by_control == '2']<- "Private nonprofit"
by_control[by_control == '3']<- "Private for-profit"
ggplot(data = by_control,
aes(x = CONTROL, y = students))+
geom_col(fill = "blue")+
ggtitle("Student Enrollment by Type of Institution")+
xlab("Type of Institution")+
ylab("Number of Undergraduates")+
scale_y_continuous(labels = scales::comma)
This does not come as a surprise. Public institutions have far more students than others as a total. Considering lower tuitions (generally), public schools are more popular.
4.4: The final two visuals will show how SAT scores and family income are related in the Ohio Valley.
neighbors<- scorecard %>%
filter(scorecard$STABBR %in% c("OH","KY","IN","MI","WV","PA"))
ggplot(data = neighbors,
aes(x = neighbors$SAT_AVG, y = neighbors$FAMINC))+
geom_point()+
facet_wrap(~ neighbors$STABBR)+
ggtitle("Family Income and SAT Scores by State")+
xlab("SAT Score")+
scale_y_continuous(name = "Family Income",
labels = scales::dollar)
## Warning: Removed 947 rows containing missing values (geom_point).
ggplot(data = neighbors,
aes(x = neighbors$SAT_AVG, y = neighbors$FAMINC, color = STABBR))+
geom_point(alpha = .5)+
ggtitle("Family Income and SAT Scores by State")+
xlab("SAT Score")+
scale_y_continuous(name = "Family Income",
labels = scales::dollar)+
scale_fill_discrete(name = "State", labels = c("OH","KY","IN","MI","PA","WV"))
## Warning: Removed 947 rows containing missing values (geom_point).
Here are two ways of looking at this relationship.
5.1: Do private schools cost more than public schools? This will be answered by a neat visualization that compares average cost between private and public.
ggplot(data = by_control,
aes(x = CONTROL, y = AVGcost))+
geom_col(fill = "red")+
ggtitle("Are Private Schools more Expensive than Public Schools?")+
xlab("Type of School")+
scale_y_continuous(name = "Average Tuition",
labels = scales::dollar)
As we can see, private schools are more expensive, on average. Public schools average out to a tuition of around 15,000 dollars. Private for-profit schools average around 26,000 dollars. Ironically, private nonprofit schools tend to be most expensive, averaging at about 39,000 dollars.
5.2: How does the average family income of students at Xavier University compare nationally? What about Ohio? I will create a table that compares Xavier’s average with the national average, as well as Ohio’s average.
Xavier<- scorecard %>%
filter(INSTNM == "Xavier University") %>%
select(INSTNM, FAMINC)
NatlFAMINC<- scorecard %>%
summarize(FAMINC = mean(FAMINC, na.rm = TRUE))
Ohio<- scorecard %>%
group_by(STABBR) %>%
filter(STABBR == "OH") %>%
summarize(FAMINC = mean(FAMINC, na.rm = TRUE))
bind_rows(Xavier, Ohio, NatlFAMINC)
## # A tibble: 3 x 3
## INSTNM FAMINC STABBR
## <chr> <dbl> <chr>
## 1 Xavier University 114330. <NA>
## 2 <NA> 42380. OH
## 3 <NA> 38483. <NA>
Xavier’s average family income is much higher than all schools of Ohio (about 42,380 dollars). Ohio’s average is higher than the nation’s (about 38,483 dollars).
5.3: How does the cost of attending an Ohio ‘university’ compare to universities in states that border Ohio? What about universities nationally? We will compare the national average with the averages of states in the Ohio Valley with a visualization:
NatlMed<- scorecard %>%
summarize(MTuition = mean(COSTT4_A, na.rm = TRUE),
count = n(),
AvgSAT = mean(SAT_AVG, na.rm = TRUE),
MIncome = mean(FAMINC, na.rm = TRUE))
OhioValley<- bind_rows(OhioValley, NatlMed)
ggplot(data = OhioValley,
aes(x = STABBR, y = MTuition))+
geom_col(fill = "turquoise")+
ggtitle("Cost of Attendance in the Ohio Valley")+
xlab("State")+
scale_y_continuous(name = "Average Cost of Attendance",
labels = scales::dollar)
Ohio’s, Kentucky’s and Michigan’s tuitions are close to the national median. West Virginia’s are noticeably lower than the nation’s. Whereas Indiana’s and Pennsylvania’s are noticeably higher than the nation’s.
5.4: What schools have the highest and lowest percentage of undergraduate students receiving a Pell grant? I will compose a table that contains the top schools of Pell recipients paired with schools that have the least amount of Pell recipients.
LowPell<- scorecard %>%
arrange(PCTPELL) %>%
select(INSTNM, PCTPELL) %>%
head()
HighPell<- scorecard %>%
arrange(desc(PCTPELL)) %>%
select(INSTNM, PCTPELL) %>%
head()
bind_rows(HighPell, LowPell)
## # A tibble: 12 x 2
## INSTNM PCTPELL
## <chr> <dbl>
## 1 MTI Business College Inc 1
## 2 Mr Bela's School of Cosmetology Inc 1
## 3 Southern School of Beauty Inc 1
## 4 Victoria Beauty College Inc 1
## 5 Central School of Practical Nursing 1
## 6 Virginia School of Hair Design 1
## 7 Bais Binyomin Academy 0
## 8 United States Coast Guard Academy 0
## 9 American Islamic College 0
## 10 Principia College 0
## 11 The Southern Baptist Theological Seminary 0
## 12 New Orleans Baptist Theological Seminary 0
The first six schools listed have student bodies completely composed of Pell grant recipients. The last six schools listed do not have ANY students that received the Pell grant.
There are a few questions I would like to answer as we continue.
6.1: Can we explain cost of attendance at schools? Let’s explore to see if enrollment, percentage of students receiving the Pell grant, average faculty salary, and average family income predict cost of attendance. We will find out by running a multiple linear regression.
LinearModel<- scorecard %>%
select(PCTPELL, AVGFACSAL, FAMINC, COSTT4_A, UGDS)
linfit<- lm(LinearModel$COSTT4_A ~ LinearModel$UGDS + LinearModel$PCTPELL + LinearModel$AVGFACSAL + LinearModel$FAMINC)
linfit
##
## Call:
## lm(formula = LinearModel$COSTT4_A ~ LinearModel$UGDS + LinearModel$PCTPELL +
## LinearModel$AVGFACSAL + LinearModel$FAMINC)
##
## Coefficients:
## (Intercept) LinearModel$UGDS LinearModel$PCTPELL
## -9478.5942 -0.4724 19953.2002
## LinearModel$AVGFACSAL LinearModel$FAMINC
## 1.1334 0.4595
summary(linfit)
##
## Call:
## lm(formula = LinearModel$COSTT4_A ~ LinearModel$UGDS + LinearModel$PCTPELL +
## LinearModel$AVGFACSAL + LinearModel$FAMINC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37877 -6538 -1028 6367 78865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.479e+03 9.979e+02 -9.499 <2e-16 ***
## LinearModel$UGDS -4.724e-01 2.769e-02 -17.060 <2e-16 ***
## LinearModel$PCTPELL 1.995e+04 1.140e+03 17.507 <2e-16 ***
## LinearModel$AVGFACSAL 1.133e+00 9.799e-02 11.567 <2e-16 ***
## LinearModel$FAMINC 4.595e-01 8.583e-03 53.540 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9696 on 3397 degrees of freedom
## (3713 observations deleted due to missingness)
## Multiple R-squared: 0.5961, Adjusted R-squared: 0.5957
## F-statistic: 1254 on 4 and 3397 DF, p-value: < 2.2e-16
Our results from the regression are statistically significant! This is proved by the p-value less than 2.2 * 10e-16. Nearly 60% of variation in cost of attendance can be explained by the independent variables (enrollment, Pell grant recipients, faculty salary, family income).
Here is our regression equation: cost of attendance = -9478.5942 - 0.4724(enrollment) + 19953.2002(% Pell recipients) + 1.1334(faculty salary) + 0.4595(family income)
Key takeaways from the equation: As enrollment increases, cost of attendance decreases (generally). As all other variables increase, cost of attendance also increases (generally).
6.2: Do student populations differ by locale? Should we expect more students in heavily urbanized areas? Or more rural areas?
by_locale<- scorecard %>%
group_by(LOCALE) %>%
filter(LOCALE %in% c("11", "12", "13", "21", "22", "23", "31",
"32", "33", "41", "42", "43")) %>%
summarize(AvgEnrollment = mean(UGDS, na.rm = TRUE))
by_locale[by_locale == '11']<- "Large City"
by_locale[by_locale == '12']<- "Midsize City"
by_locale[by_locale == '13']<- "Small City"
by_locale[by_locale == '21']<- "Large Suburb"
by_locale[by_locale == '22']<- "Midsize Suburb"
by_locale[by_locale == '23']<- "Small Suburb"
by_locale[by_locale == '31']<- "Fringe Town"
by_locale[by_locale == '32']<- "Distant Town"
by_locale[by_locale == '33']<- "Remote Town"
by_locale[by_locale == '41']<- "Fringe Rural"
by_locale[by_locale == '42']<- "Distant Rural"
by_locale[by_locale == '43']<- "Remote Rural"
ggplot(data = by_locale,
aes(x = LOCALE, y = by_locale$AvgEnrollment))+
geom_col(fill = "orange")+
ggtitle("Average Enrollment by Locale")+
xlab("Locale")+
scale_y_continuous(name = "Average Enrollment",
labels = scales::comma)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Schools in rural areas tend to have smaller student populations. While student populations at schools in cities tend to have more than others.Schools from a “midsize city” are the only locale to average more than 3,000 students per school. To test this, I would consider a logistic regression to try and predict enrollment.
6.3A: Do first-generation students tend to go to public school? Considering that cost of attendance at public schools tends to be lower, would it make sense that it would cause this behavior?
FirstGen<- scorecard %>%
group_by(starter, CONTROL) %>%
filter(starter == "Yes") %>%
summarize(students = n())
FirstGen[FirstGen == '1']<- "Public"
FirstGen[FirstGen == '2']<- "Private nonprofit"
FirstGen[FirstGen == '3']<- "Private for-profit"
ggplot(data = FirstGen,
aes(x = CONTROL, y = students))+
geom_col(fill = "green")+
ggtitle("First-Generation Enrollment by Type of School")+
xlab("Type of School")+
scale_y_continuous(name = "Enrollment",
labels = scales::comma)
First-generation students tend to go to private for-profit schools more than anyone else. According to this data, the number of first-generation students at private schools more than doubles those at public schools.
6.3B: Is there a relationship between faculty salary and family income? Do established schools get students from families with high incomes?
I will test to see if there is a positive relationship between the two variables.
ggplot(data = scorecard,
aes(x = AVGFACSAL, y = FAMINC))+
geom_point(alpha = .5)+
ggtitle("Faculty Salary vs Family Income",
subtitle = "College Scorecard Data collected by U.S. Dept. of Education")+
scale_x_continuous(name = "Faculty Salary",
labels = scales::dollar)+
scale_y_continuous(name = "Family Income",
labels = scales::dollar)+
geom_smooth(method = "lm")
## Warning: Removed 3169 rows containing non-finite values (stat_smooth).
## Warning: Removed 3169 rows containing missing values (geom_point).
There appears to be a positive relationship between faculty salary and family income. I was curious to see if institutions that take in students with well-off families, have well-paid faculties. When money is not of issue, families are more likely to spend the money to send their children to a school who pays their faculty well. When a school has a higher-end faculty salary, one could assume the school has a great reputation.
Thank you for taking the time to check out my presentation! Now you know much more about the college scorecard than you could’ve ever imagined. You got some conversational insights for the dinner table and you can sound like a smarty pants (maybe).