The U.S. Department of Education provides a lot of data on universities’ enrollment demographics in their “College Scorecard” dataset. Of particular interest to me was the income and major-choice data. I wanted to know “Is family income level predictive of the student’s major choice?”
TL;DR: Sort of? At least the way I went about doing it.
The data is available from the US DoEd website free of charge. It’s nearly a half a GB and has some 1700+ columns to it, so make sure you download the data dictionary as well if you’re going to work with this dataset.
We’ll look solely at public institutions, for no other reason than I went to one.
student_data <- read.csv("~/Documents/LevelEdu/student_data.csv", header=TRUE, stringsAsFactors=FALSE)
#We'll look solely at public institutions
pub_institutions <- student_data[student_data$CONTROL == 1, ]
#Include the columns we're interested in: major choice percentages, and the family incomes of dependent students
#See the data dictionary if confused
pub_institutions <- pub_institutions[,c(c(1:5, 60:98, 321:325), which(colnames(pub_institutions)=="DEP_INC_AVG"), which(colnames(pub_institutions)=="IND_INC_AVG"))]
#The null values in these columns are literally character vectors of value "NULL"
pub_institutions <- pub_institutions[pub_institutions[,6] != "NULL" & pub_institutions[,45]!="NULL", ]
pub_institutions <- pub_institutions[complete.cases(pub_institutions$DEP_INC_AVG), ]
pub_institutions <- pub_institutions[pub_institutions$DEP_INC_AVG != "PrivacySuppressed", ]
#Change a few of the ambiguously named columns to a more meaningful name
names(pub_institutions)[45:49] <- c("income_30", "income_48", "income_75", "income_110", "income_110plus")
head(pub_institutions)
## UNITID OPEID opeid6 INSTNM CITY
## 1 100654 100200 1002 Alabama A & M University Normal
## 2 100663 105200 1052 University of Alabama at Birmingham Birmingham
## 4 100706 105500 1055 University of Alabama in Huntsville Huntsville
## 5 100724 100500 1005 Alabama State University Montgomery
## 6 100751 105100 1051 The University of Alabama Tuscaloosa
## 7 100760 100700 1007 Central Alabama Community College Alexander City
## PCIP01 PCIP03 PCIP04 PCIP05 PCIP09 PCIP10 PCIP11 PCIP12 PCIP13 PCIP14
## 1 0.0397 0.0199 0.0116 0 0 0.0348 0.0348 0 0.149 0.1175
## 2 0 0 0 0.0018 0.0456 0 0.0099 0 0.0862 0.0632
## 4 0 0 0 0 0.0318 0 0.0273 0 0.0173 0.2566
## 5 0 0 0 0 0.0733 0 0.045 0 0.215 0
## 6 0 0.0054 0 0.0022 0.1084 0 0.0068 0 0.084 0.064
## 7 0 0 0 0 0 0 0.0186 0 0 0
## PCIP15 PCIP16 PCIP19 PCIP22 PCIP23 PCIP24 PCIP25 PCIP26 PCIP27 PCIP29
## 1 0.0348 0 0.0281 0 0.0182 0.0546 0 0.1026 0.0199 0
## 2 0 0.009 0 0 0.0203 0.0262 0 0.0619 0.0135 0
## 4 0 0.0173 0 0 0.0309 0 0 0.0855 0.0218 0
## 5 0 0 0 0 0.0183 0 0 0.1033 0.0183 0
## 6 0 0.0068 0.07 0 0.0178 0 0 0.0348 0.0076 0
## 7 0.0669 0 0 0 0 0.4833 0 0 0 0
## PCIP30 PCIP31 PCIP38 PCIP39 PCIP40 PCIP41 PCIP42 PCIP43 PCIP44 PCIP45
## 1 0 0 0 0 0.0248 0 0.0579 0.005 0.0364 0.048
## 2 0 0 0.0095 0 0.0181 0 0.084 0.028 0.0244 0.0501
## 4 0 0 0.0082 0 0.0209 0 0.0218 0 0 0.0173
## 5 0 0.0183 0 0 0.015 0 0.0617 0.1183 0.065 0.015
## 6 0.0302 0 0.006 0 0.0074 0 0.0354 0.0216 0.0124 0.0422
## 7 0.0372 0 0 0 0 0 0 0 0 0
## PCIP46 PCIP47 PCIP48 PCIP49 PCIP50 PCIP51 PCIP52 PCIP54 CIP01CERT1
## 1 0 0 0 0 0.0166 0 0.1457 0 0
## 2 0 0 0 0 0.0415 0.209 0.1765 0.0212 0
## 4 0 0 0 0 0.0346 0.172 0.2247 0.0118 0
## 5 0 0 0 0 0.0567 0.0633 0.1067 0.0067 0
## 6 0 0 0 0 0.036 0.0946 0.287 0.0194 0
## 7 0 0.0558 0.0297 0 0 0.2045 0.1041 0 0
## income_30 income_48 income_75 income_110 income_110plus DEP_INC_AVG
## 1 398 101 65 28 15 33054.68926
## 2 311 164 128 150 140 59852.54783
## 4 97 52 48 51 46 63370.50758
## 5 638 135 63 40 13 32377.76273
## 6 399 196 235 260 350 95103.51296
## 7 330 21 12 4 0 31503.4807
## IND_INC_AVG
## 1 PrivacySuppressed
## 2 PrivacySuppressed
## 4 PrivacySuppressed
## 5 PrivacySuppressed
## 6 PrivacySuppressed
## 7 PrivacySuppressed
Now, we have some information about the percentage of chosen majors at each individual school in our data frame in the columns PCIP01 through PCIP54, but these names are again not meaningful so let’s change them to their counterparts in the data dictionary. I did this by hand because there weren’t that many and it was easy to copy and paste, but it could certainly be done programmatically.
While we’re at it, we’re going to subset each of the majors into broader categories: Vocational, STEM, Humanities, Social Sciences, and Other. We’re aggregating like this because there isn’t quite enough data in every column to do what we’re trying to do.
majors = "agriculture,resources,architecture,ethnic_cultural_gender,communication,communications_technology,computer,personal_culinary,education,engineering,engineering_technology,language,family_consumer_science,legal,english,humanities,library,biological,mathematics,military,multidiscipline,parks_recreation_fitness,philosophy_religious,theology_religious_vocation,physical_science,science_technology,psychology,security_law_enforcement,public_administration_social_service,social_science,construction,mechanic_repair_technology,precision_production,transportation,visual_performing,health,business_marketing,history"
majors <- strsplit(majors, ",")[[1]]
names(pub_institutions)[6:43] <- majors
VOCATION <- c("construction", "personal_culinary", "parks_recreation_fitness",
"mechanic_repair_technology", "theology_religious_vocation",
"precision_production", "security_law_enforcement", "transportation")
STEM <- c("agriculture", "computer", "engineering", "mathematics", "physical_science",
"engineering_technology", "science_technology", "architecture", "biological",
"health")
HUMAN <- c("ethnic_cultural_gender", "humanities", "communication", "library",
"philosophy_religious", "visual_performing",
"education", "english", "communications_technology")
#I really couldn't decide where "legal" should go, so I put it in social sciences, but it could easily end up in STEM or OTHER
SOC <- c("social_science", "family_consumer_science",
"public_administration_social_service", "history",
"language", "psychology", "legal")
OTHER <- setdiff(majors, c(STEM, HUMAN, VOCATION, SOC))
Now that we have some broad major categories, let’s find out what the majority of majors…er…the major major…the major with the most enrollments at each school.
for (i in 1:length(pub_institutions[,1])){
stem_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% STEM])))
human_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% HUMAN])))
vocation_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% VOCATION])))
soc_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% SOC])))
other_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% OTHER])))
pub_institutions$major_type[i] <- c("STEM","HUMAN","VOCATION","SOC","OTHER")[which.max(c(stem_p,human_p,vocation_p,soc_p,other_p))]
}
Great, now we have what amounts to a list of schools, the most common major type at that school, and the average income of the families at that school. Sounds like that’s enough for an ANOVA test! But first let’s do a bit of plotting and see if our intuitions about major types hold.
library(rbokeh)
p <- figure(title="Family Income level by Type of Major") %>%
ly_boxplot(factor(pub_institutions$major_type), as.numeric(pub_institutions$DEP_INC_AVG),
color="blue", xlab="Type of Major", ylab="Income in 2014 Dollars")
p
Huh. That’s interesting, looks like STEM, OTHER, and SOC majors come from the wealthiest families, while VOCATION and HUMANITIES majors come from poorer families (at least, in this dataset). Let’s look at another way to show this difference in family income with a one way ANOVA test and a TukeyHSD plot.
fit_aov <- aov(pub_institutions$DEP_INC_AVG ~ pub_institutions$major_type)
tuk <- TukeyHSD(fit_aov)
par(mar=c(5,10,2,2))
plot(tuk, las=1)
When looking at a tukey plot like this we only want to look at those lines which do not cross the 0 dotted line, as those are the ones with meaningful significance.
With this plot we can see that SOC-HUMAN lies on the positive side of the 0 line. In other words, Sociological majors tend to come from higher income families than Humanities majors. Or rather, schools with a majority of Sociological majors have students that come from higher-income families. Likewise with STEM majors.
And, just as with the boxplot, Vocational schools invariably have lower incomes when compared with other schools.
Now, this is a pretty surface level analysis, and I’ve made a number of assumptions that might not hold up in the real world. It is interesting, though, to see that some intuitions about income level and major choice hold when you look at the data. What do you think? Am I way off the mark? Is this analysis bunk? Leave me a comment or hit me up on twitter.