The Question: Do kids from low income and high-income families choose different majors?

The U.S. Department of Education provides a lot of data on universities’ enrollment demographics in their “College Scorecard” dataset. Of particular interest to me was the income and major-choice data. I wanted to know “Is family income level predictive of the student’s major choice?”

TL;DR: Sort of? At least the way I went about doing it.

Loading and Subsetting

The data is available from the US DoEd website free of charge. It’s nearly a half a GB and has some 1700+ columns to it, so make sure you download the data dictionary as well if you’re going to work with this dataset.

We’ll look solely at public institutions, for no other reason than I went to one.

student_data <- read.csv("~/Documents/LevelEdu/student_data.csv", header=TRUE, stringsAsFactors=FALSE)

#We'll look solely at public institutions
pub_institutions <- student_data[student_data$CONTROL == 1, ]

#Include the columns we're interested in: major choice percentages, and the family incomes of dependent students
#See the data dictionary if confused
pub_institutions <- pub_institutions[,c(c(1:5, 60:98, 321:325), which(colnames(pub_institutions)=="DEP_INC_AVG"), which(colnames(pub_institutions)=="IND_INC_AVG"))] 

#The null values in these columns are literally character vectors of value "NULL"
pub_institutions <- pub_institutions[pub_institutions[,6] != "NULL" & pub_institutions[,45]!="NULL", ]
pub_institutions <- pub_institutions[complete.cases(pub_institutions$DEP_INC_AVG), ]
pub_institutions <- pub_institutions[pub_institutions$DEP_INC_AVG != "PrivacySuppressed", ]

#Change a few of the ambiguously named columns to a more meaningful name
names(pub_institutions)[45:49] <- c("income_30", "income_48", "income_75", "income_110", "income_110plus")

head(pub_institutions)
##   UNITID  OPEID opeid6                              INSTNM           CITY
## 1 100654 100200   1002            Alabama A & M University         Normal
## 2 100663 105200   1052 University of Alabama at Birmingham     Birmingham
## 4 100706 105500   1055 University of Alabama in Huntsville     Huntsville
## 5 100724 100500   1005            Alabama State University     Montgomery
## 6 100751 105100   1051           The University of Alabama     Tuscaloosa
## 7 100760 100700   1007   Central Alabama Community College Alexander City
##   PCIP01 PCIP03 PCIP04 PCIP05 PCIP09 PCIP10 PCIP11 PCIP12 PCIP13 PCIP14
## 1 0.0397 0.0199 0.0116      0      0 0.0348 0.0348      0  0.149 0.1175
## 2      0      0      0 0.0018 0.0456      0 0.0099      0 0.0862 0.0632
## 4      0      0      0      0 0.0318      0 0.0273      0 0.0173 0.2566
## 5      0      0      0      0 0.0733      0  0.045      0  0.215      0
## 6      0 0.0054      0 0.0022 0.1084      0 0.0068      0  0.084  0.064
## 7      0      0      0      0      0      0 0.0186      0      0      0
##   PCIP15 PCIP16 PCIP19 PCIP22 PCIP23 PCIP24 PCIP25 PCIP26 PCIP27 PCIP29
## 1 0.0348      0 0.0281      0 0.0182 0.0546      0 0.1026 0.0199      0
## 2      0  0.009      0      0 0.0203 0.0262      0 0.0619 0.0135      0
## 4      0 0.0173      0      0 0.0309      0      0 0.0855 0.0218      0
## 5      0      0      0      0 0.0183      0      0 0.1033 0.0183      0
## 6      0 0.0068   0.07      0 0.0178      0      0 0.0348 0.0076      0
## 7 0.0669      0      0      0      0 0.4833      0      0      0      0
##   PCIP30 PCIP31 PCIP38 PCIP39 PCIP40 PCIP41 PCIP42 PCIP43 PCIP44 PCIP45
## 1      0      0      0      0 0.0248      0 0.0579  0.005 0.0364  0.048
## 2      0      0 0.0095      0 0.0181      0  0.084  0.028 0.0244 0.0501
## 4      0      0 0.0082      0 0.0209      0 0.0218      0      0 0.0173
## 5      0 0.0183      0      0  0.015      0 0.0617 0.1183  0.065  0.015
## 6 0.0302      0  0.006      0 0.0074      0 0.0354 0.0216 0.0124 0.0422
## 7 0.0372      0      0      0      0      0      0      0      0      0
##   PCIP46 PCIP47 PCIP48 PCIP49 PCIP50 PCIP51 PCIP52 PCIP54 CIP01CERT1
## 1      0      0      0      0 0.0166      0 0.1457      0          0
## 2      0      0      0      0 0.0415  0.209 0.1765 0.0212          0
## 4      0      0      0      0 0.0346  0.172 0.2247 0.0118          0
## 5      0      0      0      0 0.0567 0.0633 0.1067 0.0067          0
## 6      0      0      0      0  0.036 0.0946  0.287 0.0194          0
## 7      0 0.0558 0.0297      0      0 0.2045 0.1041      0          0
##   income_30 income_48 income_75 income_110 income_110plus DEP_INC_AVG
## 1       398       101        65         28             15 33054.68926
## 2       311       164       128        150            140 59852.54783
## 4        97        52        48         51             46 63370.50758
## 5       638       135        63         40             13 32377.76273
## 6       399       196       235        260            350 95103.51296
## 7       330        21        12          4              0  31503.4807
##         IND_INC_AVG
## 1 PrivacySuppressed
## 2 PrivacySuppressed
## 4 PrivacySuppressed
## 5 PrivacySuppressed
## 6 PrivacySuppressed
## 7 PrivacySuppressed

Now, we have some information about the percentage of chosen majors at each individual school in our data frame in the columns PCIP01 through PCIP54, but these names are again not meaningful so let’s change them to their counterparts in the data dictionary. I did this by hand because there weren’t that many and it was easy to copy and paste, but it could certainly be done programmatically.

While we’re at it, we’re going to subset each of the majors into broader categories: Vocational, STEM, Humanities, Social Sciences, and Other. We’re aggregating like this because there isn’t quite enough data in every column to do what we’re trying to do.

majors = "agriculture,resources,architecture,ethnic_cultural_gender,communication,communications_technology,computer,personal_culinary,education,engineering,engineering_technology,language,family_consumer_science,legal,english,humanities,library,biological,mathematics,military,multidiscipline,parks_recreation_fitness,philosophy_religious,theology_religious_vocation,physical_science,science_technology,psychology,security_law_enforcement,public_administration_social_service,social_science,construction,mechanic_repair_technology,precision_production,transportation,visual_performing,health,business_marketing,history"


majors <- strsplit(majors, ",")[[1]]
names(pub_institutions)[6:43] <- majors

VOCATION <- c("construction", "personal_culinary", "parks_recreation_fitness", 
              "mechanic_repair_technology", "theology_religious_vocation", 
              "precision_production", "security_law_enforcement", "transportation")


STEM <- c("agriculture", "computer", "engineering", "mathematics", "physical_science", 
          "engineering_technology", "science_technology", "architecture", "biological", 
          "health")


HUMAN <- c("ethnic_cultural_gender", "humanities", "communication", "library", 
           "philosophy_religious",  "visual_performing", 
           "education", "english", "communications_technology")

#I really couldn't decide where "legal" should go, so I put it in social sciences, but it could easily end up in STEM or OTHER

SOC <- c("social_science", "family_consumer_science", 
         "public_administration_social_service", "history", 
         "language", "psychology", "legal")

OTHER <- setdiff(majors, c(STEM, HUMAN, VOCATION, SOC))

Now that we have some broad major categories, let’s find out what the majority of majors…er…the major major…the major with the most enrollments at each school.

for (i in 1:length(pub_institutions[,1])){
  stem_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% STEM])))
  human_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% HUMAN])))
  vocation_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% VOCATION])))
  soc_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% SOC])))
  other_p <- sum(as.numeric(as.character(pub_institutions[i,colnames(pub_institutions) %in% OTHER])))
  
  pub_institutions$major_type[i] <- c("STEM","HUMAN","VOCATION","SOC","OTHER")[which.max(c(stem_p,human_p,vocation_p,soc_p,other_p))]
}

Plotting and Basic Analysis

Great, now we have what amounts to a list of schools, the most common major type at that school, and the average income of the families at that school. Sounds like that’s enough for an ANOVA test! But first let’s do a bit of plotting and see if our intuitions about major types hold.

library(rbokeh)

p <- figure(title="Family Income level by Type of Major") %>%
  ly_boxplot(factor(pub_institutions$major_type), as.numeric(pub_institutions$DEP_INC_AVG),
             color="blue", xlab="Type of Major", ylab="Income in 2014 Dollars")
p

          Huh. That’s interesting, looks like STEM, OTHER, and SOC majors come from the wealthiest families, while VOCATION and HUMANITIES majors come from poorer families (at least, in this dataset). Let’s look at another way to show this difference in family income with a one way ANOVA test and a TukeyHSD plot.

          fit_aov <- aov(pub_institutions$DEP_INC_AVG ~ pub_institutions$major_type)
          tuk <- TukeyHSD(fit_aov)
          
          par(mar=c(5,10,2,2))
          plot(tuk, las=1)

          When looking at a tukey plot like this we only want to look at those lines which do not cross the 0 dotted line, as those are the ones with meaningful significance.

          With this plot we can see that SOC-HUMAN lies on the positive side of the 0 line. In other words, Sociological majors tend to come from higher income families than Humanities majors. Or rather, schools with a majority of Sociological majors have students that come from higher-income families. Likewise with STEM majors.

          And, just as with the boxplot, Vocational schools invariably have lower incomes when compared with other schools.

          Now, this is a pretty surface level analysis, and I’ve made a number of assumptions that might not hold up in the real world. It is interesting, though, to see that some intuitions about income level and major choice hold when you look at the data. What do you think? Am I way off the mark? Is this analysis bunk? Leave me a comment or hit me up on twitter.