Modeling Inequality in STEM

Michael Inkles
May 29, 2016

Overview

The goal of this project was to create a regression model that takes very basic demographics – age, race, sex, and place of birth – and uses them to predict how likely someone is to work in a STEM (science, technology, engineering, and math) career. The end result is a logistic regression model that uses these demographic predictors independently to predict a person's likelihood of having a degree in STEM or a job in STEM.

The data used for this project is from the 2013 American Community Survey, found at https://www.census.gov/programs-surveys/acs/data.html

Loading the data

There are many columns in the census data, but I only needed 2 columns for outcome ('SCIENGP' for science degree, 'SOCP' for science occupation), and 5 columns for predictors - Age, sex, hispanic origin, place of birth, and race.

library(data.table)

# Select only the relevant columns
pop_cols <- c('AGEP','SEX','HISP','POBP',
              'RAC1P','SCIENGP','SOCP')

# Load in all of the ACS2013 data
df <- rbind((fread("pums/ss13pusa.csv", select=pop_cols)),
            (fread("pums/ss13pusb.csv", select=pop_cols)))


Read 0.0% of 1613672 rows
Read 18.0% of 1613672 rows
Read 38.4% of 1613672 rows
Read 58.3% of 1613672 rows
Read 77.5% of 1613672 rows
Read 95.4% of 1613672 rows
Read 1613672 rows and 7 (of 283) columns from 1.416 GB file in 00:00:11

Read 0.0% of 1519123 rows
Read 20.4% of 1519123 rows
Read 40.2% of 1519123 rows
Read 59.9% of 1519123 rows
Read 81.6% of 1519123 rows
Read 1519123 rows and 7 (of 283) columns from 1.333 GB file in 00:00:10

Cleaning the data

First, I subsetted the data on two conditions:

Over 23 - Anyone under 23 is less likely to have a bachelor's degree in general, so this will skew the model.
Born in the USA (POPB>60) - I'll be using state of birth as a predictor, since the model gets too complicated if I include every US state AND every country of birth

df <- df[df$AGEP>22 & df$POBP<60]

Cleaning the data

Next, I recoded each of the predictor variables with labels, to match the labels that will be used on the UI of the Shiny app.

Each state had to be recoded manually.

df$State[df$POBP == 1] <- "Alabama"
df$State[df$POBP == 2] <- "Alaska"
df$State[df$POBP == 4] <- "Arizona"

Sex had to be recoded to 0 and 1 and then labelled Male and Female.

df$SEX <- df$SEX - 1
df$sex_recode[df$SEX == 0] <- "Male"
df$sex_recode[df$SEX == 1] <- "Female"

Cleaning the data

For race, I included 3 non-hispanic races (White, Black, Asian), as well as various Hispanic nationalities and groups of nationalities. All others were categorized as Other. This way of grouping people by race turned out to be the most consistently predictive in cross-validations. For more reading on this, see my blog post at https://michaelinkles.wordpress.com/2016/03/12/whats-special-about-florida/

df$race_recode <- "Other"
df$race_recode[df$RAC1P == 1] <- "White"
df$race_recode[df$RAC1P == 2] <- "Black"
df$race_recode[df$RAC1P == 6] <- "Asian"
df$race_recode[df$HISP == 2] <- "Mexican"
df$race_recode[df$HISP == 3] <- "Puerto_Rican"
df$race_recode[df$HISP == 4] <- "Cuban"
df$race_recode[df$HISP %in% 5:12] <- "Other_Central_American"
df$race_recode[df$HISP %in% 13:22] <- "South_American"
df$race_recode[df$HISP == 23] <- "Spaniard"
df$race_recode[df$HISP == 24] <- "All_Other_Hispanic"

Cleaning the data

For my outcome variables, I wanted to create two binary variables: has STEM degree yes/no and has STEM career yes/no. The first was easy, as the census data had a variable SCIENGP in which 1 was “has a bachelor's degree or higher in STEM”.

df$science_degree <- 0
df$science_degree[df$SCIENGP == 1] <- 1

For science occupation, I used a list of SOCP codes from http://www.bls.gov/soc/Attachment_C_STEM.pdf and assigned those a list of those codes to a variable called science_job_codes.

df$science_occupation <- 0
df$science_occupation[df$SOCP %in% science_job_codes] <- 1

Creating the models

Exploratory analysis shows that the effect of age is different for males and females in the science degree outcome. plot of chunk unnamed-chunk-10

Creating the models

Because of this, I need to add an interaction variable for Sex*Age to include in the model

df$sex_age <- df$SEX * df$AGEP

I also need to make sure race and state are factor variables. I can simply convert race to a factor variable, but since state has more than the maximum number of factors (32), I need to set up dummy variables.

dummies <- data.frame(matrix(NA, ncol = length(states), nrow = nrow(df)))
for(i in seq(1,length(states))){
  dummies[,i] <- ifelse(df$State == states[i], 1, 0)
}
names(dummies) <- states
df <- cbind(df, dummies)

Creating the models

I am creating two models. The first uses the original subset of data to predict how likely someone is to have a STEM degree, based on demographics. The second model uses a subset of people who have science degrees and predicts how likely someone in that subset is to have a job in STEM.

# create subset of data that has science degree, to use as a base for the science occupation model
df_degree <- df[df$science_degree == 1,]

# create science occupation model
lm_occupation <- glm(science_occupation ~ AGEP + SEX + sex_age + race_recode, data=df_degree, family="binomial")

Creating the models

# create science degree model
# leave out Pennsylvania as it will be the reference category
lm_degree <- glm(science_degree ~ AGEP + SEX + sex_age + race_recode +
                   Alabama + Alaska + Arizona + Arkansas + California + Colorado +
                   Connecticut + Delaware + DC + Florida + Georgia + Hawaii + Idaho +
                   Illinois + Indiana + Iowa + Kansas + Kentucky + Louisiana + Maine +
                   Maryland + Massachusetts + Michigan + Minnesota + Mississippi + 
                   Missouri + Montana + Nebraska + Nevada + New_Hampshire +
                   New_Mexico + New_Jersey + New_York + North_Carolina + North_Dakota +
                   Ohio + Oklahoma + Oregon + Rhode_Island + South_Carolina +
                   South_Dakota + Tennessee + Texas + Utah + Vermont + Virginia +
                   Washington + West_Virginia + Wisconsin + Wyoming, data=df, family="binomial")

Creating the Shiny App

Because the script for creating the models takes too long to run, I don't want to have to run it every time I run the Shiny app. Instead, I created a data frame of all possible demographic combinations using expand_grid to import into the Shiny App.

ages <- seq(1,100)
sexes <- c(0,1)
races <- c("White","Asian","Black","Other","Mexican",
           "Puerto_Rican","Cuban","Other_Central_American",
           "South_American","Spaniard","All_Other_Hispanic")
predict_data <- expand.grid(states, races, sexes, ages)
predict_data <- rename(predict_data,c("Var1"="State","Var2"="race_recode","Var3"="SEX","Var4"="AGEP"))
predict_data$sex_age = predict_data$SEX * predict_data$AGEP

Results

The Shiny app can be viewed at https://michaelinkles.shinyapps.io/ShinySTEM/. Simply input demographics for two people to compare their likelihood of getting degrees and jobs in STEM.