############################################################
############################################################
### ###
### Combining categorical data from multiple columns ###
### ###
############################################################
############################################################
### In this example, researchers want to use the two columns containing each parent or guardian's education level to determine
#### whether a student is a first generation student. So they want to take information from two columns and make a new column.
### When I tackle these sorts of issues, I first strategize by looking at what the data looks like now and what I want it to be.
summary(data$Par2.Educ)
## Bachelor's High School I don't know NA Not applicable
## 10 8 1 1 1
## PhD Some College
## 2 1
summary(data$Par1.Educ)
## Bachelor's High School I don't know NA PhD
## 9 10 2 1 2
#There are three things I notice
## First, there are multiple categories that indicate a parent went to college (Bachelor's, PhD, Some College), but I only care if its college or not
## This means I could simplify this data by recoding it.
## Second, there is are multiple instances where students say
## Third, in Par2.Educ there is a student who said the question was not applicable. This implies there wasn't a second parent or guardian in the house.
## This is an instance of missing data, but if I'm not careful coding this student could get dropped from the data set. I need to look for them in the end.
### Let's start by simplifying the data using recode
data$P1E_recode <- recode(data$Par1.Educ, "Bachelor's" = "College",
"High School" = "NoCollege",
"PhD" = "College",
.default = NA_character_) #this tell R that any value I didn't include can be recoded as NA. Otherwise uncoded variables would remain the same.
## In case any one wants a demonstration: without ".defaut = NA_character_" the I dont know response doesn't change
### to run this code just delete the # in front of it and then run it like any other
#data$P1E_recode2 <- recode(data$Par1.Educ, "Bachelor's" = "College",
# "High School" = "NoCollege",
# "PhD" = "College")
#summary(data$P1E_recode2)
table(data$P1E_recode, data$Par1.Educ)
##
## Bachelor's High School I don't know NA PhD
## College 9 0 0 0 2
## NoCollege 0 10 0 0 0
data$P2E_recode <- recode(data$Par2.Educ, "Bachelor's" = "College",
"High School" = "NoCollege",
"PhD" = "College",
"Some College" = "SomColl",
"Not applicable" = "No2",
.default = NA_character_)
table(data$P2E_recode, data$Par2.Educ)
##
## Bachelor's High School I don't know NA Not applicable PhD
## College 10 0 0 0 0 2
## NoCollege 0 8 0 0 0 0
## No2 0 0 0 0 1 0
## SomColl 0 0 0 0 0 0
##
## Some College
## College 0
## NoCollege 0
## No2 0
## SomColl 1
### Now, that we've simplified the data, we can start combining it. Again, we have to think through what we want our outcome variable to be.
### F.Gen = a student who doesn't have any parents or guardians who attended college
### C.Gen = a student who has at least one parent or guardian who attended college
### We will use ifelse statements to make these changes. Ifelese statement work like if ... then statements. You can link them with an '&" or '|'.
### & means both columns have to look exactly as your specificy to change to your new category. | means as long as one of them looks how you specify, change it to the new category.
#Let's look at some simple examples
#Using &
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"& data$P2E_recode=="NoCollege", "F.Gen", "C.Gen")
# The logic of this statement is if P1E went to college and P2E went to college, then call the student continuing generation.
# Call everyone else first gen.
#To visualize what happened create a new dataset with just the three relevant columns and look at them. Notice what happens to the student who
## has a parent with SomColl, the student who has missing data for one parent, and the student with only one parent.
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
## Gen.Status P1E_recode P2E_recode
## 1 C.Gen NoCollege SomColl
## 2 F.Gen NoCollege NoCollege
## 3 F.Gen NoCollege NoCollege
## 4 C.Gen College College
## 5 C.Gen NoCollege No2
## 6 C.Gen College College
## 7 C.Gen College College
## 8 C.Gen College College
## 9 C.Gen College College
## 10 F.Gen NoCollege NoCollege
## 11 F.Gen NoCollege NoCollege
## 12 C.Gen College College
## 13 C.Gen <NA> College
## 14 <NA> <NA> <NA>
## 15 F.Gen NoCollege NoCollege
## 16 C.Gen College College
## 17 F.Gen NoCollege NoCollege
## 18 C.Gen College College
## 19 C.Gen College College
## 20 <NA> <NA> <NA>
## 21 F.Gen NoCollege NoCollege
## 22 F.Gen NoCollege NoCollege
## 23 C.Gen College College
## 24 C.Gen College College
# Using |
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"|data$P2E_recode=="NoCollege", "F.Gen", "C.Gen")
# The logic of this statement is if P1E went to college or P2E went to college, then call the student continuing generation.
# Call everyone else first gen.
#To visualize what happened create a new dataset with just the three relevant columns and look at them.
## Again, notice what happens to the three outlier cases
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
## Gen.Status P1E_recode P2E_recode
## 1 F.Gen NoCollege SomColl
## 2 F.Gen NoCollege NoCollege
## 3 F.Gen NoCollege NoCollege
## 4 C.Gen College College
## 5 F.Gen NoCollege No2
## 6 C.Gen College College
## 7 C.Gen College College
## 8 C.Gen College College
## 9 C.Gen College College
## 10 F.Gen NoCollege NoCollege
## 11 F.Gen NoCollege NoCollege
## 12 C.Gen College College
## 13 <NA> <NA> College
## 14 <NA> <NA> <NA>
## 15 F.Gen NoCollege NoCollege
## 16 C.Gen College College
## 17 F.Gen NoCollege NoCollege
## 18 C.Gen College College
## 19 C.Gen College College
## 20 <NA> <NA> <NA>
## 21 F.Gen NoCollege NoCollege
## 22 F.Gen NoCollege NoCollege
## 23 C.Gen College College
## 24 C.Gen College College
# This mostly works except for the student with a parent with some college. Luckily, ifelse statements can chain.
## Unfortunately you can't have any overlapping categories so we can't use the or statement. Look at what happens if we use or statements:
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"|data$P2E_recode=="NoCollege", "F.Gen",
ifelse(data$P1E_recode=="SomColl"|data$P2E_recode=="SomColl", "SomColl", "C.Gen"))
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
## Gen.Status P1E_recode P2E_recode
## 1 F.Gen NoCollege SomColl
## 2 F.Gen NoCollege NoCollege
## 3 F.Gen NoCollege NoCollege
## 4 C.Gen College College
## 5 F.Gen NoCollege No2
## 6 C.Gen College College
## 7 C.Gen College College
## 8 C.Gen College College
## 9 C.Gen College College
## 10 F.Gen NoCollege NoCollege
## 11 F.Gen NoCollege NoCollege
## 12 C.Gen College College
## 13 <NA> <NA> College
## 14 <NA> <NA> <NA>
## 15 F.Gen NoCollege NoCollege
## 16 C.Gen College College
## 17 F.Gen NoCollege NoCollege
## 18 C.Gen College College
## 19 C.Gen College College
## 20 <NA> <NA> <NA>
## 21 F.Gen NoCollege NoCollege
## 22 F.Gen NoCollege NoCollege
## 23 C.Gen College College
## 24 C.Gen College College
### So we have to use &, but to limit our typing we want to think about the simpliest way to do this. Firstgen is the most stringent category.
#### Both parents have to be first gen and if there is one parent they have to be first gen. So if we code those two categories than everyone else
#### is accurately captured as continuing generation.
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"& data$P2E_recode=="NoCollege", "F.Gen",
ifelse(data$P1E_recode=="NoCollege" & data$P2E_recode=="No2", "F.Gen", "C.Gen"))
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
## Gen.Status P1E_recode P2E_recode
## 1 C.Gen NoCollege SomColl
## 2 F.Gen NoCollege NoCollege
## 3 F.Gen NoCollege NoCollege
## 4 C.Gen College College
## 5 F.Gen NoCollege No2
## 6 C.Gen College College
## 7 C.Gen College College
## 8 C.Gen College College
## 9 C.Gen College College
## 10 F.Gen NoCollege NoCollege
## 11 F.Gen NoCollege NoCollege
## 12 C.Gen College College
## 13 C.Gen <NA> College
## 14 <NA> <NA> <NA>
## 15 F.Gen NoCollege NoCollege
## 16 C.Gen College College
## 17 F.Gen NoCollege NoCollege
## 18 C.Gen College College
## 19 C.Gen College College
## 20 <NA> <NA> <NA>
## 21 F.Gen NoCollege NoCollege
## 22 F.Gen NoCollege NoCollege
## 23 C.Gen College College
## 24 C.Gen College College
## ifelse statements can be chained like this ad nausem. I've had statements that went over 20 lines for factors with a lot of levels and combos.
### ifelse statements can also be frustrating because its hard to remember the logic of the statements you are building when the chains get long.
#### I often write them out in words.