############################################################
############################################################
###                                                      ###  
###   Combining categorical data from multiple columns   ###
###                                                      ###
############################################################
############################################################

###  In this example, researchers want to use the two columns containing each parent or guardian's education level to determine
#### whether a student is a first generation student.  So they want to take information from two columns and make a new column.

###  When I tackle these sorts of issues, I first strategize by looking at what the data looks like now and what I want it to be.
summary(data$Par2.Educ)
##     Bachelor's    High School   I don't know             NA Not applicable 
##             10              8              1              1              1 
##            PhD   Some College 
##              2              1
summary(data$Par1.Educ)
##   Bachelor's  High School I don't know           NA          PhD 
##            9           10            2            1            2
#There are three things I notice 
##  First, there are multiple categories that indicate a parent went to college (Bachelor's, PhD, Some College), but I only care if its college or not
##  This means I could simplify this data by recoding it.

## Second, there is are multiple instances where students say 

## Third, in Par2.Educ there is a student who said the question was not applicable.  This implies there wasn't a second parent or guardian in the house.
##  This is an instance of missing data, but if I'm not careful coding this student could get dropped from the data set.  I need to look for them in the end.

### Let's start by simplifying the data using recode


data$P1E_recode <- recode(data$Par1.Educ, "Bachelor's" = "College",
                         "High School" = "NoCollege",
                         "PhD" = "College",
                         .default = NA_character_) #this tell R that any value I didn't include can be recoded as NA. Otherwise uncoded variables would remain the same.


## In case any one wants a demonstration: without ".defaut = NA_character_" the I dont know response doesn't change
### to run this code just delete the # in front of it and then run it like any other
#data$P1E_recode2 <- recode(data$Par1.Educ, "Bachelor's" = "College",
                          # "High School" = "NoCollege",
                          # "PhD" = "College")
#summary(data$P1E_recode2)


table(data$P1E_recode, data$Par1.Educ)
##            
##             Bachelor's High School I don't know NA PhD
##   College            9           0            0  0   2
##   NoCollege          0          10            0  0   0
data$P2E_recode <- recode(data$Par2.Educ, "Bachelor's" = "College",
                          "High School" = "NoCollege",
                          "PhD" = "College",
                          "Some College" = "SomColl",
                          "Not applicable" = "No2",
                          .default = NA_character_)

table(data$P2E_recode, data$Par2.Educ)
##            
##             Bachelor's High School I don't know NA Not applicable PhD
##   College           10           0            0  0              0   2
##   NoCollege          0           8            0  0              0   0
##   No2                0           0            0  0              1   0
##   SomColl            0           0            0  0              0   0
##            
##             Some College
##   College              0
##   NoCollege            0
##   No2                  0
##   SomColl              1
### Now, that we've simplified the data, we can start combining it.  Again, we have to think through what we want our outcome variable to be.
### F.Gen = a student who doesn't have any parents or guardians who attended college
### C.Gen = a student who has at least one parent or guardian who attended college

###  We will use ifelse statements to make these changes.  Ifelese statement work like if ... then statements. You can link them with an '&" or '|'.
### & means both columns have to look exactly as your specificy to change to your new category.  | means as long as one of them looks how you specify, change it to the new category.

#Let's look at some simple examples

#Using &
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"& data$P2E_recode=="NoCollege", "F.Gen", "C.Gen")
# The logic of this statement is if P1E went to college and P2E went to college, then call the student continuing generation.  
# Call everyone else first gen.

#To visualize what happened create a new dataset with just the three relevant columns and look at them. Notice what happens to the student who
## has a parent with SomColl, the student who has missing data for one parent, and the student with only one parent.
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
##    Gen.Status P1E_recode P2E_recode
## 1       C.Gen  NoCollege    SomColl
## 2       F.Gen  NoCollege  NoCollege
## 3       F.Gen  NoCollege  NoCollege
## 4       C.Gen    College    College
## 5       C.Gen  NoCollege        No2
## 6       C.Gen    College    College
## 7       C.Gen    College    College
## 8       C.Gen    College    College
## 9       C.Gen    College    College
## 10      F.Gen  NoCollege  NoCollege
## 11      F.Gen  NoCollege  NoCollege
## 12      C.Gen    College    College
## 13      C.Gen       <NA>    College
## 14       <NA>       <NA>       <NA>
## 15      F.Gen  NoCollege  NoCollege
## 16      C.Gen    College    College
## 17      F.Gen  NoCollege  NoCollege
## 18      C.Gen    College    College
## 19      C.Gen    College    College
## 20       <NA>       <NA>       <NA>
## 21      F.Gen  NoCollege  NoCollege
## 22      F.Gen  NoCollege  NoCollege
## 23      C.Gen    College    College
## 24      C.Gen    College    College
# Using |
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"|data$P2E_recode=="NoCollege", "F.Gen", "C.Gen")
# The logic of this statement is if P1E went to college or P2E went to college, then call the student continuing generation.  
# Call everyone else first gen.

#To visualize what happened create a new dataset with just the three relevant columns and look at them.
## Again, notice what happens to the three outlier cases
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
##    Gen.Status P1E_recode P2E_recode
## 1       F.Gen  NoCollege    SomColl
## 2       F.Gen  NoCollege  NoCollege
## 3       F.Gen  NoCollege  NoCollege
## 4       C.Gen    College    College
## 5       F.Gen  NoCollege        No2
## 6       C.Gen    College    College
## 7       C.Gen    College    College
## 8       C.Gen    College    College
## 9       C.Gen    College    College
## 10      F.Gen  NoCollege  NoCollege
## 11      F.Gen  NoCollege  NoCollege
## 12      C.Gen    College    College
## 13       <NA>       <NA>    College
## 14       <NA>       <NA>       <NA>
## 15      F.Gen  NoCollege  NoCollege
## 16      C.Gen    College    College
## 17      F.Gen  NoCollege  NoCollege
## 18      C.Gen    College    College
## 19      C.Gen    College    College
## 20       <NA>       <NA>       <NA>
## 21      F.Gen  NoCollege  NoCollege
## 22      F.Gen  NoCollege  NoCollege
## 23      C.Gen    College    College
## 24      C.Gen    College    College
# This mostly works except for the student with a parent with some college.  Luckily, ifelse statements can chain. 
## Unfortunately you can't have any overlapping categories so we can't use the or statement.  Look at what happens if we use or statements:
data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"|data$P2E_recode=="NoCollege", "F.Gen", 
                         ifelse(data$P1E_recode=="SomColl"|data$P2E_recode=="SomColl", "SomColl", "C.Gen"))
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
##    Gen.Status P1E_recode P2E_recode
## 1       F.Gen  NoCollege    SomColl
## 2       F.Gen  NoCollege  NoCollege
## 3       F.Gen  NoCollege  NoCollege
## 4       C.Gen    College    College
## 5       F.Gen  NoCollege        No2
## 6       C.Gen    College    College
## 7       C.Gen    College    College
## 8       C.Gen    College    College
## 9       C.Gen    College    College
## 10      F.Gen  NoCollege  NoCollege
## 11      F.Gen  NoCollege  NoCollege
## 12      C.Gen    College    College
## 13       <NA>       <NA>    College
## 14       <NA>       <NA>       <NA>
## 15      F.Gen  NoCollege  NoCollege
## 16      C.Gen    College    College
## 17      F.Gen  NoCollege  NoCollege
## 18      C.Gen    College    College
## 19      C.Gen    College    College
## 20       <NA>       <NA>       <NA>
## 21      F.Gen  NoCollege  NoCollege
## 22      F.Gen  NoCollege  NoCollege
## 23      C.Gen    College    College
## 24      C.Gen    College    College
### So we have to use &, but to limit our typing we want to think about the simpliest way to do this.  Firstgen is the most stringent category.
#### Both parents have to be first gen and if there is one parent they have to be first gen.  So if we code those two categories than everyone else 
#### is accurately captured as continuing generation. 

data$Gen.Status<- ifelse(data$P1E_recode=="NoCollege"& data$P2E_recode=="NoCollege", "F.Gen",
                         ifelse(data$P1E_recode=="NoCollege" & data$P2E_recode=="No2", "F.Gen", "C.Gen"))
check<-data[c("Gen.Status", "P1E_recode", "P2E_recode")]
check
##    Gen.Status P1E_recode P2E_recode
## 1       C.Gen  NoCollege    SomColl
## 2       F.Gen  NoCollege  NoCollege
## 3       F.Gen  NoCollege  NoCollege
## 4       C.Gen    College    College
## 5       F.Gen  NoCollege        No2
## 6       C.Gen    College    College
## 7       C.Gen    College    College
## 8       C.Gen    College    College
## 9       C.Gen    College    College
## 10      F.Gen  NoCollege  NoCollege
## 11      F.Gen  NoCollege  NoCollege
## 12      C.Gen    College    College
## 13      C.Gen       <NA>    College
## 14       <NA>       <NA>       <NA>
## 15      F.Gen  NoCollege  NoCollege
## 16      C.Gen    College    College
## 17      F.Gen  NoCollege  NoCollege
## 18      C.Gen    College    College
## 19      C.Gen    College    College
## 20       <NA>       <NA>       <NA>
## 21      F.Gen  NoCollege  NoCollege
## 22      F.Gen  NoCollege  NoCollege
## 23      C.Gen    College    College
## 24      C.Gen    College    College
## ifelse statements can be chained like this ad nausem.  I've had statements that went over 20 lines for factors with a lot of levels and combos.
### ifelse statements can also be frustrating because its hard to remember the logic of the statements you are building when the chains get long.
#### I often write them out in words.