Most of the functions and ideas for this lab come from this recommended reading book. (hold control or command when clicking on the link)
We’ll be using data and functions from the tidyverse and knitr libraries, make sure that you have those libraries loaded.

Consider the following data frames

dat

##   XX10001 XX10002 XX10003 XX10004
## 1    a001       1       1      15
## 2    a002       1       1      16
## 3    a003       0       1      15
## 4    a004       1       2      15
## 5    a005       0       3      16

dat1

##   XX10001 XA10001 XA10002
## 1    a001      65     140
## 2    a002      71     135
## 3    a003      66     180
## 4    a004      62     110
## 5    a005      73     210

dat2

##    XX10001 XB10001 XB10002 XB10003
## 1     a001       1       A       B
## 2     a001       2       A       A
## 3     a002       1       A       B
## 4     a002       2       B       C
## 5     a003       1       B       D
## 6     a003       2       C       B
## 7     a004       1       C       C
## 8     a004       2       A       A
## 9     a005       1       A       B
## 10    a005       2       A       A

When you import data, it’s not always in a user friendly format. The above data frames would be difficult to work with in their cuurent state.
There are many techniques used to clean data. I offer a few in the following paragraphs and chunks:

After consulting the codebook, we can see that dat2 is the student grades, XX10001 is the student ID number, XB10001 is the semester number, XB10002 is the final grades for history and XB10003 is the final grades for math.

First, we’ll start with a technique that you’ve already seen

Basic Cleaning / Tidying / Renaming

dat2$stuid <- dat2$XX10001
dat2$sem <- dat2$XB10001
dat2$hist <- dat2$XB10002
dat2$math <- dat2$XB10003
print(dat2)
##    XX10001 XB10001 XB10002 XB10003 stuid sem hist math
## 1     a001       1       A       B  a001   1    A    B
## 2     a001       2       A       A  a001   2    A    A
## 3     a002       1       A       B  a002   1    A    B
## 4     a002       2       B       C  a002   2    B    C
## 5     a003       1       B       D  a003   1    B    D
## 6     a003       2       C       B  a003   2    C    B
## 7     a004       1       C       C  a004   1    C    C
## 8     a004       2       A       A  a004   2    A    A
## 9     a005       1       A       B  a005   1    A    B
## 10    a005       2       A       A  a005   2    A    A

Now we’d like to only have the variables in our dataframe that we wish to work with. We may select just those variables we wish to work with thusly

Select

dat2 <- select(dat2, c(stuid,sem,math,hist)) #selects only the variables we want to work with
print(dat2)   
##    stuid sem math hist
## 1   a001   1    B    A
## 2   a001   2    A    A
## 3   a002   1    B    A
## 4   a002   2    C    B
## 5   a003   1    D    B
## 6   a003   2    B    C
## 7   a004   1    C    C
## 8   a004   2    A    A
## 9   a005   1    B    A
## 10  a005   2    A    A

Notice that the order of the variables matched the order that they were selected.

Next we’ll introduce a new method of adding variables to a data frame
Looking at our codebook for dat1, we see that XA10001 is height in inches and XA10002 is weight in pounds.

Mutate and the Pipe

Mutate is a function in the tidyverse package that allows one to add variables to your data frame.
The pipe (which looks like %>% ) is a powerful tool that allows you to join functions together. Think of the pipe as “and then.” The shortcut for the pipe is

Command-Shift-m for macs
Control-Shift-m for PCs
dat1 <- dat1 %>%         #this creates a structure telling R that we are working within the data frame dat1 while in this pipe
  mutate( stuid=XX10001, ht=XA10001, wt=XA10002) %>%       #here we are able to add another pipe to this to select the variables we want to use
  select(stuid,ht,wt)
print(dat1)
##   stuid ht  wt
## 1  a001 65 140
## 2  a002 71 135
## 3  a003 66 180
## 4  a004 62 110
## 5  a005 73 210

Next in our codebook we see that for the values of dat, XX10001 is still the student ID. XX10002 is sex, with 0 for male and 1 for female. XX10003 is Race/Ethnicity with 1 for NHW, 2 for NHB and 3 for Hisp, and XX10004 is age.

Transmute and ifelse

Transmute essentially combines mutate and select.
Ifelse returns a value dependent on a test that you specify. Remember that if else likes ==,<=, >=, or !=.

dat <- dat %>%     #This is how I like to write my code.  lots of space and easier to read
  transmute(
    stuid = XX10001,
    sex = ifelse(XX10002 == 1,"f","m"),       #here you can see that if XX0001 is 1 we get f otherwise we get m
    race_eth = ifelse(XX10003 == 1, "nhw",       # here I nest ifelse statements to be able to account for three categories
                      ifelse(XX10003 == 2, "nhb","hisp")),     # you should have the same number of closing parenthesis as you have ifelses
    age =XX10004
  )
print(dat)
##   stuid sex race_eth age
## 1  a001   f      nhw  15
## 2  a002   f      nhw  16
## 3  a003   m      nhw  15
## 4  a004   f      nhb  15
## 5  a005   m     hisp  16

Other ways to manipulate data

Filter

Filter is a great way of only looking at the data you wish to see. It is very similar to subset.
For example, if we were interested in only viewing female data in the data frame dat:

datf <- dat %>% 
  filter(sex == "f")
print(datf)
##   stuid sex race_eth age
## 1  a001   f      nhw  15
## 2  a002   f      nhw  16
## 3  a004   f      nhb  15

Arrange

Arrange is a useful function when you want to reorder your data. For example, suppose we wanted to see the data arranged from youngest to oldest, female first then male

datarr <- dat %>% 
  arrange(age,sex) # notice that R works in alphabetical order
print(datarr)
##   stuid sex race_eth age
## 1  a001   f      nhw  15
## 2  a004   f      nhb  15
## 3  a003   m      nhw  15
## 4  a002   f      nhw  16
## 5  a005   m     hisp  16
datarr <- dat %>% 
  arrange(age, desc(sex)) # if we wanted to see males first we would instruct R to show that variable in descending order thusly
print(datarr)
##   stuid sex race_eth age
## 1  a003   m      nhw  15
## 2  a001   f      nhw  15
## 3  a004   f      nhb  15
## 4  a005   m     hisp  16
## 5  a002   f      nhw  16

Joins

Please see here for a detailed account of joins. (hold control or command and click on link)
Sometimes the data we want to analyze exists on more than one data frame. So we must learn how to combine data frames. There are many, many ways of doing this in R, the following is one of those ways that I find intuitive and easy to use.
To begin, there needs to be a variable in each of the data frames that is a unique identifier of each object in the data frame. In the above data frames that we’ve been working with, the variable stuid will be the key that we utilize to join tables together.

Suppose we are interested in looking at grades of the students by sex. While this may be relatively easy to see when we have just 5 objects in our data frames, imagine trying to accomplish this with 100,000 objects.

Left_Join, right_join, inner_join, etc…

Recommend looking at help to see the details on these

datcombl <- dat %>% 
  left_join(dat2) #as long as the key variable has the exact same name in both data frames, R will figure it out, otherwise you must specify the variables that are the keys
## Joining, by = "stuid"
datcombr <- dat %>% 
  right_join(dat2)
## Joining, by = "stuid"
datcombi <- dat %>% 
  inner_join(dat2)
## Joining, by = "stuid"
datcombf <- dat %>% 
  full_join(dat2)
## Joining, by = "stuid"
print(datcombl)
##    stuid sex race_eth age sem math hist
## 1   a001   f      nhw  15   1    B    A
## 2   a001   f      nhw  15   2    A    A
## 3   a002   f      nhw  16   1    B    A
## 4   a002   f      nhw  16   2    C    B
## 5   a003   m      nhw  15   1    D    B
## 6   a003   m      nhw  15   2    B    C
## 7   a004   f      nhb  15   1    C    C
## 8   a004   f      nhb  15   2    A    A
## 9   a005   m     hisp  16   1    B    A
## 10  a005   m     hisp  16   2    A    A
#print(datcombr)
#print(datcombi)
#print(datcombf)

ACS examples

Next, we’ll take a look at an example using some data from the ACS (American Community Survey).

I pulled the data for this lab from IPUMS. This is from the 2015 5 year ACS. It is a fairly large file, so once it is downloaded, I’ll take a random sample of 100,000 to make it easier to execute code with.

Suppose I was interested in seeing a breakdown by race (nhw, nhb, and hisp) and education (less than high school and at least high school) for those individuals at or above the age of 25

dat <- read_csv("2015_5yr_ACS_4oct.csv")# That's a big file.  It takes time
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
#this size data is fairy unwieldy for this level of instruction.

#tab2 <- table(dat$HISPAND)
#print(tab2)
set.seed(1230) # here I set a seed so we all have the same data in our reduced data frame
datred <-dat[sample(nrow(dat),100000),] # Here I take a random sample of 100,000 out of our 15.6 million individuals
#view(datred)
#tab3 <- table(datred$HISPAND)
#print(tab3)

Examople 1

dat1 <- datred %>% 
  transmute(
    race = ifelse(RACE == 1 & HISPAN == 0, 'nhw',
                  ifelse(RACE == 2 & HISPAN == 0, 'nhb',
                         ifelse(HISPAN >= 1 & HISPAN <= 4, 'hisp', NA))),# race category with three categories
    educ = ifelse(EDUC <= 5 & EDUC >= 1, 'lths',
                  ifelse(EDUC >= 6, 'hs', NA)), # create an education variable to see less than high school and at least high school
    age = AGE, #I don't like seeing caps in variables when I code (personal preference)
    sex = ifelse(SEX == 1, "m","f"),
    statefip = STATEFIP,
    countyfip = COUNTYFIP,
  ) %>% 
  filter( age >= 25) #Here I'm only interested in those at or above the age of 25

head(dat1)
## # A tibble: 6 x 6
##   race  educ    age sex   statefip countyfip
##   <chr> <chr> <dbl> <chr>    <dbl>     <dbl>
## 1 nhw   hs       39 f           42         0
## 2 hisp  lths     62 f            6        37
## 3 nhw   hs       59 f           36       103
## 4 nhb   hs       41 m           48       157
## 5 nhw   hs       61 m            6        85
## 6 nhb   hs       43 f           26       121
dat1 <- na.omit(dat1) # here I remove the rows that contain NA
head(dat1)
## # A tibble: 6 x 6
##   race  educ    age sex   statefip countyfip
##   <chr> <chr> <dbl> <chr>    <dbl>     <dbl>
## 1 nhw   hs       39 f           42         0
## 2 hisp  lths     62 f            6        37
## 3 nhw   hs       59 f           36       103
## 4 nhb   hs       41 m           48       157
## 5 nhw   hs       61 m            6        85
## 6 nhb   hs       43 f           26       121
tab1 <- table(dat1$educ, dat1$race) # this creates a 2x3 table
print(tab1)
##       
##         hisp   nhb   nhw
##   hs    5389  5875 46298
##   lths  2127   907  3300
tab1a <- round(prop.table(tab1,2), digits = 3)*100 # the 2 tells R to compute percentages by column, a 1 would do it by row
print(tab1a)
##       
##        hisp  nhb  nhw
##   hs   71.7 86.6 93.3
##   lths 28.3 13.4  6.7
#tab1b<- round(prop.table(tab1,1), digits = 3)*100
#print(tab1b)

Next I’m interested to see what the differences in incomes are between males and females in the US. I’m not sure why the code book didn’t contain the income categories but I’ll leave notes when I add them to our data frame. I’ve decided to categorize the individual income variable into 4 categories in an attempt learn more than just a comparison of the mean values.

Example 2

dat2 <- datred %>% 
    filter(INCTOT > 0 & INCTOT < 9000000) %>%  # to remove those who are unemployed and those NA answers (9999999)
  transmute(
    sex = ifelse(SEX == 1, "m","f"),
    statefip = STATEFIP,
    countyfip = COUNTYFIP,
    inc = factor(ifelse(INCTOT <= 20000, "<$20k",
                 ifelse(INCTOT >= 20001 & INCTOT <= 60000, "$20k-$60k", 
                        ifelse(INCTOT >= 60001 & INCTOT <= 120000, "$60k-$120k",
                               ifelse(INCTOT >= 120001,">$120k", NA))))),
    age = AGE
  ) %>% 
      filter(age >= 18) 
dat2 <- na.omit(dat2)

#looking at the number of individuals
tab2 <- table(dat2$inc, dat2$sex)
tab2 <- tab2[c(1,3,4,2),] # here I reorder the rows so they make sense
kable(tab2) # kable is a way of making table, or making basic tables look better (compare the output with the previous example).  This is just an introduction
f m
<$20k 16840 10605
$20k-$60k 14227 14715
$60k-$120k 3913 6412
>$120k 911 2813
#looking at the percentages
tab2a <- round(prop.table(tab2,1), digits = 3)*100 
kable(tab2a)  
f m
<$20k 61.4 38.6
$20k-$60k 49.2 50.8
$60k-$120k 37.9 62.1
>$120k 24.5 75.5

Exercises

1. What if we were interested in examining the education of individuals from the Northern Triangle countries of Guatemala, Honduras, and El Salvador as compared to other Hispanic individuals, non-Hispanic white, non-Hispanic black, and non-Hispanic other individuals. Create a data frame that has the following variables:

a. Race/Ethnicity variable with the following categories: 1) nhw 2) nhb 43 hisp northern triangle origin 4) hisp other
b. Education variable with the following categories: 1) less than a GED or High School diploma 2) GED or High School Diploma 4) Bachelor’s Degree or higher
c. Age only between 18 and 65 inclusive

2. Determine the percentage by race/ethnicity of each of these education categories

3. Given that Texas is FIPS code 48 adn Bexar county is 29, Compare the income, utilizing the same categories as example two, amongst the following race ethnicities: 1) non-Hispanic white 2) non Hispanic black 3)non-Hispanic other and 4) Hispanic in the following locations:

a. The United States

b. Texas

c. Bexar county

4. Using the Education categories from exercise 1b, determine the percentages by the race ethnicity categories in 3 in the following locations:

a. The United States

b. Texas

c. Bexar county