Most of the functions and ideas for this lab come from this recommended reading book. (hold control or command when clicking on the link)
We’ll be using data and functions from the tidyverse and knitr libraries, make sure that you have those libraries loaded.
dat
## XX10001 XX10002 XX10003 XX10004
## 1 a001 1 1 15
## 2 a002 1 1 16
## 3 a003 0 1 15
## 4 a004 1 2 15
## 5 a005 0 3 16
dat1
## XX10001 XA10001 XA10002
## 1 a001 65 140
## 2 a002 71 135
## 3 a003 66 180
## 4 a004 62 110
## 5 a005 73 210
dat2
## XX10001 XB10001 XB10002 XB10003
## 1 a001 1 A B
## 2 a001 2 A A
## 3 a002 1 A B
## 4 a002 2 B C
## 5 a003 1 B D
## 6 a003 2 C B
## 7 a004 1 C C
## 8 a004 2 A A
## 9 a005 1 A B
## 10 a005 2 A A
When you import data, it’s not always in a user friendly format. The above data frames would be difficult to work with in their cuurent state.
There are many techniques used to clean data. I offer a few in the following paragraphs and chunks:
After consulting the codebook, we can see that dat2 is the student grades, XX10001 is the student ID number, XB10001 is the semester number, XB10002 is the final grades for history and XB10003 is the final grades for math.
First, we’ll start with a technique that you’ve already seen
dat2$stuid <- dat2$XX10001
dat2$sem <- dat2$XB10001
dat2$hist <- dat2$XB10002
dat2$math <- dat2$XB10003
print(dat2)
## XX10001 XB10001 XB10002 XB10003 stuid sem hist math
## 1 a001 1 A B a001 1 A B
## 2 a001 2 A A a001 2 A A
## 3 a002 1 A B a002 1 A B
## 4 a002 2 B C a002 2 B C
## 5 a003 1 B D a003 1 B D
## 6 a003 2 C B a003 2 C B
## 7 a004 1 C C a004 1 C C
## 8 a004 2 A A a004 2 A A
## 9 a005 1 A B a005 1 A B
## 10 a005 2 A A a005 2 A A
Now we’d like to only have the variables in our dataframe that we wish to work with. We may select just those variables we wish to work with thusly
dat2 <- select(dat2, c(stuid,sem,math,hist)) #selects only the variables we want to work with
print(dat2)
## stuid sem math hist
## 1 a001 1 B A
## 2 a001 2 A A
## 3 a002 1 B A
## 4 a002 2 C B
## 5 a003 1 D B
## 6 a003 2 B C
## 7 a004 1 C C
## 8 a004 2 A A
## 9 a005 1 B A
## 10 a005 2 A A
Notice that the order of the variables matched the order that they were selected.
Next we’ll introduce a new method of adding variables to a data frame
Looking at our codebook for dat1, we see that XA10001 is height in inches and XA10002 is weight in pounds.
Mutate is a function in the tidyverse package that allows one to add variables to your data frame.
The pipe (which looks like %>% ) is a powerful tool that allows you to join functions together. Think of the pipe as “and then.” The shortcut for the pipe is
dat1 <- dat1 %>% #this creates a structure telling R that we are working within the data frame dat1 while in this pipe
mutate( stuid=XX10001, ht=XA10001, wt=XA10002) %>% #here we are able to add another pipe to this to select the variables we want to use
select(stuid,ht,wt)
print(dat1)
## stuid ht wt
## 1 a001 65 140
## 2 a002 71 135
## 3 a003 66 180
## 4 a004 62 110
## 5 a005 73 210
Next in our codebook we see that for the values of dat, XX10001 is still the student ID. XX10002 is sex, with 0 for male and 1 for female. XX10003 is Race/Ethnicity with 1 for NHW, 2 for NHB and 3 for Hisp, and XX10004 is age.
Transmute essentially combines mutate and select.
Ifelse returns a value dependent on a test that you specify. Remember that if else likes ==,<=, >=, or !=.
dat <- dat %>% #This is how I like to write my code. lots of space and easier to read
transmute(
stuid = XX10001,
sex = ifelse(XX10002 == 1,"f","m"), #here you can see that if XX0001 is 1 we get f otherwise we get m
race_eth = ifelse(XX10003 == 1, "nhw", # here I nest ifelse statements to be able to account for three categories
ifelse(XX10003 == 2, "nhb","hisp")), # you should have the same number of closing parenthesis as you have ifelses
age =XX10004
)
print(dat)
## stuid sex race_eth age
## 1 a001 f nhw 15
## 2 a002 f nhw 16
## 3 a003 m nhw 15
## 4 a004 f nhb 15
## 5 a005 m hisp 16
Filter is a great way of only looking at the data you wish to see. It is very similar to subset.
For example, if we were interested in only viewing female data in the data frame dat:
datf <- dat %>%
filter(sex == "f")
print(datf)
## stuid sex race_eth age
## 1 a001 f nhw 15
## 2 a002 f nhw 16
## 3 a004 f nhb 15
Arrange is a useful function when you want to reorder your data. For example, suppose we wanted to see the data arranged from youngest to oldest, female first then male
datarr <- dat %>%
arrange(age,sex) # notice that R works in alphabetical order
print(datarr)
## stuid sex race_eth age
## 1 a001 f nhw 15
## 2 a004 f nhb 15
## 3 a003 m nhw 15
## 4 a002 f nhw 16
## 5 a005 m hisp 16
datarr <- dat %>%
arrange(age, desc(sex)) # if we wanted to see males first we would instruct R to show that variable in descending order thusly
print(datarr)
## stuid sex race_eth age
## 1 a003 m nhw 15
## 2 a001 f nhw 15
## 3 a004 f nhb 15
## 4 a005 m hisp 16
## 5 a002 f nhw 16
Please see here for a detailed account of joins. (hold control or command and click on link)
Sometimes the data we want to analyze exists on more than one data frame. So we must learn how to combine data frames. There are many, many ways of doing this in R, the following is one of those ways that I find intuitive and easy to use.
To begin, there needs to be a variable in each of the data frames that is a unique identifier of each object in the data frame. In the above data frames that we’ve been working with, the variable stuid will be the key that we utilize to join tables together.
Suppose we are interested in looking at grades of the students by sex. While this may be relatively easy to see when we have just 5 objects in our data frames, imagine trying to accomplish this with 100,000 objects.
Recommend looking at help to see the details on these
datcombl <- dat %>%
left_join(dat2) #as long as the key variable has the exact same name in both data frames, R will figure it out, otherwise you must specify the variables that are the keys
## Joining, by = "stuid"
datcombr <- dat %>%
right_join(dat2)
## Joining, by = "stuid"
datcombi <- dat %>%
inner_join(dat2)
## Joining, by = "stuid"
datcombf <- dat %>%
full_join(dat2)
## Joining, by = "stuid"
print(datcombl)
## stuid sex race_eth age sem math hist
## 1 a001 f nhw 15 1 B A
## 2 a001 f nhw 15 2 A A
## 3 a002 f nhw 16 1 B A
## 4 a002 f nhw 16 2 C B
## 5 a003 m nhw 15 1 D B
## 6 a003 m nhw 15 2 B C
## 7 a004 f nhb 15 1 C C
## 8 a004 f nhb 15 2 A A
## 9 a005 m hisp 16 1 B A
## 10 a005 m hisp 16 2 A A
#print(datcombr)
#print(datcombi)
#print(datcombf)
Next, we’ll take a look at an example using some data from the ACS (American Community Survey).
I pulled the data for this lab from IPUMS. This is from the 2015 5 year ACS. It is a fairly large file, so once it is downloaded, I’ll take a random sample of 100,000 to make it easier to execute code with.
Suppose I was interested in seeing a breakdown by race (nhw, nhb, and hisp) and education (less than high school and at least high school) for those individuals at or above the age of 25
dat <- read_csv("2015_5yr_ACS_4oct.csv")# That's a big file. It takes time
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
#this size data is fairy unwieldy for this level of instruction.
#tab2 <- table(dat$HISPAND)
#print(tab2)
set.seed(1230) # here I set a seed so we all have the same data in our reduced data frame
datred <-dat[sample(nrow(dat),100000),] # Here I take a random sample of 100,000 out of our 15.6 million individuals
#view(datred)
#tab3 <- table(datred$HISPAND)
#print(tab3)
dat1 <- datred %>%
transmute(
race = ifelse(RACE == 1 & HISPAN == 0, 'nhw',
ifelse(RACE == 2 & HISPAN == 0, 'nhb',
ifelse(HISPAN >= 1 & HISPAN <= 4, 'hisp', NA))),# race category with three categories
educ = ifelse(EDUC <= 5 & EDUC >= 1, 'lths',
ifelse(EDUC >= 6, 'hs', NA)), # create an education variable to see less than high school and at least high school
age = AGE, #I don't like seeing caps in variables when I code (personal preference)
sex = ifelse(SEX == 1, "m","f"),
statefip = STATEFIP,
countyfip = COUNTYFIP,
) %>%
filter( age >= 25) #Here I'm only interested in those at or above the age of 25
head(dat1)
## # A tibble: 6 x 6
## race educ age sex statefip countyfip
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 nhw hs 39 f 42 0
## 2 hisp lths 62 f 6 37
## 3 nhw hs 59 f 36 103
## 4 nhb hs 41 m 48 157
## 5 nhw hs 61 m 6 85
## 6 nhb hs 43 f 26 121
dat1 <- na.omit(dat1) # here I remove the rows that contain NA
head(dat1)
## # A tibble: 6 x 6
## race educ age sex statefip countyfip
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 nhw hs 39 f 42 0
## 2 hisp lths 62 f 6 37
## 3 nhw hs 59 f 36 103
## 4 nhb hs 41 m 48 157
## 5 nhw hs 61 m 6 85
## 6 nhb hs 43 f 26 121
tab1 <- table(dat1$educ, dat1$race) # this creates a 2x3 table
print(tab1)
##
## hisp nhb nhw
## hs 5389 5875 46298
## lths 2127 907 3300
tab1a <- round(prop.table(tab1,2), digits = 3)*100 # the 2 tells R to compute percentages by column, a 1 would do it by row
print(tab1a)
##
## hisp nhb nhw
## hs 71.7 86.6 93.3
## lths 28.3 13.4 6.7
#tab1b<- round(prop.table(tab1,1), digits = 3)*100
#print(tab1b)
Next I’m interested to see what the differences in incomes are between males and females in the US. I’m not sure why the code book didn’t contain the income categories but I’ll leave notes when I add them to our data frame. I’ve decided to categorize the individual income variable into 4 categories in an attempt learn more than just a comparison of the mean values.
dat2 <- datred %>%
filter(INCTOT > 0 & INCTOT < 9000000) %>% # to remove those who are unemployed and those NA answers (9999999)
transmute(
sex = ifelse(SEX == 1, "m","f"),
statefip = STATEFIP,
countyfip = COUNTYFIP,
inc = factor(ifelse(INCTOT <= 20000, "<$20k",
ifelse(INCTOT >= 20001 & INCTOT <= 60000, "$20k-$60k",
ifelse(INCTOT >= 60001 & INCTOT <= 120000, "$60k-$120k",
ifelse(INCTOT >= 120001,">$120k", NA))))),
age = AGE
) %>%
filter(age >= 18)
dat2 <- na.omit(dat2)
#looking at the number of individuals
tab2 <- table(dat2$inc, dat2$sex)
tab2 <- tab2[c(1,3,4,2),] # here I reorder the rows so they make sense
kable(tab2) # kable is a way of making table, or making basic tables look better (compare the output with the previous example). This is just an introduction
| f | m | |
|---|---|---|
| <$20k | 16840 | 10605 |
| $20k-$60k | 14227 | 14715 |
| $60k-$120k | 3913 | 6412 |
| >$120k | 911 | 2813 |
#looking at the percentages
tab2a <- round(prop.table(tab2,1), digits = 3)*100
kable(tab2a)
| f | m | |
|---|---|---|
| <$20k | 61.4 | 38.6 |
| $20k-$60k | 49.2 | 50.8 |
| $60k-$120k | 37.9 | 62.1 |
| >$120k | 24.5 | 75.5 |