For this homework assignment, we must first load and clean wage data from IPUMS. The survey data is held on my professor’s github account.
#First, let's load some libraries we will need
library(haven)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Now we load the ACS Microdata
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
As pointed out in class, not all of the data under the “incwage” variable represents real wage information. “999999” and “999998” mean “NA” and “Missing”, respectively. I’ll call the cleaned variable “mywage”.
#the original values give us a weird mean for #incwage
paste("original $incwage mean value:",round(mean(ipums$incwage,na.rm = TRUE),digits=2),sep=" ")
## [1] "original $incwage mean value: 205672.4"
#So, we recode 999999 and 999998 as "NA"
ipums<-ipums %>% mutate(mywage=ifelse(incwage %in% c(999998,999999),NA, incwage))
#much better!
paste("new $mywage mean value:", round(mean(ipums$mywage,na.rm = TRUE),digits=2),sep=" ")
## [1] "new $mywage mean value: 27489.69"
Now, we will estimate the mean, median, standard deviation and sample size of the cleaned data.
ipums %>% summarise(newmedian=median(mywage,na.rm=T),newmean=mean(mywage,na.rm=T),newsd=sd(mywage,na.rm=T),sample_size=n())
## # A tibble: 1 x 4
## newmedian newmean newsd sample_size
## <dbl> <dbl> <dbl> <int>
## 1 7000 27489.69 50665.1 300552
Here, we will calculate summary statistics of Adults (over age 25), in the labor force, by education and sex
#according to the IPUMS codebook, "2" in the variable "$labforce" represents someone who is in the labor force, while "1" and "2" are "male" and "female", respectively, in the variable "$sex"
ipums %>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>25) %>%
mutate(sexrecode=ifelse(sex==1,"Male","Female")) %>%
mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs",
.$educd %in% c(62:64)~"hs",
.$educd %in% c(65:100)~"somecoll",
.$educd %in% c(101:116)~"collgrad",
.$educd ==999 ~ "missing")) %>%
group_by(edurec,sexrecode) %>%
summarise(newmedian=median(mywage,na.rm=T),newmean=mean(mywage,na.rm=T),newsd=sd(mywage,na.rm=T),sample_size=n())
## # A tibble: 8 x 6
## # Groups: edurec [?]
## edurec sexrecode newmedian newmean newsd sample_size
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 collgrad Female 49500 58349.99 57186.03 22975
## 2 collgrad Male 70000 93302.39 98629.45 23391
## 3 hs Female 21900 25729.43 26523.98 13199
## 4 hs Male 33000 38268.80 37317.79 17005
## 5 nohs Female 15000 18361.46 23312.47 3724
## 6 nohs Male 23000 28374.60 33113.55 6261
## 7 somecoll Female 28000 33058.59 31811.67 18914
## 8 somecoll Male 40550 48672.07 47372.76 18670
This completes my homework for the week. Thank you!