Background
This program reads in the SAS datasaet provided by Census bureau for the most recent wave of the Household Pulse Survey. Unfortunately, there are no labels provided within the dataset and no program for creating them, so you’ll have to do it yourself.
Labels are a useful way to look at data, because it retains the underlying value or number or order, but it shows you the words. In R, this is done through factor variable. In SAS and other statistical programs, it’s done through labels. They pretty much do the same thing, and we’ll work with factors here.
You may need to install a few packages before this will run.
library(knitr) # to get some options set
library(rmdformats) # to use nicer setups.
knitr::opts_chunk$set(echo = TRUE,
message=FALSE,
error=TRUE,
eval=TRUE)
library(tidyverse) # the usual
library(janitor)
library(lubridate)
library(forcats) # part of the tidyverse, for working with categorical data
library(haven) # for using our saved SAS data
library(pollster) # for getting crosstabs with error margins
library(reactable) # for formatting numbers in tables.Read raw data
Using the haven package, load in the SAS download data. We don’t need to deal with replicate weights – those are for re-estimating the margins of error. Note that this file is saved in a subfolder called sasdata/HPS_Week29_PUF_SAS. If you saved it elswhere you should change this.
raw_sas_hps <- read_sas("sasdata/HPS_Week29_PUF_SAS/pulse2021_puf_29.sas7bdat")
colnames(raw_sas_hps)## [1] "WEEK" "RECVDVACC" "DOSES" "GETVACRV"
## [5] "WHYNOT1" "WHYNOT2" "WHYNOT3" "WHYNOT4"
## [9] "WHYNOT5" "WHYNOT6" "WHYNOT7" "WHYNOT8"
## [13] "WHYNOT9" "WHYNOT10" "WHYNOT11" "WHYNOTB1"
## [17] "WHYNOTB2" "WHYNOTB3" "WHYNOTB4" "WHYNOTB5"
## [21] "WHYNOTB6" "HADCOVID" "MS" "WRKLOSSRV"
## [25] "EXPCTLOSS" "ANYWORK" "KINDWORK" "RSNNOWRKRV"
## [29] "CURFOODSUF" "FOODRSNRV1" "FOODRSNRV2" "FOODRSNRV3"
## [33] "FOODRSNRV4" "FREEFOOD" "ANXIOUS" "WORRY"
## [37] "INTEREST" "DOWN" "HLTHINS1" "HLTHINS2"
## [41] "HLTHINS3" "HLTHINS4" "HLTHINS5" "HLTHINS6"
## [45] "HLTHINS7" "HLTHINS8" "DELAY" "NOTGET"
## [49] "TENURE" "MORTCONF" "TEACH1" "TEACH2"
## [53] "TEACH3" "TEACH4" "TEACH5" "TEACH6"
## [57] "TEACH7" "TEACH8" "COMPAVAIL" "INTRNTAVAIL"
## [61] "INTRNTRV1" "INTRNTRV2" "INTRNTRV3" "INTRNTRV4"
## [65] "SCHLHRS" "INCOME" "CHILDFOOD" "SPNDSRC1"
## [69] "SPNDSRC2" "SPNDSRC3" "SPNDSRC4" "SPNDSRC5"
## [73] "SPNDSRC6" "SPNDSRC7" "SPNDSRC8" "SPNDSRC9"
## [77] "EIP_YN" "EIPSPND1" "EIPSPND2" "EIPSPND3"
## [81] "EIPSPND4" "EIPSPND5" "EIPSPND6" "EIPSPND7"
## [85] "EIPSPND8" "EIPSPND9" "EIPSPND10" "EIPSPND11"
## [89] "EIPSPND12" "EIPSPND13" "TW_YN" "TW_COV"
## [93] "UI_APPLYRV" "UI_RECVRV" "SSA_RECV" "SSA_APPLYRV"
## [97] "SSAPGMRV1" "SSAPGMRV2" "SSAPGMRV3" "SSAPGMRV4"
## [101] "SSAPGMRV5" "SSALIKELYRV" "SSAEXPCT1" "SSAEXPCT2"
## [105] "SSAEXPCT3" "SSAEXPCT4" "SSAEXPCT5" "SSADECISN"
## [109] "EXPNS_DIF" "WHYCHNGD1" "WHYCHNGD2" "WHYCHNGD3"
## [113] "WHYCHNGD4" "WHYCHNGD5" "WHYCHNGD6" "WHYCHNGD7"
## [117] "WHYCHNGD8" "WHYCHNGD9" "WHYCHNGD10" "WHYCHNGD11"
## [121] "WHYCHNGD12" "WHYCHNGD13" "FEWRTRIP1" "FEWRTRIP2"
## [125] "FEWRTRIP3" "FEWRTRANS" "PLNDTRIPS" "SNAP_YN"
## [129] "PRESCRIPT" "MH_SVCS" "MH_NOTGET" "LIVQTRRV"
## [133] "RENTCUR" "MORTCUR" "EVICT" "FORCLOSE"
## [137] "PSCHNG1" "PSCHNG2" "PSCHNG3" "PSCHNG4"
## [141] "PSCHNG5" "PSCHNG6" "PSCHNG7" "PSWHYCHG1"
## [145] "PSWHYCHG2" "PSWHYCHG3" "PSWHYCHG4" "PSWHYCHG5"
## [149] "PSWHYCHG6" "PSWHYCHG7" "PSWHYCHG8" "PSWHYCHG9"
## [153] "ACTVDUTY1" "ACTVDUTY2" "ACTVDUTY3" "ACTVDUTY4"
## [157] "ACTVDUTY5" "COVPRVNT" "WKVOL" "SETTING"
## [161] "UI_RECVNOW" "EIPRV" "CHNGSHOP1" "CHNGSHOP2"
## [165] "CHNGSHOP3" "CHNGSVCS1" "CHNGSVCS2" "CHNGSVCS3"
## [169] "CHNGSHP1ML" "CHNGSHP2ML" "CHNGSHP3ML" "CHNGSVC1ML"
## [173] "CHNGSVC2ML" "CHNGSVC3ML" "CASHUSE" "PRVRIDESHR"
## [177] "TELEHLTH" "TELECHLD" "PRVNTIVE" "PRVNTWHY1"
## [181] "PRVNTWHY2" "PRVNTWHY3" "PRVNTWHY4" "PRVNTWHY5"
## [185] "PRVNTWHY6" "PRVNTWHY7" "SEEING" "HEARING"
## [189] "REMEMBERING" "MOBILITY" "ENROLLNONE" "HYBRID"
## [193] "SCHLFOOD" "SCHLFDHLP1" "SCHLFDHLP2" "SCHLFDHLP3"
## [197] "SCHLFDHLP4" "CHLDCARE" "CHLDIMPCT1" "CHLDIMPCT2"
## [201] "CHLDIMPCT3" "CHLDIMPCT4" "CHLDIMPCT5" "CHLDIMPCT6"
## [205] "CHLDIMPCT7" "CHLDIMPCT8" "CHLDIMPCT9" "ENRPUBCHK"
## [209] "ENRPRVCHK" "ENRHMSCHK" "ABIRTH_YEAR" "AGENDER"
## [213] "EGENDER" "AHISPANIC" "ARACE" "AHHLD_NUMPER"
## [217] "AHHLD_NUMKID" "RRACE" "RHISPANIC" "AEDUC"
## [221] "EEDUC" "THHLD_NUMPER" "THHLD_NUMKID" "TSPNDFOOD"
## [225] "TSPNDPRPD" "TNUM_PS" "TENROLLPUB" "TENROLLPRV"
## [229] "TENROLLHMSCH" "TBIRTH_YEAR" "THHLD_NUMADLT" "SCRAM"
## [233] "EST_ST" "EST_MSA" "PWEIGHT" "HWEIGHT"
## [237] "PRIVHLTH" "PUBHLTH" "REGION"
Factor labels example
Unfortunately, the Census Bureau has not provided translations for each of the variables. That means you have to do some recoding yourself. One way to do this is with factors, which take the numeric information and convert them to words, retaining the underlying values and making sure that you don’t have illegal labels. Here’s an example with a couple of the variables:
raw_sas_hps %>%
select (EGENDER, PWEIGHT) %>%
mutate ( gender = factor ( EGENDER, levels=c(1,2), labels=c("Male", "Female"))) %>%
count ( gender, EGENDER, wt=PWEIGHT)Recode example
Here’s an example of recoding the values for vaccines. The idea is that they have several questions, each of which could be true or false. The logic goes like this:
- Did you receive a vaccination? 1 (Yes) or 2 (No)
- Have you received or do you expect to get all doses? 1 or 2, only asked of those that had a vaccination. (1=Yes, 2=No)
- Do you intend to get vaccine (only people who said No to the first question) (1=Definititely.. 3= Unsure … 5 = Definitely NOT) only of those that say No to previous. Then a series of questions only asked of those who say they do not intend on getting it, either because of 2 in “all doses” question, or anyone who didn’t say “definitely” in the previous question. (WHYNOT1-WHYNOT7). And after that, there are questions about why you don’t think you need it if you answered “WHYNOT3”
There are also possibilities that they are -99, which is “Seen but not selected”, which is -99.
So you have to work out the logic and then decide which of the answers is most important. Note that it’s quite possible that they will answer more than one reason, so you have to decide on a hierarchy of which one is most important. Let’s take them one at a time:
1 = I'm concerned about side effects
2 = I don't know it will work
3 = I don't think I need it.
4 = I don't like vaccines
5 = Doctor hasn't recommended it
6= I want to wait to see if it's safe
7 = I think others need it more right now
8 = Cost
9 = Don't trust the vaccine
10 = Don't trust the government
11 = Other
B1 = If don't need, already had COVID
B2 = I'm not member of high-risk group
B3 = Same, Plan to use masks instead
B4 = Don't believe COVID is a serious illness
B5 = Don't think vaccines are beneficial
B6 = Other reason not to get it.
So say we want to turn this into a one-answer question, we might use this logic:
If you got one or plan to get both, then Yes to vaccinated If you don’t plan to get one or both:
Want to get = Planned.
Don’t plan to get:
- Already had covide (B1)
- Want to wait / doctor not recommended / don’t know it will work
- Side effects / don’t like vaccines / Don’t think vaccines are beneficial / wait for safe
- Don’t think I need it. (other than already had)
- Others need it more / cost (suggests they’d want it if they could/ using other precautions)
- Don’t trust vaccine / government
- Everyone else who didn’t get it or didn’t say why.
One way to do this is to do it in code, then translate to words. This is an example of doing this. (NOTE: The "rowwise() indicator is really slow, so this takes longer than you’d expect. It was used so I didn’t need to do a bunch of OR conditions, but it does slow things down. It implicitly groups by each row.)
vaccines <-
raw_sas_hps %>%
rowwise() %>%
mutate ( new_covid =
case_when ( RECVDVACC == 1 & DOSES == 1 ~ 10, # already got it
GETVACRV == 1 ~ 20, # will get it
WHYNOTB1 == 1 ~ 31, # already had covid
1 %in% c (WHYNOT5, WHYNOT6, WHYNOT2, WHYNOTB6) ~ 32 , # wait / doctor/ work / other reasons don't need
1 %in% c (WHYNOT1, WHYNOT4, WHYNOT6, WHYNOTB5) ~ 33, # STILL NEED REST OF 3,
1 %in% c(WHYNOT7, WHYNOT8) ~ 34, # cost, others need more
1 %in% c(WHYNOT9, WHYNOT10, WHYNOTB4) ~ 35,
GETVACRV>=2 ~ 39 # other reasons not getting it. (including those that didn't say anything)
)) %>%
#this apparently gets rid of the implicit group by each row.
#SUPER IMPORTANT - ALL THE ANSWERS WILL BE WRONG IF YOU DON'T UNGROUP
ungroup() %>%
mutate ( covid_as_factor = factor ( new_covid, levels = c(10, 20, 31, 32, 33, 34, 35, 39),
labels = c ( "Getting or got voccinated",
"Plan to get vaccinated",
"Already had COVID",
"Want to wait (doctor, see if it works, etc.)",
"Side effects , vaccine fear, etc.",
"Cose / others need it more",
"Mistrust government and vaccine",
"Other reasons for not getting it"))
) %>%
select (SCRAM, PWEIGHT, new_covid, covid_as_factor, RECVDVACC:WHYNOTB6)You can troubleshoot it by checking across variables if they should have answered:
vaccines %>%
filter ( is.na(new_covid)) %>%
pivot_longer ( cols=c( WHYNOT1:WHYNOTB6), names_to="VARIABLE", values_to = "VALUE") %>%
head()So there are none left with no answer. I’m a little surprised that there are no answers in which there was no response, but they may have included this in the basic response. (Usually, at least a few people refuse to answer some questions. We’ll want to check on this.)
Now we can do some crosstabs on the data, using both the raw number of responses and the weights.
Weighted data
The idea behind weighted data is that it adds up to a population. In each of these rows, there are two weights – one for people and another for household. I believe that only one respondent in each household was allowed, so there shouldn’t be any double-counting of people in households. (All questions are just for yourself)
Now you can change all of them into words:
vaccines %>%
mutate (
) %>%
count( covid_as_factor, wt=PWEIGHT)Here, I’m using a shortcut to group by and summarise by using the pollster package. This has commands for frequencies for one variable (topline), 2 variables (crosstab) and 3 variables (crosstab3) that are common in using weighted data in polls. topline includes missing values, moe_topline removes them but gives you a margin of error. I don’t know how to get rid of the extra “Valid Percent” column - it seems to break the function.
#topline( vaccines, covid_as_factor, weight=PWEIGHT)
moe_topline( vaccines, covid_as_factor, weight=PWEIGHT) %>%
select (- "Valid Percent") %>%
reactable ( compact=TRUE,
defaultColDef = colDef( format=colFormat(suffix="%", digit=1)) ,
columns = (list ( "Frequency" = colDef(name="Pop estimate", format=colFormat(separators=TRUE, prefix="", digits=0))))
)