Exercise

  1. The Household Pulse Survey was a 20-minute online survey that measures how emergent social and economic issues impacted households across the country.” Find and download the Household Pulse Survey (HPS) Public Use File from June 23-July 5 2021 (Week 33). This dataset is readily available from the US Census Bureau website.

  2. How many person-level observations does this data set contain? Are there any particular inclusion criteria you note from the codebook?

    The data contains 66262 person-level observations.

    From the codebook, the main inclusion criterion is that respondents must be born before 2003 and reside in a US household in one of the 50 states or Washington D.C.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("pulse2021_puf_33.csv")
Rows: 66262 Columns: 239
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): SCRAM, EST_ST
dbl (237): WEEK, EST_MSA, REGION, HWEIGHT, PWEIGHT, TBIRTH_YEAR, ABIRTH_YEAR...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
spc_tbl_ [66,262 × 239] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ SCRAM        : chr [1:66262] "V330000001S30010049000112" "V330000001S33010328200122" "V330000001S37010166600122" "V330000001S37010752600122" ...
 $ WEEK         : num [1:66262] 33 33 33 33 33 33 33 33 33 33 ...
 $ EST_ST       : chr [1:66262] "01" "01" "01" "01" ...
 $ EST_MSA      : num [1:66262] NA NA NA NA NA NA NA NA NA NA ...
 $ REGION       : num [1:66262] 2 2 2 2 2 2 2 4 4 4 ...
 $ HWEIGHT      : num [1:66262] 5585 711 589 951 7459 ...
 $ PWEIGHT      : num [1:66262] 15635 1991 1649 887 6960 ...
 $ TBIRTH_YEAR  : num [1:66262] 1987 1952 1975 1968 1956 ...
 $ ABIRTH_YEAR  : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ EGENDER      : num [1:66262] 1 1 2 2 2 1 2 1 1 1 ...
 $ AGENDER      : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ RHISPANIC    : num [1:66262] 2 1 1 1 1 1 1 1 1 1 ...
 $ AHISPANIC    : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ RRACE        : num [1:66262] 4 1 2 1 1 1 1 1 4 1 ...
 $ ARACE        : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ EEDUC        : num [1:66262] 4 7 6 7 3 6 7 6 6 6 ...
 $ AEDUC        : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ MS           : num [1:66262] 1 1 3 1 3 1 1 3 5 5 ...
 $ THHLD_NUMPER : num [1:66262] 4 3 3 3 1 5 2 1 1 3 ...
 $ AHHLD_NUMPER : num [1:66262] 2 2 2 2 2 2 2 2 2 1 ...
 $ THHLD_NUMKID : num [1:66262] 1 0 0 2 0 0 0 0 0 0 ...
 $ AHHLD_NUMKID : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ THHLD_NUMADLT: num [1:66262] 3 3 3 1 1 5 2 1 1 3 ...
 $ ACTVDUTY1    : num [1:66262] 1 1 1 1 1 1 -99 1 1 1 ...
 $ ACTVDUTY2    : num [1:66262] -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 ...
 $ ACTVDUTY3    : num [1:66262] -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 ...
 $ ACTVDUTY4    : num [1:66262] -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 ...
 $ ACTVDUTY5    : num [1:66262] -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 ...
 $ RECVDVACC    : num [1:66262] 1 1 1 1 2 1 1 1 2 1 ...
 $ DOSES        : num [1:66262] 1 1 1 1 -88 1 1 1 -88 1 ...
 $ GETVACRV     : num [1:66262] -88 -88 -88 -88 4 -88 -88 -88 4 -88 ...
 $ WHYNOT1      : num [1:66262] -88 -88 -88 -88 1 -88 -88 -88 -99 -88 ...
 $ WHYNOT2      : num [1:66262] -88 -88 -88 -88 1 -88 -88 -88 -99 -88 ...
 $ WHYNOT3      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 -99 -88 ...
 $ WHYNOT4      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 -99 -88 ...
 $ WHYNOT5      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 1 -88 ...
 $ WHYNOT6      : num [1:66262] -88 -88 -88 -88 1 -88 -88 -88 -99 -88 ...
 $ WHYNOT7      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 -99 -88 ...
 $ WHYNOT8      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 -99 -88 ...
 $ WHYNOT9      : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 -99 -88 ...
 $ WHYNOT10     : num [1:66262] -88 -88 -88 -88 1 -88 -88 -88 -99 -88 ...
 $ WHYNOT11     : num [1:66262] -88 -88 -88 -88 -99 -88 -88 -88 1 -88 ...
 $ WHYNOTB1     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ WHYNOTB2     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ WHYNOTB3     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ WHYNOTB4     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ WHYNOTB5     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ WHYNOTB6     : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ COVPRVNT     : num [1:66262] 2 1 2 2 -88 1 1 1 -88 1 ...
 $ HADCOVID     : num [1:66262] 2 2 2 1 2 2 1 2 2 2 ...
 $ WRKLOSSRV    : num [1:66262] 1 2 1 2 2 2 -99 1 2 1 ...
 $ EXPCTLOSS    : num [1:66262] 1 2 2 2 2 2 2 1 2 1 ...
 $ ANYWORK      : num [1:66262] 1 2 1 1 1 1 2 1 1 2 ...
 $ KINDWORK     : num [1:66262] 4 -88 2 1 2 2 -88 4 1 -88 ...
 $ RSNNOWRKRV   : num [1:66262] -88 7 -88 -88 -88 -88 7 -88 -88 12 ...
 $ TW_YN        : num [1:66262] 1 2 1 1 2 2 2 2 2 2 ...
 $ TW_COV       : num [1:66262] 1 -88 1 1 -88 -88 -88 -88 -88 -88 ...
 $ WKVOL        : num [1:66262] 1 1 2 1 1 2 2 2 1 1 ...
 $ SETTING      : num [1:66262] 15 2 -88 1 4 -88 -88 -88 16 16 ...
 $ UI_APPLYRV   : num [1:66262] 1 2 2 2 2 2 2 2 2 1 ...
 $ UI_RECVRV    : num [1:66262] 1 2 2 2 2 2 2 2 2 1 ...
 $ UI_RECVNOW   : num [1:66262] 2 -88 -88 -88 -88 -88 -88 -88 -88 1 ...
 $ SSA_RECV     : num [1:66262] 2 1 2 2 1 2 1 1 2 2 ...
 $ SSA_APPLYRV  : num [1:66262] 1 2 2 2 2 2 2 2 2 2 ...
 $ SSAPGMRV1    : num [1:66262] -99 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAPGMRV2    : num [1:66262] -99 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAPGMRV3    : num [1:66262] -99 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAPGMRV4    : num [1:66262] -99 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAPGMRV5    : num [1:66262] 1 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSALIKELYRV  : num [1:66262] -88 -88 4 4 -88 4 -88 -88 4 4 ...
 $ SSAEXPCT1    : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAEXPCT2    : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAEXPCT3    : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAEXPCT4    : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSAEXPCT5    : num [1:66262] -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ...
 $ SSADECISN    : num [1:66262] 3 -88 1 1 -88 1 -88 -88 1 1 ...
 $ EIP_YN       : num [1:66262] 2 2 1 2 1 2 2 2 2 2 ...
 $ EIPRV        : num [1:66262] -88 -88 3 -88 1 -88 -88 -88 -88 -88 ...
 $ EIPSPND1     : num [1:66262] -88 -88 -99 -88 1 -88 -88 -88 -88 -88 ...
 $ EIPSPND2     : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND3     : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND4     : num [1:66262] -88 -88 -99 -88 1 -88 -88 -88 -88 -88 ...
 $ EIPSPND5     : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND6     : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND7     : num [1:66262] -88 -88 1 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND8     : num [1:66262] -88 -88 1 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND9     : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND10    : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND11    : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND12    : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EIPSPND13    : num [1:66262] -88 -88 -99 -88 -99 -88 -88 -88 -88 -88 ...
 $ EXPNS_DIF    : num [1:66262] 2 1 2 1 2 1 1 4 1 2 ...
 $ CHNGSHOP1    : num [1:66262] 1 2 1 2 1 2 2 1 2 2 ...
 $ CHNGSHOP2    : num [1:66262] 2 2 2 2 2 2 2 2 2 2 ...
 $ CHNGSHOP3    : num [1:66262] 1 2 1 2 1 2 2 1 2 2 ...
 $ CHNGSVCS1    : num [1:66262] 1 2 -99 2 1 2 2 2 2 2 ...
 $ CHNGSVCS2    : num [1:66262] 1 2 1 2 1 2 2 2 2 2 ...
 $ CHNGSVCS3    : num [1:66262] 1 2 1 2 2 2 2 2 2 2 ...
 $ CHNGSHP1ML   : num [1:66262] 1 -88 1 -88 -88 -88 -88 2 -88 -88 ...
  [list output truncated]
 - attr(*, "spec")=
  .. cols(
  ..   SCRAM = col_character(),
  ..   WEEK = col_double(),
  ..   EST_ST = col_character(),
  ..   EST_MSA = col_double(),
  ..   REGION = col_double(),
  ..   HWEIGHT = col_double(),
  ..   PWEIGHT = col_double(),
  ..   TBIRTH_YEAR = col_double(),
  ..   ABIRTH_YEAR = col_double(),
  ..   EGENDER = col_double(),
  ..   AGENDER = col_double(),
  ..   RHISPANIC = col_double(),
  ..   AHISPANIC = col_double(),
  ..   RRACE = col_double(),
  ..   ARACE = col_double(),
  ..   EEDUC = col_double(),
  ..   AEDUC = col_double(),
  ..   MS = col_double(),
  ..   THHLD_NUMPER = col_double(),
  ..   AHHLD_NUMPER = col_double(),
  ..   THHLD_NUMKID = col_double(),
  ..   AHHLD_NUMKID = col_double(),
  ..   THHLD_NUMADLT = col_double(),
  ..   ACTVDUTY1 = col_double(),
  ..   ACTVDUTY2 = col_double(),
  ..   ACTVDUTY3 = col_double(),
  ..   ACTVDUTY4 = col_double(),
  ..   ACTVDUTY5 = col_double(),
  ..   RECVDVACC = col_double(),
  ..   DOSES = col_double(),
  ..   GETVACRV = col_double(),
  ..   WHYNOT1 = col_double(),
  ..   WHYNOT2 = col_double(),
  ..   WHYNOT3 = col_double(),
  ..   WHYNOT4 = col_double(),
  ..   WHYNOT5 = col_double(),
  ..   WHYNOT6 = col_double(),
  ..   WHYNOT7 = col_double(),
  ..   WHYNOT8 = col_double(),
  ..   WHYNOT9 = col_double(),
  ..   WHYNOT10 = col_double(),
  ..   WHYNOT11 = col_double(),
  ..   WHYNOTB1 = col_double(),
  ..   WHYNOTB2 = col_double(),
  ..   WHYNOTB3 = col_double(),
  ..   WHYNOTB4 = col_double(),
  ..   WHYNOTB5 = col_double(),
  ..   WHYNOTB6 = col_double(),
  ..   COVPRVNT = col_double(),
  ..   HADCOVID = col_double(),
  ..   WRKLOSSRV = col_double(),
  ..   EXPCTLOSS = col_double(),
  ..   ANYWORK = col_double(),
  ..   KINDWORK = col_double(),
  ..   RSNNOWRKRV = col_double(),
  ..   TW_YN = col_double(),
  ..   TW_COV = col_double(),
  ..   WKVOL = col_double(),
  ..   SETTING = col_double(),
  ..   UI_APPLYRV = col_double(),
  ..   UI_RECVRV = col_double(),
  ..   UI_RECVNOW = col_double(),
  ..   SSA_RECV = col_double(),
  ..   SSA_APPLYRV = col_double(),
  ..   SSAPGMRV1 = col_double(),
  ..   SSAPGMRV2 = col_double(),
  ..   SSAPGMRV3 = col_double(),
  ..   SSAPGMRV4 = col_double(),
  ..   SSAPGMRV5 = col_double(),
  ..   SSALIKELYRV = col_double(),
  ..   SSAEXPCT1 = col_double(),
  ..   SSAEXPCT2 = col_double(),
  ..   SSAEXPCT3 = col_double(),
  ..   SSAEXPCT4 = col_double(),
  ..   SSAEXPCT5 = col_double(),
  ..   SSADECISN = col_double(),
  ..   EIP_YN = col_double(),
  ..   EIPRV = col_double(),
  ..   EIPSPND1 = col_double(),
  ..   EIPSPND2 = col_double(),
  ..   EIPSPND3 = col_double(),
  ..   EIPSPND4 = col_double(),
  ..   EIPSPND5 = col_double(),
  ..   EIPSPND6 = col_double(),
  ..   EIPSPND7 = col_double(),
  ..   EIPSPND8 = col_double(),
  ..   EIPSPND9 = col_double(),
  ..   EIPSPND10 = col_double(),
  ..   EIPSPND11 = col_double(),
  ..   EIPSPND12 = col_double(),
  ..   EIPSPND13 = col_double(),
  ..   EXPNS_DIF = col_double(),
  ..   CHNGSHOP1 = col_double(),
  ..   CHNGSHOP2 = col_double(),
  ..   CHNGSHOP3 = col_double(),
  ..   CHNGSVCS1 = col_double(),
  ..   CHNGSVCS2 = col_double(),
  ..   CHNGSVCS3 = col_double(),
  ..   CHNGSHP1ML = col_double(),
  ..   CHNGSHP2ML = col_double(),
  ..   CHNGSHP3ML = col_double(),
  ..   CHNGSVC1ML = col_double(),
  ..   CHNGSVC2ML = col_double(),
  ..   CHNGSVC3ML = col_double(),
  ..   CASHUSE = col_double(),
  ..   WHYCHNGD1 = col_double(),
  ..   WHYCHNGD2 = col_double(),
  ..   WHYCHNGD3 = col_double(),
  ..   WHYCHNGD4 = col_double(),
  ..   WHYCHNGD5 = col_double(),
  ..   WHYCHNGD6 = col_double(),
  ..   WHYCHNGD7 = col_double(),
  ..   WHYCHNGD8 = col_double(),
  ..   WHYCHNGD9 = col_double(),
  ..   WHYCHNGD10 = col_double(),
  ..   WHYCHNGD11 = col_double(),
  ..   WHYCHNGD12 = col_double(),
  ..   WHYCHNGD13 = col_double(),
  ..   SPNDSRC1 = col_double(),
  ..   SPNDSRC2 = col_double(),
  ..   SPNDSRC3 = col_double(),
  ..   SPNDSRC4 = col_double(),
  ..   SPNDSRC5 = col_double(),
  ..   SPNDSRC6 = col_double(),
  ..   SPNDSRC7 = col_double(),
  ..   SPNDSRC8 = col_double(),
  ..   SPNDSRC9 = col_double(),
  ..   FEWRTRIP1 = col_double(),
  ..   FEWRTRIP2 = col_double(),
  ..   FEWRTRIP3 = col_double(),
  ..   PRVRIDESHR = col_double(),
  ..   FEWRTRANS = col_double(),
  ..   PLNDTRIPS = col_double(),
  ..   CURFOODSUF = col_double(),
  ..   CHILDFOOD = col_double(),
  ..   FOODRSNRV1 = col_double(),
  ..   FOODRSNRV2 = col_double(),
  ..   FOODRSNRV3 = col_double(),
  ..   FOODRSNRV4 = col_double(),
  ..   FREEFOOD = col_double(),
  ..   SNAP_YN = col_double(),
  ..   TSPNDFOOD = col_double(),
  ..   TSPNDPRPD = col_double(),
  ..   ANXIOUS = col_double(),
  ..   WORRY = col_double(),
  ..   INTEREST = col_double(),
  ..   DOWN = col_double(),
  ..   HLTHINS1 = col_double(),
  ..   HLTHINS2 = col_double(),
  ..   HLTHINS3 = col_double(),
  ..   HLTHINS4 = col_double(),
  ..   HLTHINS5 = col_double(),
  ..   HLTHINS6 = col_double(),
  ..   HLTHINS7 = col_double(),
  ..   HLTHINS8 = col_double(),
  ..   PRIVHLTH = col_double(),
  ..   PUBHLTH = col_double(),
  ..   DELAY = col_double(),
  ..   NOTGET = col_double(),
  ..   TELEHLTH = col_double(),
  ..   TELECHLD = col_double(),
  ..   PRESCRIPT = col_double(),
  ..   MH_SVCS = col_double(),
  ..   MH_NOTGET = col_double(),
  ..   PRVNTIVE = col_double(),
  ..   PRVNTWHY1 = col_double(),
  ..   PRVNTWHY2 = col_double(),
  ..   PRVNTWHY3 = col_double(),
  ..   PRVNTWHY4 = col_double(),
  ..   PRVNTWHY5 = col_double(),
  ..   PRVNTWHY6 = col_double(),
  ..   PRVNTWHY7 = col_double(),
  ..   SEEING = col_double(),
  ..   HEARING = col_double(),
  ..   REMEMBERING = col_double(),
  ..   MOBILITY = col_double(),
  ..   TENURE = col_double(),
  ..   LIVQTRRV = col_double(),
  ..   RENTCUR = col_double(),
  ..   MORTCUR = col_double(),
  ..   MORTCONF = col_double(),
  ..   EVICT = col_double(),
  ..   FORCLOSE = col_double(),
  ..   ENRPUBCHK = col_double(),
  ..   ENRPRVCHK = col_double(),
  ..   ENRHMSCHK = col_double(),
  ..   TENROLLPUB = col_double(),
  ..   TENROLLPRV = col_double(),
  ..   TENROLLHMSCH = col_double(),
  ..   ENROLLNONE = col_double(),
  ..   TEACH1 = col_double(),
  ..   TEACH2 = col_double(),
  ..   TEACH3 = col_double(),
  ..   TEACH4 = col_double(),
  ..   TEACH5 = col_double(),
  ..   TEACH6 = col_double(),
  ..   TEACH7 = col_double(),
  ..   TEACH8 = col_double(),
  ..   HYBRID = col_double(),
  ..   COMPAVAIL = col_double(),
  ..   INTRNTAVAIL = col_double(),
  ..   INTRNTRV1 = col_double(),
  ..   INTRNTRV2 = col_double(),
  ..   INTRNTRV3 = col_double(),
  ..   INTRNTRV4 = col_double(),
  ..   SCHLHRS = col_double(),
  ..   SCHLFOOD = col_double(),
  ..   SCHLFDHLP1 = col_double(),
  ..   SCHLFDHLP2 = col_double(),
  ..   SCHLFDHLP3 = col_double(),
  ..   SCHLFDHLP4 = col_double(),
  ..   CHLDCARE = col_double(),
  ..   CHLDIMPCT1 = col_double(),
  ..   CHLDIMPCT2 = col_double(),
  ..   CHLDIMPCT3 = col_double(),
  ..   CHLDIMPCT4 = col_double(),
  ..   CHLDIMPCT5 = col_double(),
  ..   CHLDIMPCT6 = col_double(),
  ..   CHLDIMPCT7 = col_double(),
  ..   CHLDIMPCT8 = col_double(),
  ..   CHLDIMPCT9 = col_double(),
  ..   TNUM_PS = col_double(),
  ..   PSCHNG1 = col_double(),
  ..   PSCHNG2 = col_double(),
  ..   PSCHNG3 = col_double(),
  ..   PSCHNG4 = col_double(),
  ..   PSCHNG5 = col_double(),
  ..   PSCHNG6 = col_double(),
  ..   PSCHNG7 = col_double(),
  ..   PSWHYCHG1 = col_double(),
  ..   PSWHYCHG2 = col_double(),
  ..   PSWHYCHG3 = col_double(),
  ..   PSWHYCHG4 = col_double(),
  ..   PSWHYCHG5 = col_double(),
  ..   PSWHYCHG6 = col_double(),
  ..   PSWHYCHG7 = col_double(),
  ..   PSWHYCHG8 = col_double(),
  ..   PSWHYCHG9 = col_double(),
  ..   INCOME = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
weight <- read_csv("pulse2021_repwgt_puf_33.csv")
Rows: 66262 Columns: 162
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): SCRAM
dbl (161): WEEK, HWEIGHT1, HWEIGHT2, HWEIGHT3, HWEIGHT4, HWEIGHT5, HWEIGHT6,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#str(weight)
  1. Describe the relationship between COVID vaccination (ignore number of doses) and a few key demographic variables (approximate age, gender, education – measured as having at least a Bachelor’s degree) using a regression-based approach. You can drop any individuals whose response to the COVID vaccination question is missing for the rest of the exercise.

On average, a one-year increment in age is associated with a 1.037-fold relative increment in the odds of recieving COVID vaccination (95% CI: 1.036–1.039, p < 0.001),holding gender and education constant.

On average, people who are female is associated with a 0.886 -fold relative decrement in the odds of recieving COVID vaccination (95% CI: 0.844–0.930, p <0.001) ,compared to people who are not male,holding age and education constant.

On average, people who have at least a Bachelor’s degree is associated with a 3.243 -fold relative increment in the odds of recieving COVID vaccination (95% CI: 3.087–3.407, p < 0.001), compared to people who does not have at least a Bachelor’s degree,holding age and gender constant.

library(skimr)
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(broom)

analysis_data <- data %>%
  select(RECVDVACC, TBIRTH_YEAR, EGENDER, EEDUC)

skim(analysis_data)
Data summary
Name analysis_data
Number of rows 66262
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RECVDVACC 0 1 0.38 8.67 -99 1 1 1 2 ▁▁▁▁▇
TBIRTH_YEAR 0 1 1967.21 15.70 1933 1955 1966 1980 2003 ▃▇▇▆▂
EGENDER 0 1 1.59 0.49 1 1 2 2 2 ▆▁▁▁▇
EEDUC 0 1 5.31 1.46 1 4 6 7 7 ▁▂▃▂▇
analysis_data <- analysis_data %>%
  filter(RECVDVACC %in% c(1, 2)) %>%  # 這樣就夠了
  mutate(
    vaccinated = ifelse(RECVDVACC == 1, 1, 0),
    age = 2021 - TBIRTH_YEAR,
    female = ifelse(EGENDER == 2, 1, 0),
    bachelors_plus = ifelse(EEDUC >= 6, 1, 0)
  ) 



model <- glm(vaccinated ~ age + female + bachelors_plus, 
             data = analysis_data, 
             family = "binomial")

result <- tidy(coeftest(model), conf.int=T)%>%
  mutate(OR=round(exp(estimate),3),
         OR_low=round(exp(conf.low),3),
         OR_high=round(exp(conf.high),3)
         )

result 
# A tibble: 4 × 10
  term     estimate std.error statistic  p.value conf.low conf.high    OR OR_low
  <chr>       <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <dbl>  <dbl>
1 (Interc…  -0.426   0.0437       -9.75 1.86e-22  -0.511    -0.340  0.653  0.6  
2 age        0.0367  0.000785     46.8  0          0.0352    0.0383 1.04   1.04 
3 female    -0.121   0.0248       -4.89 9.96e- 7  -0.170    -0.0727 0.886  0.844
4 bachelo…   1.18    0.0252       46.7  0          1.13      1.23   3.24   3.09 
# ℹ 1 more variable: OR_high <dbl>
  1. Create a machine-learning model (eg: KNN, shrinkage, tree-based, clustering) that predicts individual-level COVID vaccination status from other variables available in the HPS dataset. Briefly explain why you chose this method. (Note: Please limit yourself to 10-15 minutes of coding. We want this to be an opportunity to think creatively without being onerous.)

I chose to use a random forest model because it is an ensemble method that aggregates multiple decision trees, which helps improve prediction accuracy and reduce overfitting. Additionally, the model’s use of Gini impurity to rank variable importance provides a useful starting point for identifying key predictors and exploring additional variables.

After this, I rendered a singel decision tree to improve interpretation. It turns out having higher specificity and a better F1 score.

Clean data and number of variables went from 240 to 105

library(tidyverse)
library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
library(rpart)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-9
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
#str(data) #66,262 × 239

#remove rows missing RECVDVACC
data_clean <- data
data_clean <- data_clean %>%
  filter(!is.na(RECVDVACC) & RECVDVACC %in% c(1, 2)) %>%
  mutate(vaccination_status = factor(ifelse(RECVDVACC == 1, 1, 0), 
                                   levels = c(0, 1),
                                   labels = c("Not_Vaccinated", "Vaccinated")))

data_clean <- data_clean %>%
  mutate(MSA_flag = ifelse(is.na(EST_MSA), "Non_Metro", "Metro"))

#recode multiple_response questions
library(readxl)
library(dplyr)
library(stringr)
dict <- read_excel("pulse2021_data.dictionary_CSV_33.xlsx", skip=4)
New names:
• `` -> `...4`
#names(dict)

condition1 <- dict %>%
  filter(str_detect(`Question Wording`, regex("Select all that apply", ignore_case = TRUE))) %>%
  pull(Variable)

condition2 <- dict %>%
  filter(str_detect(Range, fixed("<blank>,1"))) %>%
  pull(Variable)

condition3 <- dict %>%
  filter(str_detect(Range, fixed("1:2"))) %>%
  pull(Variable)

condition4 <- dict %>%
  filter(str_detect(Range, fixed("<blank>,1,2"))) %>%
  pull(Variable)

multiple_response <- unique(c(condition1, condition2, condition3, condition4))
multiple_response <- multiple_response[multiple_response %in% names(data_clean)]

for (var in multiple_response) {
  data_clean[[var]] <- ifelse(data_clean[[var]] %in% c(-88, -99), 2, data_clean[[var]])
}

# change -88 / -99 to 2 (No)
for (var in multiple_response) {
  data_clean[[var]] <- ifelse(data_clean[[var]] %in% c(-88, -99), 2, data_clean[[var]])
}

# Define variables to exclude
exclude_vars <- c(
  
  # Remove to prevent data leak
  "RECVDVACC", "SCRAM", "PWEIGHT", "HWEIGHT",
  "DOSES", "GETVACRV", paste0("WHYNOT", 1:11), paste0("WHYNOTB", 1:6), "COVPRVNT",
  
  # Remove administrative data
  "ABIRTH_YEAR", "AGENDER", "AHISPANIC", "ARACE", 
  "AEDUC", "AHHLD_NUMPER", "AHHLD_NUMKID", "WEEK", "EST_MSA",
  
  # Remove too detailed conditional questions
  "EIPRV", paste0("EIPSPND", 1:13), "RENTCUR", "MORTCUR", 
  "EVICT", "FORCLOSE", "CHLDCARE", 
  paste0("CHLDIMPCT", c(2:4, 6:8)),
  "TENROLLPUB", "TENROLLPRV", "TENROLLHMSCH", "ENROLLNONE",
  paste0("TEACH", 1:8), "HYBRID", "COMPAVAIL", 
  paste0("INTRNTRV", 1:4), "INTRNTAVAIL", "SCHLHRS", 
  "SCHLFOOD", paste0("SCHLFDHLP", 1:4), 
  "PRVNTIVE", paste0("PRVNTWHY", 1:7), "TELECHLD",
  "TNUM_PS", paste0("PSCHNG", 1:7), paste0("PSWHYCHG", 1:9),
  "MORTCONF", "SSADECISN", "KINDWORK", "RSNNOWRKRV", 
  paste0("FOODRSNRV", 1:4), "UI_RECVNOW", "CHILDFOOD",paste0("ACTVDUTY", 2:5),
  
  # Remove variables not in data dictionary
  "ENRPRVCHK", "ENRHMSCHK", "ENRPUBCHK"
    )

# Apply exclusions
data_clean <- data_clean[, !names(data_clean) %in% exclude_vars]

# Recode missing data as NA
data_clean$PRIVHLTH <- ifelse(data_clean$PRIVHLTH == 3, NA, data_clean$PRIVHLTH)
data_clean$PUBHLTH  <- ifelse(data_clean$PUBHLTH  == 3, NA, data_clean$PUBHLTH)

# Recode non applicable questions
data_clean$SSALIKELYRV[data_clean$SSALIKELYRV %in% c(-88, -99) & 
                        !(data_clean$SSA_APPLYRV == 2 & data_clean$SSA_RECV != 1)] <- -1

# Recode -88, -99 to NA
vars_to_na <- c("SSALIKELYRV", "HADCOVID", "SETTING", "EXPNS_DIF", 
                "CASHUSE", "CURFOODSUF", "MOBILITY", "SEEING", "ANXIOUS", 
                "REMEMBERING", "HEARING", "MS", "DOWN", "TENURE", "WORRY", 
                "INTEREST", "LIVQTRRV", "INCOME", "TSPNDPRPD", "TSPNDFOOD")

data_clean[vars_to_na] <- lapply(data_clean[vars_to_na], function(x) {
  replace(x, x %in% c(-88, -99), NA)
})

#find variables with -99
cols_with_minus99 <- sapply(data_clean, function(x) {
  is.numeric(x) && any(x == -99, na.rm = TRUE)})
neg99_prop <- sapply(data_clean[cols_with_minus99], function(col) {
  sum(col == -99, na.rm = TRUE) / sum(!is.na(col))})
#sort(neg99_prop)
#names(neg99_prop)

#find variables with -88
cols_with_minus88 <- sapply(data_clean, function(x) any(x == -88, na.rm = TRUE))
neg88_prop <- sapply(data_clean[cols_with_minus88], function(col) {
  sum(col == -88, na.rm = TRUE) / sum(!is.na(col))
})
#sort(neg88_prop)


#remove columns with missing data >=0.2
threshold <- 0.2
missing_rate <- colMeans(is.na(data_clean))
data_clean <- data_clean[, missing_rate < threshold]


#skim(data_clean)
#glimpse(data_clean)




sapply(data_clean, function(x) length(unique(x)))
            EST_ST             REGION        TBIRTH_YEAR            EGENDER 
                51                  4                 71                  2 
         RHISPANIC              RRACE              EEDUC                 MS 
                 2                  4                  7                  6 
      THHLD_NUMPER       THHLD_NUMKID      THHLD_NUMADLT          ACTVDUTY1 
                10                  6                 10                  2 
          HADCOVID          WRKLOSSRV          EXPCTLOSS            ANYWORK 
                 4                  2                  2                  2 
             TW_YN             TW_COV              WKVOL         UI_APPLYRV 
                 2                  2                  2                  2 
         UI_RECVRV           SSA_RECV        SSA_APPLYRV          SSAPGMRV1 
                 2                  2                  2                  2 
         SSAPGMRV2          SSAPGMRV3          SSAPGMRV4          SSAPGMRV5 
                 2                  2                  2                  2 
       SSALIKELYRV          SSAEXPCT1          SSAEXPCT2          SSAEXPCT3 
                 6                  2                  2                  2 
         SSAEXPCT4          SSAEXPCT5             EIP_YN          EXPNS_DIF 
                 2                  2                  2                  5 
         CHNGSHOP1          CHNGSHOP2          CHNGSHOP3          CHNGSVCS1 
                 2                  2                  2                  2 
         CHNGSVCS2          CHNGSVCS3         CHNGSHP1ML         CHNGSHP2ML 
                 2                  2                  2                  2 
        CHNGSHP3ML         CHNGSVC1ML         CHNGSVC2ML         CHNGSVC3ML 
                 2                  2                  2                  2 
           CASHUSE          WHYCHNGD1          WHYCHNGD2          WHYCHNGD3 
                 4                  2                  2                  2 
         WHYCHNGD4          WHYCHNGD5          WHYCHNGD6          WHYCHNGD7 
                 2                  2                  2                  2 
         WHYCHNGD8          WHYCHNGD9         WHYCHNGD10         WHYCHNGD11 
                 2                  2                  2                  2 
        WHYCHNGD12         WHYCHNGD13           SPNDSRC1           SPNDSRC2 
                 2                  2                  2                  2 
          SPNDSRC3           SPNDSRC4           SPNDSRC5           SPNDSRC6 
                 2                  2                  2                  2 
          SPNDSRC7           SPNDSRC8           SPNDSRC9          FEWRTRIP1 
                 2                  2                  2                  2 
         FEWRTRIP2          FEWRTRIP3         PRVRIDESHR          FEWRTRANS 
                 2                  2                  2                  2 
         PLNDTRIPS         CURFOODSUF           FREEFOOD            SNAP_YN 
                 2                  5                  2                  2 
         TSPNDFOOD            ANXIOUS              WORRY           INTEREST 
               470                  5                  5                  5 
              DOWN           HLTHINS1           HLTHINS2           HLTHINS3 
                 5                  2                  2                  2 
          HLTHINS4           HLTHINS5           HLTHINS6           HLTHINS7 
                 2                  2                  2                  2 
          HLTHINS8              DELAY             NOTGET           TELEHLTH 
                 2                  2                  2                  2 
         PRESCRIPT            MH_SVCS          MH_NOTGET         CHLDIMPCT1 
                 2                  2                  2                  2 
        CHLDIMPCT5         CHLDIMPCT9 vaccination_status           MSA_flag 
                 2                  2                  2                  2 
# variable type
not_factor_vars <- c(
  "TBIRTH_YEAR", "THHLD_NUMPER", "THHLD_NUMKID", "THHLD_NUMADLT",
  "TSPNDFOOD", "ANXIOUS", "WORRY", "INTEREST", "DOWN")

binary_vars <- c(
  "EGENDER", "RHISPANIC", "ACTVDUTY1", "WRKLOSSRV", "EXPCTLOSS", "ANYWORK", "TW_YN", "TW_COV", 
  "WKVOL", "UI_APPLYRV", "UI_RECVRV", "SSA_RECV", "SSA_APPLYRV", "SSAPGMRV1", "SSAPGMRV2", 
  "SSAPGMRV3", "SSAPGMRV4", "SSAPGMRV5", "SSAEXPCT1", "SSAEXPCT2", "SSAEXPCT3", "SSAEXPCT4", 
  "SSAEXPCT5", "EIP_YN", "CHNGSHOP1", "CHNGSHOP2", "CHNGSHOP3", "CHNGSVCS1", "CHNGSVCS2", 
  "CHNGSVCS3", "CHNGSHP1ML", "CHNGSHP2ML", "CHNGSHP3ML", "CHNGSVC1ML", "CHNGSVC2ML", 
  "CHNGSVC3ML", "WHYCHNGD1", "WHYCHNGD2", "WHYCHNGD3", "WHYCHNGD4", "WHYCHNGD5", "WHYCHNGD6", 
  "WHYCHNGD7", "WHYCHNGD8", "WHYCHNGD9", "WHYCHNGD10", "WHYCHNGD11", "WHYCHNGD12", 
  "WHYCHNGD13", "SPNDSRC1", "SPNDSRC2", "SPNDSRC3", "SPNDSRC4", "SPNDSRC5", "SPNDSRC6", 
  "SPNDSRC7", "SPNDSRC8", "SPNDSRC9", "FEWRTRIP1", "FEWRTRIP2", "FEWRTRIP3", "PRVRIDESHR", 
  "FEWRTRANS", "PLNDTRIPS", "FREEFOOD", "SNAP_YN", "HLTHINS1", "HLTHINS2", "HLTHINS3", 
  "HLTHINS4", "HLTHINS5", "HLTHINS6", "HLTHINS7", "HLTHINS8", "DELAY", "NOTGET", "TELEHLTH", 
  "PRESCRIPT", "MH_SVCS", "MH_NOTGET", "CHLDIMPCT1", "CHLDIMPCT5", "CHLDIMPCT9", "MSA_flag"
)

data_clean[binary_vars] <- lapply(data_clean[binary_vars], function(x) {
  x_new <- ifelse(x == 1, "Yes", ifelse(x == 2, "No", NA))
  factor(x_new, levels = c("No", "Yes"))
})

# Turn rest variables to factor
to_factor_vars <- setdiff(names(data_clean), c(not_factor_vars, binary_vars))
data_clean[to_factor_vars] <- lapply(data_clean[to_factor_vars], factor)


data_clean$AGE <- 2021 - data_clean$TBIRTH_YEAR

skim(data_clean)
Data summary
Name data_clean
Number of rows 65762
Number of columns 105
_______________________
Column type frequency:
factor 95
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
EST_ST 0 1.00 FALSE 51 06: 5289, 48: 3491, 53: 2472, 12: 2447
REGION 0 1.00 FALSE 4 4: 21407, 2: 21028, 3: 13191, 1: 10136
EGENDER 0 1.00 FALSE 2 No: 38857, Yes: 26905
RHISPANIC 0 1.00 FALSE 2 Yes: 59772, No: 5990
RRACE 0 1.00 FALSE 4 1: 53761, 2: 5351, 3: 3496, 4: 3154
EEDUC 0 1.00 FALSE 7 6: 18825, 7: 17342, 4: 13820, 3: 7549
MS 405 0.99 FALSE 5 1: 38415, 5: 11970, 3: 10139, 2: 3634
ACTVDUTY1 0 1.00 FALSE 2 Yes: 64537, No: 1225
HADCOVID 77 1.00 FALSE 3 2: 57813, 1: 7412, 3: 460
WRKLOSSRV 0 1.00 FALSE 2 No: 57379, Yes: 8383
EXPCTLOSS 0 1.00 FALSE 2 No: 60129, Yes: 5633
ANYWORK 0 1.00 FALSE 2 Yes: 38256, No: 27506
TW_YN 0 1.00 FALSE 2 No: 43379, Yes: 22383
TW_COV 0 1.00 FALSE 2 No: 47935, Yes: 17827
WKVOL 0 1.00 FALSE 2 Yes: 33074, No: 32688
UI_APPLYRV 0 1.00 FALSE 2 No: 60949, Yes: 4813
UI_RECVRV 0 1.00 FALSE 2 No: 61620, Yes: 4142
SSA_RECV 0 1.00 FALSE 2 No: 45033, Yes: 20729
SSA_APPLYRV 0 1.00 FALSE 2 No: 63689, Yes: 2073
SSAPGMRV1 0 1.00 FALSE 2 No: 64989, Yes: 773
SSAPGMRV2 0 1.00 FALSE 2 No: 65177, Yes: 585
SSAPGMRV3 0 1.00 FALSE 2 No: 65652, Yes: 110
SSAPGMRV4 0 1.00 FALSE 2 No: 65513, Yes: 249
SSAPGMRV5 0 1.00 FALSE 2 No: 65004, Yes: 758
SSALIKELYRV 1335 0.98 FALSE 5 4: 40374, -1: 21648, 3: 1306, 1: 650
SSAEXPCT1 0 1.00 FALSE 2 No: 64970, Yes: 792
SSAEXPCT2 0 1.00 FALSE 2 No: 65099, Yes: 663
SSAEXPCT3 0 1.00 FALSE 2 No: 65655, Yes: 107
SSAEXPCT4 0 1.00 FALSE 2 No: 65376, Yes: 386
SSAEXPCT5 0 1.00 FALSE 2 No: 64870, Yes: 892
EIP_YN 0 1.00 FALSE 2 No: 60846, Yes: 4916
EXPNS_DIF 3231 0.95 FALSE 4 1: 37985, 2: 11851, 3: 7696, 4: 4999
CHNGSHOP1 0 1.00 FALSE 2 No: 45266, Yes: 20496
CHNGSHOP2 0 1.00 FALSE 2 No: 52888, Yes: 12874
CHNGSHOP3 0 1.00 FALSE 2 No: 42865, Yes: 22897
CHNGSVCS1 0 1.00 FALSE 2 No: 43006, Yes: 22756
CHNGSVCS2 0 1.00 FALSE 2 No: 48806, Yes: 16956
CHNGSVCS3 0 1.00 FALSE 2 No: 59372, Yes: 6390
CHNGSHP1ML 0 1.00 FALSE 2 No: 56528, Yes: 9234
CHNGSHP2ML 0 1.00 FALSE 2 No: 60795, Yes: 4967
CHNGSHP3ML 0 1.00 FALSE 2 No: 57396, Yes: 8366
CHNGSVC1ML 0 1.00 FALSE 2 No: 55785, Yes: 9977
CHNGSVC2ML 0 1.00 FALSE 2 No: 58057, Yes: 7705
CHNGSVC3ML 0 1.00 FALSE 2 No: 63793, Yes: 1969
CASHUSE 4094 0.94 FALSE 3 3: 45502, 2: 11592, 1: 4574
WHYCHNGD1 0 1.00 FALSE 2 No: 59568, Yes: 6194
WHYCHNGD2 0 1.00 FALSE 2 No: 58376, Yes: 7386
WHYCHNGD3 0 1.00 FALSE 2 No: 55013, Yes: 10749
WHYCHNGD4 0 1.00 FALSE 2 No: 56312, Yes: 9450
WHYCHNGD5 0 1.00 FALSE 2 No: 58802, Yes: 6960
WHYCHNGD6 0 1.00 FALSE 2 No: 65012, Yes: 750
WHYCHNGD7 0 1.00 FALSE 2 No: 62543, Yes: 3219
WHYCHNGD8 0 1.00 FALSE 2 No: 64424, Yes: 1338
WHYCHNGD9 0 1.00 FALSE 2 No: 62186, Yes: 3576
WHYCHNGD10 0 1.00 FALSE 2 No: 63834, Yes: 1928
WHYCHNGD11 0 1.00 FALSE 2 No: 55878, Yes: 9884
WHYCHNGD12 0 1.00 FALSE 2 No: 64161, Yes: 1601
WHYCHNGD13 0 1.00 FALSE 2 No: 61929, Yes: 3833
SPNDSRC1 0 1.00 FALSE 2 Yes: 47922, No: 17840
SPNDSRC2 0 1.00 FALSE 2 No: 50079, Yes: 15683
SPNDSRC3 0 1.00 FALSE 2 No: 54668, Yes: 11094
SPNDSRC4 0 1.00 FALSE 2 No: 61939, Yes: 3823
SPNDSRC5 0 1.00 FALSE 2 No: 63126, Yes: 2636
SPNDSRC6 0 1.00 FALSE 2 No: 57790, Yes: 7972
SPNDSRC7 0 1.00 FALSE 2 No: 64226, Yes: 1536
SPNDSRC8 0 1.00 FALSE 2 No: 63112, Yes: 2650
SPNDSRC9 0 1.00 FALSE 2 No: 64300, Yes: 1462
FEWRTRIP1 0 1.00 FALSE 2 No: 50101, Yes: 15661
FEWRTRIP2 0 1.00 FALSE 2 No: 54740, Yes: 11022
FEWRTRIP3 0 1.00 FALSE 2 Yes: 35798, No: 29964
PRVRIDESHR 0 1.00 FALSE 2 No: 56834, Yes: 8928
FEWRTRANS 0 1.00 FALSE 2 No: 60587, Yes: 5175
PLNDTRIPS 0 1.00 FALSE 2 No: 40689, Yes: 25073
CURFOODSUF 7381 0.89 FALSE 4 1: 45595, 2: 9303, 3: 2717, 4: 766
FREEFOOD 0 1.00 FALSE 2 No: 62953, Yes: 2809
SNAP_YN 0 1.00 FALSE 2 No: 61138, Yes: 4624
HLTHINS1 0 1.00 FALSE 2 Yes: 34551, No: 31211
HLTHINS2 0 1.00 FALSE 2 No: 54213, Yes: 11549
HLTHINS3 0 1.00 FALSE 2 No: 49492, Yes: 16270
HLTHINS4 0 1.00 FALSE 2 No: 59764, Yes: 5998
HLTHINS5 0 1.00 FALSE 2 No: 63289, Yes: 2473
HLTHINS6 0 1.00 FALSE 2 No: 63211, Yes: 2551
HLTHINS7 0 1.00 FALSE 2 No: 65338, Yes: 424
HLTHINS8 0 1.00 FALSE 2 No: 63748, Yes: 2014
DELAY 0 1.00 FALSE 2 No: 58685, Yes: 7077
NOTGET 0 1.00 FALSE 2 No: 59900, Yes: 5862
TELEHLTH 0 1.00 FALSE 2 No: 52798, Yes: 12964
PRESCRIPT 0 1.00 FALSE 2 No: 53698, Yes: 12064
MH_SVCS 0 1.00 FALSE 2 No: 60013, Yes: 5749
MH_NOTGET 0 1.00 FALSE 2 No: 60973, Yes: 4789
CHLDIMPCT1 0 1.00 FALSE 2 No: 65488, Yes: 274
CHLDIMPCT5 0 1.00 FALSE 2 No: 65677, Yes: 85
CHLDIMPCT9 0 1.00 FALSE 2 No: 65589, Yes: 173
vaccination_status 0 1.00 FALSE 2 Vac: 57066, Not: 8696
MSA_flag 65762 0.00 FALSE 0 No: 0, Yes: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TBIRTH_YEAR 0 1.00 1967.23 15.69 1933 1955 1966 1980 2003 ▃▇▇▆▂
THHLD_NUMPER 0 1.00 2.73 1.49 1 2 2 4 10 ▇▅▂▁▁
THHLD_NUMKID 0 1.00 0.61 1.02 0 0 0 1 5 ▇▁▁▁▁
THHLD_NUMADLT 0 1.00 2.12 0.98 1 2 2 2 10 ▇▂▁▁▁
TSPNDFOOD 12973 0.80 203.33 149.54 0 100 175 250 900 ▇▆▁▁▁
ANXIOUS 11300 0.83 1.79 1.02 1 1 1 2 4 ▇▅▁▂▂
WORRY 11422 0.83 1.64 0.94 1 1 1 2 4 ▇▃▁▁▁
INTEREST 11441 0.83 1.60 0.90 1 1 1 2 4 ▇▃▁▁▁
DOWN 11388 0.83 1.57 0.89 1 1 1 2 4 ▇▃▁▁▁
AGE 0 1.00 53.77 15.69 18 41 55 66 88 ▃▇▇▇▂
#glimpse(data_clean) #65762*105

Partition the data into training and testing dataset

set.seed(123)
trainIndex <- createDataPartition(data_clean$vaccination_status, p = 0.7, list = FALSE)
train_data <- data_clean[trainIndex, ]
test_data <- data_clean[-trainIndex, ]

#check balance
prop.table(table(train_data$vaccination_status))  # Distribution in training

Not_Vaccinated     Vaccinated 
     0.1322472      0.8677528 
prop.table(table(test_data$vaccination_status))   # Distribution in test

Not_Vaccinated     Vaccinated 
     0.1322046      0.8677954 
prop.table(table(data_clean$vaccination_status))  # Original distribution

Not_Vaccinated     Vaccinated 
     0.1322344      0.8677656 
prop.table(table(analysis_data$vaccinated))  # Original distribution

        0         1 
0.1322344 0.8677656 

Random Forest to select variable automatically

library(randomForest)
library(caret)
library(pROC)
library(ROSE)
Loaded ROSE 0.0-4
# imputation
train_data_imputed <- na.roughfix(train_data)
test_data_imputed <- na.roughfix(test_data)

# check no N/A
colSums(is.na(train_data_imputed))
            EST_ST             REGION        TBIRTH_YEAR            EGENDER 
                 0                  0                  0                  0 
         RHISPANIC              RRACE              EEDUC                 MS 
                 0                  0                  0                  0 
      THHLD_NUMPER       THHLD_NUMKID      THHLD_NUMADLT          ACTVDUTY1 
                 0                  0                  0                  0 
          HADCOVID          WRKLOSSRV          EXPCTLOSS            ANYWORK 
                 0                  0                  0                  0 
             TW_YN             TW_COV              WKVOL         UI_APPLYRV 
                 0                  0                  0                  0 
         UI_RECVRV           SSA_RECV        SSA_APPLYRV          SSAPGMRV1 
                 0                  0                  0                  0 
         SSAPGMRV2          SSAPGMRV3          SSAPGMRV4          SSAPGMRV5 
                 0                  0                  0                  0 
       SSALIKELYRV          SSAEXPCT1          SSAEXPCT2          SSAEXPCT3 
                 0                  0                  0                  0 
         SSAEXPCT4          SSAEXPCT5             EIP_YN          EXPNS_DIF 
                 0                  0                  0                  0 
         CHNGSHOP1          CHNGSHOP2          CHNGSHOP3          CHNGSVCS1 
                 0                  0                  0                  0 
         CHNGSVCS2          CHNGSVCS3         CHNGSHP1ML         CHNGSHP2ML 
                 0                  0                  0                  0 
        CHNGSHP3ML         CHNGSVC1ML         CHNGSVC2ML         CHNGSVC3ML 
                 0                  0                  0                  0 
           CASHUSE          WHYCHNGD1          WHYCHNGD2          WHYCHNGD3 
                 0                  0                  0                  0 
         WHYCHNGD4          WHYCHNGD5          WHYCHNGD6          WHYCHNGD7 
                 0                  0                  0                  0 
         WHYCHNGD8          WHYCHNGD9         WHYCHNGD10         WHYCHNGD11 
                 0                  0                  0                  0 
        WHYCHNGD12         WHYCHNGD13           SPNDSRC1           SPNDSRC2 
                 0                  0                  0                  0 
          SPNDSRC3           SPNDSRC4           SPNDSRC5           SPNDSRC6 
                 0                  0                  0                  0 
          SPNDSRC7           SPNDSRC8           SPNDSRC9          FEWRTRIP1 
                 0                  0                  0                  0 
         FEWRTRIP2          FEWRTRIP3         PRVRIDESHR          FEWRTRANS 
                 0                  0                  0                  0 
         PLNDTRIPS         CURFOODSUF           FREEFOOD            SNAP_YN 
                 0                  0                  0                  0 
         TSPNDFOOD            ANXIOUS              WORRY           INTEREST 
                 0                  0                  0                  0 
              DOWN           HLTHINS1           HLTHINS2           HLTHINS3 
                 0                  0                  0                  0 
          HLTHINS4           HLTHINS5           HLTHINS6           HLTHINS7 
                 0                  0                  0                  0 
          HLTHINS8              DELAY             NOTGET           TELEHLTH 
                 0                  0                  0                  0 
         PRESCRIPT            MH_SVCS          MH_NOTGET         CHLDIMPCT1 
                 0                  0                  0                  0 
        CHLDIMPCT5         CHLDIMPCT9 vaccination_status           MSA_flag 
                 0                  0                  0                  0 
               AGE 
                 0 
colSums(is.na(test_data_imputed))
            EST_ST             REGION        TBIRTH_YEAR            EGENDER 
                 0                  0                  0                  0 
         RHISPANIC              RRACE              EEDUC                 MS 
                 0                  0                  0                  0 
      THHLD_NUMPER       THHLD_NUMKID      THHLD_NUMADLT          ACTVDUTY1 
                 0                  0                  0                  0 
          HADCOVID          WRKLOSSRV          EXPCTLOSS            ANYWORK 
                 0                  0                  0                  0 
             TW_YN             TW_COV              WKVOL         UI_APPLYRV 
                 0                  0                  0                  0 
         UI_RECVRV           SSA_RECV        SSA_APPLYRV          SSAPGMRV1 
                 0                  0                  0                  0 
         SSAPGMRV2          SSAPGMRV3          SSAPGMRV4          SSAPGMRV5 
                 0                  0                  0                  0 
       SSALIKELYRV          SSAEXPCT1          SSAEXPCT2          SSAEXPCT3 
                 0                  0                  0                  0 
         SSAEXPCT4          SSAEXPCT5             EIP_YN          EXPNS_DIF 
                 0                  0                  0                  0 
         CHNGSHOP1          CHNGSHOP2          CHNGSHOP3          CHNGSVCS1 
                 0                  0                  0                  0 
         CHNGSVCS2          CHNGSVCS3         CHNGSHP1ML         CHNGSHP2ML 
                 0                  0                  0                  0 
        CHNGSHP3ML         CHNGSVC1ML         CHNGSVC2ML         CHNGSVC3ML 
                 0                  0                  0                  0 
           CASHUSE          WHYCHNGD1          WHYCHNGD2          WHYCHNGD3 
                 0                  0                  0                  0 
         WHYCHNGD4          WHYCHNGD5          WHYCHNGD6          WHYCHNGD7 
                 0                  0                  0                  0 
         WHYCHNGD8          WHYCHNGD9         WHYCHNGD10         WHYCHNGD11 
                 0                  0                  0                  0 
        WHYCHNGD12         WHYCHNGD13           SPNDSRC1           SPNDSRC2 
                 0                  0                  0                  0 
          SPNDSRC3           SPNDSRC4           SPNDSRC5           SPNDSRC6 
                 0                  0                  0                  0 
          SPNDSRC7           SPNDSRC8           SPNDSRC9          FEWRTRIP1 
                 0                  0                  0                  0 
         FEWRTRIP2          FEWRTRIP3         PRVRIDESHR          FEWRTRANS 
                 0                  0                  0                  0 
         PLNDTRIPS         CURFOODSUF           FREEFOOD            SNAP_YN 
                 0                  0                  0                  0 
         TSPNDFOOD            ANXIOUS              WORRY           INTEREST 
                 0                  0                  0                  0 
              DOWN           HLTHINS1           HLTHINS2           HLTHINS3 
                 0                  0                  0                  0 
          HLTHINS4           HLTHINS5           HLTHINS6           HLTHINS7 
                 0                  0                  0                  0 
          HLTHINS8              DELAY             NOTGET           TELEHLTH 
                 0                  0                  0                  0 
         PRESCRIPT            MH_SVCS          MH_NOTGET         CHLDIMPCT1 
                 0                  0                  0                  0 
        CHLDIMPCT5         CHLDIMPCT9 vaccination_status           MSA_flag 
                 0                  0                  0                  0 
               AGE 
                 0 
# Oversampling to balance classes
balanced_data <- ovun.sample(vaccination_status ~ ., 
                             data = train_data_imputed, 
                             method = "over", 
                             N = 2 * nrow(train_data_imputed))$data

# Train random forest model 
rf_model <- randomForest(vaccination_status ~ ., 
                         data = balanced_data, 
                         ntree = 200, 
                         classwt = c("Not_Vaccinated" = 7, "Vaccinated" = 1), 
                         importance = TRUE)

#evaluate
rf_pred <- predict(rf_model, newdata = test_data_imputed)
conf <- confusionMatrix(rf_pred, test_data_imputed$vaccination_status)
Warning in confusionMatrix.default(rf_pred,
test_data_imputed$vaccination_status): Levels are not in the same order for
reference and data. Refactoring data to match.
conf$byClass["F1"]
       F1 
0.3199798 
# ROC and AUC
pred_prob <- predict(rf_model, test_data_imputed, type = "prob")[, "Vaccinated"]
roc_obj <- roc(test_data_imputed$vaccination_status, pred_prob, levels = c("Not_Vaccinated", "Vaccinated"))
Setting direction: controls < cases
cat("AUC:", auc(roc_obj), "\n")
AUC: 0.7951859 
#which variables predict better
library(knitr)
variable_importance <- importance(rf_model)
tmp <- tibble(feature = rownames(variable_importance),
              Gini=variable_importance[,1])%>%
                arrange(desc(Gini))
kable(tmp[1:30,])
feature Gini
EST_ST 32.5810629
EEDUC 32.3451706
HADCOVID 31.1740012
REGION 20.4997849
TW_COV 15.3378684
EXPNS_DIF 10.8131585
RHISPANIC 7.7790768
RRACE 7.7070942
TW_YN 5.5211151
CHLDIMPCT5 5.1790426
AGE 4.4319888
CHLDIMPCT9 4.0189691
TBIRTH_YEAR 3.5569210
CURFOODSUF 2.9064471
WRKLOSSRV 2.4907813
HLTHINS1 2.2569281
SSAPGMRV1 1.8518839
CHLDIMPCT1 1.7915478
WHYCHNGD11 1.7526195
SSAPGMRV4 1.5731954
ANYWORK 1.4831377
HLTHINS4 1.2419224
SSAEXPCT5 0.9834883
SPNDSRC4 0.9809902
THHLD_NUMKID 0.7298347
WHYCHNGD4 0.6061365
WHYCHNGD13 0.3712734
ACTVDUTY1 0.1506334
HLTHINS7 0.0370651
SSAPGMRV5 0.0339254
#direction of prediction
library(pdp)

Attaching package: 'pdp'
The following object is masked from 'package:purrr':

    partial
library(ggplot2)

top_vars <- tmp$feature[1:5]

pdp_plots <- list()
for(var in top_vars) {
  pdp_data <- partial(rf_model, 
                      pred.var = var, 
                      which.class = "Vaccinated",
                      prob = TRUE,
                      plot = FALSE)
  
  p <- ggplot(pdp_data, aes_string(x = var, y = "yhat")) +
    geom_line(size = 1.2) +
    geom_point() +
    labs(title = paste("Partial Dependence Plot:", var),
         x = var,
         y = "Probability of Being Vaccinated") +
    theme_minimal()
  
  pdp_plots[[var]] <- p
  print(p)
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Single decision tree (better for interpretation)

library(rpart.plot)

dt_model <- rpart(vaccination_status ~ ., 
                  data = balanced_data, 
                  method = "class")

dt_model <- rpart(vaccination_status ~ ., data = balanced_data, method = "class")

rpart.plot(dt_model)

#evaluate
dt_pred <- predict(dt_model, newdata = test_data_imputed, type = "class")
conf <- confusionMatrix(dt_pred, test_data_imputed$vaccination_status)
Warning in confusionMatrix.default(dt_pred,
test_data_imputed$vaccination_status): Levels are not in the same order for
reference and data. Refactoring data to match.
conf
Confusion Matrix and Statistics

                Reference
Prediction       Not_Vaccinated Vaccinated
  Not_Vaccinated           1852       5810
  Vaccinated                756      11309
                                          
               Accuracy : 0.6672          
                 95% CI : (0.6605, 0.6737)
    No Information Rate : 0.8678          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2036          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.71012         
            Specificity : 0.66061         
         Pos Pred Value : 0.24171         
         Neg Pred Value : 0.93734         
             Prevalence : 0.13220         
         Detection Rate : 0.09388         
   Detection Prevalence : 0.38840         
      Balanced Accuracy : 0.68537         
                                          
       'Positive' Class : Not_Vaccinated  
                                          
conf$byClass["F1"]
       F1 
0.3606621 
# ROC and AUC
pred_prob_dt <- predict(dt_model, test_data_imputed, type = "prob")[, "Vaccinated"]
roc_obj_dt <- roc(test_data_imputed$vaccination_status, pred_prob_dt, levels = c("Not_Vaccinated", "Vaccinated"))
Setting direction: controls < cases
cat("AUC:", auc(roc_obj_dt), "\n")
AUC: 0.7109319 
  1. We are interested in the ability of the Household Pulse to generate reliable estimates of the COVID-vaccination rate as a rapid-response survey. Calculate state-level vaccination rates (weighted), and report the mean, median, and standard deviation of the state-level rate.
  • Mean state-level vaccination rate: 0.7967

  • Median state-level vaccination rate: 0.7953

  • Standard deviation of state-level vaccination rates: 0.0737

This indicates that on average, approximately 79.7% of individuals across states reported being vaccinated, with some variation between states as reflected in the standard deviation.

library(survey)
Loading required package: grid
Loading required package: survival

Attaching package: 'survival'
The following object is masked from 'package:caret':

    cluster

Attaching package: 'survey'
The following object is masked from 'package:graphics':

    dotchart
library(tidyverse)

# Read data
data <- read_csv("pulse2021_puf_33.csv")
Rows: 66262 Columns: 239
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): SCRAM, EST_ST
dbl (237): WEEK, EST_MSA, REGION, HWEIGHT, PWEIGHT, TBIRTH_YEAR, ABIRTH_YEAR...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight <- read_csv("pulse2021_repwgt_puf_33.csv")
Rows: 66262 Columns: 162
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): SCRAM
dbl (161): WEEK, HWEIGHT1, HWEIGHT2, HWEIGHT3, HWEIGHT4, HWEIGHT5, HWEIGHT6,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Merge data
full_data <- left_join(data, weight, by = "SCRAM")

# Process vaccination variable
analysis_data <- full_data %>%
  mutate(
    vaccinated = case_when(
      RECVDVACC == 1 ~ 1,  # Yes
      RECVDVACC == 2 ~ 0,  # No
      TRUE ~ NA_real_      # Missing
    )
  ) %>%
  select(SCRAM, EST_ST, vaccinated, PWEIGHT, num_range("PWEIGHT", 1:80)) %>%
  drop_na(vaccinated, PWEIGHT)

# Get replicate weight names
repweight_names <- names(analysis_data)[grepl("^PWEIGHT[0-9]+$", names(analysis_data))]

repweight_names
 [1] "PWEIGHT1"  "PWEIGHT2"  "PWEIGHT3"  "PWEIGHT4"  "PWEIGHT5"  "PWEIGHT6" 
 [7] "PWEIGHT7"  "PWEIGHT8"  "PWEIGHT9"  "PWEIGHT10" "PWEIGHT11" "PWEIGHT12"
[13] "PWEIGHT13" "PWEIGHT14" "PWEIGHT15" "PWEIGHT16" "PWEIGHT17" "PWEIGHT18"
[19] "PWEIGHT19" "PWEIGHT20" "PWEIGHT21" "PWEIGHT22" "PWEIGHT23" "PWEIGHT24"
[25] "PWEIGHT25" "PWEIGHT26" "PWEIGHT27" "PWEIGHT28" "PWEIGHT29" "PWEIGHT30"
[31] "PWEIGHT31" "PWEIGHT32" "PWEIGHT33" "PWEIGHT34" "PWEIGHT35" "PWEIGHT36"
[37] "PWEIGHT37" "PWEIGHT38" "PWEIGHT39" "PWEIGHT40" "PWEIGHT41" "PWEIGHT42"
[43] "PWEIGHT43" "PWEIGHT44" "PWEIGHT45" "PWEIGHT46" "PWEIGHT47" "PWEIGHT48"
[49] "PWEIGHT49" "PWEIGHT50" "PWEIGHT51" "PWEIGHT52" "PWEIGHT53" "PWEIGHT54"
[55] "PWEIGHT55" "PWEIGHT56" "PWEIGHT57" "PWEIGHT58" "PWEIGHT59" "PWEIGHT60"
[61] "PWEIGHT61" "PWEIGHT62" "PWEIGHT63" "PWEIGHT64" "PWEIGHT65" "PWEIGHT66"
[67] "PWEIGHT67" "PWEIGHT68" "PWEIGHT69" "PWEIGHT70" "PWEIGHT71" "PWEIGHT72"
[73] "PWEIGHT73" "PWEIGHT74" "PWEIGHT75" "PWEIGHT76" "PWEIGHT77" "PWEIGHT78"
[79] "PWEIGHT79" "PWEIGHT80"
# Create survey design (ignore warnings)
svy_design <- svrepdesign(
  data = analysis_data,
  weights = ~PWEIGHT,
  repweights = analysis_data[repweight_names],
  type = "JK1",
  combined.weights = FALSE,
  mse = TRUE
)
Warning in svrepdesign.default(data = analysis_data, weights = ~PWEIGHT, : Data
look like combined weights: mean replication weight is 3775.31631273037 and
mean sampling weight is 3775.4637608826
Warning in svrepdesign.default(data = analysis_data, weights = ~PWEIGHT, :
scale (n-1)/n not provided: guessing from weights
# Calculate state-level estimates
state_estimates <- svyby(
  formula = ~vaccinated,
  by = ~EST_ST,
  design = svy_design,
  FUN = svymean,
  keep.var = FALSE
) %>%
  rename(vaccination_rate = statistic)  

state_estimates
   EST_ST vaccination_rate
01     01        0.7437622
02     02        0.7126474
04     04        0.7895504
05     05        0.7118875
06     06        0.8795519
08     08        0.8465841
09     09        0.8761389
10     10        0.8741673
11     11        0.9349731
12     12        0.7736586
13     13        0.7204214
15     15        0.8824271
16     16        0.7138166
17     17        0.8436551
18     18        0.7581182
19     19        0.7838675
20     20        0.7749813
21     21        0.7568192
22     22        0.6725276
23     23        0.8363105
24     24        0.8587940
25     25        0.9072533
26     26        0.7924953
27     27        0.8779015
28     28        0.6405879
29     29        0.7850180
30     30        0.7659171
31     31        0.8084022
32     32        0.7952617
33     33        0.8463384
34     34        0.8792668
35     35        0.8591901
36     36        0.8299612
37     37        0.8196761
38     38        0.7040710
39     39        0.7639145
40     40        0.7369978
41     41        0.8448790
42     42        0.8240156
44     44        0.8767355
45     45        0.7329581
46     46        0.7454531
47     47        0.7097751
48     48        0.7835179
49     49        0.8052163
50     50        0.9105776
51     51        0.8800049
53     53        0.8450497
54     54        0.7027265
55     55        0.8148422
56     56        0.6005584
# Calculate statistics
mean_rate <- mean(state_estimates$vaccination_rate, na.rm = TRUE)
median_rate <- median(state_estimates$vaccination_rate, na.rm = TRUE)
sd_rate <- sd(state_estimates$vaccination_rate, na.rm = TRUE)

# Print results
cat("State-level vaccination statistics:\n")
State-level vaccination statistics:
cat(sprintf("Mean: %.2f%%\n", mean_rate * 100))
Mean: 79.67%
cat(sprintf("Median: %.2f%%\n", median_rate * 100))
Median: 79.53%
cat(sprintf("Standard deviation: %.2f%%\n", sd_rate * 100))
Standard deviation: 7.37%
  1. We have provided an Excel with daily state-level vaccination rates throughout a period of time in 2021 (Source: CDC). Using the vaccination rate for July 1, 2021, merge this external source into your estimates of the state-level vaccination rate from the HPS. You might find it helpful to use the state FIPS code crosswalk in the second sheet of the Excel.
  2. Does the HPS provide accurate estimates of the true state-level vaccination rate? Specifically, report the 3 states whose estimate is most accurate and the 3 states whose estimate is least accurate.

Most Accurate States (smallest absolute error):

Vermont – HPS: 91.06%, CDC: 74.0%, Absolute Error: 17.06%

Maine – HPS: 83.63%, CDC: 66.5%, Absolute Error: 17.13%

Hawaii – HPS: 88.24%, CDC: 69.9%, Absolute Error: 18.34%

Least Accurate States (largest absolute error):

Missouri – HPS: 78.50%, CDC: 45.0%, Absolute Error: 33.50%

North Carolina – HPS: 81.97%, CDC: 48.7%, Absolute Error: 33.27%

District of Columbia – HPS: 93.50%, CDC: 61.3%, Absolute Error: 32.20%

library(readxl)
library(dplyr)

# Read CDC data (7/1/2021)
cdc_data_raw <- read_excel("cdc_vaccination_rates.xlsx", sheet = "CDC_Vax") %>%
  mutate(
    Date = as.Date(Date, format = "%m/%d/%Y"),
    Location = toupper(trimws(Location))  # 標準化 Location
  ) %>%
  filter(Date == as.Date("2021-07-01")) %>%
  select(Location, cdc_vax_rate = Administered_Dose1_Pop_Pct)

# Read sheet 2
fips_crosswalk <- read_excel("cdc_vaccination_rates.xlsx", sheet = "Sheet 2") %>%
  mutate(
    FIPS_Code = as.character(`FIPS Code`),
    Location = toupper(trimws(Location)),
    State = trimws(State)
  ) %>%
  select(FIPS_Code, State, Location)

# --- 3. combine HPS and fips_crosswalk and CDC

hps_estimates <- state_estimates %>%
  mutate(FIPS_Code = as.character(EST_ST)) %>%
  left_join(fips_crosswalk, by = "FIPS_Code") %>%
  mutate(hps_vax_rate = vaccination_rate * 100) %>%  
  select(State, hps_vax_rate) %>%
  drop_na()

cdc_merged <- cdc_data_raw %>%
  left_join(fips_crosswalk, by = "Location") %>%
  select(State, cdc_vax_rate) %>%
  drop_na()

comparison_df <- hps_estimates %>%
  inner_join(cdc_merged, by = "State") %>%
  mutate(
    absolute_error = abs(hps_vax_rate - cdc_vax_rate),
    percent_error = absolute_error / cdc_vax_rate * 100
  )

# Result

cat("\n3 Most Accurate States:\n")

3 Most Accurate States:
print(
  comparison_df %>%
    arrange(absolute_error) %>%
    head(3) %>%
    select(State, HPS_Estimate = hps_vax_rate, CDC_Truth = cdc_vax_rate,
           Absolute_Error = absolute_error, Percent_Error = percent_error)
)
    State HPS_Estimate CDC_Truth Absolute_Error Percent_Error
1 Vermont     91.05776      74.0       17.05776      23.05102
2   Maine     83.63105      66.5       17.13105      25.76097
3  Hawaii     88.24271      69.9       18.34271      26.24135
cat("\n3 Least Accurate States:\n")

3 Least Accurate States:
print(
  comparison_df %>%
    arrange(desc(absolute_error)) %>%
    head(3) %>%
    select(State, HPS_Estimate = hps_vax_rate, CDC_Truth = cdc_vax_rate,
           Absolute_Error = absolute_error, Percent_Error = percent_error)
)
                 State HPS_Estimate CDC_Truth Absolute_Error Percent_Error
1             Missouri     78.50180      45.0       33.50180      74.44845
2       North Carolina     81.96761      48.7       33.26761      68.31132
3 District of Columbia     93.49731      61.3       32.19731      52.52416
  1. Create a plot that shows the relationship between HPS state vaccination rate and the vaccination rate reported from the CDC.

    According to the plot, HPS tends to overestimate vaccination rates.

library(ggplot2)
library(dplyr)

# First ensure your data is properly formatted
comparison_df <- comparison_df %>%
  filter(State %in% c(state.name, "District of Columbia")) %>%
  mutate(State = factor(State))

# Create the plot with state abbreviations to reduce clutter
state_abbs <- c(state.abb, "DC")
names(state_abbs) <- c(state.name, "District of Columbia")

ggplot(comparison_df, aes(x = cdc_vax_rate, y = hps_vax_rate)) +
  geom_point(color = "blue", size = 3) +
  geom_text(
    aes(label = state_abbs[State]), 
    vjust = -0.8, 
    size = 3, 
    check_overlap = TRUE  # Prevents label overlapping
  ) +
  geom_abline(
    slope = 1, 
    intercept = 0, 
    linetype = "dashed", 
    color = "red", 
    linewidth = 0.7
  ) +
  labs(
    x = "CDC-Reported Vaccination Rate (%)",
    y = "HPS-Estimated Vaccination Rate (%)",
    title = "Comparison of HPS vs CDC State Vaccination Rates",
    subtitle = "Dashed red line represents perfect agreement"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_continuous(limits = c(0, 100))

  1. Using your preferred package for synthetic control and the CDC dataset, estimate the effect of Florida’s Executive Order banning vaccine passports (Issued April 2, 2021) on vaccination rates. Write up a brief description of the results (no more than 1-2 paragraphs) and include any plots that might help illustrate your findings.

    Contrary to initial expectations that banning vaccine passports would decrease vaccination rates, the results reveal a surprising positive effect. The average treatment effect was +1.51 percentage points (not negative), meaning Florida’s vaccination rate was 1.51 percentage points higher than its synthetic control after the ban. As shown in the gap plot, while there was a brief negative dip immediately following April 2, 2021 (possibly reflecting initial uncertainty), Florida’s vaccination rate quickly exceeded the synthetic control by early May. The positive gap continued to widen, reaching over 3 percentage points by July 2021. This counterintuitive finding suggests that prohibiting vaccine mandates may have paradoxically increased vaccine uptake, possibly by reducing resistance among individuals who opposed coercion, or by fostering a sense of voluntary choice that made vaccination more appealing to hesitant populations.

library(Synth)
##
## Synth Package: Implements Synthetic Control Methods.
## See https://web.stanford.edu/~jhain/synthpage.html for additional information.
library(ggplot2)
library(dplyr)
library(readxl)

# Load and prepare data
cdc_panel <- read_excel("cdc_vaccination_rates.xlsx", sheet = "CDC_Vax") %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"),
         Location = toupper(trimws(Location))) %>%
  left_join(fips_crosswalk, by = "Location") %>%
  select(State, Date, Administered_Dose1_Pop_Pct) %>%
  filter(!is.na(State))

# Convert to data.frame (Synth doesn't work well with tibbles)
cdc_panel <- as.data.frame(cdc_panel)

# Create numeric identifiers
state_names <- unique(cdc_panel$State)
cdc_panel$unit.numeric <- match(cdc_panel$State, state_names)

# Create time variable
dates <- sort(unique(cdc_panel$Date))
cdc_panel$time.numeric <- match(cdc_panel$Date, dates)

# Identify Florida and control states
florida_id <- which(state_names == "Florida")
control_ids <- which(state_names != "Florida")

# Identify pre-treatment period
florida_ban_date <- as.Date("2021-04-02")
pre_period <- which(dates < florida_ban_date)

# Check data structure
print(paste("Number of states:", length(state_names)))
[1] "Number of states: 51"
print(paste("Number of time periods:", length(dates)))
[1] "Number of time periods: 138"
print(paste("Pre-treatment periods:", length(pre_period)))
[1] "Pre-treatment periods: 47"
print(paste("Florida ID:", florida_id))
[1] "Florida ID: 48"
# Reshape data to wide format for predictors
# Create lagged vaccination rates as predictors
cdc_wide <- cdc_panel %>%
  select(unit.numeric, time.numeric, Administered_Dose1_Pop_Pct) %>%
  arrange(unit.numeric, time.numeric)

# Prepare data for Synth
dataprep.out <- dataprep(
  foo = cdc_wide,
  predictors = "Administered_Dose1_Pop_Pct",
  predictors.op = "mean",
  dependent = "Administered_Dose1_Pop_Pct",
  unit.variable = "unit.numeric",
  time.variable = "time.numeric",
  treatment.identifier = florida_id,
  controls.identifier = control_ids,
  time.predictors.prior = pre_period,
  time.optimize.ssr = pre_period,
  time.plot = 1:length(dates)
)

# Run synthetic control
synth.out <- synth(dataprep.out)

X1, X0, Z1, Z0 all come directly from dataprep object.


**************** 
 optimization over w weights: computing synthtic control unit 
 


**************** 
**************** 
**************** 

MSPE (LOSS V): 0.03725125 

solution.v:
 1 

solution.w:
 0.01438288 0.01498492 0.008588144 0.01764769 0.01611909 0.01439032 0.01252901 0.01421751 0.01582097 0.01806373 0.01655958 0.01643426 0.01585174 0.01077619 0.01722208 0.0172598 0.01623748 0.01762184 0.013594 0.01285317 0.01100613 0.01738618 0.01795118 0.01435365 0.01610139 0.01381523 0.2476301 0.0112812 0.01588258 0.01481215 0.01418873 0.009052314 0.01811195 0.01424107 0.017011 0.01316164 0.01663501 0.0145318 0.01543611 0.01823165 0.01574902 0.01823388 0.01578245 0.0164219 0.01634745 0.01821058 0.01693931 0.01483572 0.01820323 0.01730098 
# Extract results for plotting
synth.tables <- synth.tab(dataprep.res = dataprep.out,
                          synth.res = synth.out)

# Get the actual and synthetic values
florida_actual <- dataprep.out$Y1plot
florida_synth <- dataprep.out$Y0plot %*% synth.out$solution.w

# Create results dataframe
results_df <- data.frame(
  Date = dates,
  Florida_Actual = as.numeric(florida_actual),
  Florida_Synthetic = as.numeric(florida_synth),
  Gap = as.numeric(florida_actual - florida_synth)
)

# Plot 1: Florida vs Synthetic Florida
p1 <- ggplot(results_df, aes(x = Date)) +
  geom_line(aes(y = Florida_Actual, color = "Florida")) +
  geom_line(aes(y = Florida_Synthetic, color = "Synthetic Florida"), 
            linetype = "dashed") +
  geom_vline(xintercept = florida_ban_date, color = "gray40", 
             linetype = "dotted") +
  labs(title = "Florida vs Synthetic Florida: First Dose Vaccination Rates",
       x = "Date",
       y = "Vaccination Rate (%)",
       color = NULL) +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))+ annotate("text", x = florida_ban_date + 5, 
           y = max(results_df$Florida_Actual) * 0.95,
           label = "Policy starts", hjust = 0, color = "gray40", size = 3)

print(p1)

# Plot 2: Gap between Florida and Synthetic Florida
p2 <- ggplot(results_df, aes(x = Date, y = Gap)) +
  geom_line(color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray60") +
  geom_vline(xintercept = florida_ban_date, color = "gray50", 
             linetype = "dotted") +
  annotate("text", x = florida_ban_date + 5, 
           y = min(results_df$Gap) * 0.95,
           label = "Policy starts", hjust = 0, color = "gray40", size = 3) +
  labs(title = "Vaccination Rate Gap: Florida vs Synthetic Control",
       x = "Date",
       y = "Gap (Florida - Synthetic, % points)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

print(p2)

# Print weights of control states
weights_df <- data.frame(
  State = state_names[control_ids],
  Weight = synth.out$solution.w
)
# Filter and sort manually to avoid the error
weights_df_filtered <- weights_df[weights_df$Weight > 0.01, ]
if(nrow(weights_df_filtered) > 0) {
  weights_df_filtered <- weights_df_filtered[order(-weights_df_filtered$Weight), ]
}

print("States contributing to Synthetic Florida (weights > 1%):")
[1] "States contributing to Synthetic Florida (weights > 1%):"
print(weights_df)
                  State    w.weight
1  District of Columbia 0.014382877
2                  Iowa 0.014984919
3            New Mexico 0.008588144
4                  Utah 0.017647695
5          Pennsylvania 0.016119093
6         West Virginia 0.014390319
7          North Dakota 0.012529012
8         New Hampshire 0.014217513
9              Illinois 0.015820969
10            Louisiana 0.018063734
11             Delaware 0.016559582
12             Colorado 0.016434260
13             Nebraska 0.015851740
14         South Dakota 0.010776193
15            Tennessee 0.017222082
16             Michigan 0.017259795
17           California 0.016237479
18               Nevada 0.017621839
19         Rhode Island 0.013594002
20        Massachusetts 0.012853167
21          Connecticut 0.011006129
22                 Ohio 0.017386176
23       South Carolina 0.017951176
24             Oklahoma 0.014353650
25              Wyoming 0.016101387
26              Vermont 0.013815227
27              Georgia 0.247630125
28                Texas 0.011281203
29              Alabama 0.015882585
30            Minnesota 0.014812148
31           New Jersey 0.014188730
32               Alaska 0.009052314
33              Indiana 0.018111945
34               Hawaii 0.014241067
35       North Carolina 0.017010998
36                Maine 0.013161635
37             New York 0.016635008
38              Montana 0.014531796
39             Kentucky 0.015436112
40             Missouri 0.018231654
41              Arizona 0.015749024
42          Mississippi 0.018233877
43             Virginia 0.015782452
44           Washington 0.016421897
45             Maryland 0.016347448
46             Arkansas 0.018210582
47               Kansas 0.016939307
49            Wisconsin 0.014835719
50                Idaho 0.018203229
51               Oregon 0.017300983
# Calculate average treatment effect
post_period <- which(dates >= florida_ban_date)
ate <- mean(results_df$Gap[post_period])
print(paste("Average Treatment Effect (post-ban):", round(ate, 2), "percentage points"))
[1] "Average Treatment Effect (post-ban): 1.51 percentage points"
# Calculate pre-treatment fit
pre_period_full <- which(dates < florida_ban_date)
pre_rmse <- sqrt(mean(results_df$Gap[pre_period_full]^2))
print(paste("Pre-treatment RMSE:", round(pre_rmse, 3)))
[1] "Pre-treatment RMSE: 0.193"
# Save plots
ggsave("florida_synth_control.png", p1, width = 10, height = 6)
ggsave("florida_synth_gap.png", p2, width = 10, height = 6)