πŸ“„ Project Ovierview

This report details the process of constructing an analytic dataset from the National Health and Aging Trends Study (NHATS) to examine the relationship between financial strain and dementia classification. Data were drawn from Roundsβ€―1, 5, 6, and 7 of NHATS and harmonized to build a clean, merged dataset suitable for analysis. We restricted the cohort to sample-person respondents living in the community, excluding those in nursing homes, assisted living facilities, or other institutions.

Roundβ€―6 served as the anchor wave because it contained the study’s primary exposure variable, Financial Strain, derived from four items about whether participants skipped meals or struggled to pay for rent, utilities, or medical care. Participants were coded as Any Strain if they endorsed any β€œYes,” No Strain only if they answered β€œNo” to all four items, and Missing if any response was refused, unknown, or inapplicable in the absence of any financial strain flag.

Roundβ€―7 provided the study’s primary outcome, dementia classification, combining clinician diagnosis with cognitive testing. Participants were classified as Probable Dementia if a diagnosis was confirmed, Possible Dementia if β‰₯2 cognitive domains (memory, orientation, executive function) were impaired, No Dementia if dementia was ruled out and all domains were intact, or Missing if classification was not possible.

To support covariate completeness, demographic and socioeconomic data were merged from Roundsβ€―1 and 5, with Roundβ€―5 values prioritized and Roundβ€―1 used to backfill missing values. The resulting FullData dataset retained all Roundβ€―6 participants, preserving expected structural missingness from other waves. From this, an analysis dataset called CleanData was created by excluding anyone missing either the exposure or outcome, while retaining explicit β€œMissing” categories for other covariates to ensure transparency in descriptive analyses and reporting.


# --- Load Libraries ---
library(tidyverse)   # Data wrangling
library(haven)       # Read SAS files
library(janitor)     # Clean variable names
library(skimr)       # Data inspection
library(arsenal)    # For beautiful Table 1s
library(Hmisc)       # for label()

πŸ›  Helper Definitions

Before I started cleaning NHATS data, I defined a few helper objects and functions to standardize how I handled missing data codes and factors across the dataset.

  • First, I created vectors for NHATS missing data codes (e.g., -1, -7) and their labels.
  • Next, I set up a standard Yes/No structure that I reuse for multiple variables.
  • Finally, I wrote a helper function safe_collapse() that will safely collapse all the special missing codes into one single β€œMissing” level for any factor variable.
# Missing data codes used across NHATS
special_missing_levels <- c("-1", "-7", "-8", "-9")
special_missing_labels <- c("Inapplicable", "Refused", "Don’t know", "Missing")

# Common Yes/No structure across NHATS
yes_no_levels <- c("1", "2", special_missing_levels)
yes_no_labels <- c("Yes", "No", special_missing_labels)

# Convert special missing codes to numeric for case_when logic
special_missing_numeric <- as.numeric(special_missing_levels)

# πŸ›  Helper: Collapse special missing levels into a single β€œMissing” factor level.
safe_collapse <- function(x, missing_levels = special_missing_labels) {
  if (!is.factor(x)) {
    return(x) # Do not attempt to collapse if it's not a factor
  }
  valid <- intersect(levels(x), missing_levels)
  if (length(valid) > 0) {
    x |> fct_collapse(Missing = valid) |> fct_explicit_na("Missing")
  } else {
    x |> fct_explicit_na("Missing")
  }
}

🧹 Cleaning Round 1 Data (R1)

At this stage of the project, I was focused on cleaning up Round 1 (R1) and Round 5 (R5), which would ultimately provide the covariates for my analysis.
R1’s role here wasn’t to stand alone β€” it was really meant to supplement R5 where R5 was incomplete. More coming soon here.

Specifically, for R1, I: - Imported the Round 1 SP file from NHATS.
- Selected only the key variables I knew I would need for covariates.
- Filtered down to sample-person respondents living at home, because those are the cases relevant for my analysis.
- Renamed variables to have clearer, more intuitive names for later merging.
- Converted household size and income to numeric values.
- Simplified occupation codes into 6 buckets and converted several fields into factors - Converted -1 inapplicable values from Section 8 Housing variable as no due to skip logic to reduce missing data for this variable.

R1_SP <- read_sas("NHATS_Round_1_SP_File.sas7bdat")


R1_SP_clean <- R1_SP |>

# Selecting the variables of interest
  select(
    spid, is1resptype, r1dresid, r1d2intvrage, r1dgender, el1higstschl,
    lf1doccpctgy, lf1occupaton, hp1ownrentot, lf1workfpay, hp1sec8pubsn,
    ip1cmedicaid, ew1progneed1, ew1progneed2, ew1progneed3, el1hlthchild,
    ia1totinc, hh1dhshldnum, rl1dracehisp
  ) |>
  
# Filtering to sample-person responders living at home
  filter(is1resptype == 1 & r1dresid == 1) |>   
  
# Renaming variables 
  rename(
    Responder1 = is1resptype,
    Residential1 = r1dresid,
    Age1 = r1d2intvrage,
    Gender1 = r1dgender,
    Education1 = el1higstschl,
    Occupation1 = lf1doccpctgy,
    EverWorked1 = lf1occupaton,
    HomeOwnership1 = hp1ownrentot,
    HouseholdSize1 = hh1dhshldnum,
    HouseholdIncome1 = ia1totinc,
    RetirementStatus1 = lf1workfpay,
    Section81 = hp1sec8pubsn,
    Medicaid1 = ip1cmedicaid,
    FoodAssist11 = ew1progneed1,
    FoodAssist21 = ew1progneed2,
    FoodAssist31 = ew1progneed3,
    ChildhoodHealth1 = el1hlthchild,
    RaceEthnicity1 = rl1dracehisp
  ) |>
  mutate(
    
# changing household size and income to numeric variables
    HouseholdSize1 = case_when(
      as.character(HouseholdSize1) %in% special_missing_levels ~ NA_real_,
      TRUE ~ as.numeric(as.character(HouseholdSize1)) # Ensure conversion from potentially factor/character
    ),
    HouseholdIncome1 = case_when(
      as.character(HouseholdIncome1) %in% special_missing_levels ~ NA_real_,
      TRUE ~ as.numeric(as.character(HouseholdIncome1)) # Ensure conversion from potentially factor/character
    ),
  
    
# Simplifying occupation codes into 6 buckets 
    Occupation1 = case_when(
      Occupation1 %in% 1:11 ~ "1",       # Management / Professional
      Occupation1 %in% 12:15 ~ "2",      # Service
      Occupation1 %in% 16:17 ~ "3",      # Sales / Office
      Occupation1 %in% 18:20 ~ "4",      # Construction / Farming
      Occupation1 %in% 21:23 ~ "5",      # Production
      EverWorked1 %in% c(2,3) ~ "6",     # Never worked / Homemaker
      TRUE ~ NA_character_
    ),
    
# Converting variables to labeled factors 
    Responder1 = factor(Responder1, c("1","2"), c("Sample_Person","Proxy")),
    Residential1 = factor(Residential1, c("1","2","3","4","5", special_missing_levels),
                          c("Home/apartment","Retirement community","Assisted living","Nursing home","Other institution", special_missing_labels)),
    Age1 = factor(Age1, c("1","2","3","4","5","6", special_missing_levels),
                  c("65–69","70–74","75–79","80–84","85–89","90+", special_missing_labels)),
    Gender1 = factor(Gender1, c("1","2"), c("Male","Female")),
    Education1 = factor(Education1, c("1","2","3","4","5","6","7","8","9", special_missing_levels),
                        c("No schooling","1–8th grade","9–12 (no diploma)","HS grad",
                          "Vocational","Some college","Associate","Bachelor","Master/PhD", special_missing_labels)),
    Occupation1 = factor(Occupation1, c("1","2","3","4","5","6"),
                         c("Management/Professional","Service","Sales/Office",
                           "Construction/Farming","Production","Homemaker")),
    HomeOwnership1 = factor(HomeOwnership1, c("1","2","3", special_missing_levels),
                            c("Own","Rent","Other", special_missing_labels)),
    RetirementStatus1 = factor(RetirementStatus1, c("1","2","3", special_missing_levels),
                               c("Yes","No","Retired", special_missing_labels)),
    Section81 = factor(Section81, yes_no_levels, yes_no_labels),
    Medicaid1 = factor(Medicaid1, yes_no_levels, yes_no_labels),
    FoodAssist11 = factor(FoodAssist11, yes_no_levels, yes_no_labels),
    FoodAssist21 = factor(FoodAssist21, yes_no_levels, yes_no_labels),
    FoodAssist31 = factor(FoodAssist31, yes_no_levels, yes_no_labels),
    ChildhoodHealth1 = factor(ChildhoodHealth1, c("1","2","3","4","5", special_missing_levels),
                              c("Excellent","Very good","Good","Fair","Poor", special_missing_labels)),
    RaceEthnicity1 = factor(RaceEthnicity1)
  ) |>
  
  # Handling -1s "inapplicable" as "no" for Section 8 housing
  mutate(Section81 = fct_recode(Section81, "No" = "Inapplicable"))

🧽 Cleaning Round 5 Data (R5)

After cleaning up Round 1 (R1), I shifted my focus to Round 5 (R5). This was the core dataset I relied on for the covariates used for analysis, since R5 represents a more current snapshot of the NHATS cohort of interest.

I used R1 mainly as a supplement β€” but R5 needed a full clean on its own first.

Here’s what I did for R5 (mirroring R1):
- Imported the Round 5 SP file.
- Selected the key variables that would ultimately serve as covariates in my analysis.
- Filtered the dataset to include only sample-person respondents living at home (just as I did for R1).
- Renamed variables for consistency and readability when merging across rounds.
- Converted household size and income into numeric format to handle NHATS’ special missing codes.
- Collapsed the occupation categories into simplified buckets and built out factors for categorical fields like education, marital status, and Section 8 status.
- Ensured everything was factorized and labeled consistently for later merging. - Converted -1 inapplicable values from Section 8 Housing variable as no due to skip logic to reduce missing data for this variable.

R5_SP <- read_sas("NHATS_Round_5_SP_File_v2.sas7bdat")


R5_SP_clean <- R5_SP |>

# Selecting the variables of interest
  select(
    spid, is5resptype, r5dresid, r5dcontnew, r5d2intvrage, r5dgender, el5higstschl,
    lf5doccpctgy, lf5occupaton, hp5ownrentot, lf5workfpay, hp5sec8pubsn,
    ip5cmedicaid, ew5progneed1, ew5progneed2, ew5progneed3, el5hlthchild,
    hh5dmarstat, ia5totinc, hh5dhshldnum, fl5newsample, rl5dracehisp
  ) |>
  
# Filtering to sample-person responders living at home
  filter(is5resptype == 1 & r5dresid == 1) |>
  
# Renaming variables 
  rename(
    Responder5 = is5resptype,
    Residential5 = r5dresid,
    NewPerson = r5dcontnew,
    Age = r5d2intvrage,
    Gender = r5dgender,
    Education = el5higstschl,
    Occupation = lf5doccpctgy,
    EverWorked = lf5occupaton,
    HomeOwnership = hp5ownrentot,
    HouseholdSize = hh5dhshldnum,
    HouseholdIncome = ia5totinc,
    RetirementStatus = lf5workfpay,
    Section8 = hp5sec8pubsn,
    Medicaid = ip5cmedicaid,
    FoodAssist1 = ew5progneed1,
    FoodAssist2 = ew5progneed2,
    FoodAssist3 = ew5progneed3,
    ChildhoodHealth = el5hlthchild,
    MaritalStatus = hh5dmarstat,
    Newsample = fl5newsample,
    RaceEthnicity = rl5dracehisp
  ) |>
  mutate(
    
# changing household size and income to numeric variables
    HouseholdSize = case_when(
      as.character(HouseholdSize) %in% special_missing_levels ~ NA_real_,
      TRUE ~ as.numeric(as.character(HouseholdSize))
    ),
    HouseholdIncome = case_when(
      as.character(HouseholdIncome) %in% special_missing_levels ~ NA_real_,
      TRUE ~ as.numeric(as.character(HouseholdIncome))
    ),
    
# Simplifying occupation codes into 6 buckets 
    Occupation = case_when(
      Occupation %in% 1:11 ~ "1",
      Occupation %in% 12:15 ~ "2",
      Occupation %in% 16:17 ~ "3",
      Occupation %in% 18:20 ~ "4",
      Occupation %in% 21:23 ~ "5",
      EverWorked %in% c(2,3) ~ "6",
      Occupation == -1 ~ "7",
      TRUE ~ NA_character_
    ),
    
# Converting variables to labeled factors 
    Responder5 = factor(Responder5, c("1","2"), c("Sample_Person","Proxy")),
    Residential5 = factor(Residential5, c("1","2","3","4","5", special_missing_levels),
                          c("Home/apartment","Retirement community","Assisted living","Nursing home","Other institution", special_missing_labels)),
    NewPerson = factor(NewPerson, yes_no_levels, yes_no_labels),
    Age = factor(Age, c("1","2","3","4","5","6", special_missing_levels),
                 c("65–69","70–74","75–79","80–84","85–89","90+", special_missing_labels)),
    Gender = factor(Gender, c("1","2"), c("Male","Female")),
    Education = factor(Education, c("1","2","3","4","5","6","7","8","9", special_missing_levels),
                       c("No schooling","1–8th grade","9–12 (no diploma)","HS grad",
                         "Vocational","Some college","Associate","Bachelor","Master/PhD", special_missing_labels)),
    Occupation = factor(Occupation, c("1","2","3","4","5","6","7"),
                        c("Management/Professional","Service","Sales/Office","Construction/Farming","Production","Homemaker","Not working/retired")),
    HomeOwnership = factor(HomeOwnership, c("1","2","3", special_missing_levels),
                           c("Own","Rent","Other", special_missing_labels)),
    RetirementStatus = factor(RetirementStatus, c("1","2","3", special_missing_levels),
                              c("Yes","No","Retired", special_missing_labels)),
    Section8 = factor(Section8, yes_no_levels, yes_no_labels),
    Medicaid = factor(Medicaid, yes_no_levels, yes_no_labels),
    FoodAssist1 = factor(FoodAssist1, yes_no_levels, yes_no_labels),
    FoodAssist2 = factor(FoodAssist2, yes_no_levels, yes_no_labels),
    FoodAssist3 = factor(FoodAssist3, yes_no_levels, yes_no_labels),
    ChildhoodHealth = factor(ChildhoodHealth, c("1","2","3","4","5", special_missing_levels),
                             c("Excellent","Very good","Good","Fair","Poor", special_missing_labels)),
    MaritalStatus = factor(MaritalStatus, c("1","2","3","4","5","6", special_missing_levels),
                           c("Married","Living with Partner","Separated","Divorced","Widowed","Never married", special_missing_labels)),
    Newsample = factor(Newsample, yes_no_levels, yes_no_labels),
    RaceEthnicity = factor(RaceEthnicity)
  ) |>
  
  # Handling -1s "inapplicable" as "no" for Section 8 housing
  mutate(Section8 = fct_recode(Section8, "No" = "Inapplicable"))

πŸ”— Merging Round 1 and Round 5 Data

Once I had cleaned both R1 and R5, the next step was to merge them together.
Round 5 serves as the main covariate dataset, but I wanted to pull in information from Round 1 to fill in any gaps where R5 data were missing or coded as one of NHATS’s special missing values (like -1, -7, -8, or -9).

Here’s what I did:

  • I started by using the cleaned versions of both datasets: R5_SP_clean and R1_SP_clean.
  • I used left_join() to merge R1 into R5, keeping all Round 5 participants (since R5 is my primary round for covariates).
  • For each covariate, I told R to use the R1 value only if R5’s value was marked as missing.
  • After filling from R1, I standardized the missingness across the merged dataset by collapsing all special missing codes into one "Missing" category for easier viewing.
  • Finally, I **kept only the β€œ_final” columns** so the dataset is clean and ready for the next step to integrate with R6 and R7 data.
# Merge Round 5 with Round 1
R5_merged <- R5_SP_clean |> 
  left_join(R1_SP_clean, by = "spid") |> 
  
# Fill from R1 if R5 is special missing (-1, -7, -8, -9)
  mutate(
    Responder_final       = if_else(Responder5 %in% special_missing_labels, Responder1, Responder5),
    Residential_final     = if_else(Residential5 %in% special_missing_labels, Residential1, Residential5),
    Age_final             = if_else(Age %in% special_missing_labels, Age1, Age),
    Gender_final          = if_else(Gender %in% special_missing_labels, Gender1, Gender),
    Education_final       = if_else(Education %in% special_missing_labels, Education1, Education),
    Occupation_final      = if_else(Occupation %in% special_missing_labels, Occupation1, Occupation),
    HomeOwnership_final   = if_else(HomeOwnership %in% special_missing_labels, HomeOwnership1, HomeOwnership),
    HouseholdSize_final   = if_else(HouseholdSize %in% special_missing_labels, HouseholdSize1, HouseholdSize),  
    HouseholdIncome_final = if_else(HouseholdIncome %in% special_missing_labels, HouseholdIncome1, HouseholdIncome),
    RetirementStatus_final = if_else(RetirementStatus %in% special_missing_labels, RetirementStatus1, RetirementStatus),
    Section8_final        = if_else(Section8 %in% special_missing_labels, Section81, Section8),
    Medicaid_final        = if_else(Medicaid %in% special_missing_labels, Medicaid1, Medicaid),
    FoodAssist1_final     = if_else(FoodAssist1 %in% special_missing_labels, FoodAssist11, FoodAssist1),
    FoodAssist2_final     = if_else(FoodAssist2 %in% special_missing_labels, FoodAssist21, FoodAssist2),
    FoodAssist3_final     = if_else(FoodAssist3 %in% special_missing_labels, FoodAssist31, FoodAssist3),
    ChildhoodHealth_final = if_else(ChildhoodHealth %in% special_missing_labels, ChildhoodHealth1, ChildhoodHealth),
    RaceEthnicity_final   = if_else(RaceEthnicity %in% special_missing_labels, RaceEthnicity1, RaceEthnicity),
    
# Include Round 5 only variables
    NewPerson_final       = NewPerson,
    MaritalStatus_final   = MaritalStatus,
    Newsample_final       = Newsample
  )

# Collapse missing levels consistently into one β€œMissing” level
R5_merged <- R5_merged |> 
  mutate(across(ends_with("_final"), safe_collapse, .names = "{.col}"))

# Keep only final columns for downstream merges
R5_merged <- R5_merged |> 
  select(spid, ends_with("_final"))

🧰 Cleaning Round 6 Data (R6)

After merging R1 and R5, I moved on to Round 6 (R6).
This roundprovided the key exposure variable for my analysis: Financial Strain.

Here’s how I approached cleaning R6:

  • Imported the Round 6 SP file.
  • Selected only the 4 variables related to basic needs strain (e.g., skipping meals, struggling to pay rent, utilities, or medical bills).
  • Filtered the dataset to include only sample-person respondents living at home.
  • Renamed the variables for clarity (e.g., ew6mealskip1 became SkippedMeals).
  • Converted these items into factors using the Yes/No coding already defined earlier with the helper functions.
  • Built a new variable called FinancialStrainFlag:
    • Marked participants as β€œAny Strain” if they answered Yes to any of the four items.
    • Marked them as β€œNo Strain” if they explicitly answered No to all four items.
    • Then classified participants as β€œMissing” if there was any missing or inapplicable data in the absence of any single Yes to the four items.
R6_SP <- read_sas("NHATS_Round_6_SP_File_V2.sas7bdat")

R6_SP_clean <- R6_SP |>
  
# Selecting the variables of interest
  select(spid, is6resptype, r6dresid,
         ew6mealskip1, ew6nopayhous, ew6nopayutil, ew6nopaymed) |>
  
#Filtering to sample-person responders living at home
  filter(is6resptype == 1 & r6dresid == 1) |>

# Renaming variables 
  rename(
    SkippedMeals = ew6mealskip1,
    UnableToPayRent = ew6nopayhous,
    UnableToPayUtilities = ew6nopayutil,
    UnableToPayMedical = ew6nopaymed
  ) |>
  
# Convert all 4 financial strain indicators into labeled factors at once
  mutate(
    across(c(SkippedMeals, UnableToPayRent, UnableToPayUtilities, UnableToPayMedical),
           ~ factor(.x, yes_no_levels, yes_no_labels)),
    
# Creating new Financial Strain Flag variable
    FinancialStrainFlag = case_when(
      
# If ANY of the four is YES, classify as Any Strain
      SkippedMeals == "Yes" | UnableToPayRent == "Yes" |
        UnableToPayUtilities == "Yes" | UnableToPayMedical == "Yes" ~ "Any Strain",
      
# If ALL FOUR are explicitly NO, classify as No Strain
      SkippedMeals == "No" & UnableToPayRent == "No" &
        UnableToPayUtilities == "No" & UnableToPayMedical == "No" ~ "No Strain",
      
# If we get here, at least one answer is missing/inapplicable/refused
      TRUE ~ "Missing"
    ) |> 
  
# Treat Financialstrainflag as a factor variable with three levls
  factor(c("No Strain", "Any Strain", "Missing"))
  )

🧴 Cleaning Round 7 Data (R7)

Next, I turned to Roundβ€―7 (R7), used to create the dementia classification outcome in the analysis.

R7 contained both: - A clinician‑verified dementia diagnosis field. - Multiple cognitive test results across three domains: memory, orientation, and executive function.

Here’s what I did step by step:

  • Imported the R7 SP file and selected the variables I needed for cognition and dementia classification.
  • Filtered the data to keep only participants living at home, since those are my analytic population. However, I allowed for proxy responders to account for individuals who may have developed dementia and were unable to answer since r6.
  • Renamed variables so their purpose was clear (e.g., cg7dwrdimmrc became MemoryImmediate).
  • Calculated a memory score by combining immediate and delayed recall scores (assigning 0 if a test result was missing).
  • Calculated an orientation score by re-codeing orientation items (like date, president, and vice president questions) into correct/incorrect and then summarized the scores.
  • Flagged impairment in executive function based on the clock‑draw task.
  • Counted how many of the three domains (memory, orientation, executive) were impaired for each participant.
  • Created the final dementia classification (dementia_class) using this logic:
    • Probable Dementia if the NHATS clinician diagnosis confirmed dementia.
    • Possible Dementia if there was no formal diagnosis but two or more domains were impaired.
    • No Dementia if the diagnosis explicitly ruled it out and all domains were sufficiently intact no missing().
    • Missing if Probable Dementia was not confirmed, and there was any missing data related to the memory, orientation score, or executive function domains.
R7_SP <- read_sas("NHATS_Round_7_SP_File.sas7bdat")

R7_SP_clean <- R7_SP |>
  
# Select variables of interest
  select(spid, is7resptype, r7dresid,
         hc7disescn9, cg7dwrdimmrc, cg7dwrddlyrc,
         cg7todaydat1, cg7todaydat2, cg7todaydat3, cg7todaydat4,
         cg7presidna1, cg7presidna3, cg7vpname1, cg7vpname3,
         cg7dclkdraw) |>
  
# Filter to only community dwelling respondents, but did not filter out proxy respondents this time. 
  filter(r7dresid == 1) |>   
  
# Rename variables
  rename(
    DementiaDx = hc7disescn9,           # Clinician-verified dementia diagnosis
    MemoryImmediate = cg7dwrdimmrc,     # Immediate recall
    MemoryDelayed = cg7dwrddlyrc,       # Delayed recall
    ExecutiveDraw = cg7dclkdraw         # Clock draw
  ) |>

  mutate(
# Calculate 🧠 MEMORY SCORE: Immediate + Delayed Recall
# If either is missing, we assign 0 points for that part (as NHATS coding convention).
    memory_score = if_else(MemoryImmediate >= 0, MemoryImmediate, 0) +
      if_else(MemoryDelayed >= 0, MemoryDelayed, 0),
    memory_impaired = memory_score <= 3,  # Binary: <=3 points means impaired
    
# Calculate 🧭 ORIENTATION SCORE: Recode correct/incorrect items (1=correct, 2=incorrect)
    across(c(cg7todaydat1, cg7todaydat2, cg7todaydat3, cg7todaydat4,
             cg7presidna1, cg7presidna3, cg7vpname1, cg7vpname3),
           ~ case_when(.x == 2 ~ 1,     # Incorrect = 1 point (impaired)
                       .x == 1 ~ 0,     # Correct = 0 points
                       TRUE ~ NA_real_), 
           .names = "{.col}_rec"),
    
    orientation_score = rowSums(across(ends_with("_rec")), na.rm = TRUE),
    orientation_impaired = orientation_score >= 5,   # β‰₯5 wrong answers = impaired
    
# Calculate πŸ•° EXECUTIVE FUNCTION: Clock draw task
    exec_impaired = ExecutiveDraw %in% c(0, 1),  # 0/1 scores indicate impairment
    
# 🏷 Count how many domains (memory/orientation/executive) are impaired
    impaired_domains = rowSums(across(c(memory_impaired, orientation_impaired, exec_impaired)), 
                               na.rm = TRUE),
    
#  Classify dementia 
    dementia_class = case_when(
      
  # 1️⃣ **Probable Dementia**: If the NHATS clinician diagnosis says so, trust it
      DementiaDx == 1 ~ "Probable Dementia",
      
  # 2️⃣ **Missing**: If DementiaDx is missing/refused/inapplicable (-1, -7, -8, -9)
      DementiaDx %in% special_missing_numeric ~ "Missing",
      
  # 3️⃣ **Possible Dementia**: No formal diagnosis, but β‰₯2 impaired domains
      DementiaDx != 1 & impaired_domains >= 2 ~ "Possible Dementia",
      
  # 4️⃣ **No Dementia**: Explicit diagnosis of β€œNo dementia” and all three domains are fully observed (none missing) AND <2 impaired domains
      DementiaDx == 2 & impaired_domains < 2 &
        !is.na(memory_impaired) & !is.na(orientation_impaired) & !is.na(exec_impaired) ~ "No Dementia",
      
  # 5️⃣ **Missing**: Anything else (e.g., partial data, can’t confidently classify)
      TRUE ~ "Missing"
  
  # Treat Dementia class as a factor variable with four levls
    ) |> factor(c("No Dementia", "Possible Dementia", "Probable Dementia", "Missing"))
  )

πŸ— Final Merge: Combining All Rounds

With all four rounds cleaned, I moved on to the final merge step.
This was where I brought everything together into one unified dataset for analysis.

Here’s what I did:

  • Started with Roundβ€―6 (R6) as the base since it contained the exposure variable (Financial Strain).
  • Left‑joined the merged R1/R5 dataset so I could bring in all of the covariates.
  • Left‑joined Roundβ€―7 (R7) to add the dementia classification outcome.
  • Saved that combined dataset as FullData, which keeps everyone from R6 β€” even participants with missing exposures or outcomes.

Then I created CleanData. This step was more than just removing missing values β€” it also filtered out people who couldn’t contribute to the analysis.

  • I filtered out anyone without usable exposure or outcome data:
    • If FinancialStrainFlag was coded as "Missing" or set to true missing (NA), they were removed.
    • If dementia_class was "Missing" or NA, they were removed too.

Importantly, this means that participants without Roundβ€―7 data (and therefore no dementia classification at all) were also excluded, since they wouldn’t meet the requirement of having a non-missing dementia_class.

This filtering step ensured that CleanData only included participants with both: - A valid financial strain classification from R6, and
- A valid dementia classification from R7.

# FullData keeps everyone from R6, even if FinancialStrainFlag or dementia_class = "Missing"
FullData <- R6_SP_clean |>
  left_join(R5_merged, by = "spid") |>
  left_join(R7_SP_clean, by = "spid")

# CleanData removes people missing either the exposure (FinancialStrainFlag) or outcome (dementia_class)
CleanData <- FullData |>
  filter(
    !is.na(FinancialStrainFlag),       # drop rows that are NA (true missing)
    !is.na(dementia_class),            # drop rows that are NA (true missing)
    FinancialStrainFlag != "Missing",  # drop coded "Missing" exposures
    dementia_class != "Missing"        # drop coded "Missing" outcomes
  )

πŸ” Inspecting Missingness in FullData and CleanData

Before doing any analysis, I skimmed both FullData and CleanData to understand the pattern of missing values and where they come from.


skim(FullData)
Data summary
Name FullData
Number of rows 5628
Number of columns 57
_______________________
Column type frequency:
factor 24
logical 3
numeric 30
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SkippedMeals 0 1.00 FALSE 4 No: 5522, Yes: 58, Don: 37, Ref: 11
UnableToPayRent 0 1.00 FALSE 4 No: 5457, Yes: 111, Don: 41, Ref: 19
UnableToPayUtilities 0 1.00 FALSE 4 No: 5388, Yes: 182, Don: 42, Ref: 16
UnableToPayMedical 0 1.00 FALSE 4 No: 5389, Yes: 181, Don: 42, Ref: 16
FinancialStrainFlag 0 1.00 FALSE 3 No : 5219, Any: 343, Mis: 66
Responder_final 45 0.99 FALSE 1 Sam: 5583, Pro: 0
Residential_final 45 0.99 FALSE 1 Hom: 5583, Ret: 0, Ass: 0, Nur: 0
Age_final 45 0.99 FALSE 6 70–: 1468, 75–: 1271, 80–: 1009, 65–: 856
Gender_final 45 0.99 FALSE 2 Fem: 3190, Mal: 2393
Education_final 45 0.99 FALSE 10 HS : 1450, Som: 800, Bac: 743, Mas: 722
Occupation_final 45 0.99 FALSE 8 Not: 2736, Man: 1017, Sal: 541, Pro: 458
HomeOwnership_final 45 0.99 FALSE 4 Own: 4178, Ren: 872, Oth: 426, Mis: 107
RetirementStatus_final 45 0.99 FALSE 4 Ret: 2645, No: 2081, Yes: 754, Mis: 103
Section8_final 45 0.99 FALSE 3 No: 5307, Yes: 263, Mis: 13
Medicaid_final 45 0.99 FALSE 3 No: 4746, Yes: 699, Mis: 138
FoodAssist1_final 45 0.99 FALSE 3 No: 5011, Yes: 457, Mis: 115
FoodAssist2_final 45 0.99 FALSE 3 No: 5318, Yes: 149, Mis: 116
FoodAssist3_final 45 0.99 FALSE 3 No: 5123, Yes: 343, Mis: 117
ChildhoodHealth_final 45 0.99 FALSE 6 Exc: 2685, Ver: 1481, Goo: 920, Fai: 275
RaceEthnicity_final 45 0.99 FALSE 6 1: 3898, 2: 1136, 4: 314, 3: 132
NewPerson_final 45 0.99 FALSE 2 No: 2847, Yes: 2736, Mis: 0
MaritalStatus_final 45 0.99 FALSE 7 Mar: 2745, Wid: 1702, Div: 691, Nev: 209
Newsample_final 45 0.99 FALSE 2 Yes: 2847, Mis: 2736, No: 0
dementia_class 705 0.87 FALSE 4 No : 4506, Pos: 202, Mis: 130, Pro: 85

Variable type: logical

skim_variable n_missing complete_rate mean count
memory_impaired 705 0.87 0.11 FAL: 4384, TRU: 539
orientation_impaired 705 0.87 0.07 FAL: 4603, TRU: 320
exec_impaired 705 0.87 0.04 FAL: 4741, TRU: 182

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
spid 0 1.00 15095468.64 4998327.09 1e+07 10006283 20000129 20003671 20007119 ▇▁▁▁▇
is6resptype 0 1.00 1.00 0.00 1e+00 1 1 1 1 ▁▁▇▁▁
r6dresid 0 1.00 1.00 0.00 1e+00 1 1 1 1 ▁▁▇▁▁
HouseholdSize_final 45 0.99 1.99 1.06 1e+00 1 2 2 11 ▇▁▁▁▁
HouseholdIncome_final 2317 0.59 67394.31 477048.95 0e+00 19000 35000 67000 25000000 ▇▁▁▁▁
is7resptype 705 0.87 1.01 0.12 1e+00 1 1 1 2 ▇▁▁▁▁
r7dresid 705 0.87 1.00 0.00 1e+00 1 1 1 1 ▁▁▇▁▁
DementiaDx 705 0.87 2.16 0.95 -8e+00 2 2 2 7 ▁▁▁▇▁
MemoryImmediate 705 0.87 4.74 2.06 -7e+00 4 5 6 10 ▁▁▂▇▂
MemoryDelayed 705 0.87 3.45 2.31 -7e+00 2 4 5 9 ▁▁▅▇▂
cg7todaydat1 705 0.87 1.04 0.28 -1e+00 1 1 1 2 ▁▁▁▇▁
cg7todaydat2 705 0.87 1.23 0.46 -1e+00 1 1 1 2 ▁▁▁▇▂
cg7todaydat3 705 0.87 1.05 0.29 -1e+00 1 1 1 2 ▁▁▁▇▁
cg7todaydat4 705 0.87 1.04 0.27 -1e+00 1 1 1 2 ▁▁▁▇▁
cg7presidna1 705 0.87 1.04 0.41 -7e+00 1 1 1 2 ▁▁▁▁▇
cg7presidna3 705 0.87 1.14 0.50 -7e+00 1 1 1 2 ▁▁▁▁▇
cg7vpname1 705 0.87 1.38 0.57 -7e+00 1 1 2 2 ▁▁▁▁▇
cg7vpname3 705 0.87 1.69 0.56 -7e+00 1 2 2 2 ▁▁▁▁▇
ExecutiveDraw 705 0.87 3.65 1.61 -9e+00 3 4 5 5 ▁▁▁▁▇
memory_score 705 0.87 8.32 3.67 0e+00 6 9 11 19 ▂▆▇▃▁
cg7todaydat1_rec 734 0.87 0.06 0.23 0e+00 0 0 0 1 ▇▁▁▁▁
cg7todaydat2_rec 734 0.87 0.24 0.43 0e+00 0 0 0 1 ▇▁▁▁▂
cg7todaydat3_rec 734 0.87 0.06 0.24 0e+00 0 0 0 1 ▇▁▁▁▁
cg7todaydat4_rec 734 0.87 0.05 0.23 0e+00 0 0 0 1 ▇▁▁▁▁
cg7presidna1_rec 752 0.87 0.07 0.25 0e+00 0 0 0 1 ▇▁▁▁▁
cg7presidna3_rec 752 0.87 0.17 0.37 0e+00 0 0 0 1 ▇▁▁▁▂
cg7vpname1_rec 750 0.87 0.40 0.49 0e+00 0 0 1 1 ▇▁▁▁▆
cg7vpname3_rec 750 0.87 0.72 0.45 0e+00 0 1 1 1 ▃▁▁▁▇
orientation_score 705 0.87 1.76 1.63 0e+00 1 1 2 8 ▇▆▁▁▁
impaired_domains 705 0.87 0.21 0.54 0e+00 0 0 0 3 ▇▁▁▁▁

What I saw in FullData:

  • 705 missing values for many Roundβ€―7 variables (e.g., dementia_class, MemoryImmediate, MemoryDelayed):
    These are people from Roundβ€―6 who never had a Roundβ€―7 interview at all (not surveyed, not alive, or otherwise not present in R7).
    Because R7_SP_clean was left-joined to R6, those rows remain, but every R7 variable is NA.

  • 45 missing values for many R1/R5-derived covariates (e.g., Education_final, Occupation_final):
    These are participants in R6 who weren’t present in R1 or R5.
    The left_join() kept them, but all R1/R5 variables stayed blank (NA).

  • 2,317 missing values for HouseholdIncome_final: *******************I NEED TO FIX THIS *************************** This isn’t from missing interviews β€” it’s from NHATS survey coding.
    Some participants refused to answer, didn’t know, or weren’t asked, and those responses are collapsed to NA.

  • FullData is designed this way β€” it keeps everyone from Roundβ€―6 so nothing is lost prematurely, which means we expect a lot of structural missingness from other rounds.

Now let’s have a look at the β€œCleandata” set

skim(CleanData)
Data summary
Name CleanData
Number of rows 4756
Number of columns 57
_______________________
Column type frequency:
factor 24
logical 3
numeric 30
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SkippedMeals 0 1.00 FALSE 2 No: 4709, Yes: 47, Ina: 0, Ref: 0
UnableToPayRent 0 1.00 FALSE 2 No: 4663, Yes: 93, Ina: 0, Ref: 0
UnableToPayUtilities 0 1.00 FALSE 3 No: 4601, Yes: 154, Ref: 1, Ina: 0
UnableToPayMedical 0 1.00 FALSE 2 No: 4604, Yes: 152, Ina: 0, Ref: 0
FinancialStrainFlag 0 1.00 FALSE 2 No : 4469, Any: 287, Mis: 0
Responder_final 30 0.99 FALSE 1 Sam: 4726, Pro: 0
Residential_final 30 0.99 FALSE 1 Hom: 4726, Ret: 0, Ass: 0, Nur: 0
Age_final 30 0.99 FALSE 6 70–: 1305, 75–: 1102, 80–: 842, 65–: 754
Gender_final 30 0.99 FALSE 2 Fem: 2701, Mal: 2025
Education_final 30 0.99 FALSE 10 HS : 1208, Som: 697, Bac: 636, Mas: 636
Occupation_final 30 0.99 FALSE 8 Not: 2314, Man: 872, Sal: 463, Pro: 384
HomeOwnership_final 30 0.99 FALSE 4 Own: 3597, Ren: 712, Oth: 342, Mis: 75
RetirementStatus_final 30 0.99 FALSE 4 Ret: 2218, No: 1749, Yes: 686, Mis: 73
Section8_final 30 0.99 FALSE 3 No: 4503, Yes: 215, Mis: 8
Medicaid_final 30 0.99 FALSE 3 No: 4050, Yes: 576, Mis: 100
FoodAssist1_final 30 0.99 FALSE 3 No: 4254, Yes: 392, Mis: 80
FoodAssist2_final 30 0.99 FALSE 3 No: 4530, Yes: 115, Mis: 81
FoodAssist3_final 30 0.99 FALSE 3 No: 4347, Yes: 296, Mis: 83
ChildhoodHealth_final 30 0.99 FALSE 6 Exc: 2296, Ver: 1266, Goo: 768, Fai: 227
RaceEthnicity_final 30 0.99 FALSE 6 1: 3303, 2: 987, 4: 255, 3: 109
NewPerson_final 30 0.99 FALSE 2 No: 2412, Yes: 2314, Mis: 0
MaritalStatus_final 30 0.99 FALSE 6 Mar: 2364, Wid: 1387, Div: 588, Nev: 189
Newsample_final 30 0.99 FALSE 2 Yes: 2412, Mis: 2314, No: 0
dementia_class 0 1.00 FALSE 3 No : 4473, Pos: 199, Pro: 84, Mis: 0

Variable type: logical

skim_variable n_missing complete_rate mean count
memory_impaired 0 1 0.10 FAL: 4261, TRU: 495
orientation_impaired 0 1 0.06 FAL: 4448, TRU: 308
exec_impaired 0 1 0.04 FAL: 4581, TRU: 175

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
spid 0 1.00 15103687.19 4998232.90 1e+07 10006363.2 20000129 20003683 20007119 ▇▁▁▁▇
is6resptype 0 1.00 1.00 0.00 1e+00 1.0 1 1 1 ▁▁▇▁▁
r6dresid 0 1.00 1.00 0.00 1e+00 1.0 1 1 1 ▁▁▇▁▁
HouseholdSize_final 30 0.99 1.98 1.05 1e+00 1.0 2 2 11 ▇▁▁▁▁
HouseholdIncome_final 1885 0.60 70141.54 511528.88 0e+00 19638.5 36000 70000 25000000 ▇▁▁▁▁
is7resptype 0 1.00 1.01 0.10 1e+00 1.0 1 1 2 ▇▁▁▁▁
r7dresid 0 1.00 1.00 0.00 1e+00 1.0 1 1 1 ▁▁▇▁▁
DementiaDx 0 1.00 2.03 0.50 1e+00 2.0 2 2 7 ▇▁▁▁▁
MemoryImmediate 0 1.00 4.79 2.01 -7e+00 4.0 5 6 10 ▁▁▂▇▂
MemoryDelayed 0 1.00 3.50 2.29 -7e+00 2.0 4 5 9 ▁▁▅▇▂
cg7todaydat1 0 1.00 1.04 0.26 -1e+00 1.0 1 1 2 ▁▁▁▇▁
cg7todaydat2 0 1.00 1.23 0.45 -1e+00 1.0 1 1 2 ▁▁▁▇▂
cg7todaydat3 0 1.00 1.05 0.28 -1e+00 1.0 1 1 2 ▁▁▁▇▁
cg7todaydat4 0 1.00 1.04 0.26 -1e+00 1.0 1 1 2 ▁▁▁▇▁
cg7presidna1 0 1.00 1.05 0.37 -7e+00 1.0 1 1 2 ▁▁▁▁▇
cg7presidna3 0 1.00 1.15 0.47 -7e+00 1.0 1 1 2 ▁▁▁▁▇
cg7vpname1 0 1.00 1.38 0.55 -7e+00 1.0 1 2 2 ▁▁▁▁▇
cg7vpname3 0 1.00 1.69 0.52 -7e+00 1.0 2 2 2 ▁▁▁▁▇
ExecutiveDraw 0 1.00 3.70 1.48 -9e+00 3.0 4 5 5 ▁▁▁▁▇
memory_score 0 1.00 8.41 3.64 0e+00 6.0 9 11 19 ▂▅▇▃▁
cg7todaydat1_rec 22 1.00 0.05 0.22 0e+00 0.0 0 0 1 ▇▁▁▁▁
cg7todaydat2_rec 22 1.00 0.24 0.43 0e+00 0.0 0 0 1 ▇▁▁▁▂
cg7todaydat3_rec 22 1.00 0.06 0.24 0e+00 0.0 0 0 1 ▇▁▁▁▁
cg7todaydat4_rec 22 1.00 0.05 0.22 0e+00 0.0 0 0 1 ▇▁▁▁▁
cg7presidna1_rec 29 0.99 0.07 0.25 0e+00 0.0 0 0 1 ▇▁▁▁▁
cg7presidna3_rec 29 0.99 0.17 0.37 0e+00 0.0 0 0 1 ▇▁▁▁▂
cg7vpname1_rec 27 0.99 0.40 0.49 0e+00 0.0 0 1 1 ▇▁▁▁▅
cg7vpname3_rec 27 0.99 0.71 0.45 0e+00 0.0 1 1 1 ▃▁▁▁▇
orientation_score 0 1.00 1.73 1.62 0e+00 1.0 1 2 8 ▇▆▁▁▁
impaired_domains 0 1.00 0.21 0.54 0e+00 0.0 0 0 3 ▇▁▁▁▁

What I saw in CleanData:

  • All participants missing the exposure (FinancialStrainFlag) or outcome (dementia_class) were removed.
    That’s why there are 0 missing for those two key variables here.

  • Only ~30 missing values remain for R1/R5-derived covariates:
    These are likely the same participants who had no data in R1 or R5.
    They stayed because they had R6 exposure and R7 outcome data, but we couldn’t backfill their covariates.

  • 22–29 missing values on the _rec cognition items (e.g., cg7todaydat1_rec):
    These are mostly due to item-level missingness in R7.
    Some respondents skipped or refused certain test items, and NHATS coded those as special missing values (e.g., -7, -8), which you collapsed to NA.

  • 1,885 missing for HouseholdIncome_final: ********* I NEED TO FIX THIS ********************* Same story as in FullData β€” this is a survey artifact (refusals, β€œdon’t know,” or skipped questions).

  • CleanData dramatically reduced missingness in the key analytic variables (exposure and outcome) but kept β€œreal-world” missingness for some covariates and individual test items. This is expected in a complex survey merge and doesn’t need to be over-cleaned β€” it’s just important to explain.


πŸ“Š Building TableΒ 1

To summarize the characteristics of my dataset, I created a TableΒ 1 for Cleandata using the arsenal package.

I wanted this table to compare participants by Financial Strain Flag across key demographic and health variables. Before building the table, I:

  • Labeled the variables in Cleandata so the table would display readable column names instead of raw variable names.
  • Specified which covariates should appear as factors (e.g., Gender_final, Education_final) and which should remain numeric (e.g., HouseholdSize_final, HouseholdIncome_final).
  • Chose summary statistics for numeric variables β€” here I displayed the median and standard deviation rather than the mean, since income and household size are often skewed.

Finally, I ran tableby() to generate the table, and then used summary()

## πŸ“Š Building Table 1 with Arsenal (Hmisc Label Method)

# Step 1: Assign pretty labels to all variables in CleanData
label(CleanData$FinancialStrainFlag)    <- "Financial Strain Flag"
label(CleanData$dementia_class)         <- "Dementia Classification"
label(CleanData$Age_final)              <- "Age Group"
label(CleanData$Gender_final)           <- "Gender"
label(CleanData$Education_final)        <- "Education Level"
label(CleanData$Occupation_final)       <- "Occupation Category"
label(CleanData$HomeOwnership_final)    <- "Home Ownership"
label(CleanData$HouseholdSize_final)    <- "Household Size"
label(CleanData$HouseholdIncome_final)  <- "Household Income"
label(CleanData$Section8_final)         <- "Receives Section 8"
label(CleanData$Medicaid_final)         <- "Receives Medicaid"
label(CleanData$ChildhoodHealth_final)  <- "Childhood Health Status"
label(CleanData$RaceEthnicity_final)    <- "Race and Ethnicity"
label(CleanData$MaritalStatus_final)    <- "Marital Status"

# Step 2: Build Table 1 comparing participants by Financial Strain Flag
tab1 <- tableby(
  FinancialStrainFlag ~
    dementia_class +
    Age_final +
    Gender_final +
    Education_final +
    Occupation_final +
    HomeOwnership_final +
    HouseholdSize_final +
    HouseholdIncome_final +
    Section8_final +
    Medicaid_final +
    ChildhoodHealth_final +
    RaceEthnicity_final +
    MaritalStatus_final,
  data = CleanData,
  numeric.stats = c("median", "sd")
)

# Step 3: Output summary (labels will now appear automatically)
summary(tab1, text = TRUE)
No Strain (N=4469) Any Strain (N=287) Total (N=4756) p value
Dementia Classification
- No Dementia 4218 (94.4%) 255 (88.9%) 4473 (94.0%)
- Possible Dementia 174 (3.9%) 25 (8.7%) 199 (4.2%)
- Probable Dementia 77 (1.7%) 7 (2.4%) 84 (1.8%)
- Missing 0 (0.0%) 0 (0.0%) 0 (0.0%)
Age Group
- N-Miss 27 3 30
- 65–69 691 (15.6%) 63 (22.2%) 754 (16.0%)
- 70–74 1209 (27.2%) 96 (33.8%) 1305 (27.6%)
- 75–79 1035 (23.3%) 67 (23.6%) 1102 (23.3%)
- 80–84 803 (18.1%) 39 (13.7%) 842 (17.8%)
- 85–89 495 (11.1%) 15 (5.3%) 510 (10.8%)
- 90+ 209 (4.7%) 4 (1.4%) 213 (4.5%)
- Missing 0 (0.0%) 0 (0.0%) 0 (0.0%)
Gender 0.039
- N-Miss 27 3 30
- Male 1920 (43.2%) 105 (37.0%) 2025 (42.8%)
- Female 2522 (56.8%) 179 (63.0%) 2701 (57.2%)
Education Level < 0.001
- N-Miss 27 3 30
- No schooling 16 (0.4%) 2 (0.7%) 18 (0.4%)
- 1–8th grade 304 (6.8%) 49 (17.3%) 353 (7.5%)
- 9–12 (no diploma) 455 (10.2%) 55 (19.4%) 510 (10.8%)
- HS grad 1139 (25.6%) 69 (24.3%) 1208 (25.6%)
- Vocational 320 (7.2%) 18 (6.3%) 338 (7.2%)
- Some college 665 (15.0%) 32 (11.3%) 697 (14.7%)
- Associate 224 (5.0%) 19 (6.7%) 243 (5.1%)
- Bachelor 613 (13.8%) 23 (8.1%) 636 (13.5%)
- Master/PhD 623 (14.0%) 13 (4.6%) 636 (13.5%)
- Missing 83 (1.9%) 4 (1.4%) 87 (1.8%)
Occupation Category < 0.001
- N-Miss 27 3 30
- Management/Professional 832 (18.7%) 40 (14.1%) 872 (18.5%)
- Service 175 (3.9%) 27 (9.5%) 202 (4.3%)
- Sales/Office 433 (9.7%) 30 (10.6%) 463 (9.8%)
- Construction/Farming 210 (4.7%) 16 (5.6%) 226 (4.8%)
- Production 352 (7.9%) 32 (11.3%) 384 (8.1%)
- Homemaker 127 (2.9%) 7 (2.5%) 134 (2.8%)
- Not working/retired 2190 (49.3%) 124 (43.7%) 2314 (49.0%)
- Missing 123 (2.8%) 8 (2.8%) 131 (2.8%)
Home Ownership < 0.001
- N-Miss 27 3 30
- Own 3448 (77.6%) 149 (52.5%) 3597 (76.1%)
- Rent 605 (13.6%) 107 (37.7%) 712 (15.1%)
- Other 319 (7.2%) 23 (8.1%) 342 (7.2%)
- Missing 70 (1.6%) 5 (1.8%) 75 (1.6%)
Household Size 0.077
- Median 2.000 2.000 2.000
- SD 1.030 1.365 1.053
Household Income 0.188
- Median 40000.000 15000.000 36000.000
- SD 528553.118 26876.641 511528.883
Receives Section 8 < 0.001
- N-Miss 27 3 30
- Yes 180 (4.1%) 35 (12.3%) 215 (4.5%)
- No 4256 (95.8%) 247 (87.0%) 4503 (95.3%)
- Missing 6 (0.1%) 2 (0.7%) 8 (0.2%)
Receives Medicaid < 0.001
- N-Miss 27 3 30
- Yes 488 (11.0%) 88 (31.0%) 576 (12.2%)
- No 3864 (87.0%) 186 (65.5%) 4050 (85.7%)
- Missing 90 (2.0%) 10 (3.5%) 100 (2.1%)
Childhood Health Status < 0.001
- N-Miss 27 3 30
- Excellent 2181 (49.1%) 115 (40.5%) 2296 (48.6%)
- Very good 1193 (26.9%) 73 (25.7%) 1266 (26.8%)
- Good 708 (15.9%) 60 (21.1%) 768 (16.3%)
- Fair 207 (4.7%) 20 (7.0%) 227 (4.8%)
- Poor 72 (1.6%) 14 (4.9%) 86 (1.8%)
- Missing 81 (1.8%) 2 (0.7%) 83 (1.8%)
Race and Ethnicity < 0.001
- N-Miss 27 3 30
- 1 3209 (72.2%) 94 (33.1%) 3303 (69.9%)
- 2 851 (19.2%) 136 (47.9%) 987 (20.9%)
- 3 94 (2.1%) 15 (5.3%) 109 (2.3%)
- 4 223 (5.0%) 32 (11.3%) 255 (5.4%)
- 5 3 (0.1%) 1 (0.4%) 4 (0.1%)
- 6 62 (1.4%) 6 (2.1%) 68 (1.4%)
Marital Status
- N-Miss 27 3 30
- Married 2275 (51.2%) 89 (31.3%) 2364 (50.0%)
- Living with Partner 104 (2.3%) 9 (3.2%) 113 (2.4%)
- Separated 69 (1.6%) 16 (5.6%) 85 (1.8%)
- Divorced 529 (11.9%) 59 (20.8%) 588 (12.4%)
- Widowed 1297 (29.2%) 90 (31.7%) 1387 (29.3%)
- Never married 168 (3.8%) 21 (7.4%) 189 (4.0%)
- Missing 0 (0.0%) 0 (0.0%) 0 (0.0%)