Setup

Load packages

library(ggplot2)
library(dplyr)
library(survey)
library(mice)
library(reshape2)
library(usmap)
#library(tidyverse)

Load data

Download Links:

brfss2013.Rdata

Codebook

load("brfss2013.RData")

Part 1: Data

How the observations in the sample are collected

Landline sample

Cellular sample

The implications of this data collection method on generalizability and causality

Since there is no random assignment,

Potential sources of bias

Potential non-response bias may be considered for this data since the surveys were conducted using land lines and cell phones. However, this can be avoided by using the ‘survey’ package and the provided weighting variables in the brfss2013 data set.


Part 2: Research questions

Research quesion 1: What are the probabilities of owning or renting a home given an individual’s income level and race?

Higher income individuals may be more able to afford owning a house. There may also be differences in home-ownership rates based on race due to systemic inequalities and historical factors.

Research quesion 2: Is there any association between physical activity level and tobacco use among young adults?

It is reasonable to assume people who engage in regular physical activities also adopt other healthy lifestyle choices.

Research quesion 3: Is there a pattern in sleep duration, education level, and state residence among individuals who self-report excellent versus poor general health status?

This question explores the three separate variable types on general health: health behavior, demographics, and geographic region.


Part 3: Exploratory data analysis

Research quesion 1: What are the probabilities of owning or renting a home given an individual’s income level and race?

Let’s first look into our three variables separately.

# summary for variable renthom1
brfss2013 %>% count(renthom1) %>% mutate(percent = n/sum(n))
##            renthom1      n    percent
## 1               Own 351636 0.71503431
## 2              Rent 110980 0.22567231
## 3 Other arrangement  20711 0.04211479
## 4              <NA>   8448 0.01717859

There are three levels of housing types with the frequency and percent for each.

# summary for variable income2
brfss2013 %>% count(income2) %>% mutate(percent = n/sum(n))
##             income2      n    percent
## 1 Less than $10,000  25441 0.05173301
## 2 Less than $15,000  26794 0.05448427
## 3 Less than $20,000  34873 0.07091251
## 4 Less than $25,000  41732 0.08485995
## 5 Less than $35,000  48867 0.09936861
## 6 Less than $50,000  61509 0.12507549
## 7 Less than $75,000  65231 0.13264399
## 8   $75,000 or more 115902 0.23568095
## 9              <NA>  71426 0.14524122

We can see that the $75,000 group accounts for roughly a quarter of our data set.

# summary for variable race
# rename levels
levels(brfss2013$X_race) <- c("White","Black","American Indian or Alaskan Native","Asian","Native Hawaiian or Pacific Islander","Other race","Multiracial","Hispanic")

brfss2013 %>% count(X_race) %>% mutate(percent = n/sum(n))
##                                X_race      n     percent
## 1                               White 376253 0.765091759
## 2                               Black  39339 0.079993900
## 3   American Indian or Alaskan Native   7676 0.015608764
## 4                               Asian   9508 0.019334045
## 5 Native Hawaiian or Pacific Islander   1547 0.003145748
## 6                          Other race   2701 0.005492349
## 7                         Multiracial   9121 0.018547100
## 8                            Hispanic  37060 0.075359667
## 9                                <NA>   8570 0.017426669

The ‘White’ group accounts for the majority of the data set.

We will use ‘mice package’ to fill in the NA values.

The basic idea behind the algorithm is to treat each variable that has missing values as a dependent variable in regression and treat the others as independent (predictors).

We will use an imputation method called predictive mean matching.

set1 <- subset(brfss2013, select = c("income2","X_race","renthom1"))
table(is.na(set1))
## 
##   FALSE    TRUE 
## 1386881   88444
# list the variable being imputed on the left-hand side of the equation, and the predictor variables on the right-hand side of the equation, separated by the tilde (~) symbol.
imp1 <- mice(set1, method= "pmm", seed=1234)
## 
##  iter imp variable
##   1   1  income2  X_race  renthom1
##   1   2  income2  X_race  renthom1
##   1   3  income2  X_race  renthom1
##   1   4  income2  X_race  renthom1
##   1   5  income2  X_race  renthom1
##   2   1  income2  X_race  renthom1
##   2   2  income2  X_race  renthom1
##   2   3  income2  X_race  renthom1
##   2   4  income2  X_race  renthom1
##   2   5  income2  X_race  renthom1
##   3   1  income2  X_race  renthom1
##   3   2  income2  X_race  renthom1
##   3   3  income2  X_race  renthom1
##   3   4  income2  X_race  renthom1
##   3   5  income2  X_race  renthom1
##   4   1  income2  X_race  renthom1
##   4   2  income2  X_race  renthom1
##   4   3  income2  X_race  renthom1
##   4   4  income2  X_race  renthom1
##   4   5  income2  X_race  renthom1
##   5   1  income2  X_race  renthom1
##   5   2  income2  X_race  renthom1
##   5   3  income2  X_race  renthom1
##   5   4  income2  X_race  renthom1
##   5   5  income2  X_race  renthom1
imputed_data1 <- complete(imp1)

# rename columns
imputed_data1 <- rename(imputed_data1,income_lev = income2, race = X_race, housing_type = renthom1)

# rename income column's levels
imputed_data1 <- imputed_data1 %>% mutate(income_lev = as.factor(case_when(
  income_lev == "Less than $10,000" ~ "$0 - $10,000",
  income_lev == "Less than $15,000" ~ "$10,000 - $15,000",
  income_lev == "Less than $20,000" ~ "$15,000 - $20,000",
  income_lev == "Less than $25,000" ~ "$20,000 - $25,000",
  income_lev == "Less than $35,000" ~ "$25,000 - $35,000",
  income_lev == "Less than $50,000" ~ "$35,000 - $50,000",
  income_lev == "Less than $75,000" ~ "$50,000 - $75,000",
  income_lev == "$75,000 or more"   ~ "$75,000 +"    ) ) )

imputed_data1 %>% head()
##          income_lev  race housing_type
## 1 $50,000 - $75,000 Black          Own
## 2         $75,000 + White          Own
## 3         $75,000 + White          Own
## 4 $50,000 - $75,000 White          Own
## 5 $35,000 - $50,000 White          Own
## 6         $75,000 + Black          Own

The first six rows of our desired subset with the columns renamed for better readability

Here are the histograms for the three variables

# histograms
imputed_data1 %>% ggplot(aes(y = housing_type)) + 
  geom_bar()

imputed_data1 %>% ggplot(aes(y = income_lev)) + 
  geom_bar()

imputed_data1 %>% ggplot(aes(y = race)) + 
  geom_bar()

Now let’s explore our method of finding the probability of owning/renting given income level and race.

Recall the fact that the “White” race category has the highest number of people in the data set. This can affect the calculation of our probabilities. If one race category has a much larger frequency than the others, it may dominate the conditional probabilities and make it difficult to make meaningful comparisons between the different race categories.

It is more useful to normalize the frequencies by the total number of people in each income+race combination. This would give us the proportion of people who own/rent in each race category within each income level, which can help to reveal any patterns or differences between the race categories and/or income levels.

# Define the categories for income and race
income_categories <- c("$0 - $10,000", "$10,000 - $15,000", "$15,000 - $20,000", "$20,000 - $25,000", "$25,000 - $35,000", "$35,000 - $50,000", "$50,000 - $75,000", "$75,000 +")
race_categories <- c("White", "Black", "American Indian or Alaskan Native", "Asian", "Native Hawaiian or Pacific Islander", "Other race", "Multiracial", "Hispanic")

# Create two empty matrices to store the probabilities
prob_matrix_own <- matrix(NA, nrow = length(income_categories), ncol = length(race_categories))
prob_matrix_rent <- matrix(NA, nrow = length(income_categories), ncol = length(race_categories))


# Loop over the income and race categories and calculate the probabilities
for (i in seq_along(income_categories)) {
  for (j in seq_along(race_categories)) {
    # Subset the data to only include a certain (income,race) pair
    subset_data <- subset(imputed_data1, income_lev == income_categories[i] & race == race_categories[j])
    
    # Calculate the total number of people in the subset
    n <- nrow(subset_data)
    
    # Calculate the number of people who own or rent their home in the subset
    rent_own_table <- table(subset_data$housing_type)
    n_own <- rent_own_table[1]
    n_rent <- rent_own_table[2]
    
    # Calculate the probabilities and store them in the matrix
    prob_own <- n_own / n
    prob_rent <- n_rent / n
    prob_matrix_own[i, j] <- prob_own
    prob_matrix_rent[i,j] <- prob_rent
  }
}

# Add row and column names to the probability matrices
rownames(prob_matrix_own) <- income_categories
colnames(prob_matrix_own) <- race_categories
rownames(prob_matrix_rent) <- income_categories
colnames(prob_matrix_rent) <- race_categories

# View the probability matrices
prob_matrix_own
##                       White     Black American Indian or Alaskan Native
## $0 - $10,000      0.3554515 0.3462592                         0.3867450
## $10,000 - $15,000 0.5096915 0.3432439                         0.5645342
## $15,000 - $20,000 0.5984650 0.3214887                         0.4882914
## $20,000 - $25,000 0.6067832 0.5196784                         0.5690276
## $25,000 - $35,000 0.7373117 0.5161115                         0.6050339
## $35,000 - $50,000 0.8485875 0.5670000                         0.7405784
## $50,000 - $75,000 0.8744283 0.7101263                         0.7413295
## $75,000 +         0.9184410 0.8398802                         0.8157603
##                       Asian Native Hawaiian or Pacific Islander Other race
## $0 - $10,000      0.2154472                           0.1751412  0.3732394
## $10,000 - $15,000 0.4148352                           0.1111111  0.3377778
## $15,000 - $20,000 0.3348348                           0.3276836  0.5492958
## $20,000 - $25,000 0.4389831                           0.4024390  0.6273292
## $25,000 - $35,000 0.4497768                           0.3093525  0.7317647
## $35,000 - $50,000 0.5129241                           0.4360190  0.6600985
## $50,000 - $75,000 0.5443131                           0.4950495  0.5755814
## $75,000 +         0.7901483                           0.7431507  0.8428325
##                   Multiracial  Hispanic
## $0 - $10,000        0.2768730 0.3630114
## $10,000 - $15,000   0.2876858 0.3903815
## $15,000 - $20,000   0.4154412 0.4517747
## $20,000 - $25,000   0.4535017 0.3356902
## $25,000 - $35,000   0.5310078 0.5623113
## $35,000 - $50,000   0.6543408 0.6079576
## $50,000 - $75,000   0.7675528 0.7306763
## $75,000 +           0.7623126 0.7653846
prob_matrix_rent
##                        White     Black American Indian or Alaskan Native
## $0 - $10,000      0.44574896 0.5664910                         0.4437919
## $10,000 - $15,000 0.42309898 0.5665942                         0.3748597
## $15,000 - $20,000 0.34584860 0.5865611                         0.4579358
## $20,000 - $25,000 0.34055333 0.4092256                         0.3625450
## $25,000 - $35,000 0.22961645 0.4480449                         0.3552759
## $35,000 - $50,000 0.12367809 0.3868000                         0.2383874
## $50,000 - $75,000 0.10949410 0.2700335                         0.2066474
## $75,000 +         0.06651064 0.1438083                         0.1664817
##                       Asian Native Hawaiian or Pacific Islander Other race
## $0 - $10,000      0.5325203                           0.6101695  0.4542254
## $10,000 - $15,000 0.4725275                           0.6592593  0.5688889
## $15,000 - $20,000 0.5375375                           0.5084746  0.3873239
## $20,000 - $25,000 0.4830508                           0.4817073  0.2888199
## $25,000 - $35,000 0.3828125                           0.5863309  0.2376471
## $35,000 - $50,000 0.4361874                           0.4170616  0.2733990
## $50,000 - $75,000 0.3840473                           0.3217822  0.3895349
## $75,000 +         0.1827092                           0.1369863  0.1347150
##                   Multiracial  Hispanic
## $0 - $10,000        0.5428882 0.5001677
## $10,000 - $15,000   0.6061571 0.5290167
## $15,000 - $20,000   0.4828431 0.4689429
## $20,000 - $25,000   0.4167623 0.6194761
## $25,000 - $35,000   0.3478682 0.3626563
## $35,000 - $50,000   0.2765273 0.3275862
## $50,000 - $75,000   0.1779141 0.2424517
## $75,000 +           0.1927195 0.2123482

Here is the plot of owning a home given each income level and race category

# Convert the probability matrix to a data frame
prob_df <- melt(prob_matrix_own, varnames = c("income", "race"), value.name = "probability")
prob_df$income <- factor(prob_df$income, levels = income_categories)

# Plot the probability distribution by income level
ggplot(prob_df, aes(x = income, y = probability, group = race, color = race)) +
  geom_line() +
  geom_point() +
  ggtitle(expression(atop(
    paste("Probability of ", underline("Owning a Home"), " Given Income Level and Race Category")))) +
  labs(x = "Income Level", y = "Probability", color = "Race") +
  theme(axis.text.x = element_text(angle = 90))

Plot for renting a home.

# Convert the probability matrix to a data frame
prob_df <- melt(prob_matrix_rent, varnames = c("income", "race"), value.name = "probability")
prob_df$income <- factor(prob_df$income, levels = income_categories)

# Plot the probability distribution by income level
ggplot(prob_df, aes(x = income, y = probability, group = race, color = race)) +
  geom_line() +
  geom_point() +
  ggtitle(expression(atop(
    paste("Probability of ", underline("Renting a Home"), " Given Income Level and Race Category")))) +
  labs(x = "Income Level", y = "Probability", color = "Race") +
  theme(axis.text.x = element_text(angle = 90))

Research quesion 2: Is there any association between physical activity level and tobacco use among young adults?

#filter the data to include only young adults, defined as those between the ages of 18 and 24
young_adults <- filter(brfss2013, X_age_g == "Age 18 to 24")

# check to see if the 'X_llcpwt' weighting variable has any missing values
sum(is.na(young_adults$X_llcpwt))
## [1] 0

This output means there are no missing values in the weighting variable.

set2 <- subset(young_adults, select = c("seqno", "X_psu","X_ststr","X_llcpwt","X_pacat1","X_rfsmok3","X_smoker3"))
sum(is.na(set2))
## [1] 7242
imp2 <- mice(set2, method = c("","","","","sample", "logreg.boot", "sample"), seed=4321) 
## 
##  iter imp variable
##   1   1  X_pacat1  X_rfsmok3  X_smoker3
##   1   2  X_pacat1  X_rfsmok3  X_smoker3
##   1   3  X_pacat1  X_rfsmok3  X_smoker3
##   1   4  X_pacat1  X_rfsmok3  X_smoker3
##   1   5  X_pacat1  X_rfsmok3  X_smoker3
##   2   1  X_pacat1  X_rfsmok3  X_smoker3
##   2   2  X_pacat1  X_rfsmok3  X_smoker3
##   2   3  X_pacat1  X_rfsmok3  X_smoker3
##   2   4  X_pacat1  X_rfsmok3  X_smoker3
##   2   5  X_pacat1  X_rfsmok3  X_smoker3
##   3   1  X_pacat1  X_rfsmok3  X_smoker3
##   3   2  X_pacat1  X_rfsmok3  X_smoker3
##   3   3  X_pacat1  X_rfsmok3  X_smoker3
##   3   4  X_pacat1  X_rfsmok3  X_smoker3
##   3   5  X_pacat1  X_rfsmok3  X_smoker3
##   4   1  X_pacat1  X_rfsmok3  X_smoker3
##   4   2  X_pacat1  X_rfsmok3  X_smoker3
##   4   3  X_pacat1  X_rfsmok3  X_smoker3
##   4   4  X_pacat1  X_rfsmok3  X_smoker3
##   4   5  X_pacat1  X_rfsmok3  X_smoker3
##   5   1  X_pacat1  X_rfsmok3  X_smoker3
##   5   2  X_pacat1  X_rfsmok3  X_smoker3
##   5   3  X_pacat1  X_rfsmok3  X_smoker3
##   5   4  X_pacat1  X_rfsmok3  X_smoker3
##   5   5  X_pacat1  X_rfsmok3  X_smoker3
## Warning: Number of logged events: 1
imputed_data2 <- complete(imp2)

imputed_data2 %>% head()
##        seqno      X_psu X_ststr  X_llcpwt              X_pacat1 X_rfsmok3
## 1 2013006443 2013006443   12019 1459.8473 Insufficiently active        No
## 2 2013001722 2013001722   11091  834.3642 Insufficiently active        No
## 3 2013001798 2013001798   11091 1137.5738         Highly active        No
## 4 2013004740 2013004740   11091  267.9022         Highly active        No
## 5 2013005110 2013005110   12019 1124.6586 Insufficiently active        No
## 6 2013005363 2013005363   12019 1717.1831         Highly active        No
##      X_smoker3
## 1 Never smoked
## 2 Never smoked
## 3 Never smoked
## 4 Never smoked
## 5 Never smoked
## 6 Never smoked

We used random sample from observed values for X_pacat1 and X_smoker3 and logistic regression for X_rfsmok3 since it is binary. We did not impute these variables “X_seqno”, “X_psu”,“X_ststr”,“X_llcpwt” because they are inputs to create a survey design object.

Let’s look at the summary statistics for physical activity level and tobacco use.

summary(imputed_data2$X_pacat1)
##         Highly active                Active Insufficiently active 
##                  8546                  5610                  6966 
##              Inactive 
##                  6081
summary(imputed_data2$X_rfsmok3)
##    No   Yes 
## 21825  5378
summary(imputed_data2$X_smoker3)
## Current smoker - now smokes every day Current smoker - now smokes some days 
##                                  3466                                  1907 
##                         Former smoker                          Never smoked 
##                                  2070                                 19760

Let’s now create a bar plot to investigate the potential association.

# create a weighted survey design object
design2 <- svydesign(
  data = imputed_data2,
  id = ~seqno,
  cluster = imputed_data2$X_psu,
  strata = imputed_data2$X_ststr,
  weights = imputed_data2$X_llcpwt,
  nest = TRUE
)

# create a weighted contingency table
# returns a table with the weighted frequencies for each combination of categories
tab <- svytable(~X_rfsmok3+X_pacat1, design2)

# normalize the weighted frequencies by computing the proportion of each table entry relative to the total sum of all entries
normalized_tab <- prop.table(tab)


# adjust plot margins
par(mar = c(5,4,4,5) + 0.5)
# visualize the association using a bar plot
ggplot(as.data.frame(normalized_tab), aes(x = X_pacat1, y = Freq, fill = X_rfsmok3)) + 
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Association in Physical Activity Level and Tobacco Use",
       x = "Physical Activity Levels",
       y = "Percentages",
       fill = "Tobacco Use") +
  scale_fill_manual(values = c("mediumseagreen", "red"))

Let’s dig a little bit deeper.

tab2 <- svytable(~X_smoker3+X_pacat1, design2, 1)
df_tab2 <- as.data.frame(tab2)

# visualize the association using a relative frequency segmented bar plot
ggplot(df_tab2, aes(x = X_pacat1, y = Freq, fill = X_smoker3)) +
  geom_col(position = position_fill()) +
  labs(title = "Relative Frequency Segmented Bar Plot",
       x = "Physical Activity Level", y = "", fill = "Tobacco Use Status") +
  scale_fill_manual(values = c("red","orange","yellow","mediumseagreen"))

Due to the general nature of our question, in which we don’t assume which variable is the explanatory variable and which is the response, let’s create both the column and row proportion contingency tables.

# create column proportion contingency table
# interpret the relationship from side to side
# does physical activity level affect tobacco use status?
colp_tab <- prop.table(tab, margin = 2)

# print the normalized table
print(colp_tab)
##          X_pacat1
## X_rfsmok3 Highly active    Active Insufficiently active  Inactive
##       No      0.8022698 0.8427361             0.8299502 0.7726700
##       Yes     0.1977302 0.1572639             0.1700498 0.2273300
ggplot(as.data.frame(colp_tab), aes(x = X_pacat1, y = Freq, fill = X_rfsmok3)) + 
  geom_bar(stat = "identity") +
  labs(title = "Column Proportion Contingency Plot",
       x = "Physical Activity Levels",
       y = "Percentages",
       fill = "Tobacco Use") +
  scale_fill_manual(values = c("mediumseagreen", "red"))

# create row proportion contingency table
# interpret the relationship from top to bottom
# does tobacco use status affect physical activity level?
rowp_tab <- prop.table(tab, margin = 1)

# print the normalized table
print(rowp_tab)
##          X_pacat1
## X_rfsmok3 Highly active    Active Insufficiently active  Inactive
##       No      0.3170953 0.2092409             0.2521617 0.2215021
##       Yes     0.3339364 0.1668419             0.2207622 0.2784595
ggplot(as.data.frame(rowp_tab), aes(x = X_pacat1, y = Freq, fill = X_rfsmok3)) + 
  geom_bar(stat = "identity",position = "dodge") +
  labs(title = "Row Proportion Contingency Plot",
       x = "Physical Activity Levels",
       y = "Percentages",
       fill = "Tobacco Use") +
  scale_fill_manual(values = c("mediumseagreen", "red"))

Research question 3: Is there a pattern in sleep duration, education level, and state residence among individuals who self-report excellent versus poor general health status?

Cleaning and Pre-processing.

# filter the data to include only individuals who self report 'Excellent' or Poor general health
exc_pr <- subset(brfss2013, genhlth == "Excellent" | genhlth == "Poor")
# remove other levels in "genhlth" variable to only have two levels: excellent, poor
exc_pr$genhlth <- factor(exc_pr$genhlth, levels = c("Excellent", "Poor"))


# select only variables of interest
set3 <- exc_pr[,c("seqno","X_llcpwt", "X_ststr", "X_psu","genhlth", "sleptim1", "X_educag", "X_state")]

# Count the number of missing values for each variable
missing_counts <- colSums(is.na(set3))
# Print the missing value counts for each variable
print(missing_counts)
##    seqno X_llcpwt  X_ststr    X_psu  genhlth sleptim1 X_educag  X_state 
##        1        1        1        0        0     1972      599        0
# remove missing values for the variables "seqno", "X_llcpwt", "X_llcpwt"
set3 <- set3[complete.cases(set3[,c("seqno", "X_llcpwt", "X_ststr")]), ]

Impute missing values.

# impute missing values
imp3 <- mice(set3, method = "rf", seed = 123)
## 
##  iter imp variable
##   1   1  sleptim1  X_educag
##   1   2  sleptim1  X_educag
##   1   3  sleptim1  X_educag
##   1   4  sleptim1  X_educag
##   1   5  sleptim1  X_educag
##   2   1  sleptim1  X_educag
##   2   2  sleptim1  X_educag
##   2   3  sleptim1  X_educag
##   2   4  sleptim1  X_educag
##   2   5  sleptim1  X_educag
##   3   1  sleptim1  X_educag
##   3   2  sleptim1  X_educag
##   3   3  sleptim1  X_educag
##   3   4  sleptim1  X_educag
##   3   5  sleptim1  X_educag
##   4   1  sleptim1  X_educag
##   4   2  sleptim1  X_educag
##   4   3  sleptim1  X_educag
##   4   4  sleptim1  X_educag
##   4   5  sleptim1  X_educag
##   5   1  sleptim1  X_educag
##   5   2  sleptim1  X_educag
##   5   3  sleptim1  X_educag
##   5   4  sleptim1  X_educag
##   5   5  sleptim1  X_educag
## Warning: Number of logged events: 51
imputed_data3 <- complete(imp3)

# rename variables/columns
imputed_data3 <- rename(imputed_data3, gen_health = genhlth, sleep_hrs = sleptim1, edu_lv = X_educag, state = X_state)

Summary statistics and plots for sleep duration by general health status.

# create a design survey object 
design3 <- svydesign(
  data = imputed_data3,
  id = ~seqno,
  cluster = imputed_data3$X_psu,
  strata = imputed_data3$X_ststr,
  weights = imputed_data3$X_llcpwt,
  nest = TRUE)

########################## ############# ############# ############# ############# sleep and general health
# svymean(~sleep_hrs, by = ~gen_health, design = design3)

# Summary statistics for sleep duration by general health status
# Histogram of sleep_hrs as summary statistics
ggplot(imputed_data3, aes(x = sleep_hrs)) +
  geom_histogram(binwidth = 1, fill = "#69b3a2", color = "black") +
  labs(title = "Distribution of Sleep Duration", x = "Sleep Duration (Hours)", y = "Count")

# Bar plot of genhlth
ggplot(set3, aes(x = genhlth, fill = genhlth)) +
  geom_bar() +
  labs(title = "General Health Status", x = "General Health Status", y = "Count") +
  scale_fill_manual(values = c("mediumseagreen", "red")) # custom colors

# Create a box plot of sleptim1 for each level of genhlth
ggplot(imputed_data3, aes(x = gen_health, y = sleep_hrs)) +
  geom_boxplot() +
  labs(x = "General Health Status", y = "Sleep Duration") +
  ggtitle("Sleep Duration by General Health Status")

Summary statistics and plot for education level by general health status.

########################## ############# ############# ############# ############# education level and general health
# the weighted counts indicate the number of people in the population that each combination of categories represents
# the weighted counts are population estimates for each combination of the variables of interest
educ_table <- svytable(~edu_lv + gen_health, design = design3)
norm_educ_table <- prop.table(educ_table)

# Create a data frame from the svytable object
educ_table_df <- as.data.frame(norm_educ_table)

# Create bar chart for education level by general health status
ggplot(educ_table_df, aes(x = Freq, y = edu_lv, fill = gen_health)) +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("mediumseagreen", "red")) +
  labs(x = "Percentage", y = "Education Level", 
       title = "Distribution of Education Level by General Health Status")

Create choropleth map to explore if there is any pattern in where individuals who self report excellent/poor health reside in the US.

########################## ############# ############# ############# #############  state residence and general health
# Calculate percentage of individuals who self-report excellent or poor general health status by state

state_table <- svytable(~state + gen_health, design = design3)
rowp_state_df <- as.data.frame(prop.table(state_table, margin = 1))

# create a vector of levels to remove
levels_to_remove <- c("0", "80", "Puerto Rico", "Guam")
# remove the levels
rowp_state_df <- rowp_state_df[!(rowp_state_df$state %in% levels_to_remove), ]


# Subset the state_data data frame to include only poor health status percentages
poor_health_data <- subset(rowp_state_df, gen_health == "Poor")
colnames(poor_health_data)[3] <- "percentage"
# Subset the state_data data frame to include only excellent health status percentages
excellent_health_data <- subset(rowp_state_df, gen_health == "Excellent")
colnames(excellent_health_data)[3] <- "percentage"


# This will create a choropleth map that shows the percentage of individuals who self-report excellent or poor general health status by state.
# Create a choropleth map for poor health
plot_usmap(data = poor_health_data, values = "percentage", include = poor_health_data$state) +
  scale_fill_gradient(low = "#F0FFF0", high = "#8B0000") +
  theme(legend.position = "right",plot.title = element_text(size = 12, face = "bold", hjust = 0.5)) +
  ggtitle("The South more likely to have 'Poor' General Health")

# choropleth map for excellent health
plot_usmap(data = excellent_health_data, values = "percentage", include = excellent_health_data$state) +
  scale_fill_gradient(low = "#F0FFF0", high = "#1E4620") +
  theme(legend.position = "right",plot.title = element_text(size = 12, face = "bold", hjust = 0.5)) +  
  ggtitle("Northeast, Midwest more likely to have 'Excellent' General Health")