Package used for the data analysis:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3

Dataset used for the analysis:

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance Systen is an ongoing surveillance system between US states and territories and the CDC to “to measure behavioral risk factors for the non-institutionalized adult population.” The data was gathered through the use of monthly landline- and cellphone-based surveys about preventative health practices and risk behaviors of chronic diseases, injuries, and infectious diseases. The available data is from the 2013 results of the survey.
Individuals were classified based on landline or cellphone use, with cellphone users classified as those who use a cellphone for 90% or more of their calls. For landline interviews, interviewers spoke with a random individual of a given household. For cellphone interviews, interviewers spoke with individuals possessing a cellphone residing in a private residence or college housing. There is no direct method to account for non telephone-based interviews.
The system randomly sampled households using disproportionate stratified samples; simple random samples were used in Puerto Rico and Guam. The data was weighted using iterative proportional fitting, providing a way for groups underrepresented in the sample to be accurately represented in the final data set. Due to the sampling methods used and the expanse of the surveillane system, this data is generalizable to the entire United States and subsequent territories. Causality cannot be inferred due to the design structure of this project, i.e., there is no assignment of experiment and control groups.
There is a potential of bias in this dataset based on the possibility of underrepresented groups even after weighting, the ability of individuals to recall information specific to the interview, and the possibility of individuals altering their responses due to the knowledge of being interviewed.
Below are three areas for further research from this dataset.

Part 2: Areas of Research Interest

Area 1: Healthcare does not appear to be consistently accessible today based on different factors. How accessible was healthcare in 2013 based on education level?

Area 2: Economic burden may exacerbate preexisting mental health concerns. In 2013, how did individuals with a depressive disorder and of varying employment statuses and sex feel stress due to the costs of rent?

Area 3: Tobacco use is known to be hazardous to health. How did tobacco use in 2013 vary based on marital status and weight of male respondents?


Part 3: Exploratory data analysis

Area 1: Healthcare Access

The majority of individuals interviewed were high school graduates, college graduates, or attended college or technical school but did not graduate. 2,274 individuals did not provide a response. 12.22 % of individuals didn’t see a doctor due to cost, and 26.28% of individuals went for longer than one year without a routine checkup. The last percentage is 100% minus the percentage of individuals who went for less than one year without a routine checkup minus the percentage of individuals who did not provide a response.
brfss2013 %>% 
  count(educa) %>% 
  mutate(`Proportion (Percentage)`= round(n / 491775 * 100, 2)) %>%
  rename(
    `Education Level` = educa, 
    Total = n
  )
##                                                Education Level  Total
## 1                   Never attended school or only kindergarten    677
## 2                              Grades 1 through 8 (Elementary)  13395
## 3                        Grades 9 though 11 (Some high school)  28141
## 4                       Grade 12 or GED (High school graduate) 142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6                   College 4 years or more (College graduate) 170120
## 7                                                         <NA>   2274
##   Proportion (Percentage)
## 1                    0.14
## 2                    2.72
## 3                    5.72
## 4                   29.07
## 5                   27.29
## 6                   34.59
## 7                    0.46
brfss2013 %>% 
  count(medcost) %>% 
  mutate(`Proportion (Percentage)`= round(n / 491775 * 100, 2)) %>%
  rename(
    `Could not see doctor because of cost?` = medcost,
    Total = n 
  )
##   Could not see doctor because of cost?  Total Proportion (Percentage)
## 1                                   Yes  60107                   12.22
## 2                                    No 430447                   87.53
## 3                                  <NA>   1221                    0.25
brfss2013 %>% 
  count(checkup1) %>% 
  mutate(`Proportion (Percentage)`= round(n / 491775 * 100, 2)) %>%
  rename(
    `Length of Time Since Last Routine Checkup` = checkup1,
    Total = n
  )
##   Length of Time Since Last Routine Checkup  Total Proportion (Percentage)
## 1                          Within past year 356228                   72.44
## 2                       Within past 2 years  57298                   11.65
## 3                       Within past 5 years  33674                    6.85
## 4                       5 or more years ago  33748                    6.86
## 5                                     Never   4522                    0.92
## 6                                      <NA>   6305                    1.28
Below is a visualization of those who could not afford the cost of a doctor by education level and those who went longer than one year for a routine checkup by education level.
High school graduates without further study and those who attended college or technical school but did not gradauate were the largest groups to say that they couldn’t see a doctor due to cost.
brfss2013 %>%
  filter(medcost == "Yes") %>%
  mutate(educa = fct_explicit_na(educa, "No Answer")) %>% 
  ggplot() + 
  geom_histogram(aes(x = educa, fill = educa), stat = "count") +
  labs(
    title = "Couldn't See a Doctor Due to Cost by Education Level",
    x = "Education Level",
    y = "Total",
    fill = "Education Level"
  ) +
  scale_y_continuous(breaks = seq(0, 20000, 1000)) +
  theme_classic() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ) 
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `educa = fct_explicit_na(educa, "No Answer")`.
## Caused by warning:
## ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
## ℹ Please use `fct_na_value_to_level()` instead.
## Warning in geom_histogram(aes(x = educa, fill = educa), stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

At first glance, high school graduates, those who attended college or technical school but did not gradauate, and college graduates appeared the most frequent to go longer than one year without a routine checkup. This may be explained by a lack of data for adults with less than a high-school education.
brfss2013 %>%
  filter(checkup1 != "Within past year") %>%
  mutate(educa = fct_explicit_na(educa, "No Answer")) %>% 
  ggplot() + 
  geom_histogram(aes(x = educa, fill = educa), stat = "count") +
  facet_wrap(~checkup1) +
  labs(
    title = "Length of Time Since Last Routine Checkup",
    subtitle = "(Greater Than One Year)",
    x = "Education Level",
    y = "Total",
    fill = "Education Level"
  ) +
  scale_y_continuous(breaks = seq(0, 50000, 5000)) +
  theme_classic() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.subtitle = element_text(hjust = 0.5)
  ) 
## Warning in geom_histogram(aes(x = educa, fill = educa), stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Area 2: Mental Health

95,779 individuals reported being told they have a depressive disorder, and 2,289 did not give an answer. Of those who reported being told they have a depressive disorder, 28,457 identified as male, 67,321 identified as female, and 1 individual did not answer. Most individuals described as being told they have a depressive disorder did not answer; this may be due to Social Context being an optional interview module. NA values will be excluded for clarity.
brfss2013 %>% 
  count(addepev2) %>%
  rename(
    `Ever told you have a depressive disorder?` = addepev2,
    Total = n
    )
##   Ever told you have a depressive disorder?  Total
## 1                                       Yes  95779
## 2                                        No 393707
## 3                                      <NA>   2289
brfss2013 %>% 
  count(addepev2, sex) %>%
  rename(
    `Ever told you have a depressive disorder?` = addepev2,
    Sex = sex,
    Total = n
    )  
##   Ever told you have a depressive disorder?    Sex  Total
## 1                                       Yes   Male  28457
## 2                                       Yes Female  67321
## 3                                       Yes   <NA>      1
## 4                                        No   Male 171760
## 5                                        No Female 221946
## 6                                        No   <NA>      1
## 7                                      <NA>   Male   1096
## 8                                      <NA> Female   1188
## 9                                      <NA>   <NA>      5
brfss2013 %>% 
  filter(addepev2 == "Yes") %>%
  count(scntmony) %>% 
  rename(
    `Stress experienced by depressed individuals due to cost of rent?` = scntmony,
    Total = n
  )
##   Stress experienced by depressed individuals due to cost of rent? Total
## 1                                                           Always  2417
## 2                                                          Usually  1106
## 3                                                        Sometimes  2570
## 4                                                           Rarely  1797
## 5                                                            Never  4689
## 6                                                             <NA> 83200
Of those who answered this question, the majority of individuals were reported never feeling stressed over the cost of rent over the last 12 months.
brfss2013 %>%
  filter(addepev2 == "Yes" & !is.na(scntmony) & !is.na(sex)) %>%
  ggplot() + 
  geom_bar(aes(x = scntmony, fill = sex), stat = "count", position = "dodge") +
  labs(
    title = "Frequency of Stress Experienced Due to Cost of Rent by Gender",
    subtitle = "(Scored as Times over the Last 12 Months)",
    x = "Stress Level",
    y = "Total",
    fill = "Sex"
  ) +
  scale_y_continuous(breaks = seq(0, 4000, 400)) +
  theme_classic() + 
  theme(plot.subtitle = element_text(hjust = 0.5)) 

Area 3: Tobacco Use

A new factor needs to be created to represent single or married due to the number of available categories. For practicality’s sake, consider “single” to represent categories other than married.
brfss2013 <- brfss2013 %>% 
  mutate(
    m_status = as.factor(if_else(marital == "Married", "Married", "Single")),
    weight2 = as.numeric(weight2)
    ) 
We see that most male individuals responded as married while 1,322 did not answer. Let someone be considered a smoker if they answered “every day” or “some days.” Most male individuals in general reported not smoking, yet also almost half of male individuals did not answer.
brfss2013 %>%
  filter(sex == "Male") %>%
  count(m_status) %>%
  rename(
    `Marital Status` = m_status,
    Total = n
  )
##   Marital Status  Total
## 1        Married 114060
## 2         Single  85931
## 3           <NA>   1322
brfss2013 %>%
  filter(sex == "Male") %>%
  count(m_status, smokday2) %>%
  mutate(
    `Proportion (Percentage)` = round(n / 201313 * 100, 2)) %>%
  rename(
    `Marital Status` = m_status,
    `Frequency of days now smoking?` = smokday2,
    Total = n 
  )
##    Marital Status Frequency of days now smoking? Total Proportion (Percentage)
## 1         Married                      Every day  9739                    4.84
## 2         Married                      Some days  3685                    1.83
## 3         Married                     Not at all 41290                   20.51
## 4         Married                           <NA> 59346                   29.48
## 5          Single                      Every day 15245                    7.57
## 6          Single                      Some days  5907                    2.93
## 7          Single                     Not at all 24520                   12.18
## 8          Single                           <NA> 40259                   20.00
## 9            <NA>                      Every day   112                    0.06
## 10           <NA>                      Some days    61                    0.03
## 11           <NA>                     Not at all   259                    0.13
## 12           <NA>                           <NA>   890                    0.44
This data is to be used for conceptual purposes only. It is unlikely that the average reported weight in pounds of male individuals in 2013 is 99.88 pounds. This may be explained by the nature of how the data was collected and individual’s misconception about weight, lack of a scale, and so forth. Following this logic, data, married males who didn’t smoke reported on average the highest weight of 105.26 pounds (standard deviation 44.09 pounds), while single males who smoked daily reported on average the lowest weight of 91.47 pounds (standard deviation 49.02 pounds). Reported weights ranged from 1 to 570 pounds.
brfss2013 %>%
  filter(sex == "Male") %>%
  summarize(
    `Average Weight` = mean(weight2),
    `Median Weight` = median(weight2),
    `Minimum Weight` = min(weight2),
    `Maximum Weight` = max(weight2),
    `Standard Deviation of Weight` = sd(weight2)
  )
##   Average Weight Median Weight Minimum Weight Maximum Weight
## 1       99.87906            93              1            570
##   Standard Deviation of Weight
## 1                     48.49325
brfss2013 %>%
  filter(sex == "Male") %>%
  group_by(m_status, smokday2) %>%
  summarize(
    `Average Weight` = mean(weight2),
    `Median Weight` = median(weight2),
    `Minimum Weight` = min(weight2),
    `Maximum Weight` = max(weight2),
    `Standard Deviation of Weight` = sd(weight2)
  ) %>%
  rename(
    `Marital Status` = m_status,
    `Frequency of days now smoking?` = smokday2
  )
## `summarise()` has grouped output by 'm_status'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 7
## # Groups:   Marital Status [3]
##    `Marital Status` Frequency of days now smo…¹ `Average Weight` `Median Weight`
##    <fct>            <fct>                                  <dbl>           <dbl>
##  1 Married          Every day                               98.0              91
##  2 Married          Some days                              102.               93
##  3 Married          Not at all                             105.               99
##  4 Married          <NA>                                   102.               97
##  5 Single           Every day                               91.5              83
##  6 Single           Some days                               93.9              85
##  7 Single           Not at all                              99.7              93
##  8 Single           <NA>                                    96.3              88
##  9 <NA>             Every day                               83.1              78
## 10 <NA>             Some days                               86.0              83
## 11 <NA>             Not at all                              83.0              83
## 12 <NA>             <NA>                                    56.7              58
## # ℹ abbreviated name: ¹​`Frequency of days now smoking?`
## # ℹ 3 more variables: `Minimum Weight` <dbl>, `Maximum Weight` <dbl>,
## #   `Standard Deviation of Weight` <dbl>
Due to the poor quality of weight data and the high prevalence of outliers, filter for reported weight between 0 and 200 pounds in order to clearly view the boxplots based on each smoking response. All categories of smoking and marital status possess a similar spread of weight between the first and third quantile. Those who responded “Not at all” had the highest median weight in both groups, and the largest difference in median weight occured in the “Some days” group. Married male individuals appeared to have a higher median weight.
brfss2013 %>%
  mutate(smokday2 = fct_explicit_na(smokday2, "No Answer")) %>%
  filter(sex == "Male" & !is.na(m_status) & weight2 <= 200) %>%
  ggplot() +
  geom_boxplot(aes(x = smokday2, y = weight2, fill = m_status)) +
  labs(
    title = "Reported Weight (pounds) by Marital and Smoking Status",
    x = "Smoking Status",
    y = "Reported Weight (pounds)", 
    fill = "Marital Status"
  ) +
  scale_y_continuous(breaks = seq(0, 200, 20)) +
  theme_classic()