Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("brfss2013.RData")

Part 1: Data

This project concerns the preliminary analysis and illustration of data related to preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.

we will investigate 3 questions using the Brfss2013 dataset. This was collected from non-institutionalized adult population, aged 18 years or older, who reside in the US using both landline telephone- and cellular telephone-based surveys. Results are generalizable to the US population, as this is an observational study that uses random sampling. Since there is no random assignment in this study, we cannot make causal conclusions.

By reviewing the Brfss2013 Codebook and taking a close look to the variable “_state" we can safely assume there is no coveniece bias, as there is not an extreme amount of observations for any of the states. As this study does not employ volunteers we can also exclude the possibility of voluntary response bias. However, there might be non-response bias since there are variables in the Brfss2013 Codebook where we can notice observations called “Refused” or “Missing”.


Part 2: Research questions

Research quesion 1:

What is the relationship between people’s employment status and taking less than 2 years since they last visited a doctor for a routine checkup? What proportion of people who are in good general health take less than 2 years to visit a doctor for a routine check up according to their employment status?

Research quesion 2:

Is there a relationship between education level and being tested for HIV in the US population? Are educated people that have been tested for HIV more likely to be males or females?

Research quesion 3:

Is there a relationship between owning a home and being an adult in good or better health in the US population?


Part 3: Exploratory data analysis

Research quesion 1:

brfss2013 %>%
  select(employ1, checkup1, genhlth) %>%
  str()
## 'data.frame':    491775 obs. of  3 variables:
##  $ employ1 : Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
##  $ checkup1: Factor w/ 5 levels "Within past year",..: 1 1 1 2 4 1 1 1 1 1 ...
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...

Since our question is about people’s employment status (variable named “employ1”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Missing” from our analysis. Likewise, since our question is about people taking more or less than 2 years to visit a doctor for a routine check up (variable named “checkup1”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Don’t know/Not sure” from our analysis. For the last part of the question concerning people’s general health (variable named “genhlth”), we will exclude the observasions named in the BRFSS Codebook as “Missing”, “Refused” and “Don’t know/Not sure” from our analysis. We will also present the distributions of the above mentioned variables.

brfss2013 %>%
  filter(!is.na(employ1)) %>%
  group_by(employ1) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 8 x 3
##   employ1                           count r.freq
##   <fct>                             <int>  <dbl>
## 1 Employed for wages               202200 0.414 
## 2 Self-employed                     39832 0.0816
## 3 Out of work for 1 year or more    14074 0.0288
## 4 Out of work for less than 1 year  12242 0.0251
## 5 A homemaker                       31647 0.0648
## 6 A student                         12682 0.0260
## 7 Retired                          138259 0.283 
## 8 Unable to work                    37453 0.0767
brfss2013 %>%
  filter(!is.na(checkup1)) %>%
  group_by(checkup1) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 5 x 3
##   checkup1             count  r.freq
##   <fct>                <int>   <dbl>
## 1 Within past year    356228 0.734  
## 2 Within past 2 years  57298 0.118  
## 3 Within past 5 years  33674 0.0694 
## 4 5 or more years ago  33748 0.0695 
## 5 Never                 4522 0.00931
brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(genhlth) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 5 x 3
##   genhlth    count r.freq
##   <fct>      <int>  <dbl>
## 1 Excellent  85482 0.175 
## 2 Very good 159076 0.325 
## 3 Good      150555 0.307 
## 4 Fair       66726 0.136 
## 5 Poor       27951 0.0571
brfss2013_1 <- brfss2013 %>%
  filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
  select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = employ1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

brfss2013_1 <- brfss2013 %>%
  filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
  select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = checkup1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

brfss2013_1 <- brfss2013 %>%
  filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
  select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = genhlth)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can see both from the tables and the charts: the “employ1” distribution’s highest point is for “Employed for wages” with 202200 observations, the “checkup1” distribution’shighest point is for “Within past year” with 356228 and the “genhlth” distribution’s highest point is for “Very good” with 159076 observations.

We need to turn the scale of how long has it been since people last visited a doctor for a routine checkup (variable named “checkup1”) into numeric values (variable called “num_checkup1”).

brfss2013 <- brfss2013 %>%
  filter(!is.na(checkup1)) %>%
  mutate(num_checkup1 = as.numeric(checkup1))
brfss2013 %>%
  filter(!is.na(checkup1)) %>%
  group_by(checkup1,num_checkup1) %>%
  summarise(count = n()) 
## # A tibble: 5 x 3
## # Groups:   checkup1 [5]
##   checkup1            num_checkup1  count
##   <fct>                      <dbl>  <int>
## 1 Within past year               1 356228
## 2 Within past 2 years            2  57298
## 3 Within past 5 years            3  33674
## 4 5 or more years ago            4  33748
## 5 Never                          5   4522

To define whether people’s check up was gotten in the past 2 years (variable named “less than 2 years”) or not (variable named “more than 2 years”), we will seperate the “Within past year” and “Within past 2 years” observations on the dataset from the “Within past 5 years”, “5 or more years ago” and “Never” observations given on the general health question of our dataset.

Having done that, we can now compare the percentages of people between who have gotten their check up in the past 2 years (variable called “less_p”) and those that gotten their check up in more than 2 years (variable called “more_p”) across people’s employment statuses.

A segmented bar plot will help us visualise the distribution of how recently people got check up across people’s employment statuses.

brfss2013 <- brfss2013 %>%
  filter(!is.na(checkup1), !is.na(employ1)) %>%
  mutate(rec_checkup = ifelse(num_checkup1 <= 2, "less than 2 years" , "more than 2 years"))
brfss2013 %>%
  filter(!is.na(checkup1)) %>%
  group_by(rec_checkup) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 2 x 3
##   rec_checkup        count r.freq
##   <chr>              <int>  <dbl>
## 1 less than 2 years 410779  0.852
## 2 more than 2 years  71485  0.148
brfss2013_1 <- brfss2013 %>%
  filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
  select(employ1, checkup1, genhlth, rec_checkup)
ggplot(data = brfss2013_1, aes(x = rec_checkup)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

brfss2013 %>%
  filter(!is.na(checkup1), !is.na(employ1)) %>%
  group_by(employ1) %>%
  summarise(count = n(), more_p = sum(rec_checkup == "more than 2 years") / n(), less_p = sum(rec_checkup == "less than 2 years") / n()) %>%
  arrange(desc(more_p))
## # A tibble: 8 x 4
##   employ1                           count more_p less_p
##   <fct>                             <int>  <dbl>  <dbl>
## 1 Out of work for 1 year or more    13831 0.260   0.740
## 2 Out of work for less than 1 year  12056 0.246   0.754
## 3 Self-employed                     39402 0.235   0.765
## 4 A student                         12434 0.195   0.805
## 5 Employed for wages               200257 0.177   0.823
## 6 A homemaker                       31104 0.152   0.848
## 7 Unable to work                    36639 0.110   0.890
## 8 Retired                          136541 0.0667  0.933
brfss2013_na <- brfss2013 %>%
  filter(!is.na(checkup1), !is.na(employ1)) %>%
  select(rec_checkup, employ1)
ggplot(data = brfss2013_na, aes(x = employ1, fill = rec_checkup)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can notice both from the data table and from the chart, for all employment statuses more people seem to last visited a doctor for a routine checkup in less than 2 years (ranging between 74.08% and 93.33% for the different employment statuses) than those that last visited a doctor for a routine checkup in more than 2 years (ranging between 6.66% and 25.95% for the different employment statuses). The data in the “brfss2013”" dataset suggest that the people’s employment status seems to not affect whether they last visited a doctor for a routine checkup in more than 2 years.

To proceed with finding out the proportions of people who are in good general health and take more than 2 years to visit a doctor for a routine check up according to their employment status, we will turn the scale of people’s general health (“genhlth”) into numeric values.

brfss2013 <- brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  mutate(num_genhlth = as.numeric(genhlth))
brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(genhlth,num_genhlth) %>%
  summarise(count = n())
## # A tibble: 5 x 3
## # Groups:   genhlth [5]
##   genhlth   num_genhlth  count
##   <fct>           <dbl>  <int>
## 1 Excellent           1  83986
## 2 Very good           2 156769
## 3 Good                3 147495
## 4 Fair                4  65133
## 5 Poor                5  27043

To further investigate whether people are in good general health (variable named “sound”) or not (variable named “weak”), we will seperate the “Excellent”,“Very good” and “Good” observations from the “Fair” and “Poor” observations given on the general health question of our dataset.

brfss2013 <- brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  mutate(hlth_cond = ifelse(num_genhlth <= 3, "sound" , "weak"))

Therefore, we can now observe the proportions of people who are in “sound” general health condition across taking more or less than 2 years to get a routine check up and their employment status (variable named “s_h_c”). We can also observe the proportions of people who are in “weak” general health condition across taking more or less than 2 years to get a routine check up and their employment status (variable named “w_h_c”).

A segmented bar plot seperated into the “sound” and “weak” general health categories will help us visualise the distribution of how recently people got check up across people’s employment status.

brfss2013 %>%
  filter(!is.na(genhlth), !is.na(checkup1), !is.na(employ1)) %>%
  group_by(employ1, rec_checkup) %>%
  summarise(count = n(), s_h_c = sum(hlth_cond == "sound") / n(), w_h_c = sum(hlth_cond == "weak") / n())%>%
  arrange(desc(w_h_c))
## # A tibble: 16 x 5
## # Groups:   employ1 [8]
##    employ1                          rec_checkup        count s_h_c  w_h_c
##    <fct>                            <chr>              <int> <dbl>  <dbl>
##  1 Unable to work                   less than 2 years  32357 0.312 0.688 
##  2 Unable to work                   more than 2 years   3992 0.318 0.682 
##  3 Out of work for 1 year or more   less than 2 years  10200 0.701 0.299 
##  4 Out of work for 1 year or more   more than 2 years   3568 0.707 0.293 
##  5 Retired                          less than 2 years 126819 0.763 0.237 
##  6 Retired                          more than 2 years   9060 0.787 0.213 
##  7 Out of work for less than 1 year more than 2 years   2948 0.791 0.209 
##  8 A homemaker                      less than 2 years  26289 0.801 0.199 
##  9 Out of work for less than 1 year less than 2 years   9050 0.811 0.189 
## 10 A homemaker                      more than 2 years   4694 0.836 0.164 
## 11 Employed for wages               more than 2 years  35275 0.897 0.103 
## 12 Self-employed                    more than 2 years   9222 0.902 0.0978
## 13 Self-employed                    less than 2 years  30045 0.902 0.0978
## 14 A student                        more than 2 years   2412 0.912 0.0879
## 15 Employed for wages               less than 2 years 164500 0.912 0.0875
## 16 A student                        less than 2 years   9995 0.925 0.0748
brfss2013_1 <- brfss2013 %>%
  filter(!is.na(genhlth), !is.na(checkup1), !is.na(employ1)) %>%
  select(rec_checkup, employ1, hlth_cond)
ggplot(data = brfss2013_1, aes(x = employ1, fill = rec_checkup)) + geom_bar(position=position_dodge()) + facet_grid(.~hlth_cond) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

The confounding variable “general health condition” appears to have affected the distribution of people’s employment status. According to both the data table and the chart above, people in “sound”" health conditions seem more likely to get a routine check up than those in “weak” health conditions. For both “sound” and “weak” health conditions, we can notice differences in percentages between taking more or less than 2 years to get a routine check up for the students, employed for wages, homemakers and those out of work for less than 1 year, while for the rest employment statuses the percentages remain more or less the same. By looking at the chart we can notice there is a difference in pattern between “sound” and “weak” health conditions; when in weak health conditions people who are unable to work or retired seem to visit the doctor more than the people in the rest of the employment statuses.

To sum up, the data in the “brfss2013”" dataset suggest that no matter the employment status people tent to take less than 2 years to get a routine check up .The proprtions of people who are in who are in good (“Sound” ) general health and take less than 2 years to visit a doctor for a routine check up according to their employment statuses are higher than those of the ones than aren’t (“Weak” general health). As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.

Research quesion 2:

brfss2013 %>%
  select(educa, hivtst6, sex) %>%
  str()
## 'data.frame':    480426 obs. of  3 variables:
##  $ educa  : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
##  $ hivtst6: Factor w/ 2 levels "Yes","No": 2 1 1 2 2 1 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...

Since our question is about people’s completed education grade or year (variable named “educa”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Missing” from our analysis. Likewise, since our question is about people ever being HIV tested (variable named “hivtst6”), we will exclude the observasions named in the BRFSS Codebook as “Refused”, “Missing” and “Don’t know/Not sure” from our analysis. For the last part of the question concerning people’s gender (variable named “sex”), we only need to exclude 1 “NA” observation which does not appear in the BRFSS Codebook. We will also present the distributions of the above mentioned variables.

brfss2013 %>%
  filter(!is.na(educa)) %>%
  group_by(educa) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 6 x 3
##   educa                                                       count  r.freq
##   <fct>                                                       <int>   <dbl>
## 1 Never attended school or only kindergarten                    631 0.00132
## 2 Grades 1 through 8 (Elementary)                             12761 0.0266 
## 3 Grades 9 though 11 (Some high school)                       27110 0.0566 
## 4 Grade 12 or GED (High school graduate)                     139477 0.291  
## 5 College 1 year to 3 years (Some college or technical scho~ 131660 0.275  
## 6 College 4 years or more (College graduate)                 167650 0.350
brfss2013 %>%
  filter(!is.na(hivtst6)) %>%
  group_by(hivtst6) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 2 x 3
##   hivtst6  count r.freq
##   <fct>    <int>  <dbl>
## 1 Yes     129903  0.303
## 2 No      298755  0.697
brfss2013 %>%
  group_by(sex) %>%
  summarise(count = n())
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   sex     count
##   <fct>   <int>
## 1 Male   196767
## 2 Female 283658
## 3 <NA>        1
brfss2013 %>%
  filter(!is.na(sex)) %>%
  group_by(sex) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 2 x 3
##   sex     count r.freq
##   <fct>   <int>  <dbl>
## 1 Male   196767  0.410
## 2 Female 283658  0.590
brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = educa)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() 

brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = hivtst6)) + geom_bar() + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = sex)) + geom_bar() + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can see both from the tables and the charts: the “educa” distribution’s highest point is for “College 4 years or more (College graduate)” with 167650 observations, the “hivtst6” distribution’s highest point is for “No” with 298755 observations and the “sex” distribution’s highest point is for “Female” with 283658 observations.

Having done that, we can now compare the percentages of people between who have been tested for HIV (variable called “y_per”) and those that haven’t been (variable called “n_per”) across people’s completed education grade or year.

A segmented bar plot will help us visualise the distribution of having been tested for HIV across people’s completed education grade or year.

brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6)) %>%
  group_by(educa) %>%
  summarise(count = n(), y_per = sum(hivtst6 == "Yes") / n() *100, n_per = sum(hivtst6 == "No") / n() *100) %>%
  arrange(desc(y_per))
## # A tibble: 6 x 4
##   educa                                                   count y_per n_per
##   <fct>                                                   <int> <dbl> <dbl>
## 1 College 4 years or more (College graduate)             152528  33.5  66.5
## 2 College 1 year to 3 years (Some college or technical ~ 118251  32.4  67.6
## 3 Grades 9 though 11 (Some high school)                   23337  31.1  68.9
## 4 Never attended school or only kindergarten                524  28.1  71.9
## 5 Grades 1 through 8 (Elementary)                         10864  25.1  74.9
## 6 Grade 12 or GED (High school graduate)                 122349  24.6  75.4
brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6)) %>%
  select(educa, hivtst6)
ggplot(data = brfss2013_2, aes(x = educa, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))

As we can notice both from the data table and from the chart, the data in the “brfss2013”" dataset suggest that higher educated people seem to get more tested for HIV.

We can further present the above mentioned indication by seperating the “Never attended school or only kindergarten”,“Grades 1 through 8 (Elementary)”, “Grades 9 though 11 (Some high school)” and “Grade 12 or GED (High school graduate)” observations from the “College 1 year to 3 years (Some college or technical school)” and “College 1 year to 3 years (Some college or technical school)” observations given on the general health question of our dataset.

brfss2013 <- brfss2013 %>%
  filter(!is.na(educa)) %>%
  mutate(num_educa = as.numeric(educa))
brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  group_by(educa,num_educa) %>%
  summarise(count = n())
## # A tibble: 6 x 3
## # Groups:   educa [6]
##   educa                                                    num_educa  count
##   <fct>                                                        <dbl>  <int>
## 1 Never attended school or only kindergarten                       1    524
## 2 Grades 1 through 8 (Elementary)                                  2  10864
## 3 Grades 9 though 11 (Some high school)                            3  23337
## 4 Grade 12 or GED (High school graduate)                           4 122349
## 5 College 1 year to 3 years (Some college or technical sc~         5 118251
## 6 College 4 years or more (College graduate)                       6 152528
brfss2013 <- brfss2013 %>%
  filter(!is.na(educa)) %>%
  mutate(edu_lvl = ifelse(num_educa <= 5, "higher" , "lower")) 
brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  group_by(edu_lvl) %>%
  summarise(count = n(), y_per = sum(hivtst6 == "Yes") / n() *100, n_per = sum(hivtst6 == "No") / n() *100)
## # A tibble: 2 x 4
##   edu_lvl  count y_per n_per
##   <chr>    <int> <dbl> <dbl>
## 1 higher  275325  28.5  71.5
## 2 lower   152528  33.5  66.5
brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(edu_lvl, hivtst6)
ggplot(data = brfss2013_2, aes(x = edu_lvl, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_manual(values=c("#003300", "#CC0000"))

Once again we can notice that higher educated people appear to get tested for HIV more than lower educated ones.

we can now observe the proportions of females/males across their completed education grade or year and having been tested for HIV (variables named “f_per” and “m_per” accordingly).

A segmented bar plot seperated into the “Female” and “Male” genders will help us visualise the distribution of having been tested for HIV across people’s completed education grade or year.

brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  group_by(educa, hivtst6) %>%
  summarise(count = n(), f_per = sum(sex == "Female") / n() *100, m_per = sum(sex == "Male") / n() *100) %>%
  arrange(desc(f_per))
## # A tibble: 12 x 5
## # Groups:   educa [6]
##    educa                                         hivtst6  count f_per m_per
##    <fct>                                         <fct>    <int> <dbl> <dbl>
##  1 College 1 year to 3 years (Some college or t~ No       79925  62.5  37.5
##  2 Grade 12 or GED (High school graduate)        No       92216  61.4  38.6
##  3 Never attended school or only kindergarten    No         377  61.0  39.0
##  4 College 1 year to 3 years (Some college or t~ Yes      38326  60.9  39.1
##  5 Grades 9 though 11 (Some high school)         No       16082  60.9  39.1
##  6 Grades 1 through 8 (Elementary)               Yes       2731  59.3  40.7
##  7 Never attended school or only kindergarten    Yes        147  59.2  40.8
##  8 College 4 years or more (College graduate)    Yes      51113  58.6  41.4
##  9 Grades 9 though 11 (Some high school)         Yes       7255  58.4  41.6
## 10 Grades 1 through 8 (Elementary)               No        8133  56.8  43.2
## 11 College 4 years or more (College graduate)    No      101415  56.1  43.9
## 12 Grade 12 or GED (High school graduate)        Yes      30133  55.2  44.8
brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = educa, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(.~sex) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))

The confounding variable “sex” appears to have affected the distribution of people being tested for HIV across their completed education grade or year.No matter the education level completed, females seem to get tested more than males. By looking more closely at both the data table and the chart, we can see that the higher the education level completed for both males and females the more they get tested for HIV relatively.

brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  group_by(edu_lvl, hivtst6) %>%
  summarise(count = n(), f_per = sum(sex == "Female") / n() *100, m_per = sum(sex == "Male") / n() *100)
## # A tibble: 4 x 5
## # Groups:   edu_lvl [2]
##   edu_lvl hivtst6  count f_per m_per
##   <chr>   <fct>    <int> <dbl> <dbl>
## 1 higher  Yes      78592  58.4  41.6
## 2 higher  No      196733  61.6  38.4
## 3 lower   Yes      51113  58.6  41.4
## 4 lower   No      101415  56.1  43.9
brfss2013_2 <- brfss2013 %>%
  filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
  select(edu_lvl, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = edu_lvl, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(.~sex) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))

By looking at it from a higher/lower education angle, the indication that higher educated females get tested more than higher educated males is reaffirmed from both the table and the chart shown above.

To sum up, the data in the “brfss2013”" dataset suggest that higher educated people appear to get tested for HIV more than lower educated ones .It also seems that higher educated females get tested more than higher educated males. As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.

Research quesion 3:

brfss2013 %>%
  select(renthom1, X_rfhlth, sex) %>%
  str()
## 'data.frame':    479289 obs. of  3 variables:
##  $ renthom1: Factor w/ 3 levels "Own","Rent","Other arrangement": 1 1 1 1 1 1 1 2 1 1 ...
##  $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...

Since our question is about people owning a home (variable named “renthom1”), we will exclude the observasions named in the BRFSS Codebook as “Refused”, “Don’t know/Not sure” and “Missing” from our analysis. Likewise, since our question is about adults with fair or poor health (variable named “X_rfhlth”), we will exclude the observasions named in the BRFSS Codebook as “Don’t know/Not Sure Or Refused/Missing” from our analysis. We will also present the distributions of the above mentioned variables.

brfss2013 %>%
  filter(!is.na(renthom1)) %>%
  group_by(renthom1) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 3 x 3
##   renthom1           count r.freq
##   <fct>              <int>  <dbl>
## 1 Own               344831 0.729 
## 2 Rent              107946 0.228 
## 3 Other arrangement  19940 0.0422
brfss2013 %>%
  filter(!is.na(X_rfhlth)) %>%
  group_by(X_rfhlth) %>%
  summarise(count = n()) %>%
  mutate(r.freq = count/sum(count))
## # A tibble: 2 x 3
##   X_rfhlth               count r.freq
##   <fct>                  <int>  <dbl>
## 1 Good or Better Health 387383  0.808
## 2 Fair or Poor Health    91903  0.192
brfss2013_3_na <- brfss2013 %>%
  filter(!is.na(X_rfhlth), !is.na(renthom1)) %>%
  select(X_rfhlth, renthom1, sex)
ggplot(data = brfss2013_3_na, aes(x = X_rfhlth)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

brfss2013_3_na <- brfss2013 %>%
  filter(!is.na(X_rfhlth), !is.na(renthom1)) %>%
  select(X_rfhlth, renthom1)
ggplot(data = brfss2013_3_na, aes(x = renthom1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can see both from the tables and the charts: in the “X_rfhlth” distribution more people are in “Good or Better Health” than those in “Fair or Poor Health”, in the “renthom1” distribution more people “own” their home than then ones who “rent” (who are more than those with “other arrangments”).

Having done that, we can now compare the percentages of people between who own their home (variable called “o_per”) and those that don’t (variables called “r_per” and “other_per”) across people’s health conditions. A segmented bar plot will help us visualise the distribution of accomodation circumstances across people’s health conditions.

brfss2013 %>%
  filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
  group_by(renthom1, X_rfhlth) %>%
  summarise(count=n())
## # A tibble: 6 x 3
## # Groups:   renthom1 [3]
##   renthom1          X_rfhlth               count
##   <fct>             <fct>                  <int>
## 1 Own               Good or Better Health 287760
## 2 Own               Fair or Poor Health    57069
## 3 Rent              Good or Better Health  79695
## 4 Rent              Fair or Poor Health    28251
## 5 Other arrangement Good or Better Health  14632
## 6 Other arrangement Fair or Poor Health     5308
brfss2013 %>%
  filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
  group_by(X_rfhlth) %>%
  summarise(count = n(), o_per = sum(renthom1 == "Own") / n() *100, r_per = sum(renthom1 == "Rent") / n() *100, other_per = sum(renthom1 == "Other arrangement") / n() *100)
## # A tibble: 2 x 5
##   X_rfhlth               count o_per r_per other_per
##   <fct>                  <int> <dbl> <dbl>     <dbl>
## 1 Good or Better Health 382087  75.3  20.9      3.83
## 2 Fair or Poor Health    90628  63.0  31.2      5.86
brfss2013_3_na <- brfss2013 %>%
  filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
  select(renthom1, X_rfhlth)
ggplot(data = brfss2013_3_na, aes(x = X_rfhlth, fill = renthom1)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_manual(values=c("#A224D8", "#136FCA", "#E9B524")) 

As we can notice both from the data table and from the chart, the data in the “brfss2013”" dataset suggest that more people who are in “Good or Better Health” appear to own their home (75.14%) than those in “Fair or Poor Health” (62.88%). In contrast, when it comes to renting more people who are in “Fair or Poor Health” appear to rent their home (31.21%) than those in “Good or Better Health” (20.96%).

To sum up, the data in the “brfss2013”" dataset suggest that healthier people seem to own their home. As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.