library(ggplot2)
library(dplyr)
library(statsr)load("brfss2013.RData")This project concerns the preliminary analysis and illustration of data related to preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.
we will investigate 3 questions using the Brfss2013 dataset. This was collected from non-institutionalized adult population, aged 18 years or older, who reside in the US using both landline telephone- and cellular telephone-based surveys. Results are generalizable to the US population, as this is an observational study that uses random sampling. Since there is no random assignment in this study, we cannot make causal conclusions.
By reviewing the Brfss2013 Codebook and taking a close look to the variable “_state" we can safely assume there is no coveniece bias, as there is not an extreme amount of observations for any of the states. As this study does not employ volunteers we can also exclude the possibility of voluntary response bias. However, there might be non-response bias since there are variables in the Brfss2013 Codebook where we can notice observations called “Refused” or “Missing”.
Research quesion 1:
What is the relationship between people’s employment status and taking less than 2 years since they last visited a doctor for a routine checkup? What proportion of people who are in good general health take less than 2 years to visit a doctor for a routine check up according to their employment status?
Research quesion 2:
Is there a relationship between education level and being tested for HIV in the US population? Are educated people that have been tested for HIV more likely to be males or females?
Research quesion 3:
Is there a relationship between owning a home and being an adult in good or better health in the US population?
Research quesion 1:
brfss2013 %>%
select(employ1, checkup1, genhlth) %>%
str()## 'data.frame': 491775 obs. of 3 variables:
## $ employ1 : Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
## $ checkup1: Factor w/ 5 levels "Within past year",..: 1 1 1 2 4 1 1 1 1 1 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
Since our question is about people’s employment status (variable named “employ1”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Missing” from our analysis. Likewise, since our question is about people taking more or less than 2 years to visit a doctor for a routine check up (variable named “checkup1”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Don’t know/Not sure” from our analysis. For the last part of the question concerning people’s general health (variable named “genhlth”), we will exclude the observasions named in the BRFSS Codebook as “Missing”, “Refused” and “Don’t know/Not sure” from our analysis. We will also present the distributions of the above mentioned variables.
brfss2013 %>%
filter(!is.na(employ1)) %>%
group_by(employ1) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 8 x 3
## employ1 count r.freq
## <fct> <int> <dbl>
## 1 Employed for wages 202200 0.414
## 2 Self-employed 39832 0.0816
## 3 Out of work for 1 year or more 14074 0.0288
## 4 Out of work for less than 1 year 12242 0.0251
## 5 A homemaker 31647 0.0648
## 6 A student 12682 0.0260
## 7 Retired 138259 0.283
## 8 Unable to work 37453 0.0767
brfss2013 %>%
filter(!is.na(checkup1)) %>%
group_by(checkup1) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 5 x 3
## checkup1 count r.freq
## <fct> <int> <dbl>
## 1 Within past year 356228 0.734
## 2 Within past 2 years 57298 0.118
## 3 Within past 5 years 33674 0.0694
## 4 5 or more years ago 33748 0.0695
## 5 Never 4522 0.00931
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(genhlth) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 5 x 3
## genhlth count r.freq
## <fct> <int> <dbl>
## 1 Excellent 85482 0.175
## 2 Very good 159076 0.325
## 3 Good 150555 0.307
## 4 Fair 66726 0.136
## 5 Poor 27951 0.0571
brfss2013_1 <- brfss2013 %>%
filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = employ1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))brfss2013_1 <- brfss2013 %>%
filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = checkup1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))brfss2013_1 <- brfss2013 %>%
filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
select(employ1, checkup1, genhlth)
ggplot(data = brfss2013_1, aes(x = genhlth)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))As we can see both from the tables and the charts: the “employ1” distribution’s highest point is for “Employed for wages” with 202200 observations, the “checkup1” distribution’shighest point is for “Within past year” with 356228 and the “genhlth” distribution’s highest point is for “Very good” with 159076 observations.
We need to turn the scale of how long has it been since people last visited a doctor for a routine checkup (variable named “checkup1”) into numeric values (variable called “num_checkup1”).
brfss2013 <- brfss2013 %>%
filter(!is.na(checkup1)) %>%
mutate(num_checkup1 = as.numeric(checkup1))
brfss2013 %>%
filter(!is.na(checkup1)) %>%
group_by(checkup1,num_checkup1) %>%
summarise(count = n()) ## # A tibble: 5 x 3
## # Groups: checkup1 [5]
## checkup1 num_checkup1 count
## <fct> <dbl> <int>
## 1 Within past year 1 356228
## 2 Within past 2 years 2 57298
## 3 Within past 5 years 3 33674
## 4 5 or more years ago 4 33748
## 5 Never 5 4522
To define whether people’s check up was gotten in the past 2 years (variable named “less than 2 years”) or not (variable named “more than 2 years”), we will seperate the “Within past year” and “Within past 2 years” observations on the dataset from the “Within past 5 years”, “5 or more years ago” and “Never” observations given on the general health question of our dataset.
Having done that, we can now compare the percentages of people between who have gotten their check up in the past 2 years (variable called “less_p”) and those that gotten their check up in more than 2 years (variable called “more_p”) across people’s employment statuses.
A segmented bar plot will help us visualise the distribution of how recently people got check up across people’s employment statuses.
brfss2013 <- brfss2013 %>%
filter(!is.na(checkup1), !is.na(employ1)) %>%
mutate(rec_checkup = ifelse(num_checkup1 <= 2, "less than 2 years" , "more than 2 years"))
brfss2013 %>%
filter(!is.na(checkup1)) %>%
group_by(rec_checkup) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 2 x 3
## rec_checkup count r.freq
## <chr> <int> <dbl>
## 1 less than 2 years 410779 0.852
## 2 more than 2 years 71485 0.148
brfss2013_1 <- brfss2013 %>%
filter(!is.na(employ1), !is.na(checkup1), !is.na(genhlth)) %>%
select(employ1, checkup1, genhlth, rec_checkup)
ggplot(data = brfss2013_1, aes(x = rec_checkup)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))brfss2013 %>%
filter(!is.na(checkup1), !is.na(employ1)) %>%
group_by(employ1) %>%
summarise(count = n(), more_p = sum(rec_checkup == "more than 2 years") / n(), less_p = sum(rec_checkup == "less than 2 years") / n()) %>%
arrange(desc(more_p))## # A tibble: 8 x 4
## employ1 count more_p less_p
## <fct> <int> <dbl> <dbl>
## 1 Out of work for 1 year or more 13831 0.260 0.740
## 2 Out of work for less than 1 year 12056 0.246 0.754
## 3 Self-employed 39402 0.235 0.765
## 4 A student 12434 0.195 0.805
## 5 Employed for wages 200257 0.177 0.823
## 6 A homemaker 31104 0.152 0.848
## 7 Unable to work 36639 0.110 0.890
## 8 Retired 136541 0.0667 0.933
brfss2013_na <- brfss2013 %>%
filter(!is.na(checkup1), !is.na(employ1)) %>%
select(rec_checkup, employ1)
ggplot(data = brfss2013_na, aes(x = employ1, fill = rec_checkup)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1))As we can notice both from the data table and from the chart, for all employment statuses more people seem to last visited a doctor for a routine checkup in less than 2 years (ranging between 74.08% and 93.33% for the different employment statuses) than those that last visited a doctor for a routine checkup in more than 2 years (ranging between 6.66% and 25.95% for the different employment statuses). The data in the “brfss2013”" dataset suggest that the people’s employment status seems to not affect whether they last visited a doctor for a routine checkup in more than 2 years.
To proceed with finding out the proportions of people who are in good general health and take more than 2 years to visit a doctor for a routine check up according to their employment status, we will turn the scale of people’s general health (“genhlth”) into numeric values.
brfss2013 <- brfss2013 %>%
filter(!is.na(genhlth)) %>%
mutate(num_genhlth = as.numeric(genhlth))
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(genhlth,num_genhlth) %>%
summarise(count = n())## # A tibble: 5 x 3
## # Groups: genhlth [5]
## genhlth num_genhlth count
## <fct> <dbl> <int>
## 1 Excellent 1 83986
## 2 Very good 2 156769
## 3 Good 3 147495
## 4 Fair 4 65133
## 5 Poor 5 27043
To further investigate whether people are in good general health (variable named “sound”) or not (variable named “weak”), we will seperate the “Excellent”,“Very good” and “Good” observations from the “Fair” and “Poor” observations given on the general health question of our dataset.
brfss2013 <- brfss2013 %>%
filter(!is.na(genhlth)) %>%
mutate(hlth_cond = ifelse(num_genhlth <= 3, "sound" , "weak"))Therefore, we can now observe the proportions of people who are in “sound” general health condition across taking more or less than 2 years to get a routine check up and their employment status (variable named “s_h_c”). We can also observe the proportions of people who are in “weak” general health condition across taking more or less than 2 years to get a routine check up and their employment status (variable named “w_h_c”).
A segmented bar plot seperated into the “sound” and “weak” general health categories will help us visualise the distribution of how recently people got check up across people’s employment status.
brfss2013 %>%
filter(!is.na(genhlth), !is.na(checkup1), !is.na(employ1)) %>%
group_by(employ1, rec_checkup) %>%
summarise(count = n(), s_h_c = sum(hlth_cond == "sound") / n(), w_h_c = sum(hlth_cond == "weak") / n())%>%
arrange(desc(w_h_c))## # A tibble: 16 x 5
## # Groups: employ1 [8]
## employ1 rec_checkup count s_h_c w_h_c
## <fct> <chr> <int> <dbl> <dbl>
## 1 Unable to work less than 2 years 32357 0.312 0.688
## 2 Unable to work more than 2 years 3992 0.318 0.682
## 3 Out of work for 1 year or more less than 2 years 10200 0.701 0.299
## 4 Out of work for 1 year or more more than 2 years 3568 0.707 0.293
## 5 Retired less than 2 years 126819 0.763 0.237
## 6 Retired more than 2 years 9060 0.787 0.213
## 7 Out of work for less than 1 year more than 2 years 2948 0.791 0.209
## 8 A homemaker less than 2 years 26289 0.801 0.199
## 9 Out of work for less than 1 year less than 2 years 9050 0.811 0.189
## 10 A homemaker more than 2 years 4694 0.836 0.164
## 11 Employed for wages more than 2 years 35275 0.897 0.103
## 12 Self-employed more than 2 years 9222 0.902 0.0978
## 13 Self-employed less than 2 years 30045 0.902 0.0978
## 14 A student more than 2 years 2412 0.912 0.0879
## 15 Employed for wages less than 2 years 164500 0.912 0.0875
## 16 A student less than 2 years 9995 0.925 0.0748
brfss2013_1 <- brfss2013 %>%
filter(!is.na(genhlth), !is.na(checkup1), !is.na(employ1)) %>%
select(rec_checkup, employ1, hlth_cond)
ggplot(data = brfss2013_1, aes(x = employ1, fill = rec_checkup)) + geom_bar(position=position_dodge()) + facet_grid(.~hlth_cond) + theme(axis.text.x = element_text(angle = 90, hjust = 1))The confounding variable “general health condition” appears to have affected the distribution of people’s employment status. According to both the data table and the chart above, people in “sound”" health conditions seem more likely to get a routine check up than those in “weak” health conditions. For both “sound” and “weak” health conditions, we can notice differences in percentages between taking more or less than 2 years to get a routine check up for the students, employed for wages, homemakers and those out of work for less than 1 year, while for the rest employment statuses the percentages remain more or less the same. By looking at the chart we can notice there is a difference in pattern between “sound” and “weak” health conditions; when in weak health conditions people who are unable to work or retired seem to visit the doctor more than the people in the rest of the employment statuses.
To sum up, the data in the “brfss2013”" dataset suggest that no matter the employment status people tent to take less than 2 years to get a routine check up .The proprtions of people who are in who are in good (“Sound” ) general health and take less than 2 years to visit a doctor for a routine check up according to their employment statuses are higher than those of the ones than aren’t (“Weak” general health). As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.
Research quesion 2:
brfss2013 %>%
select(educa, hivtst6, sex) %>%
str()## 'data.frame': 480426 obs. of 3 variables:
## $ educa : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
## $ hivtst6: Factor w/ 2 levels "Yes","No": 2 1 1 2 2 1 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
Since our question is about people’s completed education grade or year (variable named “educa”), we will exclude the observasions named in the BRFSS Codebook as “Refused” and “Missing” from our analysis. Likewise, since our question is about people ever being HIV tested (variable named “hivtst6”), we will exclude the observasions named in the BRFSS Codebook as “Refused”, “Missing” and “Don’t know/Not sure” from our analysis. For the last part of the question concerning people’s gender (variable named “sex”), we only need to exclude 1 “NA” observation which does not appear in the BRFSS Codebook. We will also present the distributions of the above mentioned variables.
brfss2013 %>%
filter(!is.na(educa)) %>%
group_by(educa) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 6 x 3
## educa count r.freq
## <fct> <int> <dbl>
## 1 Never attended school or only kindergarten 631 0.00132
## 2 Grades 1 through 8 (Elementary) 12761 0.0266
## 3 Grades 9 though 11 (Some high school) 27110 0.0566
## 4 Grade 12 or GED (High school graduate) 139477 0.291
## 5 College 1 year to 3 years (Some college or technical scho~ 131660 0.275
## 6 College 4 years or more (College graduate) 167650 0.350
brfss2013 %>%
filter(!is.na(hivtst6)) %>%
group_by(hivtst6) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 2 x 3
## hivtst6 count r.freq
## <fct> <int> <dbl>
## 1 Yes 129903 0.303
## 2 No 298755 0.697
brfss2013 %>%
group_by(sex) %>%
summarise(count = n())## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
## sex count
## <fct> <int>
## 1 Male 196767
## 2 Female 283658
## 3 <NA> 1
brfss2013 %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 2 x 3
## sex count r.freq
## <fct> <int> <dbl>
## 1 Male 196767 0.410
## 2 Female 283658 0.590
brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = educa)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = hivtst6)) + geom_bar() + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust = 1))brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = sex)) + geom_bar() + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust = 1))As we can see both from the tables and the charts: the “educa” distribution’s highest point is for “College 4 years or more (College graduate)” with 167650 observations, the “hivtst6” distribution’s highest point is for “No” with 298755 observations and the “sex” distribution’s highest point is for “Female” with 283658 observations.
Having done that, we can now compare the percentages of people between who have been tested for HIV (variable called “y_per”) and those that haven’t been (variable called “n_per”) across people’s completed education grade or year.
A segmented bar plot will help us visualise the distribution of having been tested for HIV across people’s completed education grade or year.
brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6)) %>%
group_by(educa) %>%
summarise(count = n(), y_per = sum(hivtst6 == "Yes") / n() *100, n_per = sum(hivtst6 == "No") / n() *100) %>%
arrange(desc(y_per))## # A tibble: 6 x 4
## educa count y_per n_per
## <fct> <int> <dbl> <dbl>
## 1 College 4 years or more (College graduate) 152528 33.5 66.5
## 2 College 1 year to 3 years (Some college or technical ~ 118251 32.4 67.6
## 3 Grades 9 though 11 (Some high school) 23337 31.1 68.9
## 4 Never attended school or only kindergarten 524 28.1 71.9
## 5 Grades 1 through 8 (Elementary) 10864 25.1 74.9
## 6 Grade 12 or GED (High school graduate) 122349 24.6 75.4
brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6)) %>%
select(educa, hivtst6)
ggplot(data = brfss2013_2, aes(x = educa, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))As we can notice both from the data table and from the chart, the data in the “brfss2013”" dataset suggest that higher educated people seem to get more tested for HIV.
We can further present the above mentioned indication by seperating the “Never attended school or only kindergarten”,“Grades 1 through 8 (Elementary)”, “Grades 9 though 11 (Some high school)” and “Grade 12 or GED (High school graduate)” observations from the “College 1 year to 3 years (Some college or technical school)” and “College 1 year to 3 years (Some college or technical school)” observations given on the general health question of our dataset.
brfss2013 <- brfss2013 %>%
filter(!is.na(educa)) %>%
mutate(num_educa = as.numeric(educa))
brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
group_by(educa,num_educa) %>%
summarise(count = n())## # A tibble: 6 x 3
## # Groups: educa [6]
## educa num_educa count
## <fct> <dbl> <int>
## 1 Never attended school or only kindergarten 1 524
## 2 Grades 1 through 8 (Elementary) 2 10864
## 3 Grades 9 though 11 (Some high school) 3 23337
## 4 Grade 12 or GED (High school graduate) 4 122349
## 5 College 1 year to 3 years (Some college or technical sc~ 5 118251
## 6 College 4 years or more (College graduate) 6 152528
brfss2013 <- brfss2013 %>%
filter(!is.na(educa)) %>%
mutate(edu_lvl = ifelse(num_educa <= 5, "higher" , "lower"))
brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
group_by(edu_lvl) %>%
summarise(count = n(), y_per = sum(hivtst6 == "Yes") / n() *100, n_per = sum(hivtst6 == "No") / n() *100)## # A tibble: 2 x 4
## edu_lvl count y_per n_per
## <chr> <int> <dbl> <dbl>
## 1 higher 275325 28.5 71.5
## 2 lower 152528 33.5 66.5
brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(edu_lvl, hivtst6)
ggplot(data = brfss2013_2, aes(x = edu_lvl, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_manual(values=c("#003300", "#CC0000"))Once again we can notice that higher educated people appear to get tested for HIV more than lower educated ones.
we can now observe the proportions of females/males across their completed education grade or year and having been tested for HIV (variables named “f_per” and “m_per” accordingly).
A segmented bar plot seperated into the “Female” and “Male” genders will help us visualise the distribution of having been tested for HIV across people’s completed education grade or year.
brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
group_by(educa, hivtst6) %>%
summarise(count = n(), f_per = sum(sex == "Female") / n() *100, m_per = sum(sex == "Male") / n() *100) %>%
arrange(desc(f_per))## # A tibble: 12 x 5
## # Groups: educa [6]
## educa hivtst6 count f_per m_per
## <fct> <fct> <int> <dbl> <dbl>
## 1 College 1 year to 3 years (Some college or t~ No 79925 62.5 37.5
## 2 Grade 12 or GED (High school graduate) No 92216 61.4 38.6
## 3 Never attended school or only kindergarten No 377 61.0 39.0
## 4 College 1 year to 3 years (Some college or t~ Yes 38326 60.9 39.1
## 5 Grades 9 though 11 (Some high school) No 16082 60.9 39.1
## 6 Grades 1 through 8 (Elementary) Yes 2731 59.3 40.7
## 7 Never attended school or only kindergarten Yes 147 59.2 40.8
## 8 College 4 years or more (College graduate) Yes 51113 58.6 41.4
## 9 Grades 9 though 11 (Some high school) Yes 7255 58.4 41.6
## 10 Grades 1 through 8 (Elementary) No 8133 56.8 43.2
## 11 College 4 years or more (College graduate) No 101415 56.1 43.9
## 12 Grade 12 or GED (High school graduate) Yes 30133 55.2 44.8
brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(educa, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = educa, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(.~sex) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))The confounding variable “sex” appears to have affected the distribution of people being tested for HIV across their completed education grade or year.No matter the education level completed, females seem to get tested more than males. By looking more closely at both the data table and the chart, we can see that the higher the education level completed for both males and females the more they get tested for HIV relatively.
brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
group_by(edu_lvl, hivtst6) %>%
summarise(count = n(), f_per = sum(sex == "Female") / n() *100, m_per = sum(sex == "Male") / n() *100)## # A tibble: 4 x 5
## # Groups: edu_lvl [2]
## edu_lvl hivtst6 count f_per m_per
## <chr> <fct> <int> <dbl> <dbl>
## 1 higher Yes 78592 58.4 41.6
## 2 higher No 196733 61.6 38.4
## 3 lower Yes 51113 58.6 41.4
## 4 lower No 101415 56.1 43.9
brfss2013_2 <- brfss2013 %>%
filter(!is.na(educa), !is.na(hivtst6), !is.na(sex)) %>%
select(edu_lvl, hivtst6, sex)
ggplot(data = brfss2013_2, aes(x = edu_lvl, fill = hivtst6)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(.~sex) + coord_flip() + scale_fill_manual(values=c("#003300", "#CC0000"))By looking at it from a higher/lower education angle, the indication that higher educated females get tested more than higher educated males is reaffirmed from both the table and the chart shown above.
To sum up, the data in the “brfss2013”" dataset suggest that higher educated people appear to get tested for HIV more than lower educated ones .It also seems that higher educated females get tested more than higher educated males. As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.
Research quesion 3:
brfss2013 %>%
select(renthom1, X_rfhlth, sex) %>%
str()## 'data.frame': 479289 obs. of 3 variables:
## $ renthom1: Factor w/ 3 levels "Own","Rent","Other arrangement": 1 1 1 1 1 1 1 2 1 1 ...
## $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
Since our question is about people owning a home (variable named “renthom1”), we will exclude the observasions named in the BRFSS Codebook as “Refused”, “Don’t know/Not sure” and “Missing” from our analysis. Likewise, since our question is about adults with fair or poor health (variable named “X_rfhlth”), we will exclude the observasions named in the BRFSS Codebook as “Don’t know/Not Sure Or Refused/Missing” from our analysis. We will also present the distributions of the above mentioned variables.
brfss2013 %>%
filter(!is.na(renthom1)) %>%
group_by(renthom1) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 3 x 3
## renthom1 count r.freq
## <fct> <int> <dbl>
## 1 Own 344831 0.729
## 2 Rent 107946 0.228
## 3 Other arrangement 19940 0.0422
brfss2013 %>%
filter(!is.na(X_rfhlth)) %>%
group_by(X_rfhlth) %>%
summarise(count = n()) %>%
mutate(r.freq = count/sum(count))## # A tibble: 2 x 3
## X_rfhlth count r.freq
## <fct> <int> <dbl>
## 1 Good or Better Health 387383 0.808
## 2 Fair or Poor Health 91903 0.192
brfss2013_3_na <- brfss2013 %>%
filter(!is.na(X_rfhlth), !is.na(renthom1)) %>%
select(X_rfhlth, renthom1, sex)
ggplot(data = brfss2013_3_na, aes(x = X_rfhlth)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))brfss2013_3_na <- brfss2013 %>%
filter(!is.na(X_rfhlth), !is.na(renthom1)) %>%
select(X_rfhlth, renthom1)
ggplot(data = brfss2013_3_na, aes(x = renthom1)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))As we can see both from the tables and the charts: in the “X_rfhlth” distribution more people are in “Good or Better Health” than those in “Fair or Poor Health”, in the “renthom1” distribution more people “own” their home than then ones who “rent” (who are more than those with “other arrangments”).
Having done that, we can now compare the percentages of people between who own their home (variable called “o_per”) and those that don’t (variables called “r_per” and “other_per”) across people’s health conditions. A segmented bar plot will help us visualise the distribution of accomodation circumstances across people’s health conditions.
brfss2013 %>%
filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
group_by(renthom1, X_rfhlth) %>%
summarise(count=n())## # A tibble: 6 x 3
## # Groups: renthom1 [3]
## renthom1 X_rfhlth count
## <fct> <fct> <int>
## 1 Own Good or Better Health 287760
## 2 Own Fair or Poor Health 57069
## 3 Rent Good or Better Health 79695
## 4 Rent Fair or Poor Health 28251
## 5 Other arrangement Good or Better Health 14632
## 6 Other arrangement Fair or Poor Health 5308
brfss2013 %>%
filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
group_by(X_rfhlth) %>%
summarise(count = n(), o_per = sum(renthom1 == "Own") / n() *100, r_per = sum(renthom1 == "Rent") / n() *100, other_per = sum(renthom1 == "Other arrangement") / n() *100)## # A tibble: 2 x 5
## X_rfhlth count o_per r_per other_per
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Good or Better Health 382087 75.3 20.9 3.83
## 2 Fair or Poor Health 90628 63.0 31.2 5.86
brfss2013_3_na <- brfss2013 %>%
filter(!is.na(renthom1), !is.na(X_rfhlth)) %>%
select(renthom1, X_rfhlth)
ggplot(data = brfss2013_3_na, aes(x = X_rfhlth, fill = renthom1)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_manual(values=c("#A224D8", "#136FCA", "#E9B524")) As we can notice both from the data table and from the chart, the data in the “brfss2013”" dataset suggest that more people who are in “Good or Better Health” appear to own their home (75.14%) than those in “Fair or Poor Health” (62.88%). In contrast, when it comes to renting more people who are in “Fair or Poor Health” appear to rent their home (31.21%) than those in “Good or Better Health” (20.96%).
To sum up, the data in the “brfss2013”" dataset suggest that healthier people seem to own their home. As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.