library(ggplot2)
library(dplyr)
BRFSS collects state data about U.S. residents regarding their health-related risk behaviors and events, chronic health conditions, and use of preventive services. BRFSS also collects data on important emerging health issues such as vaccine shortage and influenza-like illness.
Dataset notes:
load("brfss2013.RData")
data = brfss2013
names = names(data)
# names[!grepl("(^X_)|(_$)", names)]
Describe how the observations in the sample are collected, and the implications of this data collection method on generalizability.
With technical and methodological assistance from CDC, state health departments use in-house interviewers or contract with telephone call centers or universities to administer the BRFSS surveys continuously through the year. The states use a standardized core questionnaire, optional modules, and state-added questions. The survey is conducted using Random Digit Dialing (RDD) techniques on both landlines and cell phones. BRFSS collects state data about U.S. residents regarding their health-related risk behaviors and events, chronic health conditions, and use of preventive services. BRFSS also collects data on important emerging health issues such as vaccine shortage and influenza-like illness.
Describe how the observations in the sample are collected, and the implications of this data collection method on causality.
Random sampling was used in conducting the landline tellephone survey, and it is the way get general samples without less prejudice. Using Random Digit Dialing to randomly select the sample would help blocking for variables known or suspected to affect the outcome. Causality is an difficult task for data analysis, generally it involves both randomization and control. So without control, we can make causality inference only for a few special ways by this data collection method.
Research quesion 1: Whether the group of people who exercise is healthier than those who don’t?
We select the variable genhlth
as indicator of health, and the variable exerany2
which represents whether someone exercise in the past 30 days or not.
# We select the variable genhlth as indicator of health
dat1 = select(data, genhlth, exerany2) %>%
filter(!is.na(exerany2))
ggplot(dat1, aes(x = exerany2, fill = genhlth)) + geom_bar(position = "fill") + ylab("Percentages of general health") + xlab("Exercise or not")
From the picture, it is clear in the picture that the group of of people who exercise in the past 30 days are healthier than those who don’t!
There are many variables highly related to the health condition, we can continue to explore this topic. Yet we have done general description of the population.
Research quesion 2: What is the health condition of the whole population?
There are several variables to describe the health condition of the whole population
# select a data data set with colghous and variables related to health.
hlthdata = select(data, grep("hlth", names)) %>%
select(genhlth, physhlth, menthlth, poorhlth, qlhlth2)
# we think the variables genhlth, physhlth, menthlth, poorhlth, qlhlth2 are measuring the health condition.
summary(hlthdata)
## genhlth physhlth menthlth poorhlth
## Excellent: 85482 Min. : 0.000 Min. : 0.000 Min. : 0.0
## Very good:159076 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0
## Good :150555 Median : 0.000 Median : 0.000 Median : 0.0
## Fair : 66726 Mean : 4.353 Mean : 3.383 Mean : 5.3
## Poor : 27951 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 5.0
## NA's : 1985 Max. :60.000 Max. :5000.000 Max. :7000.0
## NA's :10957 NA's :8627 NA's :243153
## qlhlth2
## Min. : 0.0
## 1st Qu.: 2.0
## Median : 15.0
## Mean : 15.9
## 3rd Qu.: 28.0
## Max. :243.0
## NA's :491310
As we could see that General Health
ggplot(hlthdata, aes(x = genhlth)) + geom_bar(position = "stack", fill = "green") + ylab("number of person") + xlab("general health")
For the reason of outliers and missing values, we use 3rd quantile to describle those variables. The 3rd quantiles of Number Of Days Physical Health Not Good is 3. The 3rd quantiles of Number Of Days Mental Health Not Good is 2. The 3rd quantiles days of Poor Physical Or Mental Health is 5 days. The mean number of Days Full Of Energy In Past 30 Days is 15.9.
Research quesion 3: what is the situation of health care?
Health care is a very big deal for all of us, we are going to explore health care coverage and related items.
# select some variables related to health care
healthcaredata = select(data, medicare,hlthcvrg, drvisits,carercvd)
summary(healthcaredata)
## medicare hlthcvrg drvisits
## Yes :134598 :175883 Min. : 0.00
## No :178386 01 : 69541 1st Qu.: 1.00
## NA's:178791 03 : 31973 Median : 3.00
## 1 : 31230 Mean : 5.24
## 0 : 28736 3rd Qu.: 6.00
## 02 : 27340 Max. :76.00
## (Other):127072 NA's :154152
## carercvd
## Very satisfied :227880
## Somewhat satisfied : 98644
## Not at all satisfied: 12035
## NA's :153216
##
##
##
From the table we could see that:
Using plots to explain the relationships between certain variables is presented below.
Research quesion 1: The relationship between exercise and health
We select some related variables, General Health, Number Of Days Physical Health Not Good, Number Of Days Mental Health Not Good, Exercise In Past 30 Days.
# Clean the outliers and NAs in the data. there are outlier in variable physhlth and menthlth according to their definition.
dat1 = select(data, genhlth, exerany2, physhlth, menthlth) %>%
filter(physhlth <= 30 & menthlth <= 30 & !is.na(exerany2) & !is.na(genhlth))
group_by(dat1, exerany2) %>%
summarise(phys_mean = mean(physhlth), ment_mean = mean(menthlth))
## Source: local data frame [2 x 3]
##
## exerany2 phys_mean ment_mean
## (fctr) (dbl) (dbl)
## 1 Yes 3.183990 2.825518
## 2 No 7.295681 4.768849
From the table we can see that the group who excercise is over who don’t excercise on the mean of not good physical or mental days.
ggplot(dat1, aes(x= exerany2, y = physhlth, fill = exerany2)) + geom_boxplot() + ylab(" Number Of Days Physical Health Not Good") + xlab("Exercise In Past 30 Days or not")
The mean of physhlth
is r
for excercise or not We can see from the picture that people who Exercised In Past 30 Days is significantly smaller Number Of Days Physical Health Not Good!
ggplot(dat1, aes(x= exerany2, y = menthlth, fill = exerany2)) + geom_boxplot() + ylab(" Number Of Days Mental Health Not Good") + xlab("Exercise In Past 30 Days or not")
We can see from the picture that people who Exercised In Past 30 Days is significantly smaller Number Of Days Mental Health Not Good!
Research quesion 2: Tobacco use in the population
There are several variables related to tobacco use.
We only use those two computed variables.
tobacco_data = select(data, X_smoker3, X_rfsmok3) %>%
filter(!is.na(X_smoker3) & !is.na(X_rfsmok3))
summary(tobacco_data)
## X_smoker3 X_rfsmok3
## Current smoker - now smokes every day: 55161 No :399785
## Current smoker - now smokes some days: 21493 Yes: 76654
## Former smoker :138134
## Never smoked :261651
As we can see that the number of 76654 of 491775 people are current smokes while 399786 are not. Non-smokes are 5.2154617 times of smokes. Lets see some plots of tobacco use.
ggplot(tobacco_data, aes(x = factor(1), fill = X_smoker3))+ geom_bar() + coord_polar(theta = "y") + xlab("") + ylab("")
From the plot, we can see that most people are not smokers, more than a half never smoked, and a lot of people are giving up smoking!
Research quesion 3: tobacco use and alcohol consumption
Tobacco use and alcohol consumption are related actions, we study the relationship between them. We select variables X_smoker3, X_rfsmok3, X_rfdrhv4 to study it.
tobacco_alcohol_data = select(data, X_smoker3, X_rfsmok3, X_rfdrhv4) %>%
filter(!is.na(X_smoker3) & !is.na(X_rfdrhv4) & !is.na(X_rfsmok3))
table(select(tobacco_alcohol_data, X_rfsmok3, X_rfdrhv4))
## X_rfdrhv4
## X_rfsmok3 No Yes
## No 374377 17269
## Yes 66000 8144
It seems that 8144 of 17269+8144 heavy drinkers are smokers, which is far more that the proportion \(\frac{66000+8144}{374377+17269+ 66000+8144}\) of smokers an in the population. Let’s see some plots.
ggplot(tobacco_alcohol_data, aes(x = X_rfdrhv4, fill = X_rfsmok3)) + geom_bar(position = "fill") + xlab("Heavy Alcohol Consumption") + ylab("Current smokers or not")
This picture show the above conclusion that the porportion of smoker among the heavy alcohol consumption is much higher that who aren’t heavy alcohol consumers.
ggplot(tobacco_alcohol_data, aes(x = X_rfdrhv4, fill = X_smoker3)) + geom_bar(position = "fill") + xlab("Heavy Alcohol Consumption") + ylab("proportion")
This plot also shows relationship between tobacco use and alcohol consumption. Let’s do some test to see if there are indepent variables.
chisq.test(x = data$X_rfdrhv4, y = data$X_rfsmok3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$X_rfdrhv4 and data$X_rfsmok3
## X-squared = 5223, df = 1, p-value < 2.2e-16
This test with very small p-value shows that tobacco use and alcohol consumption are not independent variable.