library(ggplot2)
library(dplyr)load("brfss2013.RData")The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
Generalization : The survey is conducted using Random Digit Dialing(RDD) technique for both landline and telephonic interviews. Thus respondents are selected randomly from the population. Hence the inferences made from the study can be generalized to the population.
Causation : This is an observational study and no random assignment is done. Therefore we can only infer any correlations and not causations.
Bias : Since the responses are taken from a telephonic interview, respondents should be comfortable in giving responses to personal questions as opposed to a face to face interview. However the interview isn’t completely anonymous and respondents may hesitate in answering personal questions. Also the respondent can be in a public place where they are not comfortable answering personal questions.
Research quesion 1:
Is there a relation between physical activity and self reported general health? My hypothesis is that more physical leads to better general health and I want to see how diferent levels of activity are related to the general health.
Research quesion 2:
Is there a relation between smoking and self reported general health? I want to explore how active, past and non smokers identify themselves in the health aspect.
Research quesion 3:
Is High blood pressure & Cholestrol related to the BMI? Being overweight is unhealthy in general, but is it related to problems like BP and cholestrol. The result can provide motivation for people to lose weight.
Research quesion 1:
To address the research question we need to find the relation between self reported general health and physical activity. Searching the codebook for the bfrss data we find genhlth listing the self reported general health in categories from Excellent to Poor. The physical activity levels are listed in the X_pacat1 variable ranging from Inactive to highly active. Also we’ll check to see if sex plays a role in this relation.
exercise <- brfss2013 %>% select(genhlth, X_pacat1, sex) %>% na.omit()Checking the total counts classified by the three factors.
table(exercise)## , , sex = Male
##
## X_pacat1
## genhlth Highly active Active Insufficiently active Inactive
## Excellent 13164 5868 5324 5427
## Very good 21742 10693 11075 11981
## Good 17318 8744 10044 16935
## Fair 5702 2815 3631 9705
## Poor 1523 846 1276 5390
##
## , , sex = Female
##
## X_pacat1
## genhlth Highly active Active Insufficiently active Inactive
## Excellent 18019 9013 7433 7178
## Very good 28008 16537 16542 18626
## Good 20159 11793 14420 27100
## Fair 6856 3942 5855 17949
## Poor 1897 1064 1959 9786
We can see that Highly active people are generally healthy with most of the observations having good or better health. While Inactive people have a higher proportion of poor health and fair health people. The trends are similar across gender.
Now lets plot a mosaic plot to get a clear picture of the distribution.
# Creating a function for making mosaicplots
ggMMplot <- function(var1, var2){
require(ggplot2)
levVar1 <- length(levels(var1))
levVar2 <- length(levels(var2))
jointTable <- prop.table(table(var1, var2))
plotData <- as.data.frame(jointTable)
plotData$marginVar1 <- prop.table(table(var1))
plotData$var2Height <- plotData$Freq / plotData$marginVar1
plotData$var1Center <- c(0, cumsum(plotData$marginVar1)[1:levVar1 -1]) +
plotData$marginVar1 / 2
g <- ggplot(plotData, aes(var1Center, var2Height)) +
geom_bar(stat = "identity", aes(width = marginVar1, fill = var2), col = "Black") +
geom_text(aes(label = as.character(var1), x = var1Center, y = 1.05))
g
}
exercise$genhlth <- factor(exercise$genhlth, levels = rev(levels(exercise$genhlth)))
exercise$X_pacat1 <- factor(exercise$X_pacat1, levels = rev(levels(exercise$X_pacat1)))
ex_male <- exercise %>% filter(sex == "Male")
p1 <- ggMMplot(ex_male$X_pacat1, ex_male$genhlth)
p1 + labs(x="Physical Activity (Males)", y="General Health", fill = "")ex_fem <- exercise %>% filter(sex == "Female")
p2 <- ggMMplot(ex_fem$X_pacat1, ex_fem$genhlth)
p2 + labs(x="Physical Activity (Females)", y="General Health", fill = "")It becomes clear from the plot that the proportion of people having better health is better for physically active people and the proportion of people having poor health is higher for inactive people. These plots show a positive correlation between exercising and self reported general health.
Research quesion 2:
To address the research question we need to find the relation between self reported general health and smoker type. Searching the codebook for the bfrss data we find genhlth listing the self reported general health in categories from Excellent to Poor. The smoker types are listed in the X_smoker3 variable ranging from Non smokers to daily smokers. Also we’ll check to see if sex plays a role in this relation.
smoke <- brfss2013 %>% select(genhlth, X_smoker3, sex) %>% na.omit()
levels(smoke$X_smoker3) <- c("Daily", "Somedays", "Former", "Non_smoker")Checking the total counts classified by the three factors.
table(smoke)## , , sex = Male
##
## X_smoker3
## genhlth Daily Somedays Former Non_smoker
## Excellent 2526 1312 9266 21412
## Very good 6266 2761 20256 33943
## Good 9152 3214 21733 26827
## Fair 4734 1540 10100 8610
## Poor 2296 784 4408 2835
##
## , , sex = Female
##
## X_smoker3
## genhlth Daily Somedays Former Non_smoker
## Excellent 2719 1364 11114 32980
## Very good 7807 3327 23118 57213
## Good 10309 3525 21685 49167
## Fair 6113 2247 10618 20585
## Poor 3004 1321 5258 7128
We can see that Non smokers are generally healthy with most of the observations having good or better health. While regular smokers have a higher proportion of poor health and fair health people. The trends are similar across gender.
Now lets plot a mosaic plot to get a clear picture of the distribution.
smoke$genhlth <- factor(smoke$genhlth, levels = rev(levels(smoke$genhlth)))
#smoke$X_smoker3 <- factor(smoke$X_smoker3, levels = rev(levels(smoke$X_smoker3)))
ex_male <- smoke %>% filter(sex == "Male")
p1 <- ggMMplot(ex_male$X_smoker3, ex_male$genhlth)
p1 + labs(x="Smoker Type (Males)", y="General Health", fill = "")ex_fem <- smoke %>% filter(sex == "Female")
p2 <- ggMMplot(ex_fem$X_smoker3, ex_fem$genhlth)
p2 + labs(x="Smoker Type (Females)", y="General Health", fill = "")It becomes clear from the plot that the proportion of people having better health is better for Non smokers and the proportion of people having poor health is higher for daily smokers. These plots show a negative correlation between smoking and self reported general health.
Research quesion 3:
In this analysis we aim to check if high bp and high cholestrol levels are correlated to weight. Searching the code we find X_rfhype5 and X_rfchol variables which specify if the person has high bp and cholestrol respectively. The X_bmi5cat variable categorises the weight of a person ranging from underweight to obese.
bmi <- brfss2013 %>% select(X_rfhype5, X_rfchol, X_bmi5cat) %>% na.omit()Lets check to see some statistics of high bp people
table(bmi$X_rfhype5, bmi$X_bmi5cat)##
## Underweight Normal weight Overweight Obese
## No 4288 87327 80488 49156
## Yes 1899 39276 65001 70736
We can see a lot of obese people having high blood pressure. Proportion of obese and overweight people having high bp is much higher compared to underweight and normal weight people. Similary checking for people having high cholestrol ordered by weight group.
table(bmi$X_rfchol, bmi$X_bmi5cat)##
## Underweight Normal weight Overweight Obese
## No 4302 81895 78358 58618
## Yes 1885 44708 67131 61274
Obese people have almost similar no of people having high cholestrol and normal cholestrol. However proportion of obese and overweight people having high cholestrol is much higher compared to underweight and normal weight people.
Lets plot the data now to get a clear picture of the relation.
ggplot(bmi, aes(X_bmi5cat, fill = X_rfhype5)) + geom_bar() + labs(x = "BMI Category", y = "High Blood Pressure", fill = "")ggplot(bmi, aes(X_bmi5cat, fill = X_rfchol)) + geom_bar() + labs(x = "BMI Category", y = "High Cholestrol", fill = "")The above plots show the proportion of blue bar increasing as weight increases i.e, overweight people tend to have a higher proportion of high bp and cholestrol. Thus we have established a positive correlation between weight and high cholestrol and blood pressure.