Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.


Part 2: Research questions

Research quesion 1:

Is there a relation between physical activity and self reported general health? My hypothesis is that more physical leads to better general health and I want to see how diferent levels of activity are related to the general health.

Research quesion 2:

Is there a relation between smoking and self reported general health? I want to explore how active, past and non smokers identify themselves in the health aspect.

Research quesion 3:

Is High blood pressure & Cholestrol related to the BMI? Being overweight is unhealthy in general, but is it related to problems like BP and cholestrol. The result can provide motivation for people to lose weight.


Part 3: Exploratory data analysis

Research quesion 1:

To address the research question we need to find the relation between self reported general health and physical activity. Searching the codebook for the bfrss data we find genhlth listing the self reported general health in categories from Excellent to Poor. The physical activity levels are listed in the X_pacat1 variable ranging from Inactive to highly active. Also we’ll check to see if sex plays a role in this relation.

exercise <- brfss2013 %>% select(genhlth, X_pacat1, sex) %>% na.omit()

Checking the total counts classified by the three factors.

table(exercise)
## , , sex = Male
## 
##            X_pacat1
## genhlth     Highly active Active Insufficiently active Inactive
##   Excellent         13164   5868                  5324     5427
##   Very good         21742  10693                 11075    11981
##   Good              17318   8744                 10044    16935
##   Fair               5702   2815                  3631     9705
##   Poor               1523    846                  1276     5390
## 
## , , sex = Female
## 
##            X_pacat1
## genhlth     Highly active Active Insufficiently active Inactive
##   Excellent         18019   9013                  7433     7178
##   Very good         28008  16537                 16542    18626
##   Good              20159  11793                 14420    27100
##   Fair               6856   3942                  5855    17949
##   Poor               1897   1064                  1959     9786

We can see that Highly active people are generally healthy with most of the observations having good or better health. While Inactive people have a higher proportion of poor health and fair health people. The trends are similar across gender.

Now lets plot a mosaic plot to get a clear picture of the distribution.

# Creating a function for making mosaicplots

ggMMplot <- function(var1, var2){
    require(ggplot2)
    levVar1 <- length(levels(var1)) 
    levVar2 <- length(levels(var2)) 
    
    jointTable <- prop.table(table(var1, var2)) 
    plotData <- as.data.frame(jointTable)
    plotData$marginVar1 <- prop.table(table(var1)) 
    plotData$var2Height <- plotData$Freq / plotData$marginVar1
    plotData$var1Center <- c(0, cumsum(plotData$marginVar1)[1:levVar1 -1]) +
        plotData$marginVar1 / 2
    
    g <- ggplot(plotData, aes(var1Center, var2Height)) +
        geom_bar(stat = "identity", aes(width = marginVar1, fill = var2), col = "Black") +
        geom_text(aes(label = as.character(var1), x = var1Center, y = 1.05)) 
    g
}

exercise$genhlth <- factor(exercise$genhlth, levels = rev(levels(exercise$genhlth)))

exercise$X_pacat1 <- factor(exercise$X_pacat1, levels = rev(levels(exercise$X_pacat1)))

ex_male <- exercise %>% filter(sex == "Male")
p1 <- ggMMplot(ex_male$X_pacat1, ex_male$genhlth)
p1 + labs(x="Physical Activity (Males)", y="General Health", fill = "")

ex_fem <- exercise %>% filter(sex == "Female")
p2 <- ggMMplot(ex_fem$X_pacat1, ex_fem$genhlth)
p2 + labs(x="Physical Activity (Females)", y="General Health", fill = "")

It becomes clear from the plot that the proportion of people having better health is better for physically active people and the proportion of people having poor health is higher for inactive people. These plots show a positive correlation between exercising and self reported general health.

Research quesion 2:

To address the research question we need to find the relation between self reported general health and smoker type. Searching the codebook for the bfrss data we find genhlth listing the self reported general health in categories from Excellent to Poor. The smoker types are listed in the X_smoker3 variable ranging from Non smokers to daily smokers. Also we’ll check to see if sex plays a role in this relation.

smoke <- brfss2013 %>% select(genhlth, X_smoker3, sex) %>% na.omit()
levels(smoke$X_smoker3) <- c("Daily", "Somedays", "Former", "Non_smoker")

Checking the total counts classified by the three factors.

table(smoke)
## , , sex = Male
## 
##            X_smoker3
## genhlth     Daily Somedays Former Non_smoker
##   Excellent  2526     1312   9266      21412
##   Very good  6266     2761  20256      33943
##   Good       9152     3214  21733      26827
##   Fair       4734     1540  10100       8610
##   Poor       2296      784   4408       2835
## 
## , , sex = Female
## 
##            X_smoker3
## genhlth     Daily Somedays Former Non_smoker
##   Excellent  2719     1364  11114      32980
##   Very good  7807     3327  23118      57213
##   Good      10309     3525  21685      49167
##   Fair       6113     2247  10618      20585
##   Poor       3004     1321   5258       7128

We can see that Non smokers are generally healthy with most of the observations having good or better health. While regular smokers have a higher proportion of poor health and fair health people. The trends are similar across gender.

Now lets plot a mosaic plot to get a clear picture of the distribution.

smoke$genhlth <- factor(smoke$genhlth, levels = rev(levels(smoke$genhlth)))

#smoke$X_smoker3 <- factor(smoke$X_smoker3, levels = rev(levels(smoke$X_smoker3)))

ex_male <- smoke %>% filter(sex == "Male")
p1 <- ggMMplot(ex_male$X_smoker3, ex_male$genhlth)
p1 + labs(x="Smoker Type (Males)", y="General Health", fill = "")

ex_fem <- smoke %>% filter(sex == "Female")
p2 <- ggMMplot(ex_fem$X_smoker3, ex_fem$genhlth)
p2 + labs(x="Smoker Type (Females)", y="General Health", fill = "")

It becomes clear from the plot that the proportion of people having better health is better for Non smokers and the proportion of people having poor health is higher for daily smokers. These plots show a negative correlation between smoking and self reported general health.

Research quesion 3:

In this analysis we aim to check if high bp and high cholestrol levels are correlated to weight. Searching the code we find X_rfhype5 and X_rfchol variables which specify if the person has high bp and cholestrol respectively. The X_bmi5cat variable categorises the weight of a person ranging from underweight to obese.

bmi <- brfss2013 %>% select(X_rfhype5, X_rfchol, X_bmi5cat) %>% na.omit()

Lets check to see some statistics of high bp people

table(bmi$X_rfhype5, bmi$X_bmi5cat)
##      
##       Underweight Normal weight Overweight Obese
##   No         4288         87327      80488 49156
##   Yes        1899         39276      65001 70736

We can see a lot of obese people having high blood pressure. Proportion of obese and overweight people having high bp is much higher compared to underweight and normal weight people. Similary checking for people having high cholestrol ordered by weight group.

table(bmi$X_rfchol, bmi$X_bmi5cat)
##      
##       Underweight Normal weight Overweight Obese
##   No         4302         81895      78358 58618
##   Yes        1885         44708      67131 61274

Obese people have almost similar no of people having high cholestrol and normal cholestrol. However proportion of obese and overweight people having high cholestrol is much higher compared to underweight and normal weight people.

Lets plot the data now to get a clear picture of the relation.

ggplot(bmi, aes(X_bmi5cat, fill = X_rfhype5)) + geom_bar() + labs(x = "BMI Category", y = "High Blood Pressure", fill = "")

ggplot(bmi, aes(X_bmi5cat, fill = X_rfchol)) + geom_bar() + labs(x = "BMI Category", y = "High Cholestrol", fill = "")

The above plots show the proportion of blue bar increasing as weight increases i.e, overweight people tend to have a higher proportion of high bp and cholestrol. Thus we have established a positive correlation between weight and high cholestrol and blood pressure.