In this project, we carry out exploratory analysis of the BRFSS-2013 data set by setting out research questions, and then exploring relationship between identified variables to answer those questions.
library(ggplot2)
library(dplyr)
brfss2013 <- read.csv("2013.csv")
The BRFSS-2013 dataset was sampled from the non-institutionalised adult population (i.e. 18 years and older) residing in the US. The data was collected through landline and cellular-telephone based surveys.
Disproportionate stratified sampling, which is more efficient than simple random sampling, was used for the landline sample. The cellular sample was generated from randomly selected respondents, with an equal probability of selection.
As random sampling was used for both data collection methods, the data for the sample is generalizable to the population. On the other hand, as this is an observational study, it won’t be possible to make causal inferences from the data.
Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?
While researching this, we’re trying to explore the impact of consuming alcohol vs smoking tobacco on a person’s health and see which is worse.
Do people who sleep fewer hours than average person, also have more than days with poor mental health?
Here we try to determine if inadequate sleep has a negative effect on their mental health.
Are people who have completed higher levels of education, more likely to consume fruits and vegetables once or more in a day?
We might assume that educated people live a healthier lifestyle i.e. exercising or eating nutritious food.
Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?
The following variables will be used to analyze this:
GENHLTH: Respondent’s general health.
X_RFSMOK3: Is the respondent a current smoker?
X_RFDRHV4: Is the respondent a heavy drinker?
The categories we are intrested in are:
str(select(brfss2013,GENHLTH,X_RFSMOK3,X_RFDRHV4))
## 'data.frame': 491773 obs. of 3 variables:
## $ GENHLTH : num 4 3 3 2 3 2 4 3 1 3 ...
## $ X_RFSMOK3: num 1 1 2 1 1 1 1 2 1 1 ...
## $ X_RFDRHV4: num 1 1 2 1 1 1 1 1 1 1 ...
We start by looking individually at each one of the variables.
The factor for general health is divided in 6 levels, we use the code book to relabel them.
brfss2013$GENHLTH[brfss2013$GENHLTH == 1] <- "Excellent"
brfss2013$GENHLTH[brfss2013$GENHLTH == 2] <- "Very good"
brfss2013$GENHLTH[brfss2013$GENHLTH == 3] <- "Good"
brfss2013$GENHLTH[brfss2013$GENHLTH == 4] <- "Fair"
brfss2013$GENHLTH[brfss2013$GENHLTH == 5] <- "Poor"
brfss2013$GENHLTH[brfss2013$GENHLTH == 7] <- "Not sure"
brfss2013$GENHLTH[brfss2013$GENHLTH == 9] <- "Refused"
brfss2013$GENHLTH <- factor(brfss2013$GENHLTH)
The distribution of the different factors are as follows,
total_obs <- nrow(brfss2013)
brfss2013 %>%
group_by(GENHLTH) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## Warning: Factor `GENHLTH` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 8 x 3
## GENHLTH count percentage
## <fct> <int> <dbl>
## 1 Excellent 85532 17.4
## 2 Fair 66700 13.6
## 3 Good 150548 30.6
## 4 Not sure 969 0.197
## 5 Poor 27909 5.68
## 6 Refused 1004 0.204
## 7 Very good 159104 32.4
## 8 <NA> 7 0.00142
and plotted in a histogram.
ggplot(brfss2013, aes(x=GENHLTH,)) + geom_bar() + ggtitle('General Health of Respondents') + xlab('General Health') + theme_bw()
Around 95% of the respondents in our dataset have stated they are in good health or better, and most of the people have ‘Very good’ health. There are some missing values, denoted by ‘NA’, too which we will deal with later as they don’t make much sense with our analysis.
Do similar as for the factor above.
brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 1] <- "No"
brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 2] <- "Yes"
brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 9] <- "NA"
brfss2013$X_RFSMOK3 <- factor(brfss2013$X_RFSMOK3)
brfss2013 %>%
group_by(X_RFSMOK3) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 3 x 3
## X_RFSMOK3 count percentage
## <fct> <int> <dbl>
## 1 NA 15322 3.12
## 2 No 399839 81.3
## 3 Yes 76612 15.6
ggplot(brfss2013, aes(x=X_RFSMOK3)) + geom_bar() + ggtitle('Smoking Status of Respondents') + xlab('Currently a smoker?')+ theme_bw()
More than 81% of the respondents have stated they are not current smokers, though they might have smoked earlier in their lifetimes,as this was not explicitly asked.
The variable for ‘heavy drinker’ is defined as adult men having two or more drinks per day and adult women having one or more drink per day.
brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 1] <- "No"
brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 2] <- "Yes"
brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 9] <- "NA"
brfss2013$X_RFDRHV4 <- factor(brfss2013$X_RFDRHV4)
brfss2013 %>%
group_by(X_RFDRHV4) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 3 x 3
## X_RFDRHV4 count percentage
## <fct> <int> <dbl>
## 1 NA 23874 4.85
## 2 No 442353 90.0
## 3 Yes 25546 5.19
ggplot(brfss2013, aes(x=X_RFDRHV4)) + geom_bar() + ggtitle('Drinking Habits of Respondents') + xlab('Heavy Drinker?') +theme_bw()
Just a little more than 5% in our dataset have stated they are a heavy drinked as defined above.
Now to get back to the original question, we create a new variable to categorise a person as ‘Smoker’, ‘Heavy drinker’, ‘Both’ or ‘None’.
brfss2013 <- brfss2013 %>%
mutate(smoke_alc = ifelse(X_RFDRHV4 == 'Yes',
ifelse(X_RFSMOK3 == 'Yes','Both','Heavy Drinker'),
ifelse(X_RFSMOK3 == 'Yes','Current Smoker','None')))
With the following distribution:
brfss2013 %>%
group_by(smoke_alc) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 4 x 3
## smoke_alc count percentage
## <chr> <int> <dbl>
## 1 Both 8141 1.66
## 2 Current Smoker 68471 13.9
## 3 Heavy Drinker 17405 3.54
## 4 None 397756 80.9
ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Drinking and Smoking Habits of Respondents') + xlab('Drinker or Smoker?') +theme_bw()
Close to 81% are categorised as ‘None’, almost 14% as ‘Current smoker’ and 3.5% as ‘Heavy drinker’. We will be looking at the last two of our new variables.
We can represent the counts of two categorical variables in a contingency table.
rq1_table <- table(brfss2013$smoke_alc,brfss2013$GENHLTH)
rq1_table
##
## Excellent Fair Good Not sure Poor Refused Very good
## Both 1000 1270 2959 22 444 20 2426
## Current Smoker 6908 13374 23243 152 6947 138 17708
## Heavy Drinker 4176 1362 4699 17 352 30 6769
## None 73448 50694 119647 778 20166 816 132201
But it’s easier to look at proportions rather than the counts. Here we then have a table where we see proportions of health across drinker or smokers.
prop.table(rq1_table,1)
##
## Excellent Fair Good Not sure
## Both 0.1228350326 0.1560004913 0.3634688613 0.0027023707
## Current Smoker 0.1008909011 0.1953264203 0.3394625383 0.0022199503
## Heavy Drinker 0.2399310543 0.0782533755 0.2699798908 0.0009767308
## None 0.1846587052 0.1274519170 0.3008095537 0.0019560025
##
## Poor Refused Very good
## Both 0.0545387545 0.0024567007 0.2979977890
## Current Smoker 0.1014604936 0.0020154812 0.2586242150
## Heavy Drinker 0.0202240735 0.0017236426 0.3889112324
## None 0.0507001886 0.0020515399 0.3323720930
From this we have a sense of what’s going on. Let’s visualize this table with a mosiac plot.
mosaicplot(prop.table(rq1_table,1),main='Drinking and/or Smoking vs General Health', xlab='Drinking and/or Smoking status', ylab='General Health')
Now, looking at the summary statistics, the proportions in the contingency table and the mosiac plot. We see that there is a larger proportion of individuals who are heavy drinkers and are in ‘Excellent’ and ‘Very good’ condition compared to the current smokers. Even though the are proportionally more smokers in ‘Good’ health, they are also higher in both ‘Fair’ and ‘Poor’ health, which we can consider is below par.
From this it looks like the heavy drinkers are in better health than the smokers.
Do people who sleep fewer hours than average person, also have more than days with poor mental health?
For this, we have to look at the relationship between the variables:
Checking out the type of variables that we’re dealing with:
str(select(brfss2013,SLEPTIM1,MENTHLTH))
## 'data.frame': 491773 obs. of 2 variables:
## $ SLEPTIM1: num 77 6 9 8 6 8 7 6 8 8 ...
## $ MENTHLTH: num 29 88 2 88 2 88 15 88 88 88 ...
Looking at how the SLEPTIM1 is distributed:
ggplot(brfss2013,aes(x=SLEPTIM1)) + geom_bar()
We see that there has been a mistake here as the hours slept are recorded to almost 100 in a 24 hour period.
head(brfss2013 %>%
filter(SLEPTIM1>24) %>%
select(SLEPTIM1))
## SLEPTIM1
## 1 77
## 2 77
## 3 77
## 4 77
## 5 77
## 6 77
We filter those observations out:
rq2_brfss2013 <- brfss2013 %>%
filter(SLEPTIM1 <= 24)
Plotting again:
ggplot(rq2_brfss2013,aes(x=as.factor(SLEPTIM1))) + geom_bar() + ggtitle('Amount of Sleep of Respondents') + xlab('Hours slept') + theme_bw()
As one might suspect, most people get around 6-8 hours of sleep. With the average at 7.05 hours.
rq2_brfss2013 %>%
summarise(avg_sleep = mean(SLEPTIM1))
## avg_sleep
## 1 7.051032
Now looking at days of poor mental health recorded under a 30 day period.
ggplot(rq2_brfss2013, aes(x=MENTHLTH)) + geom_histogram(bins=30)
## Warning: Removed 9 rows containing non-finite values (stat_bin).
Again, the same issue as with SLEPTIM1
rq2_brfss2013 <- rq2_brfss2013 %>%
filter(MENTHLTH <= 30)
ggplot(rq2_brfss2013, aes(x=MENTHLTH)) + geom_histogram() + ggtitle('Mental Health of Respondents') + xlab('Number of days with poor mental health (out of 30)') + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To answer the question, we look at the average mental health recorded of groups of people who get different hours of sleep per 24 hour period.
slep_ment <- rq2_brfss2013 %>%
group_by(hours_slept = as.factor(SLEPTIM1)) %>%
summarise(avg_poor_mental = mean(MENTHLTH), count=n())
slep_ment
## # A tibble: 24 x 3
## hours_slept avg_poor_mental count
## <fct> <dbl> <int>
## 1 1 19.4 131
## 2 2 20.7 657
## 3 3 19.7 2104
## 4 4 17.5 7729
## 5 5 14.0 14658
## 6 6 11.0 37164
## 7 7 8.15 38133
## 8 8 9.37 32899
## 9 9 10.1 6329
## 10 10 13.5 4109
## # ... with 14 more rows
Now, visualizing this new variable:
ggplot(slep_ment, aes(x=hours_slept,y=avg_poor_mental)) + geom_bar(stat='identity') + ggtitle('Do people who sleep irregularly have poor mental health?') + xlab('Hours Slept') + ylab('No. of days with poor mental health (out of 30)') + theme_bw()
Looking at the summary statistics produces above and the histogram just above, we see that the people who sleep around 7 hours per 24 hour period have a fewer days of poor mental health. With the lowest at 7 hours. While people who sleep less or more have considerably more days with poor mental health.
We can also look at mental health for people who sleep more than 12 hours:
slep_ment %>%
filter(as.integer(hours_slept) > 12)
## # A tibble: 12 x 3
## hours_slept avg_poor_mental count
## <fct> <dbl> <int>
## 1 13 20.0 90
## 2 14 18.2 207
## 3 15 15.8 186
## 4 16 15.3 155
## 5 17 19.2 15
## 6 18 19.5 83
## 7 19 20 6
## 8 20 17.1 31
## 9 21 15 1
## 10 22 16.4 5
## 11 23 30 3
## 12 24 17.8 13
We can conclude that there seems to be a relationship between inadequate sleep and days of poor mental health. But also if you sleep too much. However we can not be sure if one directly causes the other one.
Are people who have completed higher levels of education, more likely to consume fruits and vegetables once or mor in a day?
For this we’ll be using the following variables:
All three of these are categorical variables, so we can answer our question through a contingency table.
Before proceeding, we’ll change the name of the factor levels, as they make the outplot and visualizations look better. For this purpose, we’ll assume that college and technical school the same thing in X_EDUCAG variable.
brfss2013$X_FRTLT1 <- as.factor(brfss2013$X_FRTLT1)
brfss2013$X_VEGLT1 <- as.factor(brfss2013$X_VEGLT1)
levels(brfss2013$X_FRTLT1) <- c('Once or more a day','Less than once a day','NA')
levels(brfss2013$X_VEGLT1) <- c('Once or more a day','Less than once a day','NA')
brfss2013$X_EDUCAG <- as.factor(brfss2013$X_EDUCAG)
levels(brfss2013$X_EDUCAG)[1] <- c('Did not graduate high school')
levels(brfss2013$X_EDUCAG)[2] <- c('Graduated high school')
levels(brfss2013$X_EDUCAG)[3] <- c('Attended college')
levels(brfss2013$X_EDUCAG)[4] <- c('Graduated college')
levels(brfss2013$X_EDUCAG)[5] <- c('NA')
Beginning with an examination of the variables on their own.
brfss2013 %>%
group_by(X_EDUCAG) %>%
summarise(count=n(), percentage=n()*100/total_obs)
## # A tibble: 5 x 3
## X_EDUCAG count percentage
## <fct> <int> <dbl>
## 1 Did not graduate high school 42132 8.57
## 2 Graduated high school 142953 29.1
## 3 Attended college 134242 27.3
## 4 Graduated college 170173 34.6
## 5 NA 2273 0.462
ggplot(brfss2013, aes(x=X_EDUCAG)) + geom_bar() + ggtitle('Education Level of Respondents') + xlab('Completed Education Level') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))
Around 90% of the respondents have graduated high school or higher, while 8.58% did not graduate high school. In the context of our question, we expect that as we go up the completed level of education, we’ll see higher proportions of people who consume fruits and vegetables. We’ll come back to that later.
brfss2013 %>%
group_by(X_FRTLT1) %>%
summarise(count=n(), percentage=n()*100/total_obs)
## # A tibble: 3 x 3
## X_FRTLT1 count percentage
## <fct> <int> <dbl>
## 1 Once or more a day 286488 58.3
## 2 Less than once a day 164003 33.3
## 3 NA 41282 8.39
ggplot(brfss2013, aes(x=X_FRTLT1)) + geom_bar() + ggtitle('Fruit Consumption by Respondents') + xlab('Fruit Consumption') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))
More people in our dataset, around 58% consume fruits one or more times per day than those who don’t, around 33.3%.
brfss2013 %>%
group_by(X_VEGLT1) %>%
summarise(count=n(), percentage=n()*100/total_obs)
## # A tibble: 3 x 3
## X_VEGLT1 count percentage
## <fct> <int> <dbl>
## 1 Once or more a day 349868 71.1
## 2 Less than once a day 92422 18.8
## 3 NA 49483 10.1
ggplot(brfss2013, aes(x=X_VEGLT1)) + geom_bar() + ggtitle('Vegetable Consumption by Respondents') + xlab('Vegetable Consumption') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))
Around 71% people in our dataset consume vegetables one or more times per day, while around 18.8% don’t. We can also see that there are more people that consume veggies (one or more times per day), than there are who consume fruits.
Answering our question, we’ll build a 2-way contingency table to count frequencies of completed education level with both fruit and vegetable consumption.
ct_f <- table(brfss2013$X_EDUCAG,brfss2013$X_FRTLT1)
prop.table(ct_f,1)
##
## Once or more a day Less than once a day
## Did not graduate high school 0.47673977 0.39829583
## Graduated high school 0.52086350 0.37769057
## Attended college 0.58036233 0.34522728
## Graduated college 0.66616913 0.27338062
## NA 0.29476463 0.16014078
##
## NA
## Did not graduate high school 0.12496440
## Graduated high school 0.10144593
## Attended college 0.07441039
## Graduated college 0.06045025
## NA 0.54509459
mosaicplot(prop.table(ct_f,1), main='Completed Education vs Fruit Consumption', xlab='Completed Education Level', ylab='Fruit Consumption')
In both the proportional frequency table, and the mosaic, we can see an increase in the proportion of people who consume fruits, as we move to increased levels of completed education, vs those who don’t. Let’s see if this also holds true for vegetable consumption.
ct_v <- table(brfss2013$X_EDUCAG,brfss2013$X_VEGLT1)
prop.table(ct_v,1)
##
## Once or more a day Less than once a day
## Did not graduate high school 0.56028672 0.28213709
## Graduated high school 0.63699258 0.24080642
## Attended college 0.72854993 0.18252112
## Graduated college 0.80321790 0.12545468
## NA 0.31412231 0.11438627
##
## NA
## Did not graduate high school 0.15757619
## Graduated high school 0.12220100
## Attended college 0.08892895
## Graduated college 0.07132741
## NA 0.57149142
mosaicplot(prop.table(ct_v,1), main='Completed Education vs Vegetable Consumption', xlab='Completed Education Level', ylab='Vegetable Consumption')
There was already a higher proportion of people who consumed vegetables than those who consumed fruits, but the increase in proportions, depending on completed education level is still evident here.
So, we can say that people with higher education levels are more likely to adopt healthy eating habits.