Data Exploratory Analysis

In this project, we carry out exploratory analysis of the BRFSS-2013 data set by setting out research questions, and then exploring relationship between identified variables to answer those questions.

Exploratory data analysis

Question 1:

Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?

The following variables will be used to analyze this:

GENHLTH: Respondent’s general health.
X_RFSMOK3: Is the respondent a current smoker?
X_RFDRHV4: Is the respondent a heavy drinker?

The categories we are intrested in are:

str(select(brfss2013,GENHLTH,X_RFSMOK3,X_RFDRHV4))

## 'data.frame':    491773 obs. of  3 variables:
##  $ GENHLTH  : num  4 3 3 2 3 2 4 3 1 3 ...
##  $ X_RFSMOK3: num  1 1 2 1 1 1 1 2 1 1 ...
##  $ X_RFDRHV4: num  1 1 2 1 1 1 1 1 1 1 ...

We start by looking individually at each one of the variables.

GENHLTH

The factor for general health is divided in 6 levels, we use the code book to relabel them.

brfss2013$GENHLTH[brfss2013$GENHLTH == 1] <- "Excellent"
brfss2013$GENHLTH[brfss2013$GENHLTH == 2] <- "Very good"
brfss2013$GENHLTH[brfss2013$GENHLTH == 3] <- "Good"
brfss2013$GENHLTH[brfss2013$GENHLTH == 4] <- "Fair"
brfss2013$GENHLTH[brfss2013$GENHLTH == 5] <- "Poor"
brfss2013$GENHLTH[brfss2013$GENHLTH == 7] <- "Not sure"
brfss2013$GENHLTH[brfss2013$GENHLTH == 9] <- "Refused"

brfss2013$GENHLTH <- factor(brfss2013$GENHLTH)

The distribution of the different factors are as follows,

total_obs <- nrow(brfss2013)

brfss2013 %>%
  group_by(GENHLTH) %>%
  summarise(count=n(),percentage=n()*100/total_obs)

## Warning: Factor `GENHLTH` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 8 x 3
##   GENHLTH    count percentage
##   <fct>      <int>      <dbl>
## 1 Excellent  85532   17.4    
## 2 Fair       66700   13.6    
## 3 Good      150548   30.6    
## 4 Not sure     969    0.197  
## 5 Poor       27909    5.68   
## 6 Refused     1004    0.204  
## 7 Very good 159104   32.4    
## 8 <NA>           7    0.00142

and plotted in a histogram.

ggplot(brfss2013, aes(x=GENHLTH,)) + geom_bar() + ggtitle('General Health of Respondents') + xlab('General Health') + theme_bw()

Around 95% of the respondents in our dataset have stated they are in good health or better, and most of the people have ‘Very good’ health. There are some missing values, denoted by ‘NA’, too which we will deal with later as they don’t make much sense with our analysis.

X_RFSMOK3

Do similar as for the factor above.

brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 1] <- "No"
brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 2] <- "Yes"
brfss2013$X_RFSMOK3[brfss2013$X_RFSMOK3 == 9] <- "NA"

brfss2013$X_RFSMOK3 <- factor(brfss2013$X_RFSMOK3)

brfss2013 %>%
  group_by(X_RFSMOK3) %>%
  summarise(count=n(),percentage=n()*100/total_obs)

## # A tibble: 3 x 3
##   X_RFSMOK3  count percentage
##   <fct>      <int>      <dbl>
## 1 NA         15322       3.12
## 2 No        399839      81.3 
## 3 Yes        76612      15.6

ggplot(brfss2013, aes(x=X_RFSMOK3)) + geom_bar() + ggtitle('Smoking Status of Respondents') + xlab('Currently a smoker?')+ theme_bw()

More than 81% of the respondents have stated they are not current smokers, though they might have smoked earlier in their lifetimes,as this was not explicitly asked.

X_RFDRHV4

The variable for ‘heavy drinker’ is defined as adult men having two or more drinks per day and adult women having one or more drink per day.

brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 1] <- "No"
brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 2] <- "Yes"
brfss2013$X_RFDRHV4[brfss2013$X_RFDRHV4 == 9] <- "NA"

brfss2013$X_RFDRHV4 <- factor(brfss2013$X_RFDRHV4)

brfss2013 %>%
  group_by(X_RFDRHV4) %>%
  summarise(count=n(),percentage=n()*100/total_obs)

## # A tibble: 3 x 3
##   X_RFDRHV4  count percentage
##   <fct>      <int>      <dbl>
## 1 NA         23874       4.85
## 2 No        442353      90.0 
## 3 Yes        25546       5.19

ggplot(brfss2013, aes(x=X_RFDRHV4)) + geom_bar() + ggtitle('Drinking Habits of Respondents') + xlab('Heavy Drinker?') +theme_bw()

Just a little more than 5% in our dataset have stated they are a heavy drinked as defined above.

Now to get back to the original question, we create a new variable to categorise a person as ‘Smoker’, ‘Heavy drinker’, ‘Both’ or ‘None’.

brfss2013 <- brfss2013 %>%
  mutate(smoke_alc = ifelse(X_RFDRHV4 == 'Yes',
                            ifelse(X_RFSMOK3 == 'Yes','Both','Heavy Drinker'),
                            ifelse(X_RFSMOK3 == 'Yes','Current Smoker','None')))

With the following distribution:

brfss2013 %>%
  group_by(smoke_alc) %>%
  summarise(count=n(),percentage=n()*100/total_obs)

## # A tibble: 4 x 3
##   smoke_alc       count percentage
##   <chr>           <int>      <dbl>
## 1 Both             8141       1.66
## 2 Current Smoker  68471      13.9 
## 3 Heavy Drinker   17405       3.54
## 4 None           397756      80.9

ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Drinking and Smoking Habits of Respondents') + xlab('Drinker or Smoker?') +theme_bw()

Close to 81% are categorised as ‘None’, almost 14% as ‘Current smoker’ and 3.5% as ‘Heavy drinker’. We will be looking at the last two of our new variables.

We can represent the counts of two categorical variables in a contingency table.

rq1_table <- table(brfss2013$smoke_alc,brfss2013$GENHLTH)

rq1_table

##                 
##                  Excellent   Fair   Good Not sure   Poor Refused Very good
##   Both                1000   1270   2959       22    444      20      2426
##   Current Smoker      6908  13374  23243      152   6947     138     17708
##   Heavy Drinker       4176   1362   4699       17    352      30      6769
##   None               73448  50694 119647      778  20166     816    132201

But it’s easier to look at proportions rather than the counts. Here we then have a table where we see proportions of health across drinker or smokers.

prop.table(rq1_table,1)

##                 
##                     Excellent         Fair         Good     Not sure
##   Both           0.1228350326 0.1560004913 0.3634688613 0.0027023707
##   Current Smoker 0.1008909011 0.1953264203 0.3394625383 0.0022199503
##   Heavy Drinker  0.2399310543 0.0782533755 0.2699798908 0.0009767308
##   None           0.1846587052 0.1274519170 0.3008095537 0.0019560025
##                 
##                          Poor      Refused    Very good
##   Both           0.0545387545 0.0024567007 0.2979977890
##   Current Smoker 0.1014604936 0.0020154812 0.2586242150
##   Heavy Drinker  0.0202240735 0.0017236426 0.3889112324
##   None           0.0507001886 0.0020515399 0.3323720930

From this we have a sense of what’s going on. Let’s visualize this table with a mosiac plot.

mosaicplot(prop.table(rq1_table,1),main='Drinking and/or Smoking vs General Health', xlab='Drinking and/or Smoking status', ylab='General Health')

Now, looking at the summary statistics, the proportions in the contingency table and the mosiac plot. We see that there is a larger proportion of individuals who are heavy drinkers and are in ‘Excellent’ and ‘Very good’ condition compared to the current smokers. Even though the are proportionally more smokers in ‘Good’ health, they are also higher in both ‘Fair’ and ‘Poor’ health, which we can consider is below par.

From this it looks like the heavy drinkers are in better health than the smokers.

Question 2:

Do people who sleep fewer hours than average person, also have more than days with poor mental health?

For this, we have to look at the relationship between the variables:

SLEPTIM1: On average, the hours of sleep a person gets in a 24-hour period.
MENTHLTH: Out of 30, number of days the mental health of a person wasn’t good.

Checking out the type of variables that we’re dealing with:

str(select(brfss2013,SLEPTIM1,MENTHLTH))

## 'data.frame':    491773 obs. of  2 variables:
##  $ SLEPTIM1: num  77 6 9 8 6 8 7 6 8 8 ...
##  $ MENTHLTH: num  29 88 2 88 2 88 15 88 88 88 ...

Looking at how the SLEPTIM1 is distributed:

ggplot(brfss2013,aes(x=SLEPTIM1)) + geom_bar()

We see that there has been a mistake here as the hours slept are recorded to almost 100 in a 24 hour period.

head(brfss2013 %>%
  filter(SLEPTIM1>24) %>%
  select(SLEPTIM1))

##   SLEPTIM1
## 1       77
## 2       77
## 3       77
## 4       77
## 5       77
## 6       77

We filter those observations out:

rq2_brfss2013 <- brfss2013 %>%
  filter(SLEPTIM1 <= 24)

Plotting again:

ggplot(rq2_brfss2013,aes(x=as.factor(SLEPTIM1))) + geom_bar() + ggtitle('Amount of Sleep of Respondents') + xlab('Hours slept') + theme_bw()

As one might suspect, most people get around 6-8 hours of sleep. With the average at 7.05 hours.

rq2_brfss2013 %>%
  summarise(avg_sleep = mean(SLEPTIM1))

##   avg_sleep
## 1  7.051032

MENTHLTH

Now looking at days of poor mental health recorded under a 30 day period.

ggplot(rq2_brfss2013, aes(x=MENTHLTH)) + geom_histogram(bins=30)

## Warning: Removed 9 rows containing non-finite values (stat_bin).

Again, the same issue as with SLEPTIM1

rq2_brfss2013 <- rq2_brfss2013 %>%
  filter(MENTHLTH <= 30)

ggplot(rq2_brfss2013, aes(x=MENTHLTH)) + geom_histogram() + ggtitle('Mental Health of Respondents') + xlab('Number of days with poor mental health (out of 30)') + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To answer the question, we look at the average mental health recorded of groups of people who get different hours of sleep per 24 hour period.

slep_ment <- rq2_brfss2013 %>%
 group_by(hours_slept = as.factor(SLEPTIM1)) %>%
 summarise(avg_poor_mental = mean(MENTHLTH), count=n()) 

slep_ment

## # A tibble: 24 x 3
##    hours_slept avg_poor_mental count
##    <fct>                 <dbl> <int>
##  1 1                     19.4    131
##  2 2                     20.7    657
##  3 3                     19.7   2104
##  4 4                     17.5   7729
##  5 5                     14.0  14658
##  6 6                     11.0  37164
##  7 7                      8.15 38133
##  8 8                      9.37 32899
##  9 9                     10.1   6329
## 10 10                    13.5   4109
## # ... with 14 more rows

Now, visualizing this new variable:

ggplot(slep_ment, aes(x=hours_slept,y=avg_poor_mental)) + geom_bar(stat='identity') + ggtitle('Do people who sleep irregularly have poor mental health?') + xlab('Hours Slept') + ylab('No. of days with poor mental health (out of 30)') + theme_bw()

Looking at the summary statistics produces above and the histogram just above, we see that the people who sleep around 7 hours per 24 hour period have a fewer days of poor mental health. With the lowest at 7 hours. While people who sleep less or more have considerably more days with poor mental health.

We can also look at mental health for people who sleep more than 12 hours:

slep_ment %>%
  filter(as.integer(hours_slept) > 12)

## # A tibble: 12 x 3
##    hours_slept avg_poor_mental count
##    <fct>                 <dbl> <int>
##  1 13                     20.0    90
##  2 14                     18.2   207
##  3 15                     15.8   186
##  4 16                     15.3   155
##  5 17                     19.2    15
##  6 18                     19.5    83
##  7 19                     20       6
##  8 20                     17.1    31
##  9 21                     15       1
## 10 22                     16.4     5
## 11 23                     30       3
## 12 24                     17.8    13

We can conclude that there seems to be a relationship between inadequate sleep and days of poor mental health. But also if you sleep too much. However we can not be sure if one directly causes the other one.

Question 3:

Are people who have completed higher levels of education, more likely to consume fruits and vegetables once or mor in a day?

For this we’ll be using the following variables:

X_EDUCAG: Computed level of education completed.
X_FRTLT1: Consume fruit 1 or times per day.
X_VEGLT1: Consume vegetables 1 or times per day.

All three of these are categorical variables, so we can answer our question through a contingency table.

Before proceeding, we’ll change the name of the factor levels, as they make the outplot and visualizations look better. For this purpose, we’ll assume that college and technical school the same thing in X_EDUCAG variable.

brfss2013$X_FRTLT1 <- as.factor(brfss2013$X_FRTLT1)
brfss2013$X_VEGLT1 <- as.factor(brfss2013$X_VEGLT1)
levels(brfss2013$X_FRTLT1) <- c('Once or more a day','Less than once a day','NA')
levels(brfss2013$X_VEGLT1) <- c('Once or more a day','Less than once a day','NA')

brfss2013$X_EDUCAG <- as.factor(brfss2013$X_EDUCAG)
levels(brfss2013$X_EDUCAG)[1] <- c('Did not graduate high school')
levels(brfss2013$X_EDUCAG)[2] <- c('Graduated high school')
levels(brfss2013$X_EDUCAG)[3] <- c('Attended college')
levels(brfss2013$X_EDUCAG)[4] <- c('Graduated college')
levels(brfss2013$X_EDUCAG)[5] <- c('NA')

Beginning with an examination of the variables on their own.

X_EDUCAG

brfss2013 %>%
  group_by(X_EDUCAG) %>%
  summarise(count=n(), percentage=n()*100/total_obs)

## # A tibble: 5 x 3
##   X_EDUCAG                      count percentage
##   <fct>                         <int>      <dbl>
## 1 Did not graduate high school  42132      8.57 
## 2 Graduated high school        142953     29.1  
## 3 Attended college             134242     27.3  
## 4 Graduated college            170173     34.6  
## 5 NA                             2273      0.462

ggplot(brfss2013, aes(x=X_EDUCAG)) + geom_bar() + ggtitle('Education Level of Respondents') + xlab('Completed Education Level') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))

Around 90% of the respondents have graduated high school or higher, while 8.58% did not graduate high school. In the context of our question, we expect that as we go up the completed level of education, we’ll see higher proportions of people who consume fruits and vegetables. We’ll come back to that later.

X_FRTLT1

brfss2013 %>%
  group_by(X_FRTLT1) %>%
  summarise(count=n(), percentage=n()*100/total_obs)

## # A tibble: 3 x 3
##   X_FRTLT1              count percentage
##   <fct>                 <int>      <dbl>
## 1 Once or more a day   286488      58.3 
## 2 Less than once a day 164003      33.3 
## 3 NA                    41282       8.39

ggplot(brfss2013, aes(x=X_FRTLT1)) + geom_bar() + ggtitle('Fruit Consumption by Respondents') + xlab('Fruit Consumption') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))

More people in our dataset, around 58% consume fruits one or more times per day than those who don’t, around 33.3%.

X_VEGLT1

brfss2013 %>%
  group_by(X_VEGLT1) %>%
  summarise(count=n(), percentage=n()*100/total_obs)

## # A tibble: 3 x 3
##   X_VEGLT1              count percentage
##   <fct>                 <int>      <dbl>
## 1 Once or more a day   349868       71.1
## 2 Less than once a day  92422       18.8
## 3 NA                    49483       10.1

ggplot(brfss2013, aes(x=X_VEGLT1)) + geom_bar() + ggtitle('Vegetable Consumption by Respondents') + xlab('Vegetable Consumption') + theme_bw() + theme(axis.text.x = element_text(angle = 30, hjust = 1))

Around 71% people in our dataset consume vegetables one or more times per day, while around 18.8% don’t. We can also see that there are more people that consume veggies (one or more times per day), than there are who consume fruits.

Answering our question, we’ll build a 2-way contingency table to count frequencies of completed education level with both fruit and vegetable consumption.

ct_f <- table(brfss2013$X_EDUCAG,brfss2013$X_FRTLT1)

prop.table(ct_f,1)

##                               
##                                Once or more a day Less than once a day
##   Did not graduate high school         0.47673977           0.39829583
##   Graduated high school                0.52086350           0.37769057
##   Attended college                     0.58036233           0.34522728
##   Graduated college                    0.66616913           0.27338062
##   NA                                   0.29476463           0.16014078
##                               
##                                        NA
##   Did not graduate high school 0.12496440
##   Graduated high school        0.10144593
##   Attended college             0.07441039
##   Graduated college            0.06045025
##   NA                           0.54509459

mosaicplot(prop.table(ct_f,1), main='Completed Education vs Fruit Consumption', xlab='Completed Education Level', ylab='Fruit Consumption')

In both the proportional frequency table, and the mosaic, we can see an increase in the proportion of people who consume fruits, as we move to increased levels of completed education, vs those who don’t. Let’s see if this also holds true for vegetable consumption.

ct_v <- table(brfss2013$X_EDUCAG,brfss2013$X_VEGLT1)

prop.table(ct_v,1)

##                               
##                                Once or more a day Less than once a day
##   Did not graduate high school         0.56028672           0.28213709
##   Graduated high school                0.63699258           0.24080642
##   Attended college                     0.72854993           0.18252112
##   Graduated college                    0.80321790           0.12545468
##   NA                                   0.31412231           0.11438627
##                               
##                                        NA
##   Did not graduate high school 0.15757619
##   Graduated high school        0.12220100
##   Attended college             0.08892895
##   Graduated college            0.07132741
##   NA                           0.57149142

mosaicplot(prop.table(ct_v,1), main='Completed Education vs Vegetable Consumption', xlab='Completed Education Level', ylab='Vegetable Consumption')

There was already a higher proportion of people who consumed vegetables than those who consumed fruits, but the increase in proportions, depending on completed education level is still evident here.

So, we can say that people with higher education levels are more likely to adopt healthy eating habits.

Data Exploratory Analysis

Christian Källgren

16 september 2019

Setup

The Data

Research questions

Question 1:

Question 2:

Question 3:

Exploratory data analysis

Question 1:

GENHLTH

X_RFSMOK3

X_RFDRHV4

Question 2:

MENTHLTH

Question 3:

X_EDUCAG

X_FRTLT1

X_VEGLT1