Introduction

The below is a limited exploratory data analysis on the brfss data set for the purposes of an assignment for the Introduction to Probability and Data course offered by Duke University on Coursera link. The assignment requires choosing three questions by the program participants based on their interest, the underlying data, and, the learning objectives of the module. Since the data-set is large (over 330 variables) the EDA will focus on variables assumed relevant to the questions being posed by the author.

Setup

The following version of R is being used (3.4.4) and the markdown file is being created through R Studio’s desktop client. For a complete detail of the code used to generate this document please visit the git-hub link. For the purposes of readability, the EDA will focus on the summary statistics and graphics generated from the same and will skip some of the code output not required for grading the assignment.

Load packages

library(tidyverse)
library(gridExtra)
library(purrr)

Load data

load("brfss2013.RData")
br_data <- brfss2013; rm("brfss2013")

Part 1: Data

Per CDC’s brief on the BRFSS2013 data-set link, the Behavioral Risk Factor Surveillance System (BRFSS) project, and the resultant data-set, is a collaboration between the Centers for Disease Control (CDC) and US States and participating Territories. The purpose of collecting data regarding health-related risk behaviors, chronic health conditions, and use of preventive services is to better create health promotion tools for US residents. The following aspects of the data-set are relevant to this project:

There are a large proportion of Na’s (about 41.87 %). In fact, there are 0 complete cases. While this may present a problem, depending upon the variables being analysed, the types of questions being asked may not pertain to each individual and hence missing data is to be expected. The missing data may require a strategy to mitigate their effect, however, any strategy will be deployed on a case-by-case basis below.

The data-set contains 491775, 330 columns (observations) and rows (variables) respectively. A detailed description of the variables can be found in the BRFSS 2013 code-book here.


Part 2: Research questions

The research questions are motivated by a combination of currently topical subjects along with the author’s personal curiosities given his background in healthcare, along with his interest in nutrition.

What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?

To answer this question the following variables will be used:

What is the relatioship, if any, between mental health and alcohol consumption? Are there any gender differences?

To answer this question the following variables will be used:

What is the relationship between BMI and Balanced Food consumption, if any? Balanced Food consumption will be measured by aggregating consumption of Fruits, Vegetables and Beans.

To answer this question the following variables will be used:


Part 3: Exploratory data analysis

Q.1 What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?

The following is a summary of the three chosen variables, along with a density plot of the number of self-reported poor health days. Since $poorhlth has to be a number between 0 - 30, values over 30 and Na values have been discarded.

## Table 1.1.1
##            X_age65yr        PoorHealth     Coverage    
##  Age 18 to 64   :172245   Min.   : 0.000   Yes:215291  
##  Age 65 or older: 73711   1st Qu.: 0.000   No : 30665  
##                           Median : 0.000               
##                           Mean   : 5.276               
##                           3rd Qu.: 5.000               
##                           Max.   :30.000

Some Observations:

For those that did report not feeling well in the last 30 days; mean and median are below.

## Table 1.1.2
## # A tibble: 2 x 3
##   Coverage  mean median
##   <fct>    <dbl>  <int>
## 1 Yes       12.2      7
## 2 No        12.3      7

The data show those that reported feeling unwell, average about 12 days of total illness in the 30 days prior to their interview call. The most common total duration of illness was 7 days. There doesn’t seem to be any difference in mean or median, based on the presence or absence of any healthcare coverage.

The proportion table the data is below.

## Table 1.1.3
##           Coverage
## PoorHealth        Yes         No
##     Unwell 0.37414416 0.05719316
##     Well   0.50117907 0.06748361
## Table 1.1.4
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  testTable1
## X-squared = 107.05, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01658528 -0.01126320
## sample estimates:
##    prop 1    prop 2 
## 0.8674050 0.8813293

The difference in proportions could be due to:

Individuals in the US over the age of 65 are automatically covered by Medicare and hence separating the data by those above 65 years of age may be useful in understanding the relationship between episodes of ill health in the previous 30 day period and some form of healthcare coverage.

Are Health Coverage and Poor Health indepedent? Does Medicare (Age > 65 Yrs.) influence the relationship?

Interpretation of Mosaic Plots:

Is the difference in proportions significantly greater?

## Table 1.1.5
##                 Independence_pValue Greater_pValue
## All adults                   0.0000           1.00
## Adults Over 65               0.0112           0.99
## Adults 18 to 64              0.0000           1.00

Interpretation of tests for independence and equality of proportions:

Conlusion: The variables Coverage and Poor Health (an episode of ill health in the previous 30 days) are not entirely independent of each other. However, the difference in proportions between populations are not statistically greater than the other. Hence, there are other influencing factors not identified in the above analysis.


Q.2 What is the relatioship, if any, between mental health and alcohol consumption?

Are there any gender differences?

A brief computation of the selected variables follows, along with a summary of the data.

#create smaller data with chosen variables
q2b <- br_data %>% select(sex, 
          Mental_Health = menthlth, 
          Ave_Drinks = avedrnk2, 
          Drink_Days = alcday5) %>% # create categorical vairables $MHEpisode, $Drank_Alc
   mutate(MHEpisode = ifelse(Mental_Health > 0, "Yes_M", "No_M"),
          Drank_Alc = ifelse(Drink_Days > 0, "Consumed Alcohol", "Did Not Consume Alc"))

# Function to convert $alcday5 into Total Drinks
convert.DrinksAlc <- function(x) {
   as.character(x)
   if (grepl("^1", x)) { 
      y = (as.numeric(x) - 100)*4
   } else if (grepl("^2", x)) { 
      y = (as.numeric(x) - 200)
   } else {y = x}
   y
}
# apply convert.DrinksAlc to $Drink_Days
q2b$Drink_Days <- map_dbl(q2b$Drink_Days, convert.DrinksAlc)
# filter days to remove negative values computed by miscoded answers.
q2b <- q2b %>% filter(Drink_Days >= 0)

# create $Total_Drinks and select variables for plots
q2_data <- q2b %>% mutate(Total_Drinks = Drink_Days * Ave_Drinks) %>%
   select(sex, Mental_Health, Total_Drinks, MHEpisode, Drank_Alc)

rm(q2b)
## Table 2.1.1 - Summary of the selected variables
##      sex         Mental_Health    Total_Drinks      MHEpisode        
##  Male  :192694   Min.   : 0.00   Min.   :   1.00   Length:472127     
##  Female:279433   1st Qu.: 0.00   1st Qu.:   4.00   Class :character  
##                  Median : 0.00   Median :  10.00   Mode  :character  
##                  Mean   : 3.38   Mean   :  22.09                     
##                  3rd Qu.: 2.00   3rd Qu.:  28.00                     
##                  Max.   :30.00   Max.   :2280.00                     
##                  NA's   :7911    NA's   :240948                      
##   Drank_Alc        
##  Length:472127     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Chosen graphical representation of the data:

Interpretation of plots.

Select questions to quantify and confirm after EDA are:

  1. Is there a significant difference in the proportion of individuals who consumed alcohol and had mental health concerns?
  2. Are women disproportionately consuming more alcohol if there have mental health concerns?
## Table 2.1.2 - Proportion of Individuals who reported mental health concerns and reported consuming alcohol.
##                      MHEpisode
## Drank_Alc                  No_M     Yes_M
##   Consumed Alcohol    0.3436590 0.1577585
##   Did Not Consume Alc 0.3465736 0.1520090
## prop.test()
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  Alc_ME_table
## X-squared = 51.521, df = 1, p-value = 7.083e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.012402840 -0.007082608
## sample estimates:
##    prop 1    prop 2 
## 0.6853750 0.6951177

Interpretation of Table 2.1.2:

There is a minor difference in proportion between those individuals who consumed alcohol and reported mental health concerns. The 95% CI of this difference is - 0.7% - 1.24%.

## Table 2.1.3 - Test for independence of proportions: Individuals who reported mental health concerns & reported consuming alcohol by gender.
## Number of cases in table: 464216 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 13548, df = 4, p-value = 0

Interpretation of Table 2.1.3:

The three selected variables ($sex, $MHEpisode, $Drank_Alc) are not independent.

## Visualizing difference in proportions

Interpretation of the Mosaic plot:

Conclusions:

2.1 There is a difference in alcohol consumption between Males and Females. Table 2.2.1

gender_Alc_table <- q2_data %>% select(Drank_Alc, sex) %>% table(useNA = "no")
prop.test(gender_Alc_table, correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  gender_Alc_table
## X-squared = 9420.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1360765 0.1416289
## sample estimates:
##    prop 1    prop 2 
## 0.4777581 0.3389054

2.2 There is a difference in mental health concerns between Males and Females. Table 2.2.2

gender_MH_table <- q2_data %>% select(MHEpisode, sex) %>% table(useNA = "no")
prop.test(gender_MH_table, correct = FALSE, alternative = "greater")
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  gender_MH_table
## X-squared = 4131.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.09779562 1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.4398768 0.3395712

2.3 Among those that consume alcohol, males report less mental health concerns than women. Table 2.2.3

Alc_gend_MH <- q2_data %>% 
   filter(Drank_Alc == "Consumed Alcohol") %>% 
   select(sex, MHEpisode) %>% table(useNA = "no")
prop.test(Alc_gend_MH, correct = FALSE, alternative = "greater")
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  Alc_gend_MH
## X-squared = 3027.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.1028892 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.7407524 0.6347275

Q.3 What is the relationship between BMI and balanced food consumption, if any?

Data preparation for the analysis will consist of:

q3_balance_data <- br_data %>% select(Beans = beanday_, 
                                      Greens = grenday_, 
                                      OrangeV = orngday_,
                                      Fruits = frutda1_,
                                      BMI = X_bmi5,
                                      BMI_Levels = X_bmi5cat) %>% na.omit() %>%
   mutate(Balance_F = Beans + Greens + OrangeV + Fruits)
rm(br_data)
set.seed(5678)
size = nrow(q3_balance_data)
mini_q3_data <- q3_balance_data[sample(1:size, 10000),]

Table 3.1.1 summarizes the data: The range of the food consumption and BMI variables seems large. Note, there are two decimal places implied in the BMI variable.

## Table 3.1.1
##      Beans             Greens           OrangeV            Fruits      
##  Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   7.00   1st Qu.:  14.00   1st Qu.:   7.00   1st Qu.:  33.0  
##  Median :  14.00   Median :  43.00   Median :  17.00   Median : 100.0  
##  Mean   :  27.86   Mean   :  55.08   Mean   :  29.52   Mean   : 101.5  
##  3rd Qu.:  33.00   3rd Qu.:  83.00   3rd Qu.:  43.00   3rd Qu.: 100.0  
##  Max.   :9900.00   Max.   :9900.00   Max.   :9900.00   Max.   :9900.0  
##       BMI               BMI_Levels       Balance_F    
##  Min.   :   1   Underweight  :  7283   Min.   :    0  
##  1st Qu.:2367   Normal weight:140260   1st Qu.:  111  
##  Median :2675   Overweight   :152163   Median :  179  
##  Mean   :2787   Obese        :124009   Mean   :  214  
##  3rd Qu.:3086                          3rd Qu.:  276  
##  Max.   :9769                          Max.   :20800
## Visualizing Balanced_F and BMI

Interpretation of plots 3.1 and 3.2:

Plot 3.1 visualizes the BMI distribution. The data is slightly skewed as can be seen the difference in the peak of the data vs the mean BMI of ~ 28 (Overweight, but not Obese).

Plot 3.2 visualizes the $Balanced_F variable across the BMI categories. There median of each group lies underneath the mean of the entire cohort indicating, not only, skew-ness in the data, but also, that there may be variation among each category. In fact, a log10 scale has been used on the y axis to better visualize this plot.

Plot 3.3 (below) expands on plot 3.2 by showing the spread of the data. Points to note:

However, is the difference in Balance_F statistically significant across categories?

## Table 3.2.1
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Balance_F by BMI_Levels
## Kruskal-Wallis chi-squared = 4175.2, df = 3, p-value < 2.2e-16
## Table 3.2.2
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  test_data$Balance_F and test_data$BMI_Levels 
## 
##               Underweight Normal weight Overweight
## Normal weight < 2e-16     -             -         
## Overweight    0.36        < 2e-16       -         
## Obese         1.3e-13     < 2e-16       < 2e-16   
## 
## P value adjustment method: bonferroni

The low p-value in Table 3.2.1 indicates that the BMI categories and the Balance_F variable are not independent.

Table 3.2.2 describes the statistical independence between categories for the variable Balance_F. In other words, Balance_F is statistically independent between the Underweight and Overweight categories i.e. the variation between those two groups is similar. However, the variation in Balance_F among other categories is significantly different.

This is an interesting aspect of the data and could indicate that those that are underweight and those that are overweight have similar food consumption with each other, but different from those that obese and of normal weight.

Conclusion:

Additional exploration is required to ascertain the cause of these differences. However, it must be noted that the variation in Balance_F only explains very little of the variation in BMI (see Table 3.3.1)

## Table 3.3.1
## 
##  Pearson's product-moment correlation
## 
## data:  q3_balance_data$BMI and q3_balance_data$Balance_F
## t = -50.102, df = 423710, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07973452 -0.07374797
## sample estimates:
##         cor 
## -0.07674194

A note on the cause of variability in Balance_F:

The creation of Balance_F was an artifact based on the assumption that different types of foods and their balanced intake could have a relationship with BMI. However, where does the variation in this data exist? PCA is a good method to explore the underlying variation in the data. Table 3.3.1 and Table 3.3.2 shows one way to look at the underlying components of Balance_F, further exploration of which could lead to a better understanding of individual food components and their relationship to BMI.

## Table 3.4.1
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3029 0.9475 0.8658 0.8094
## Proportion of Variance 0.4244 0.2244 0.1874 0.1638
## Cumulative Proportion  0.4244 0.6488 0.8362 1.0000
## Table 3.4.2
##               PC1         PC2         PC3         PC4
## Beans   0.3706564  0.90083349 -0.22512013  0.02082708
## Greens  0.5647861 -0.22399148 -0.03979517 -0.79325960
## OrangeV 0.5251537 -0.02954757  0.77817163  0.34320472
## Fruits  0.5175365 -0.37074760 -0.58496683  0.50250966

Principle components (PC) 2 to 4 explain ~ 60% variation in the data. PC1 explains ~40% of the variation and Table 3.3.2 provides the respective weights to the underlying food groups that result in the maximal variation. Further study into these techniques (beyond the scope of this overview) could provide interesting ways to understand food proportions.


Git-hub link to original source code link