The below is a limited exploratory data analysis on the brfss data set for the purposes of an assignment for the Introduction to Probability and Data course offered by Duke University on Coursera link. The assignment requires choosing three questions by the program participants based on their interest, the underlying data, and, the learning objectives of the module. Since the data-set is large (over 330 variables) the EDA will focus on variables assumed relevant to the questions being posed by the author.
The following version of R is being used (3.4.4) and the markdown file is being created through R Studio’s desktop client. For a complete detail of the code used to generate this document please visit the git-hub link. For the purposes of readability, the EDA will focus on the summary statistics and graphics generated from the same and will skip some of the code output not required for grading the assignment.
library(tidyverse)
library(gridExtra)
library(purrr)load("brfss2013.RData")
br_data <- brfss2013; rm("brfss2013")Per CDC’s brief on the BRFSS2013 data-set link, the Behavioral Risk Factor Surveillance System (BRFSS) project, and the resultant data-set, is a collaboration between the Centers for Disease Control (CDC) and US States and participating Territories. The purpose of collecting data regarding health-related risk behaviors, chronic health conditions, and use of preventive services is to better create health promotion tools for US residents. The following aspects of the data-set are relevant to this project:
There are a large proportion of Na’s (about 41.87 %). In fact, there are 0 complete cases. While this may present a problem, depending upon the variables being analysed, the types of questions being asked may not pertain to each individual and hence missing data is to be expected. The missing data may require a strategy to mitigate their effect, however, any strategy will be deployed on a case-by-case basis below.
The data-set contains 491775, 330 columns (observations) and rows (variables) respectively. A detailed description of the variables can be found in the BRFSS 2013 code-book here.
The research questions are motivated by a combination of currently topical subjects along with the author’s personal curiosities given his background in healthcare, along with his interest in nutrition.
What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?
To answer this question the following variables will be used:
$POORHLTH is a integer variable (discrete) representing the number of days that participants reported (in the last 30 days) where they felt their health (physical and mental) prevented them from doing their usual activities such as, self-care, work, or recreation.
$HLTHPLN1 is a factor variable with the following levels Yes, No i.e. does the respondent have any type of health care coverage?
$X_age65yr is a factor variable with the following levels Age 18 to 64, Age 65 or older
What is the relatioship, if any, between mental health and alcohol consumption? Are there any gender differences?
To answer this question the following variables will be used:
$menthlth is integer variable (discrete) representing the number of days that survey participants reported (within a 30 days period prior to the interview) where they felt that they were not able to perform their usual activities due to issues related to their mental health or their emotional state.
$AVEDRNK2 is integer variable (discrete) representing the number of alcoholic beverages on average consumed by the respondent whenever they consumed an alcoholic beverage in the 30 days prior to the interview.
$ALCDAY5 is integer variable (discrete) representing the number of days per week or per month the survey respondent consumed an alcoholic beverage in the 30 days prior to the interview.
What is the relationship between BMI and Balanced Food consumption, if any? Balanced Food consumption will be measured by aggregating consumption of Fruits, Vegetables and Beans.
To answer this question the following variables will be used:
$X_bmi5 is a integer variable (continuous) with values representing the calculated BMI of participants for their responses to the question of height and weight.
Q.1 What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?
The following is a summary of the three chosen variables, along with a density plot of the number of self-reported poor health days. Since $poorhlth has to be a number between 0 - 30, values over 30 and Na values have been discarded.
## Table 1.1.1
## X_age65yr PoorHealth Coverage
## Age 18 to 64 :172245 Min. : 0.000 Yes:215291
## Age 65 or older: 73711 1st Qu.: 0.000 No : 30665
## Median : 0.000
## Mean : 5.276
## 3rd Qu.: 5.000
## Max. :30.000
Some Observations:
For those that did report not feeling well in the last 30 days; mean and median are below.
## Table 1.1.2
## # A tibble: 2 x 3
## Coverage mean median
## <fct> <dbl> <int>
## 1 Yes 12.2 7
## 2 No 12.3 7
The data show those that reported feeling unwell, average about 12 days of total illness in the 30 days prior to their interview call. The most common total duration of illness was 7 days. There doesn’t seem to be any difference in mean or median, based on the presence or absence of any healthcare coverage.
The proportion table the data is below.
## Table 1.1.3
## Coverage
## PoorHealth Yes No
## Unwell 0.37414416 0.05719316
## Well 0.50117907 0.06748361
## Table 1.1.4
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: testTable1
## X-squared = 107.05, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01658528 -0.01126320
## sample estimates:
## prop 1 prop 2
## 0.8674050 0.8813293
The difference in proportions could be due to:
Individuals in the US over the age of 65 are automatically covered by Medicare and hence separating the data by those above 65 years of age may be useful in understanding the relationship between episodes of ill health in the previous 30 day period and some form of healthcare coverage.
Are Health Coverage and Poor Health indepedent? Does Medicare (Age > 65 Yrs.) influence the relationship?
Interpretation of Mosaic Plots:
Boxes are shaded according to the disproportionate influence of any residual. “Cells representing negative residuals are drawn in shaded of red and with broken borders; positive ones are drawn in blue with solid borders.” R Documentation for mosaicplot()
The size of the boxes correspond to the proportionality of data
The medicare coverage influence on the data is clear if one compares Plot B to Plot C (The boxes representing ‘No’)
When considered in isolation, co-variation between Coverage and ill health episodes seems to less in adults over the age of 65 and more in adults below 65 years.
Is the difference in proportions significantly greater?
## Table 1.1.5
## Independence_pValue Greater_pValue
## All adults 0.0000 1.00
## Adults Over 65 0.0112 0.99
## Adults 18 to 64 0.0000 1.00
Interpretation of tests for independence and equality of proportions:
Tables 1.1.5 suggests that the variables Coverage are Poor Health are not independent of each other (regardless of the influence of medicare) because the respective p-values are below 0.05.
However, the proportions between Coverage and Poor Health (including when separating the Medicare population) are not statistically greater between populations at the 95% confidence interval.
Conlusion: The variables Coverage and Poor Health (an episode of ill health in the previous 30 days) are not entirely independent of each other. However, the difference in proportions between populations are not statistically greater than the other. Hence, there are other influencing factors not identified in the above analysis.
Q.2 What is the relatioship, if any, between mental health and alcohol consumption?
Are there any gender differences?
A brief computation of the selected variables follows, along with a summary of the data.
#create smaller data with chosen variables
q2b <- br_data %>% select(sex,
Mental_Health = menthlth,
Ave_Drinks = avedrnk2,
Drink_Days = alcday5) %>% # create categorical vairables $MHEpisode, $Drank_Alc
mutate(MHEpisode = ifelse(Mental_Health > 0, "Yes_M", "No_M"),
Drank_Alc = ifelse(Drink_Days > 0, "Consumed Alcohol", "Did Not Consume Alc"))
# Function to convert $alcday5 into Total Drinks
convert.DrinksAlc <- function(x) {
as.character(x)
if (grepl("^1", x)) {
y = (as.numeric(x) - 100)*4
} else if (grepl("^2", x)) {
y = (as.numeric(x) - 200)
} else {y = x}
y
}
# apply convert.DrinksAlc to $Drink_Days
q2b$Drink_Days <- map_dbl(q2b$Drink_Days, convert.DrinksAlc)
# filter days to remove negative values computed by miscoded answers.
q2b <- q2b %>% filter(Drink_Days >= 0)
# create $Total_Drinks and select variables for plots
q2_data <- q2b %>% mutate(Total_Drinks = Drink_Days * Ave_Drinks) %>%
select(sex, Mental_Health, Total_Drinks, MHEpisode, Drank_Alc)
rm(q2b)## Table 2.1.1 - Summary of the selected variables
## sex Mental_Health Total_Drinks MHEpisode
## Male :192694 Min. : 0.00 Min. : 1.00 Length:472127
## Female:279433 1st Qu.: 0.00 1st Qu.: 4.00 Class :character
## Median : 0.00 Median : 10.00 Mode :character
## Mean : 3.38 Mean : 22.09
## 3rd Qu.: 2.00 3rd Qu.: 28.00
## Max. :30.00 Max. :2280.00
## NA's :7911 NA's :240948
## Drank_Alc
## Length:472127
## Class :character
## Mode :character
##
##
##
##
Chosen graphical representation of the data:
Interpretation of plots.
Select questions to quantify and confirm after EDA are:
## Table 2.1.2 - Proportion of Individuals who reported mental health concerns and reported consuming alcohol.
## MHEpisode
## Drank_Alc No_M Yes_M
## Consumed Alcohol 0.3436590 0.1577585
## Did Not Consume Alc 0.3465736 0.1520090
## prop.test()
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: Alc_ME_table
## X-squared = 51.521, df = 1, p-value = 7.083e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.012402840 -0.007082608
## sample estimates:
## prop 1 prop 2
## 0.6853750 0.6951177
Interpretation of Table 2.1.2:
There is a minor difference in proportion between those individuals who consumed alcohol and reported mental health concerns. The 95% CI of this difference is - 0.7% - 1.24%.
## Table 2.1.3 - Test for independence of proportions: Individuals who reported mental health concerns & reported consuming alcohol by gender.
## Number of cases in table: 464216
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 13548, df = 4, p-value = 0
Interpretation of Table 2.1.3:
The three selected variables ($sex, $MHEpisode, $Drank_Alc) are not independent.
## Visualizing difference in proportions
Interpretation of the Mosaic plot:
Note - Red colored boxes are proportions that are under-represented in the group i.e. less than expected if random. Conversely, blue squares are proportions over-represented in the group i.e. more than expected if random.
Conclusions:
2.1 There is a difference in alcohol consumption between Males and Females. Table 2.2.1
gender_Alc_table <- q2_data %>% select(Drank_Alc, sex) %>% table(useNA = "no")
prop.test(gender_Alc_table, correct = FALSE)##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: gender_Alc_table
## X-squared = 9420.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.1360765 0.1416289
## sample estimates:
## prop 1 prop 2
## 0.4777581 0.3389054
2.2 There is a difference in mental health concerns between Males and Females. Table 2.2.2
gender_MH_table <- q2_data %>% select(MHEpisode, sex) %>% table(useNA = "no")
prop.test(gender_MH_table, correct = FALSE, alternative = "greater")##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: gender_MH_table
## X-squared = 4131.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.09779562 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.4398768 0.3395712
2.3 Among those that consume alcohol, males report less mental health concerns than women. Table 2.2.3
Alc_gend_MH <- q2_data %>%
filter(Drank_Alc == "Consumed Alcohol") %>%
select(sex, MHEpisode) %>% table(useNA = "no")
prop.test(Alc_gend_MH, correct = FALSE, alternative = "greater")##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: Alc_gend_MH
## X-squared = 3027.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.1028892 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.7407524 0.6347275
Q.3 What is the relationship between BMI and balanced food consumption, if any?
Data preparation for the analysis will consist of:
q3_balance_data <- br_data %>% select(Beans = beanday_,
Greens = grenday_,
OrangeV = orngday_,
Fruits = frutda1_,
BMI = X_bmi5,
BMI_Levels = X_bmi5cat) %>% na.omit() %>%
mutate(Balance_F = Beans + Greens + OrangeV + Fruits)
rm(br_data)set.seed(5678)
size = nrow(q3_balance_data)
mini_q3_data <- q3_balance_data[sample(1:size, 10000),]Table 3.1.1 summarizes the data: The range of the food consumption and BMI variables seems large. Note, there are two decimal places implied in the BMI variable.
## Table 3.1.1
## Beans Greens OrangeV Fruits
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 7.00 1st Qu.: 14.00 1st Qu.: 7.00 1st Qu.: 33.0
## Median : 14.00 Median : 43.00 Median : 17.00 Median : 100.0
## Mean : 27.86 Mean : 55.08 Mean : 29.52 Mean : 101.5
## 3rd Qu.: 33.00 3rd Qu.: 83.00 3rd Qu.: 43.00 3rd Qu.: 100.0
## Max. :9900.00 Max. :9900.00 Max. :9900.00 Max. :9900.0
## BMI BMI_Levels Balance_F
## Min. : 1 Underweight : 7283 Min. : 0
## 1st Qu.:2367 Normal weight:140260 1st Qu.: 111
## Median :2675 Overweight :152163 Median : 179
## Mean :2787 Obese :124009 Mean : 214
## 3rd Qu.:3086 3rd Qu.: 276
## Max. :9769 Max. :20800
## Visualizing Balanced_F and BMI
Interpretation of plots 3.1 and 3.2:
Plot 3.1 visualizes the BMI distribution. The data is slightly skewed as can be seen the difference in the peak of the data vs the mean BMI of ~ 28 (Overweight, but not Obese).
Plot 3.2 visualizes the $Balanced_F variable across the BMI categories. There median of each group lies underneath the mean of the entire cohort indicating, not only, skew-ness in the data, but also, that there may be variation among each category. In fact, a log10 scale has been used on the y axis to better visualize this plot.
Plot 3.3 (below) expands on plot 3.2 by showing the spread of the data. Points to note:
The center of mass appears in the BMI zone of 23 - 30 i.e. normal weight to overweight.
There are fewer points in the underweight category and there is a larger range of obese individuals.
There is more variability in both ends of the BMI spectrum (to be expected)
Overall there seems to be a negative correlation between BMI and Balance_F i.e. higher the Balance_F lower the BMI.
However, is the difference in Balance_F statistically significant across categories?
## Table 3.2.1
##
## Kruskal-Wallis rank sum test
##
## data: Balance_F by BMI_Levels
## Kruskal-Wallis chi-squared = 4175.2, df = 3, p-value < 2.2e-16
## Table 3.2.2
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: test_data$Balance_F and test_data$BMI_Levels
##
## Underweight Normal weight Overweight
## Normal weight < 2e-16 - -
## Overweight 0.36 < 2e-16 -
## Obese 1.3e-13 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
The low p-value in Table 3.2.1 indicates that the BMI categories and the Balance_F variable are not independent.
Table 3.2.2 describes the statistical independence between categories for the variable Balance_F. In other words, Balance_F is statistically independent between the Underweight and Overweight categories i.e. the variation between those two groups is similar. However, the variation in Balance_F among other categories is significantly different.
This is an interesting aspect of the data and could indicate that those that are underweight and those that are overweight have similar food consumption with each other, but different from those that obese and of normal weight.
Conclusion:
Additional exploration is required to ascertain the cause of these differences. However, it must be noted that the variation in Balance_F only explains very little of the variation in BMI (see Table 3.3.1)
## Table 3.3.1
##
## Pearson's product-moment correlation
##
## data: q3_balance_data$BMI and q3_balance_data$Balance_F
## t = -50.102, df = 423710, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07973452 -0.07374797
## sample estimates:
## cor
## -0.07674194
A note on the cause of variability in Balance_F:
The creation of Balance_F was an artifact based on the assumption that different types of foods and their balanced intake could have a relationship with BMI. However, where does the variation in this data exist? PCA is a good method to explore the underlying variation in the data. Table 3.3.1 and Table 3.3.2 shows one way to look at the underlying components of Balance_F, further exploration of which could lead to a better understanding of individual food components and their relationship to BMI.
## Table 3.4.1
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.3029 0.9475 0.8658 0.8094
## Proportion of Variance 0.4244 0.2244 0.1874 0.1638
## Cumulative Proportion 0.4244 0.6488 0.8362 1.0000
## Table 3.4.2
## PC1 PC2 PC3 PC4
## Beans 0.3706564 0.90083349 -0.22512013 0.02082708
## Greens 0.5647861 -0.22399148 -0.03979517 -0.79325960
## OrangeV 0.5251537 -0.02954757 0.77817163 0.34320472
## Fruits 0.5175365 -0.37074760 -0.58496683 0.50250966
Principle components (PC) 2 to 4 explain ~ 60% variation in the data. PC1 explains ~40% of the variation and Table 3.3.2 provides the respective weights to the underlying food groups that result in the maximal variation. Further study into these techniques (beyond the scope of this overview) could provide interesting ways to understand food proportions.
Git-hub link to original source code link