This project report is submitted to fulfil the final project requirements for week5 of the Introduction to Probability and Data Coursera MOOC course by Duke University.
The background context regarding the assignment can be found at: https://www.coursera.org/learn/probability-intro/supplement/1E7zQ/project-information.
Behavioral Risk Factor Surveillance System
According to CDC, the “Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world”.
Further details about the BRFSS can be obtained from this link: https://www.cdc.gov/brfss/annual_data/2013/pdf/Overview_2013.pdf
The BRFSS is administered by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Data collection
Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing. Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger. BRFSS field operations are managed by state health departments that follow protocols adopted by the states with technical assistance provided by CDC. State health departments collaborate during survey development, and conduct the interviews themselves or by using contractors. The data are transmitted to the CDC for editing, processing, weighting, and analysis. An edited and weighted data file is provided to each participating health department for each year of data collection, and summary reports of state-specific data are prepared by the CDC.
The data and further information were obtained from the following sources:
** References:** *2013 Survey Data Information
*2013 BRFSS Overview [PDF - 84 KB] Provides information on the background, design, data collection and processing, statistical, and analytical issues for the combined landline and cell phone data set.
*BRFSS Questionnaire(Mandatory and Optional Modules): [PDF - 365 KB]
*2013 BRFSS Codebook [PDF - 2.7 MB]
Codebook for the file showing variable name, location, and frequency of values for all reporting areas combined for the combined landline and cell phone data set.
*Calculated Variables in Data Files [PDF - 421 KB]
*Comparability of Data [PDF - 96 KB] Comparability of data across reporting areas for the combined landline and cell phone data set. The BRFSS 2012 data is not directly comparable to years of BRFSS data before 2011 because of the changes in weighting methodology and the addition of the cell phone sampling frame.
*2013 Weighting Formula [PDF - 98 KB]
*Summary Matrix of Calculated Variables (CV) in the 2013 Data File
Generalizability / Causality
BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household for the non-institutionalized adult population (18 years of age and older) residing in the US. Overall, an estimated 97.5% of US households had telephone service in 2012. Telephone coverage varies across states with a range of 95.3% in New Mexico to 98.6% in Connecticut. In 2013, BRFSS respondents who received 90 percent or more of their calls on cellular telephones were eligible for participation in the cellular telephone survey. In 2013, 50 states, the District of Columbia, Guam, and Puerto Rico collected samples of interviews conducted both by landline telephone and cellular telephone. This type of survey is based on a stratified sampling method where the population is stratified within the various states and a random sampling via telephone surveys is then employed within each stratum. Since all the state and territories are stratified in this survey with a high estimated coverage (97.3%) of US households, it stands to reason that the results of this survey can be generalized to the entire US population. However, as this study collects data via random telephone sampling, it is by definition an observational study and, therefore, any observed correlations cannot be inferred as causal in nature, particularly as these types of study do not account for the effects of confounding factors that may contribution towards observed correlations.
Loading the primary data set
We shall begin by examining the data set closely. Further details can be found in the BRFSS Codebook. Let’s start by looking at the size of the dataset.
[1] 491775 330
The dataset is large and contains 49,1775 rows and 330 columns. We take a closer look at the 330 parameters in the columns and gain summary statistics as below. Since the output of these commands can be quite large, it will be hidden from this report.
Given the large size of the dataset, a reduced subset is created in order to address the specific research questions of intrest. We can then create a summary of this reduced dataset to explore the descriptive statistics for each parameter.
# We create a subset of data which selects parameters relevant to the questions of interest
data_subset <- subset(brfss2013, select = c("X_state", "genhlth", "X_bmi5", "X_bmi5cat", "sleptim1", "diabete3", "educa", "income2", "hlthpln1", "exerany2"))
str(data_subset)
'data.frame': 491775 obs. of 10 variables:
$ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
$ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
$ X_bmi5 : int 3916 1822 2746 2197 3594 3986 2070 NA 3017 2829 ...
$ X_bmi5cat: Factor w/ 4 levels "Underweight",..: 4 1 3 2 4 4 2 NA 4 3 ...
$ sleptim1 : int NA 6 9 8 6 8 7 6 8 8 ...
$ diabete3 : Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...
$ educa : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
$ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
$ hlthpln1 : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
$ exerany2 : Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
X_state genhlth X_bmi5
Florida : 33668 Excellent: 85482 Min. : 1
Kansas : 23282 Very good:159076 1st Qu.:2367
Nebraska : 17139 Good :150555 Median :2663
Massachusetts: 15071 Fair : 66726 Mean :2782
Minnesota : 14340 Poor : 27951 3rd Qu.:3081
New Jersey : 13776 NA's : 1985 Max. :9769
(Other) :374499 NA's :26727
X_bmi5cat sleptim1
Underweight : 8267 Min. : 0.000
Normal weight:154898 1st Qu.: 6.000
Overweight :167084 Median : 7.000
Obese :134799 Mean : 7.052
NA's : 26727 3rd Qu.: 8.000
Max. :450.000
NA's :7387
diabete3
Yes : 62363
Yes, but female told only during pregnancy: 4602
No :415374
No, pre-diabetes or borderline diabetes : 8604
NA's : 832
educa
Never attended school or only kindergarten : 677
Grades 1 through 8 (Elementary) : 13395
Grades 9 though 11 (Some high school) : 28141
Grade 12 or GED (High school graduate) :142971
College 1 year to 3 years (Some college or technical school):134197
College 4 years or more (College graduate) :170120
NA's : 2274
income2 hlthpln1 exerany2
$75,000 or more :115902 Yes :434571 Yes :332464
Less than $75,000: 65231 No : 55300 No :125282
Less than $50,000: 61509 NA's: 1904 NA's: 34029
Less than $35,000: 48867
Less than $25,000: 41732
(Other) : 87108
NA's : 71426
As we can see, there are a number of NA values in the dataset, therefore, we shall further filter the data set to remove rows that contain NA values to ensure a complete dataset for the final analysis.
## We shall filter the data to only contain complete data sets
data_subset <- data_subset[complete.cases(data_subset),]
In order to address specific relationships, we shall create further subsets of the dataset to address each of the three research questions stipulated below.
Research quesion 1: The first research question will focus on observing any correlation between individual income levels and its impact on people’s general health and exercise levels. The hypothesis is that there would be a direct correlation between income levels and the general health and exercise levels of the individuals surveyed.
# Data set for research question 1
Selection <- c("X_state", "genhlth", "educa", "income2", "hlthpln1", "exerany2")
q1_data <- data_subset[, Selection]
Research quesion 2: The second research question will attempt to examine the relationship between the the number of sleep hours and the general health of the population as reported by the individuals surveyed. We would expect to test the hypothesis that sleep hours are directly correlated to the general health of the population.
# Data set for research question 2
Selection <- c("X_state", "genhlth", "sleptim1")
q2_data <- data_subset[, Selection]
Research quesion 3:
The third research questions will look at the impact of individual bmi levels and the prevalence of diabetes across the population surveyed. As expected, indivuals with high bmi would expect to be at higher risk of being diagnosed with diabetes.
# Data set for research question 3
Selection <- c("X_state", "X_bmi5", "X_bmi5cat", "diabete3")
q3_data <- data_subset[, Selection]
Prior to proceeding with addressing the research questions stipulated in the previous section, we shall start first by doing some exploratory data analysis. This section will aim to examine the statistical distribution of the numerical variables in the dataset, as well as, look to see if we can observe any general relationshp between the factor variables in the chosen subset of the database.
##par(mfrow = c(2, 2))
## The first row will plot the distribution of X_bmi5 parameter
hist(data_subset$X_bmi5, breaks = 100, col = "blue")
## Let's look at the same data set after log transformation
hist(log(data_subset$X_bmi5), breaks = 100, col = "blue")
We see that the distribution of the bmi parameter is right skewed as evident from the histogram and the Q-Q Plot. We can also log transform this dataset to see if the distribution is normalized. Let’s do the same with the sleep parameter as below:
##par(mfrow = c(2, 2))
## The second row plots the sleep parameter
hist(data_subset$sleptim1, breaks = 20, col = "blue")
## Let's look at the same data set after log transformation
hist(log(data_subset$sleptim1), breaks = 20, col = "blue")
The histograms indicate that there is clearly a right skew to both the X_bmi5 and the sleptim1 parameter. Let’s also have a quick look to see if there is any relationship between the X_bmi5 and sleptim1 parameters
## pairs(data = data_subset, ~ X_bmi5 + sleptim1, main = "Scatterplot Matrix for X_bmi5 and sleptim1")
## pairs(data = data_subset, ~ log(X_bmi5) + log(sleptim1), main = "Scatterplot Matrix for log transformed X_bmi5 and sleptim1")
ggplot(data_subset, aes(X_bmi5, sleptim1)) +
geom_point(aes(X_bmi5, sleptim1)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="BMI vs Sleep Time", x="BMI", y="Sleep Time")
[1] "The correlation between the two parameters is: "
[1] -0.05014367
Plotting the data shows that there is very little correlation between the two parameters with a small correlation at -0.050.
It is difficult to measure the relationship between the other parameters in the data_subset because they are categorical variables which do not allow us to measure a correlation between them. We shall, therefore, use a mosaic plot to qualitatively observe relationships between these factor variables.
The first plot looks at the relationship between general health and bmi category -
##par(mfrow = c(2, 3))
mosaicplot( ~ genhlth + X_bmi5cat, data = data_subset, xlab = "General health", ylab = "BMI Category", color = c("lightblue", "steelblue"), main = "General health vs BMI Category")
From this plot, we can clearly visualize a direct relationship between the quality of general health as reported by the individuals surveyed and their obesity levels. Individuals who report an Excellent or Very Good quality of general health tend to fall under a normal weight category as one would expect to see.
Let’s now also look at the relationship between individual diabetes category and general health -
It is not surprising to observe that the diagnosis of pre-diabetes or diabetes in individuals is directly correlated with individuals reporting themselves as in Poor or Fair health.
Let’s also look at the relationship between the general health reported and the reported education levels of individuals surveyed -
People who report good levels of health tend to have, at least, 4 years of college education. Likewise, there is a direct correlation between the number of years spent in education and the reported levels of general health amongst the individuals surveyed.
Let’s now look at the relationship between the general health and income levels of individuals surveyed -
Plot suggests that income levels are also directly correlated with improved health reported by the individuals. Of the individuals reporting excellent health, the majority of them also reported income levels greater than $75,000 a year.
And lastly, let’s look at the relationship between the general health of individuals and if they exercised -
It is perhaps not surprising to see that individuals reporting themselves in the best of health also reported themselves as being very active individuals engaging in some level of exercise.
And lastly, let’s look to see if there is any relationship between individual health and their enrollment in a health plan -
Plot suggests that there is a direct relationship between individuals reporting better health and their enrollment in a health plan. This relationship, however, is not as strong as some of the other relationships we have observed which is suggestive of the relatively higher role of education, income and exercise levels on individual health than their enrollment in a health plan. This data does not undermine the importance of a health plan, but does suggest that individual perceptions of their general health levels are not highly dependent on whether they are enrolled into a medical plan which seems obvious at an intuitive level. Let’s try and expand on some of the relationships observed in our exploratory analysis and examine them in greater depth using the research questions stipulated below.
Research quesion 1: Relationship between exercise and income levels and its impact on general health
We shall first create a summary of the average number of individuals for each health category within the q1_data set as follows -
# We shall look at the income levels and the impact that has on general health
healthIncome <- q1_data %>% group_by(income2) %>% summarize(Excellent = sum(genhlth == "Excellent")/n(), V_good = sum(genhlth == "Very good")/n(), Good = sum(genhlth == "Good")/n(), Fair = sum(genhlth == "Fair")/n(), Poor = sum(genhlth == "Poor")/n())
##par(mfrow = c(2,2))
Let’s plot the results between income levels and individuals reporting Excellent or Very Good health.
# Let's plot these results grouped by income vs report health
a <- ggplot(healthIncome, aes(income2, Excellent)) +
geom_point(aes(income2, Excellent)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Excellent Health", x="Income", y="Excellent Health")
b <- ggplot(healthIncome, aes(income2, V_good)) +
geom_point(aes(income2, V_good)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Very Good Health", x="Income", y="Very Good Health")
grid.arrange(a, b, ncol=2)
The plots confirm our previous observations. On average, there is an increasing proportion of individuals who report Excellent or Very Good health correlated by their earned income levels. Let’s see what happens in the case of individuals who report Good health.
ggplot(healthIncome, aes(income2, Good)) +
geom_point(aes(income2, Good)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Good Health", x="Income", y="Good Health")
Plot shows that folks earning up to an income level of $25,000 increasingly rate themselves in Good Health, after which the number drop sharply as individuals earning higher than $25,000 start to report themselves as in Excellent or Very Good Health as we’ve already seen. What about individuals who report themselves in Fair to Poor health?
c <- ggplot(healthIncome, aes(income2, Fair)) +
geom_point(aes(income2, Fair)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Fair Health", x="Income", y="Fair Health")
d <- ggplot(healthIncome, aes(income2, Poor)) +
geom_point(aes(income2, Poor)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Poor Health", x="Income", y="Poor Health")
grid.arrange(c, d, ncol=2)
We can clearly see that there is a direct relationship between income levels and the general health.
Let’s now look to see if there is a relationship between exercise levels and income. We shall start once again by summarizing the dataset q1_data and group it into income categories. We can then calculate the mean number of folks who exercise in each of the income categories and plot the results.
## Let's find the number of individuals that exercise as grouped by their income levels
physicalactivity_income <- q1_data %>% group_by(income2) %>% summarize(mean_exercise = mean(exerany2 == "Yes"))
## Let's now plot these results
ggplot(physicalactivity_income, aes(income2, mean_exercise)) +
geom_point(aes(income2, mean_exercise)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Income vs Exercise", x="Income", y="Average number of people who exercise")
We observe a clear relatinship between income levels and exercise thus reiterating quantitatively what we have already observed in our exploratory data analysis previously. Furthermore, we can see that the proportion of individuals who exercise increase sharply at income levels higher than $25,000.
Let’s now see if there is a relationship between health and exercise which intuitively we would expect to see. We shall once again group the q1_data by exercise levels and calculate the mean values for the various health categories and plot the data.
# We shall look at the reported health levels and how those correlate with exercise
healthExercise <- q1_data %>% group_by(exerany2) %>% summarize(Excellent = mean(genhlth == "Excellent"), V_good = mean(genhlth == "Very good"), Good = mean(genhlth == "Good"), Fair = mean(genhlth == "Fair"), Poor = mean(genhlth == "Poor"))
## Let's reformat this data using tidyr to make it easier to plot
healthExercise_tidy <- gather(healthExercise, key = "Health", value = "Average", Excellent, V_good, Good, Fair, Poor)
# Let's plot these results for health grouped by exercise
ggplot(healthExercise_tidy, aes(Health, Average, color = exerany2)) +
geom_point(aes(Health, Average)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Health vs Exercise", x="Health", y="Mean Value")
Here, it becomes evident that those individuals who exercise have a clear benefit in health over those who don’t exercise. The mean number of individuals reporting Excellent or Very Good health showed a higher number of individuals who exercised over those who did not. This difference, however, is less dramatic for those individuals who report either Good, Fair or poor health.
Research quesion 2:
The second research question will examine the impact sleep time has on the general health of the survey respondents and how these sleeping habits are distributed across the United States. We shall once again start by grouping the data for each state and calculating the mean sleep time for each state.
# We shall look at the mean sleep time for each State and the impact that has on general health
stateSleep <- q2_data %>% group_by(X_state) %>% summarize(mean_sleep = mean(sleptim1))
healthSleep <- q2_data %>% group_by(genhlth) %>% summarize(mean_sleep = mean(sleptim1))
par(mfrow = c(2,2))
## stateSleep
healthSleep
# A tibble: 5 x 2
genhlth mean_sleep
<fct> <dbl>
1 Excellent 7.17
2 Very good 7.08
3 Good 7.02
4 Fair 6.88
5 Poor 6.71
We observe that the differential in mean sleeping habits between the different states spans about 0.46 hours.It would be very interesting to see if such a small differential in mean sleep times would translate into any perceived gains in overall health by the individuals surveyed.
# Let's plot these results for median sleep grouped by X_state and genhlth
e <- ggplot(stateSleep, aes(X_state, mean_sleep)) +
geom_point(aes(X_state, mean_sleep)) + theme(axis.text.x = element_text(angle = 90)) +
labs(title="Mean sleep hours per State",
x="State", y="Mean Sleep Hours")
f <- ggplot(healthSleep, aes(genhlth, mean_sleep, color = mean_sleep)) +
geom_point(aes(genhlth, mean_sleep)) + theme(axis.text.x = element_text(angle = 90)) +
labs(title="Mean sleep hours for each general health rating",
x="General Health Rating", y="Mean Sleep Hours")
grid.arrange(e, f, nrow=2)
Plot shows that number of hours spent sleeping are directly correlated with the general health as perceived by the individuals surveyed suggesting that even a mean differntial of 0.4 hours is perceived as an advantage by individuals who report themselves in Excellent or Very Good health. Furthermore, a majority of the States in the dataset show individuals reporting, at least, greater than 7 hours of sleep.
Research quesion 3: Relationsip between bmi and diabetes across the States
We shall next examine the levels of diabetes and obesity across the States and how the BMI levels correlated with the diagnosis of diabetes
## Let's group the q3_data set by BMI Category
bmiCat <- q3_data %>% group_by(X_bmi5cat) %>% summarize(Median_bmi = median(X_bmi5))
bmiCat
# A tibble: 4 x 2
X_bmi5cat Median_bmi
<fct> <dbl>
1 Underweight 1771
2 Normal weight 2271
3 Overweight 2729
4 Obese 3366
We can see that individuals with median BMI levels as greater than 2271 fall under the overweight and the obese category. Next we group the q3_data by State and calculate the mean BMI within each state.
## Let's group the q3_data set by State
stateWeight <- q3_data %>% group_by(X_state) %>% summarize(mean_bmi = mean(X_bmi5))
## stateWeight
# Let's plot these results for mean bmi grouped by X_state
ggplot(stateWeight, aes(X_state, mean_bmi)) +
geom_point(aes(X_state, mean_bmi)) + theme(axis.text.x = element_text(angle = 90)) +
geom_hline(yintercept = 2729, color = "red") +
labs(title="Mean BMI per State", x="State", y="Mean BMI")
As we can see from the plot above, the mean bmi levels for most of the States fall within the overweight category as highlighted by the red line. Another look at the data below also shows that the diagnosis of diabetes is correlated with mean BMI.
## Let's group the q3_data set by diabetes diagnosis
diabetesBMI <- q3_data %>% group_by(diabete3) %>% summarize(Mean_bmi = mean(X_bmi5))
diabetesBMI
# A tibble: 4 x 2
diabete3 Mean_bmi
<fct> <dbl>
1 Yes 3176.
2 Yes, but female told only during pregnancy 2826.
3 No 2734.
4 No, pre-diabetes or borderline diabetes 3031.
## Let's plot these results for bmi and diabetes
ggplot(diabetesBMI, aes(diabete3, Mean_bmi)) +
geom_bar(stat="identity", width=0.5, color = "steelblue", fill = "steelblue") +
geom_hline(yintercept = 2729, color = "red") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Mean BMI vs Diabetes", x="Diabetes Diagnosis", y="Mean BMI")
In conclusion, the plots analyzed within this report support the hypothesis postulated within each of the three research questions defined. The data obtained from the observational study does support the hypothesis that on average, there is an increasing proportion of individuals who report Excellent or Very Good health correlated by their earned income levels. We also observe a clear relatinship between income levels and exercise and the number of individuals who exercise increases sharply at income levels higher than $25,000. There is also evidence to conclude that those individuals who exercise have a clear health benefit over those who do not exercise which is quite dramatic for those individuals who report themselves in Excellent or Very Good health. Equally apparent is the overall improved perception of general health to the number of hours of sleep time that the individual gets and perhaps, a bit surprising to observed that the majority of the states in the dataset show individuals reporting, at least, greater than 7 hours of sleep. And lastly, the data also supported the hypothesis that the risk of developing diabetes was increased in inviduals with high BMI levels. All three hypothesis postulated within the three research questions are obvious at an intuitive level. In an observational study of this nature, it would be unwise to identify causal relationship behind the variables studied which would required a detailed experimental protocol where one might be able to better calculate the statistical significance of these correlations; nevertheless, it stands to reason that individuals with better income levels will be in a position to afford a better quality of life by affording more time for exercise and adequate sleep. Likewise, income levels would also very likely allow individuals to afford better health care. Also, it is well established that obesity levels are directly correlated with an increased probility of getting diagnosed with diabetes and the data within this study supports that hypothesis.