Introduction

This project report is submitted to fulfil the final project requirements for week5 of the Introduction to Probability and Data Coursera MOOC course by Duke University.


Part 1: Data

The background context regarding the assignment can be found at: https://www.coursera.org/learn/probability-intro/supplement/1E7zQ/project-information.

Behavioral Risk Factor Surveillance System

According to CDC, the “Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world”.

Further details about the BRFSS can be obtained from this link: https://www.cdc.gov/brfss/annual_data/2013/pdf/Overview_2013.pdf

The BRFSS is administered by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Data collection

Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing. Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger. BRFSS field operations are managed by state health departments that follow protocols adopted by the states with technical assistance provided by CDC. State health departments collaborate during survey development, and conduct the interviews themselves or by using contractors. The data are transmitted to the CDC for editing, processing, weighting, and analysis. An edited and weighted data file is provided to each participating health department for each year of data collection, and summary reports of state-specific data are prepared by the CDC.

The data and further information were obtained from the following sources:

** References:** *2013 Survey Data Information

*2013 BRFSS Overview [PDF - 84 KB] Provides information on the background, design, data collection and processing, statistical, and analytical issues for the combined landline and cell phone data set.

*BRFSS Questionnaire(Mandatory and Optional Modules): [PDF - 365 KB]

*2013 BRFSS Codebook [PDF - 2.7 MB]

Codebook for the file showing variable name, location, and frequency of values for all reporting areas combined for the combined landline and cell phone data set.

*Calculated Variables in Data Files [PDF - 421 KB]

*Comparability of Data [PDF - 96 KB] Comparability of data across reporting areas for the combined landline and cell phone data set. The BRFSS 2012 data is not directly comparable to years of BRFSS data before 2011 because of the changes in weighting methodology and the addition of the cell phone sampling frame.

*2013 Weighting Formula [PDF - 98 KB]

*Summary Matrix of Calculated Variables (CV) in the 2013 Data File

Generalizability / Causality

BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household for the non-institutionalized adult population (18 years of age and older) residing in the US. Overall, an estimated 97.5% of US households had telephone service in 2012. Telephone coverage varies across states with a range of 95.3% in New Mexico to 98.6% in Connecticut. In 2013, BRFSS respondents who received 90 percent or more of their calls on cellular telephones were eligible for participation in the cellular telephone survey. In 2013, 50 states, the District of Columbia, Guam, and Puerto Rico collected samples of interviews conducted both by landline telephone and cellular telephone. This type of survey is based on a stratified sampling method where the population is stratified within the various states and a random sampling via telephone surveys is then employed within each stratum. Since all the state and territories are stratified in this survey with a high estimated coverage (97.3%) of US households, it stands to reason that the results of this survey can be generalized to the entire US population. However, as this study collects data via random telephone sampling, it is by definition an observational study and, therefore, any observed correlations cannot be inferred as causal in nature, particularly as these types of study do not account for the effects of confounding factors that may contribution towards observed correlations.


Setup

Load data

Loading the primary data set

We shall begin by examining the data set closely. Further details can be found in the BRFSS Codebook. Let’s start by looking at the size of the dataset.

[1] 491775    330

The dataset is large and contains 49,1775 rows and 330 columns. We take a closer look at the 330 parameters in the columns and gain summary statistics as below. Since the output of these commands can be quite large, it will be hidden from this report.

Given the large size of the dataset, a reduced subset is created in order to address the specific research questions of intrest. We can then create a summary of this reduced dataset to explore the descriptive statistics for each parameter.

'data.frame':   491775 obs. of  10 variables:
 $ X_state  : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ genhlth  : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
 $ X_bmi5   : int  3916 1822 2746 2197 3594 3986 2070 NA 3017 2829 ...
 $ X_bmi5cat: Factor w/ 4 levels "Underweight",..: 4 1 3 2 4 4 2 NA 4 3 ...
 $ sleptim1 : int  NA 6 9 8 6 8 7 6 8 8 ...
 $ diabete3 : Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ educa    : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
 $ income2  : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
 $ hlthpln1 : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
 $ exerany2 : Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
          X_state            genhlth           X_bmi5     
 Florida      : 33668   Excellent: 85482   Min.   :   1   
 Kansas       : 23282   Very good:159076   1st Qu.:2367   
 Nebraska     : 17139   Good     :150555   Median :2663   
 Massachusetts: 15071   Fair     : 66726   Mean   :2782   
 Minnesota    : 14340   Poor     : 27951   3rd Qu.:3081   
 New Jersey   : 13776   NA's     :  1985   Max.   :9769   
 (Other)      :374499                      NA's   :26727  
         X_bmi5cat         sleptim1      
 Underweight  :  8267   Min.   :  0.000  
 Normal weight:154898   1st Qu.:  6.000  
 Overweight   :167084   Median :  7.000  
 Obese        :134799   Mean   :  7.052  
 NA's         : 26727   3rd Qu.:  8.000  
                        Max.   :450.000  
                        NA's   :7387     
                                       diabete3     
 Yes                                       : 62363  
 Yes, but female told only during pregnancy:  4602  
 No                                        :415374  
 No, pre-diabetes or borderline diabetes   :  8604  
 NA's                                      :   832  
                                                    
                                                    
                                                          educa       
 Never attended school or only kindergarten                  :   677  
 Grades 1 through 8 (Elementary)                             : 13395  
 Grades 9 though 11 (Some high school)                       : 28141  
 Grade 12 or GED (High school graduate)                      :142971  
 College 1 year to 3 years (Some college or technical school):134197  
 College 4 years or more (College graduate)                  :170120  
 NA's                                                        :  2274  
              income2       hlthpln1      exerany2     
 $75,000 or more  :115902   Yes :434571   Yes :332464  
 Less than $75,000: 65231   No  : 55300   No  :125282  
 Less than $50,000: 61509   NA's:  1904   NA's: 34029  
 Less than $35,000: 48867                              
 Less than $25,000: 41732                              
 (Other)          : 87108                              
 NA's             : 71426                              

As we can see, there are a number of NA values in the dataset, therefore, we shall further filter the data set to remove rows that contain NA values to ensure a complete dataset for the final analysis.


Part 2: Research questions

In order to address specific relationships, we shall create further subsets of the dataset to address each of the three research questions stipulated below.

Research quesion 1: The first research question will focus on observing any correlation between individual income levels and its impact on people’s general health and exercise levels. The hypothesis is that there would be a direct correlation between income levels and the general health and exercise levels of the individuals surveyed.

Research quesion 2: The second research question will attempt to examine the relationship between the the number of sleep hours and the general health of the population as reported by the individuals surveyed. We would expect to test the hypothesis that sleep hours are directly correlated to the general health of the population.

Research quesion 3:

The third research questions will look at the impact of individual bmi levels and the prevalence of diabetes across the population surveyed. As expected, indivuals with high bmi would expect to be at higher risk of being diagnosed with diabetes.


Part 3: Exploratory data analysis

Prior to proceeding with addressing the research questions stipulated in the previous section, we shall start first by doing some exploratory data analysis. This section will aim to examine the statistical distribution of the numerical variables in the dataset, as well as, look to see if we can observe any general relationshp between the factor variables in the chosen subset of the database.

We see that the distribution of the bmi parameter is right skewed as evident from the histogram and the Q-Q Plot. We can also log transform this dataset to see if the distribution is normalized. Let’s do the same with the sleep parameter as below:

The histograms indicate that there is clearly a right skew to both the X_bmi5 and the sleptim1 parameter. Let’s also have a quick look to see if there is any relationship between the X_bmi5 and sleptim1 parameters

[1] "The correlation between the two parameters is: "
[1] -0.05014367

Plotting the data shows that there is very little correlation between the two parameters with a small correlation at -0.050.

It is difficult to measure the relationship between the other parameters in the data_subset because they are categorical variables which do not allow us to measure a correlation between them. We shall, therefore, use a mosaic plot to qualitatively observe relationships between these factor variables.

The first plot looks at the relationship between general health and bmi category -

From this plot, we can clearly visualize a direct relationship between the quality of general health as reported by the individuals surveyed and their obesity levels. Individuals who report an Excellent or Very Good quality of general health tend to fall under a normal weight category as one would expect to see.

Let’s now also look at the relationship between individual diabetes category and general health -

It is not surprising to observe that the diagnosis of pre-diabetes or diabetes in individuals is directly correlated with individuals reporting themselves as in Poor or Fair health.

Let’s also look at the relationship between the general health reported and the reported education levels of individuals surveyed -

People who report good levels of health tend to have, at least, 4 years of college education. Likewise, there is a direct correlation between the number of years spent in education and the reported levels of general health amongst the individuals surveyed.

Let’s now look at the relationship between the general health and income levels of individuals surveyed -

Plot suggests that income levels are also directly correlated with improved health reported by the individuals. Of the individuals reporting excellent health, the majority of them also reported income levels greater than $75,000 a year.

And lastly, let’s look at the relationship between the general health of individuals and if they exercised -

It is perhaps not surprising to see that individuals reporting themselves in the best of health also reported themselves as being very active individuals engaging in some level of exercise.

And lastly, let’s look to see if there is any relationship between individual health and their enrollment in a health plan -

Plot suggests that there is a direct relationship between individuals reporting better health and their enrollment in a health plan. This relationship, however, is not as strong as some of the other relationships we have observed which is suggestive of the relatively higher role of education, income and exercise levels on individual health than their enrollment in a health plan. This data does not undermine the importance of a health plan, but does suggest that individual perceptions of their general health levels are not highly dependent on whether they are enrolled into a medical plan which seems obvious at an intuitive level. Let’s try and expand on some of the relationships observed in our exploratory analysis and examine them in greater depth using the research questions stipulated below.

Research quesion 1: Relationship between exercise and income levels and its impact on general health

We shall first create a summary of the average number of individuals for each health category within the q1_data set as follows -

Let’s plot the results between income levels and individuals reporting Excellent or Very Good health.

The plots confirm our previous observations. On average, there is an increasing proportion of individuals who report Excellent or Very Good health correlated by their earned income levels. Let’s see what happens in the case of individuals who report Good health.

Plot shows that folks earning up to an income level of $25,000 increasingly rate themselves in Good Health, after which the number drop sharply as individuals earning higher than $25,000 start to report themselves as in Excellent or Very Good Health as we’ve already seen. What about individuals who report themselves in Fair to Poor health?

We can clearly see that there is a direct relationship between income levels and the general health.

Let’s now look to see if there is a relationship between exercise levels and income. We shall start once again by summarizing the dataset q1_data and group it into income categories. We can then calculate the mean number of folks who exercise in each of the income categories and plot the results.

We observe a clear relatinship between income levels and exercise thus reiterating quantitatively what we have already observed in our exploratory data analysis previously. Furthermore, we can see that the proportion of individuals who exercise increase sharply at income levels higher than $25,000.

Let’s now see if there is a relationship between health and exercise which intuitively we would expect to see. We shall once again group the q1_data by exercise levels and calculate the mean values for the various health categories and plot the data.

Here, it becomes evident that those individuals who exercise have a clear benefit in health over those who don’t exercise. The mean number of individuals reporting Excellent or Very Good health showed a higher number of individuals who exercised over those who did not. This difference, however, is less dramatic for those individuals who report either Good, Fair or poor health.

Research quesion 2:

The second research question will examine the impact sleep time has on the general health of the survey respondents and how these sleeping habits are distributed across the United States. We shall once again start by grouping the data for each state and calculating the mean sleep time for each state.

# A tibble: 5 x 2
  genhlth   mean_sleep
  <fct>          <dbl>
1 Excellent       7.17
2 Very good       7.08
3 Good            7.02
4 Fair            6.88
5 Poor            6.71

We observe that the differential in mean sleeping habits between the different states spans about 0.46 hours.It would be very interesting to see if such a small differential in mean sleep times would translate into any perceived gains in overall health by the individuals surveyed.

Plot shows that number of hours spent sleeping are directly correlated with the general health as perceived by the individuals surveyed suggesting that even a mean differntial of 0.4 hours is perceived as an advantage by individuals who report themselves in Excellent or Very Good health. Furthermore, a majority of the States in the dataset show individuals reporting, at least, greater than 7 hours of sleep.

Research quesion 3: Relationsip between bmi and diabetes across the States

We shall next examine the levels of diabetes and obesity across the States and how the BMI levels correlated with the diagnosis of diabetes

# A tibble: 4 x 2
  X_bmi5cat     Median_bmi
  <fct>              <dbl>
1 Underweight         1771
2 Normal weight       2271
3 Overweight          2729
4 Obese               3366

We can see that individuals with median BMI levels as greater than 2271 fall under the overweight and the obese category. Next we group the q3_data by State and calculate the mean BMI within each state.

As we can see from the plot above, the mean bmi levels for most of the States fall within the overweight category as highlighted by the red line. Another look at the data below also shows that the diagnosis of diabetes is correlated with mean BMI.

# A tibble: 4 x 2
  diabete3                                   Mean_bmi
  <fct>                                         <dbl>
1 Yes                                           3176.
2 Yes, but female told only during pregnancy    2826.
3 No                                            2734.
4 No, pre-diabetes or borderline diabetes       3031.

Summary

In conclusion, the plots analyzed within this report support the hypothesis postulated within each of the three research questions defined. The data obtained from the observational study does support the hypothesis that on average, there is an increasing proportion of individuals who report Excellent or Very Good health correlated by their earned income levels. We also observe a clear relatinship between income levels and exercise and the number of individuals who exercise increases sharply at income levels higher than $25,000. There is also evidence to conclude that those individuals who exercise have a clear health benefit over those who do not exercise which is quite dramatic for those individuals who report themselves in Excellent or Very Good health. Equally apparent is the overall improved perception of general health to the number of hours of sleep time that the individual gets and perhaps, a bit surprising to observed that the majority of the states in the dataset show individuals reporting, at least, greater than 7 hours of sleep. And lastly, the data also supported the hypothesis that the risk of developing diabetes was increased in inviduals with high BMI levels. All three hypothesis postulated within the three research questions are obvious at an intuitive level. In an observational study of this nature, it would be unwise to identify causal relationship behind the variables studied which would required a detailed experimental protocol where one might be able to better calculate the statistical significance of these correlations; nevertheless, it stands to reason that individuals with better income levels will be in a position to afford a better quality of life by affording more time for exercise and adequate sleep. Likewise, income levels would also very likely allow individuals to afford better health care. Also, it is well established that obesity levels are directly correlated with an increased probility of getting diagnosed with diabetes and the data within this study supports that hypothesis.