Exploring the BRFSS data

Setup

I will set up the working directory

setwd("D:/Git/StatsR/Prob")

Load data

load('brfss2013.RData')

Load packages

library(ggplot2)
library(dplyr)

Part 1: Data

Background

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months).

Randomness, Generalization and Causality

This data appears to have been collected via a stratified sampling method, and the sample size seems to be meeting the conditions of randomness. Hence we can infer that the sampling is Random and the observations are independent of each other.

Hence observations and inference from this sample can be generalized to the population

However, this is a random sampling and not a random assignment. Therefore, irrespective of what the analysis shows, no causal statements can be made using this data

Data Pre Processing

We will evaluate the current data set that we loaded

dim(brfss2013)

## [1] 491775    330

This data has 491,755 Observations with 330 variables

I would now be checking for completeness of all the variables

I want to check what are the columns(variables) where the total missing values or the count of NAs is more than 50% of the total observation set

x<- NULL
  for (i in 1:330){

x[i] <- sum(is.na(brfss2013[,i]))/length(brfss2013[,i])
}
y <- NULL
y <- ifelse(x<0.5,1,0)
sum(y)

## [1] 193

There are 193 columns where the missing values or NA’s make up more than 50% of the observations

After going through the data, there would be a subset of this data set that I would be working with to frame my research questions and my endeavour to answer tjem

Data set : This would contain the data of number of days unwell in the last 30 days along with general feeling of health, the sex of the respondent, the hours of excercise the respondent indulges in and the hours of sleep.

db1 <- data.frame(brfss2013$genhlth, brfss2013$menthlth, 
                  brfss2013$physhlth, brfss2013$sex, brfss2013$sleptim1,
                  brfss2013$exerhmm2)
sleep <- subset(db1,(complete.cases(db1)))
colnames(sleep) <- c("GenHealth","MenHealth", "PhyHealth", "Sex", "Sleep", "Excercise")
head(sleep)

##    GenHealth MenHealth PhyHealth    Sex Sleep Excercise
## 2       Good         0         0 Female     6        10
## 6  Very good         0         0 Female     8        30
## 10      Good         0         0 Female     8       100
## 11      Good         1         0   Male     6       200
## 14      Good         2         0 Female     8        45
## 15 Very good         0         0   Male     5        40

Part 2: Research questions

Research quesion 1:

Is there a general perception of better health among people in relation to the number of days they have been unwell in the last 30days and does that also depend on the gender of the respondent

I want to find out if there is any difference in whether how men and perceive their health and if it has any relation with the number of days they have been unwell in the last 30 days. Here I am considering Phsyical Health

Research quesion 2:

In general, do people with lower sleep tend to have higher number of days with bad mental health and does that also depend on gender

We would be exploring if the number of sleep hours of a person have any impact on the number of unwell days as far as mental health is concerned

Research quesion 3:

Is there any association between these 4 variable - Number of Unwell Days (Mental) - Number of Unwell Days (Physical) - Amount of Sleep the respondent gets on an average - The amount of time spent by the respondent on excercise

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Research quesion 1:

aggregate(sleep$PhyHealth, by = list(sleep$Sex, sleep$GenHealth), FUN = mean) # Calculating the Statistic

##    Group.1   Group.2          x
## 1     Male Excellent  0.7366589
## 2   Female Excellent  0.7772581
## 3     Male Very good  1.1502042
## 4   Female Very good  1.4304096
## 5     Male      Good  2.4841394
## 6   Female      Good  3.0650094
## 7     Male      Fair  8.6053600
## 8   Female      Fair 10.1962053
## 9     Male      Poor 21.2867946
## 10  Female      Poor 21.5208333

g <- ggplot(aes(x = GenHealth, y = PhyHealth), data = sleep)
g <- g + geom_boxplot(aes(fill = GenHealth))
g <- g + facet_grid(facets = .~ sleep$Sex)
g <- g + theme(axis.text.x = element_text(size = 7)) 
g

From the chart as well as from the summary statistics, we see that people in general do tend to feel better or worse about their general health depending on the number of days they felt unwell in the last 30 days. However, there does not seem to be any significant difference between men and women as far as this.

Both the genders feel bad or worse about their general health depending on the no of days they have been physically unwell in the last 30 days

Research quesion 2:

To complete this analysis, first I would be creating anothher variable which would categorise sleep.

I want to use the hours of sleep as a factor variable and hence I would be creating a factor variable called sbucket which would club together ranges of sleep hours

sleep$sbucket <- ifelse(sleep$Sleep < 4,"< 4", 
                        ifelse(sleep$Sleep >=4 & sleep$Sleep < 6, "4-6",
                               ifelse(sleep$Sleep >=6 & sleep$Sleep <8, "6-8",
                                      ifelse(sleep$Sleep >=8 & sleep$Sleep < 10, "8-10",
                                             ifelse(sleep$Sleep >=10 & sleep$Sleep <12, "10-12",
                                                    " > 12")))))
sleep$sbucket <- as.factor(sleep$sbucket)
head(sleep)

##    GenHealth MenHealth PhyHealth    Sex Sleep Excercise sbucket
## 2       Good         0         0 Female     6        10     6-8
## 6  Very good         0         0 Female     8        30    8-10
## 10      Good         0         0 Female     8       100    8-10
## 11      Good         1         0   Male     6       200     6-8
## 14      Good         2         0 Female     8        45    8-10
## 15 Very good         0         0   Male     5        40     4-6

Now that we have the variable created, we would be exploring the data

aggregate(sleep$MenHealth, by = list(sleep$Sex, sleep$sbucket), FUN = mean)

##    Group.1 Group.2         x
## 1     Male    > 12  5.594203
## 2   Female    > 12  6.774194
## 3     Male     < 4 10.572961
## 4   Female     < 4 11.276271
## 5     Male   10-12  3.275437
## 6   Female   10-12  4.549836
## 7     Male     4-6  5.123507
## 8   Female     4-6  6.627144
## 9     Male     6-8  1.874250
## 10  Female     6-8  2.766575
## 11    Male    8-10  1.447844
## 12  Female    8-10  2.117616

The chart

g1 <- ggplot(aes(x = sbucket, y = MenHealth), data = sleep)
g1 <- g1 + geom_boxplot(aes(fill = sbucket)) 
g1 <- g1 + facet_grid(facets = .~sleep$Sex) 
g1 <- g1 + scale_x_discrete(limits = c("< 4","4-6","6-8","8-10","10-12"," > 12"))
g1

Basis, the summary statistics and the chart, we can infer that there is indeed an impact of the sleep hours on the number of unwell day of mental health.

We see that people who are getting less than 4 hours of sleep tend to have a much higher number of unwell days than other.

This data also shows that people who are getting 8-10 hours of sleep have been observed to have better mental health.

Between the genders, while there is no major impact, generally it seems that women have been observed to have had a higher number of unwell days.

We would calculate a summary stastistic for this

aggregate(sleep$MenHealth, by = list(sleep$Sex), FUN = mean)

##   Group.1        x
## 1    Male 2.068632
## 2  Female 2.919486

We see that women have been observed to have a higher number of average unwell days than men

Research quesion 3:

To do this analysis I would be first constructing a Correlation matrix and would be using the matrix to create a correlation plot using the corrplot library

library(corrplot)

## corrplot 0.84 loaded

mat <- cor(data.frame(sleep$MenHealth, sleep$PhyHealth, sleep$Sleep, sleep$Excercise))
mat

##                 sleep.MenHealth sleep.PhyHealth sleep.Sleep
## sleep.MenHealth     1.000000000     0.288961394 -0.12407950
## sleep.PhyHealth     0.288961394     1.000000000 -0.05933830
## sleep.Sleep        -0.124079504    -0.059338297  1.00000000
## sleep.Excercise     0.001833475    -0.006587753 -0.01239385
##                 sleep.Excercise
## sleep.MenHealth     0.001833475
## sleep.PhyHealth    -0.006587753
## sleep.Sleep        -0.012393851
## sleep.Excercise     1.000000000

Now we would be contructing a chart

corrplot(mat, method = "number")

We see that apart from Physical and Mental Health, all other Variables show low degree of association with almost no association between Excercise and the other factors.

Even between Phsyical and Mental health, while there is an association, it is not very strong. We can infer that while we observe some level of association, mostly the outcome of one of these is not very dependent on the other.