Setup

Load packages

library(ggplot2)
library(dplyr)
library(reshape2)

Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance (BRFSS) is a system committed to collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. The BBFSS objective is to collect uniform data by means of telephone and cellular surveys. The dataset brfss2013 consists of:

The sample collection metodology: In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Data collection method implications: Since the sample was obtained from either a randomly selected adult in a household through landline telephone or a randomly cellular telephone user who resides in a private residence or college housing, we cannot reneralized the results to the entire U.S. population. The selected population was divided into homogeneous strata and then randomly sample from within each stratum; in other words, U.S. residents with no land-line telephone or cellular telophone are not taken into account.

Scope of Inference: Each subject in the stratum is equally likely to be selected, therefore the sample is representative of the population from which it comes (land-line telephone or cellular telophone users). As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.

In short, we have an observational study: not-causal-generalizable.


Part 2: Research questions

Research quesion 1:: First, we could make inquiries about any possible relationship between sleep habits and poor quality life. The variables could be negatively correlated, in other words, if the sleeping hours increases, the number of bad physical/mental days decreases. We will take the following variables:

Do the sleep hours have to do with the number of days of bad physical/mental health?

Research quesion 2:: Second, we could be interested in the general health status based on number of adults in household and gender of the respondents. Health status could change depending on the number of adults in household and whether the respondent is male or female. More than tow adults in household could reduce de perpective of good health. We will consider the following variables:

Does the general health perspective change depending on the number of adults in household and gender of the person?

Research quesion 3:: Third, we may wonder if physical activities have any impact on anxiety and the type of physical activity which may reduce or increase the average number of anxious days. We will consider the following variables:

Do physical activities reduce the average of anxious days? And if the question is positive, which sort of actvity reducues anxiety the most?


Part 3: Exploratory data analysis

Research quesion 1: Do the sleep hours have to do with the number of days of bad physical/mental health?

# Let's select the variables we will work with and then summarise them.
sleep <- brfss2013 %>% select(sleptim1,physhlth,menthlth)
summary(sleep)
##     sleptim1          physhlth         menthlth       
##  Min.   :  0.000   Min.   : 0.000   Min.   :   0.000  
##  1st Qu.:  6.000   1st Qu.: 0.000   1st Qu.:   0.000  
##  Median :  7.000   Median : 0.000   Median :   0.000  
##  Mean   :  7.052   Mean   : 4.353   Mean   :   3.383  
##  3rd Qu.:  8.000   3rd Qu.: 3.000   3rd Qu.:   2.000  
##  Max.   :450.000   Max.   :60.000   Max.   :5000.000  
##  NA's   :7387      NA's   :10957    NA's   :8627

Before we could make any plot, we need to clean up the data. There are possible outliers within the variables. Sleep time cannot be more than 24 hours per day and days of mental/physic health more than 30 days per month.

# Removing possible outliers
sleep <- sleep %>% filter(sleep$sleptim1<=24 & sleep$menthlth<=30 & sleep$physhlth<=30)

#Group the data by sleeping hours and calculate the mean of days of physic/mental health
sleep <- sleep %>% group_by(sleptim1) %>%
        summarise(physic=mean(physhlth,na.rm=TRUE),mental=mean(menthlth, na.rm=TRUE))

# Stack physhlth and physhlth into a single column
sleep <- sleep %>% melt(id.vars=c("sleptim1"),variable.name="TYPE",value.name="DAYS")

# Create a scatterplot. We remove 23 hours from the plot due to its unusual value of 30 average days.
ggplot(data=sleep[-which(sleep$DAYS==30),], aes(x=sleptim1, y=DAYS, color=TYPE)) + geom_point(size=2) + 
        labs(title="Sleeping Hours - Average Days Health Not Good",
             y="Average Days Health Not Good",x="Sleeping Hours")

Comments:

Research quesion 2:: Does the general health perspective change depending on the number of adults in household and gender of the person?

# Let's select the variables we will work with
household <- brfss2013 %>% select(genhlth,sex,numadult)

# Removing the NA values from each variable
household <- household %>% filter(complete.cases(household))

Now we are working with categorical variables. Therefore it is necessary to calculate the frequency of the anwers qualified as poor, fair, good, very good and excellent. Also, in order to get better visualization, we will filter the number of adults to less than 5 adults in household.

# Create the frequency table
household <- as.data.frame(table(household))

# Filter for only representative values
household <- household %>% filter(as.integer(numadult)<5)

# Create a barplot
ggplot(household, aes(fill=sex,y=Freq/1e3,x=genhlth)) + geom_bar(position="stack", stat="identity") + 
        ggtitle("General Health Acoording to Adults in Household") + ylab("Frequency of responses (K)") + xlab("General Health") + 
        facet_wrap( ~numadult , ncol=2) + 
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(axis.text.x = element_text(angle=30, vjust=1, hjust=1))

Comments:

Research quesion 3:: Do physical activities reduce the average of anxious days? And if the question is positive, which sort of actvity reducues anxiety the most?

# Let's select the variables we will work with
anxiety <- brfss2013 %>% select(exerhmm1,qlstres2,exract11) 
anxiety <- anxiety %>% filter(complete.cases(anxiety))
names(anxiety) <- c("exercise","anxiety","type")

# Change the observations to characters to crete new factor levels 
anxiety$type <- as.character(anxiety$type)

# New factor levels
strength <- c("machine|weight")
cardio <- c("dance|aerobics|bicycling|dancing-ballet|jogging|running|walking|hiking")
sports <- c("basketball|calisthenics|softball")
focus <- c("carpentry|fishing|gardening|golf|household|hunting|yard|yoga|horseback")


# Search matches 
anxiety$type <- case_when(grepl(strength, tolower(anxiety$type)) ~ "strength",
          grepl(sports, tolower(anxiety$type))   ~ "sports",
          grepl(focus, tolower(anxiety$type))    ~ "focus",
          grepl(cardio, tolower(anxiety$type))   ~ "cardio",
          tolower(anxiety$type)=="other" ~ "other")

In order to facilitate the analysis we will group the variable exract11 into 5 categories, instead of the 26 unique values that exist. The activity types are actually very various in some cases, but we can assume the following factors:

# Create a scatterplot
ggplot(data=anxiety, aes(x=exercise, y=anxiety, color=type)) + geom_point(size=2) + 
        labs(title="Minutes of Physical Activity - Anxious Days",
             y="Days Felt Anxious In Past 30 Days",x="Minutes of Physical Activity") + ylim(0,35)
## Warning: Removed 1 rows containing missing values (geom_point).

Comments: