Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.

load("brfss2013.RData")

Part 1: Data

For this project we are going to consider just the BRFSS data from 2013. Lets look into the process used by BRFSS to collect sample data. Its extensive description can be found in this article BRFSS Overview.

nrow(brfss2013)
## [1] 491775
brfss2013 %>% group_by(iyear) %>% summarize(n=n())
## # A tibble: 3 x 2
##    iyear      n
##   <fctr>  <int>
## 1   2013 486088
## 2   2014   5682
## 3   <NA>      5

we are going to be using only data from 2013 for this project.

data2013 <- brfss2013 %>% filter(iyear == 2013)
data2013 %>% group_by(iyear) %>% summarize(n=n())
## # A tibble: 1 x 2
##    iyear      n
##   <fctr>  <int>
## 1   2013 486088

For sampling, the size of population in each region is taken into consideration. BRFSS used disproportionate stratified sample (DSS) design (disproportionately sampled geographic strata) to collect samples from interviews conducted both by landline telephone and cellular telephone. Also a simple random-sample design was used for smaller for Guam and Puerto Rico.

In 2012 97.5% of US households has telephone service. In first half of 2013, only 39.4 percent cell households had cell phone service. In 2013, BRFSS respondents who received 90 percent or more of their calls on cellular telephones were eligible for participation in the cellular telephone survey.

data2013 %>% group_by(cellfon3) %>% summarize(n_percent=n()/nrow(data2013))
## # A tibble: 2 x 2
##               cellfon3 n_percent
##                 <fctr>     <dbl>
## 1 Not a cellular phone 0.7254427
## 2                 <NA> 0.2745573

The above data shows that the 72.5% of the data collected are from telephone and 27.5% unknown (may be these are the cell phones). BRFSS resolved this un evenness with a weighting methodology called iterative proportional fitting (or IPF or raking). Such weighting serves as a blanket adjustment for noncoverage and nonresponse and forces the total number of cases to equal population estimates for each geographic region, which for the BRFSS sums to the state population. Regardless of state sample design, use of the final weight in analysis is necessary if users are to make generalizations from the sample to the population.

Raking adjusts the data so that groups underrepresented in the sample can be more accurately represented in the final data set. Raking allows for the incorporation of cellular telephone survey data; it permits the introduction of additional demographic characteristics and more accurately matches sample distributions to known demographic characteristics of populations. BRFSS raking includes categories of age by gender, detailed race and ethnicity groups, education levels, marital status, regions within states, gender by race and ethnicity, telephone source, renter/owner status, and age groups by race and ethnicity.

After the weights are calculate, BRFSS uses weight trimming to increase the value of extremely low weights and decrease the value of extremely high weights. The objective of weight trimming is to reduce errors in the outcome estimates caused by unusually high or low weights in some categories.

Given the above process and BRFSS expereince in sampling from 1984, its possible they have resolved many issues in collecting good random samples. The data can be considered good representation of the population.

The data used for this study

nrow(data2013)
## [1] 486088

Since this includes only data from some geographical region, it can not be considered for generalizing over teh whole world population. This data is observational data and not experimental. Subjects are randomly assigned and are not selected for particular study groups. So we cannot attribute any causal relationship between variables in the study.


Part 2: Research questions

This project would analyze if there is any correlation between Food habits and Sleep time.

Variables for Food habits considered

  1. Fruits and Vegetables
    • fruit1: How Many Times Did You Eat Fruit?
    • fvgreen: How Many Times Did You Eat Dark Green Vegetables?
  2. Sugar Drinks
    • ssbsugar: How Often Do You Drink Regular Soda Or Pop?
  3. Sodium or Salt-Related Behavior
    • wtchsalt: Watching Sodium Or Salt Intake
  4. Alcohol Consumption
    • avedrnk2: Avg Alcoholic Drinks Per Day In Past 30

Sleep Time variable considered

  1. Inadequate Sleep
    • sleptim1: How Much Time Do You Sleep

To cross reference the results across different demographics we will be using variables from Demographics

Research quesion 1:

Is there any relationship between sleeptime sleptim1 and eating fruits fruit1? How does gender sex affect these variables ?

Is there any relationship between sleeptime sleptim1 and eating green vegetables fvgreen? How does gender sex affect these variables ?

We are going to explore more variables to build the correlation between sleep time and food habits.

Research quesion 2:

Is there any relationship between sleeptime sleptim1 and drinking soda regularly ssbsugar?

Is there any relationship between sleeptime sleptim1 and watching sodium intake wtchsalt?

Research quesion 3:

Is there any relationship between sleeptime sleptim1 and average drinks in last 30 days avedrnk2? How does above specified demographics variables affect these variables ?


Part 3: Exploratory data analysis

Lets not consider the sleep values that are NA and greater than or equal to 18. This is made from the assumption that long sleeping disorder is considered from 10 - 12 hours according to this sleep association article. Extra 12-18 hours are considered for not omitting outliers.

data2013_sleep <- data2013 %>% filter(!is.na(sleptim1)) %>% filter(sleptim1 <= 18) 
nrow(data2013_sleep)
## [1] 478668
ggplot(data = data2013_sleep, aes(x = sleptim1)) + geom_histogram(bins=30)

summary(data2013_sleep$sleptim1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   7.047   8.000  18.000

We can see the mean and median are very similar and its around 7 hours a day of sleep.

The variable fruit1, fvgreen, ssbsugar, occurs in a format where * first number denotes times per day (1), week (2), or month (3). * The remaining digits indicate the count of times.

Lets convert everything to the smallest unit that is number of time per day.

convert_to_days <- function(x) {
  
  case_when(
  x                   <= 100   ~ 0,
  x                   == 300 ~ 1,
  as.integer(x / 100) == 1   ~ (x %% 100),
  as.integer(x / 100) == 2   ~ (x %% 100)/7,
  as.integer(x / 100) == 3   ~ (x %% 100)/30
  )
}

Research_quesions_SleepTime

rs1: Vs Fruits and Veggies

Is there any relationship between sleeptime sleptim1 and eating fruits fruit1? Is there any relationship between sleeptime sleptim1 and eating veggies fvgreen?

Lets look into the variable data types that are going to be explored in this section

str(data2013$fruit1)
##  int [1:486088] 104 301 203 306 302 206 325 320 101 202 ...
str(data2013$fvgreen)
##  int [1:486088] 310 203 202 310 310 203 315 315 203 201 ...

Lets filter all the values that are NA from the new data data2013_sleep

rs1f <- data2013_sleep %>% filter(!is.na(fruit1))
rs1v <- data2013_sleep %>% filter(!is.na(fvgreen))

anyNA(rs1f$fruit1)
## [1] FALSE
anyNA(rs1v$fvgreen)
## [1] FALSE
nrow(rs1f)
## [1] 447089
nrow(rs1v)
## [1] 445743

Lets convert everything to the smallest unit that is number of time per day.

rs1f <- rs1f %>% mutate(fruit_per_day = convert_to_days(fruit1))
rs1v <- rs1v %>% mutate(veg_per_day = convert_to_days(fvgreen))

anyNA(rs1f$fruit_per_day)
## [1] FALSE
anyNA(rs1v$veg_per_day)
## [1] FALSE

Distribution of total number of fruits per day

ggplot(data = rs1f, aes(x = fruit_per_day)) + geom_histogram(bins=30)

Distribution of total number of veggies per day

ggplot(data = rs1v, aes(x = veg_per_day)) + geom_histogram(bins=30)

Sleep Relation with fruit and gender Lets calculate average fruit intake for each sleep group (group basd on number of hours slept)

rs1f_sleep = rs1f %>% 
      group_by(sleptim1) %>%
      summarise(avg_fruit_intake=mean(fruit_per_day))
rs1f_sleep                       
## # A tibble: 18 x 2
##    sleptim1 avg_fruit_intake
##       <int>            <dbl>
##  1        1        1.0073353
##  2        2        0.8604984
##  3        3        0.9014680
##  4        4        0.9137449
##  5        5        0.9258373
##  6        6        0.9688426
##  7        7        1.0788324
##  8        8        1.0508804
##  9        9        1.0427643
## 10       10        0.9270260
## 11       11        0.9147260
## 12       12        0.8555560
## 13       13        0.8087302
## 14       14        0.8630952
## 15       15        0.9377201
## 16       16        0.9007836
## 17       17        0.9255411
## 18       18        0.9020732

The dashed line represents the mean sleep hours.

ggplot(data = rs1f_sleep, aes(x = avg_fruit_intake, y = sleptim1)) + 
  geom_point() + 
  geom_hline(yintercept = 7, linetype = "dashed")

We can see a slight pattern in the data where there is a positive linear relationship between the variables below 7 hours of sleep and negative relation above the mean sleep time

How does gender sex affect these variables ?

rs1f_sleep_gender = rs1f %>% 
                    group_by(sex, sleptim1) %>%
                    summarise(avg_fruit_intake=mean(fruit_per_day))
rs1f_sleep_gender                       
## # A tibble: 36 x 3
## # Groups:   sex [?]
##       sex sleptim1 avg_fruit_intake
##    <fctr>    <int>            <dbl>
##  1   Male        1        1.0916215
##  2   Male        2        0.8250748
##  3   Male        3        0.7950662
##  4   Male        4        0.8018544
##  5   Male        5        0.7990218
##  6   Male        6        0.8241396
##  7   Male        7        0.9042576
##  8   Male        8        0.8761468
##  9   Male        9        0.8758238
## 10   Male       10        0.8112197
## # ... with 26 more rows

The dashed line represents the mean sleep hours.

ggplot(data = rs1f_sleep_gender, aes(x = avg_fruit_intake, y = sleptim1, color = sex)) +
  geom_point() +
  geom_hline(yintercept = 7, linetype = "dashed")

We can see the same pattern discussed above in the data where there is a positive linear relationship between the variables below 7 hours of sleep and negative relation above the mean sleep time. But also seems like females usually have larger portion of fruits than males.

Sleep Relation with veggies and gender Lets calculate average veggies intake for each sleep group (group basd on number of hours slept)

rs1v_sleep = rs1v %>% 
      group_by(sleptim1) %>%
      summarise(avg_veg_intake=mean(veg_per_day))
rs1v_sleep                       
## # A tibble: 18 x 2
##    sleptim1 avg_veg_intake
##       <int>          <dbl>
##  1        1      0.4715736
##  2        2      0.4472757
##  3        3      0.4788879
##  4        4      0.5123023
##  5        5      0.5197843
##  6        6      0.5375173
##  7        7      0.5822028
##  8        8      0.5698025
##  9        9      0.5518423
## 10       10      0.4784628
## 11       11      0.4646513
## 12       12      0.4333784
## 13       13      0.4399267
## 14       14      0.4444797
## 15       15      0.4909453
## 16       16      0.4346361
## 17       17      0.5817927
## 18       18      0.4027027

We do notice average veggie intake per day is less that fruits intake on average. The dashed line represents the mean sleep hours.

ggplot(data = rs1v_sleep, aes(x = avg_veg_intake, y = sleptim1)) + 
  geom_point() +
  geom_hline(yintercept = 7, linetype = "dashed")

We can see a linear positive relationship between average veggies intake to sleep hours below the mean sleep time and negative above the mean sleep time.

How does gender sex affect these variables ?

rs1v_sleep_gender = rs1v %>% 
                    group_by(sex, sleptim1) %>%
                    summarise(avg_veg_intake=mean(veg_per_day))
rs1v_sleep_gender                       
## # A tibble: 36 x 3
## # Groups:   sex [?]
##       sex sleptim1 avg_veg_intake
##    <fctr>    <int>          <dbl>
##  1   Male        1      0.4493239
##  2   Male        2      0.4101774
##  3   Male        3      0.4251140
##  4   Male        4      0.4684199
##  5   Male        5      0.4705434
##  6   Male        6      0.4848204
##  7   Male        7      0.5162709
##  8   Male        8      0.5035144
##  9   Male        9      0.4905388
## 10   Male       10      0.4468116
## # ... with 26 more rows

The dashed line represents the mean sleep hours.

ggplot(data = rs1v_sleep_gender, aes(x = avg_veg_intake, y = sleptim1, color = sex)) +   
  geom_point() +
    geom_hline(yintercept = 7, linetype = "dashed")

We can see a very similar pattern we found in realtionship with fruits

rs2: Vs Sugar and Salt

Lets look into the variable data types that are going to be explored in this section

str(data2013$ssbsugar)
##  int [1:486088] 305 203 202 203 NA NA NA NA NA NA ...
str(data2013$wtchsalt)
##  Factor w/ 2 levels "Yes","No": 1 2 2 1 NA NA NA NA NA NA ...

Lets filter all the values that are NA from the new data data2013_sleep

rs2su <- data2013_sleep %>% filter(!is.na(ssbsugar))
rs2sa <- data2013_sleep %>% filter(!is.na(wtchsalt))

anyNA(rs2su$ssbsugar)
## [1] FALSE
anyNA(rs2sa$wtchsalt)
## [1] FALSE
nrow(rs2su)
## [1] 98852
nrow(rs2sa)
## [1] 125334

Sleep Relation with sugar

Lets convert ssbsugar to the smallest unit that is number of time per day.

rs2su <- rs2su %>% mutate(soda_per_day = convert_to_days(ssbsugar))
anyNA(rs2su$soda_per_day)
## [1] FALSE

Distribution of total number of fruits per day

ggplot(data = rs2su, aes(x = soda_per_day)) + geom_histogram(bins=30)

Lets calculate average fruit intake for each sleep group (group basd on number of hours slept)

rs2su_sleep = rs2su %>% 
      group_by(sleptim1) %>%
      summarise(avg_sugar_intake=mean(soda_per_day))
rs2su_sleep                       
## # A tibble: 18 x 2
##    sleptim1 avg_sugar_intake
##       <int>            <dbl>
##  1        1        0.7800000
##  2        2        1.1189303
##  3        3        0.7780029
##  4        4        0.6487096
##  5        5        0.5622577
##  6        6        0.4049508
##  7        7        0.2887776
##  8        8        0.3044624
##  9        9        0.3201703
## 10       10        0.3919723
## 11       11        0.4345024
## 12       12        0.5537236
## 13       13        0.7618034
## 14       14        0.7540179
## 15       15        0.5172772
## 16       16        0.4696970
## 17       17        2.2678571
## 18       18        0.5868421

The dashed line represents the mean sleep hours.

ggplot(data = rs2su_sleep, aes(x = avg_sugar_intake, y = sleptim1)) + 
  geom_point() + 
  geom_hline(yintercept = 7, linetype = "dashed")

we can see a negative relation with sleep hours and average sugar intake and a positive relation above the average sleep hours.

Sleep Relation with salt

rs2sa_sleep = rs2sa %>% 
      group_by(wtchsalt) %>%
      summarise(mean_sleep=mean(sleptim1))
rs2sa_sleep                       
## # A tibble: 2 x 2
##   wtchsalt mean_sleep
##     <fctr>      <dbl>
## 1      Yes   7.032822
## 2       No   7.027613

The difference in the sleep time between watching sugar intake and not watching doesnt seem to be much but we have to perform hypothesis testing to make a decision whether this difference is statistically significant

rs3: Vs Alcohol

Lets look into the variable data types that are going to be explored in this section

str(data2013$marital)
##  Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
str(data2013$children)
##  int [1:486088] 0 2 0 0 0 0 1 0 1 0 ...
str(data2013$educa)
##  Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
str(data2013$employ1)
##  Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
str(data2013$income2)
##  Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
str(data2013$sex)
##  Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
str(data2013$pregnant)
##  Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA 2 NA NA NA ...

Lets convert children integer variable to factor variable. Also since the average alcohol drinks per day are in teh range of 1 to 76, letc make them factor or categorical variables to investigate further.

data2013$children <- factor(data2013$children)
data2013$avedrnk2 <- factor(data2013$avedrnk2)

str(data2013$children)
##  Factor w/ 20 levels "0","1","2","3",..: 1 3 1 1 1 1 2 1 2 1 ...
str(data2013$avedrnk2)
##  Factor w/ 48 levels "1","2","3","4",..: 2 NA 4 2 2 NA 1 1 1 NA ...

Lets make a function to investigate all the demographic variables. The function does the following 1. Removing all the NA rows in the dataframe. 2. groups by demographics and average drinks per day and calculated mean on the sleep time 3. we then plot the graph with demographic variables representing the color.

investigate_sleep <- function(df, col_name) {
  rs3 <- df %>%
         filter_(!is.na(col_name)) %>%
         group_by_(col_name, "avedrnk2") %>%
         summarise(avg_sleep_time=mean(sleptim1))
  ggplot(data = rs3, aes_string(x = "avedrnk2", y = "avg_sleep_time", color = col_name)) +   
    geom_point() +
    geom_hline(yintercept = 7, linetype = "dashed")
}

All the graphs below show that on average with increase in alcohol intake per day, more the sleep time varies away from the average sleep time.

investigate_sleep(data2013_sleep, "marital")
## Warning: Removed 7 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "children")
## Warning: Removed 18 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "educa")
## Warning: Removed 7 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "employ1")
## Warning: Removed 9 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "income2")
## Warning: Removed 9 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "sex")
## Warning: Removed 2 rows containing missing values (geom_point).

investigate_sleep(data2013_sleep, "pregnant")
## Warning: Removed 3 rows containing missing values (geom_point).

End tabset