library(ggplot2)
library(dplyr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.
load("brfss2013.RData")For this project we are going to consider just the BRFSS data from 2013. Lets look into the process used by BRFSS to collect sample data. Its extensive description can be found in this article BRFSS Overview.
nrow(brfss2013)## [1] 491775
brfss2013 %>% group_by(iyear) %>% summarize(n=n())## # A tibble: 3 x 2
## iyear n
## <fctr> <int>
## 1 2013 486088
## 2 2014 5682
## 3 <NA> 5
we are going to be using only data from 2013 for this project.
data2013 <- brfss2013 %>% filter(iyear == 2013)
data2013 %>% group_by(iyear) %>% summarize(n=n())## # A tibble: 1 x 2
## iyear n
## <fctr> <int>
## 1 2013 486088
For sampling, the size of population in each region is taken into consideration. BRFSS used disproportionate stratified sample (DSS) design (disproportionately sampled geographic strata) to collect samples from interviews conducted both by landline telephone and cellular telephone. Also a simple random-sample design was used for smaller for Guam and Puerto Rico.
In 2012 97.5% of US households has telephone service. In first half of 2013, only 39.4 percent cell households had cell phone service. In 2013, BRFSS respondents who received 90 percent or more of their calls on cellular telephones were eligible for participation in the cellular telephone survey.
data2013 %>% group_by(cellfon3) %>% summarize(n_percent=n()/nrow(data2013))## # A tibble: 2 x 2
## cellfon3 n_percent
## <fctr> <dbl>
## 1 Not a cellular phone 0.7254427
## 2 <NA> 0.2745573
The above data shows that the 72.5% of the data collected are from telephone and 27.5% unknown (may be these are the cell phones). BRFSS resolved this un evenness with a weighting methodology called iterative proportional fitting (or IPF or raking). Such weighting serves as a blanket adjustment for noncoverage and nonresponse and forces the total number of cases to equal population estimates for each geographic region, which for the BRFSS sums to the state population. Regardless of state sample design, use of the final weight in analysis is necessary if users are to make generalizations from the sample to the population.
Raking adjusts the data so that groups underrepresented in the sample can be more accurately represented in the final data set. Raking allows for the incorporation of cellular telephone survey data; it permits the introduction of additional demographic characteristics and more accurately matches sample distributions to known demographic characteristics of populations. BRFSS raking includes categories of age by gender, detailed race and ethnicity groups, education levels, marital status, regions within states, gender by race and ethnicity, telephone source, renter/owner status, and age groups by race and ethnicity.
After the weights are calculate, BRFSS uses weight trimming to increase the value of extremely low weights and decrease the value of extremely high weights. The objective of weight trimming is to reduce errors in the outcome estimates caused by unusually high or low weights in some categories.
Given the above process and BRFSS expereince in sampling from 1984, its possible they have resolved many issues in collecting good random samples. The data can be considered good representation of the population.
The data used for this study
nrow(data2013)## [1] 486088
Since this includes only data from some geographical region, it can not be considered for generalizing over teh whole world population. This data is observational data and not experimental. Subjects are randomly assigned and are not selected for particular study groups. So we cannot attribute any causal relationship between variables in the study.
This project would analyze if there is any correlation between Food habits and Sleep time.
Variables for Food habits considered
fruit1: How Many Times Did You Eat Fruit?fvgreen: How Many Times Did You Eat Dark Green Vegetables?ssbsugar: How Often Do You Drink Regular Soda Or Pop?wtchsalt: Watching Sodium Or Salt Intakeavedrnk2: Avg Alcoholic Drinks Per Day In Past 30Sleep Time variable considered
sleptim1: How Much Time Do You SleepTo cross reference the results across different demographics we will be using variables from Demographics
marital: Marital Statuschildren: Number Of Children In Householdeduca: Education Levelemploy1: Employment Statusincome2: Income Levelsex: Respondents Sexpregnant: Pregnancy StatusResearch quesion 1:
Is there any relationship between sleeptime sleptim1 and eating fruits fruit1? How does gender sex affect these variables ?
Is there any relationship between sleeptime sleptim1 and eating green vegetables fvgreen? How does gender sex affect these variables ?
We are going to explore more variables to build the correlation between sleep time and food habits.
Research quesion 2:
Is there any relationship between sleeptime sleptim1 and drinking soda regularly ssbsugar?
Is there any relationship between sleeptime sleptim1 and watching sodium intake wtchsalt?
Research quesion 3:
Is there any relationship between sleeptime sleptim1 and average drinks in last 30 days avedrnk2? How does above specified demographics variables affect these variables ?
Lets not consider the sleep values that are NA and greater than or equal to 18. This is made from the assumption that long sleeping disorder is considered from 10 - 12 hours according to this sleep association article. Extra 12-18 hours are considered for not omitting outliers.
data2013_sleep <- data2013 %>% filter(!is.na(sleptim1)) %>% filter(sleptim1 <= 18)
nrow(data2013_sleep)## [1] 478668
ggplot(data = data2013_sleep, aes(x = sleptim1)) + geom_histogram(bins=30)summary(data2013_sleep$sleptim1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 7.047 8.000 18.000
We can see the mean and median are very similar and its around 7 hours a day of sleep.
The variable fruit1, fvgreen, ssbsugar, occurs in a format where * first number denotes times per day (1), week (2), or month (3). * The remaining digits indicate the count of times.
Lets convert everything to the smallest unit that is number of time per day.
convert_to_days <- function(x) {
case_when(
x <= 100 ~ 0,
x == 300 ~ 1,
as.integer(x / 100) == 1 ~ (x %% 100),
as.integer(x / 100) == 2 ~ (x %% 100)/7,
as.integer(x / 100) == 3 ~ (x %% 100)/30
)
}Is there any relationship between sleeptime sleptim1 and eating fruits fruit1? Is there any relationship between sleeptime sleptim1 and eating veggies fvgreen?
Lets look into the variable data types that are going to be explored in this section
str(data2013$fruit1)## int [1:486088] 104 301 203 306 302 206 325 320 101 202 ...
str(data2013$fvgreen)## int [1:486088] 310 203 202 310 310 203 315 315 203 201 ...
Lets filter all the values that are NA from the new data data2013_sleep
rs1f <- data2013_sleep %>% filter(!is.na(fruit1))
rs1v <- data2013_sleep %>% filter(!is.na(fvgreen))
anyNA(rs1f$fruit1)## [1] FALSE
anyNA(rs1v$fvgreen)## [1] FALSE
nrow(rs1f)## [1] 447089
nrow(rs1v)## [1] 445743
Lets convert everything to the smallest unit that is number of time per day.
rs1f <- rs1f %>% mutate(fruit_per_day = convert_to_days(fruit1))
rs1v <- rs1v %>% mutate(veg_per_day = convert_to_days(fvgreen))
anyNA(rs1f$fruit_per_day)## [1] FALSE
anyNA(rs1v$veg_per_day)## [1] FALSE
Distribution of total number of fruits per day
ggplot(data = rs1f, aes(x = fruit_per_day)) + geom_histogram(bins=30)Distribution of total number of veggies per day
ggplot(data = rs1v, aes(x = veg_per_day)) + geom_histogram(bins=30)Sleep Relation with fruit and gender Lets calculate average fruit intake for each sleep group (group basd on number of hours slept)
rs1f_sleep = rs1f %>%
group_by(sleptim1) %>%
summarise(avg_fruit_intake=mean(fruit_per_day))
rs1f_sleep ## # A tibble: 18 x 2
## sleptim1 avg_fruit_intake
## <int> <dbl>
## 1 1 1.0073353
## 2 2 0.8604984
## 3 3 0.9014680
## 4 4 0.9137449
## 5 5 0.9258373
## 6 6 0.9688426
## 7 7 1.0788324
## 8 8 1.0508804
## 9 9 1.0427643
## 10 10 0.9270260
## 11 11 0.9147260
## 12 12 0.8555560
## 13 13 0.8087302
## 14 14 0.8630952
## 15 15 0.9377201
## 16 16 0.9007836
## 17 17 0.9255411
## 18 18 0.9020732
The dashed line represents the mean sleep hours.
ggplot(data = rs1f_sleep, aes(x = avg_fruit_intake, y = sleptim1)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")We can see a slight pattern in the data where there is a positive linear relationship between the variables below 7 hours of sleep and negative relation above the mean sleep time
How does gender sex affect these variables ?
rs1f_sleep_gender = rs1f %>%
group_by(sex, sleptim1) %>%
summarise(avg_fruit_intake=mean(fruit_per_day))
rs1f_sleep_gender ## # A tibble: 36 x 3
## # Groups: sex [?]
## sex sleptim1 avg_fruit_intake
## <fctr> <int> <dbl>
## 1 Male 1 1.0916215
## 2 Male 2 0.8250748
## 3 Male 3 0.7950662
## 4 Male 4 0.8018544
## 5 Male 5 0.7990218
## 6 Male 6 0.8241396
## 7 Male 7 0.9042576
## 8 Male 8 0.8761468
## 9 Male 9 0.8758238
## 10 Male 10 0.8112197
## # ... with 26 more rows
The dashed line represents the mean sleep hours.
ggplot(data = rs1f_sleep_gender, aes(x = avg_fruit_intake, y = sleptim1, color = sex)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")We can see the same pattern discussed above in the data where there is a positive linear relationship between the variables below 7 hours of sleep and negative relation above the mean sleep time. But also seems like females usually have larger portion of fruits than males.
Sleep Relation with veggies and gender Lets calculate average veggies intake for each sleep group (group basd on number of hours slept)
rs1v_sleep = rs1v %>%
group_by(sleptim1) %>%
summarise(avg_veg_intake=mean(veg_per_day))
rs1v_sleep ## # A tibble: 18 x 2
## sleptim1 avg_veg_intake
## <int> <dbl>
## 1 1 0.4715736
## 2 2 0.4472757
## 3 3 0.4788879
## 4 4 0.5123023
## 5 5 0.5197843
## 6 6 0.5375173
## 7 7 0.5822028
## 8 8 0.5698025
## 9 9 0.5518423
## 10 10 0.4784628
## 11 11 0.4646513
## 12 12 0.4333784
## 13 13 0.4399267
## 14 14 0.4444797
## 15 15 0.4909453
## 16 16 0.4346361
## 17 17 0.5817927
## 18 18 0.4027027
We do notice average veggie intake per day is less that fruits intake on average. The dashed line represents the mean sleep hours.
ggplot(data = rs1v_sleep, aes(x = avg_veg_intake, y = sleptim1)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")We can see a linear positive relationship between average veggies intake to sleep hours below the mean sleep time and negative above the mean sleep time.
How does gender sex affect these variables ?
rs1v_sleep_gender = rs1v %>%
group_by(sex, sleptim1) %>%
summarise(avg_veg_intake=mean(veg_per_day))
rs1v_sleep_gender ## # A tibble: 36 x 3
## # Groups: sex [?]
## sex sleptim1 avg_veg_intake
## <fctr> <int> <dbl>
## 1 Male 1 0.4493239
## 2 Male 2 0.4101774
## 3 Male 3 0.4251140
## 4 Male 4 0.4684199
## 5 Male 5 0.4705434
## 6 Male 6 0.4848204
## 7 Male 7 0.5162709
## 8 Male 8 0.5035144
## 9 Male 9 0.4905388
## 10 Male 10 0.4468116
## # ... with 26 more rows
The dashed line represents the mean sleep hours.
ggplot(data = rs1v_sleep_gender, aes(x = avg_veg_intake, y = sleptim1, color = sex)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")We can see a very similar pattern we found in realtionship with fruits
Lets look into the variable data types that are going to be explored in this section
str(data2013$ssbsugar)## int [1:486088] 305 203 202 203 NA NA NA NA NA NA ...
str(data2013$wtchsalt)## Factor w/ 2 levels "Yes","No": 1 2 2 1 NA NA NA NA NA NA ...
Lets filter all the values that are NA from the new data data2013_sleep
rs2su <- data2013_sleep %>% filter(!is.na(ssbsugar))
rs2sa <- data2013_sleep %>% filter(!is.na(wtchsalt))
anyNA(rs2su$ssbsugar)## [1] FALSE
anyNA(rs2sa$wtchsalt)## [1] FALSE
nrow(rs2su)## [1] 98852
nrow(rs2sa)## [1] 125334
Sleep Relation with sugar
Lets convert ssbsugar to the smallest unit that is number of time per day.
rs2su <- rs2su %>% mutate(soda_per_day = convert_to_days(ssbsugar))
anyNA(rs2su$soda_per_day)## [1] FALSE
Distribution of total number of fruits per day
ggplot(data = rs2su, aes(x = soda_per_day)) + geom_histogram(bins=30)Lets calculate average fruit intake for each sleep group (group basd on number of hours slept)
rs2su_sleep = rs2su %>%
group_by(sleptim1) %>%
summarise(avg_sugar_intake=mean(soda_per_day))
rs2su_sleep ## # A tibble: 18 x 2
## sleptim1 avg_sugar_intake
## <int> <dbl>
## 1 1 0.7800000
## 2 2 1.1189303
## 3 3 0.7780029
## 4 4 0.6487096
## 5 5 0.5622577
## 6 6 0.4049508
## 7 7 0.2887776
## 8 8 0.3044624
## 9 9 0.3201703
## 10 10 0.3919723
## 11 11 0.4345024
## 12 12 0.5537236
## 13 13 0.7618034
## 14 14 0.7540179
## 15 15 0.5172772
## 16 16 0.4696970
## 17 17 2.2678571
## 18 18 0.5868421
The dashed line represents the mean sleep hours.
ggplot(data = rs2su_sleep, aes(x = avg_sugar_intake, y = sleptim1)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")we can see a negative relation with sleep hours and average sugar intake and a positive relation above the average sleep hours.
Sleep Relation with salt
rs2sa_sleep = rs2sa %>%
group_by(wtchsalt) %>%
summarise(mean_sleep=mean(sleptim1))
rs2sa_sleep ## # A tibble: 2 x 2
## wtchsalt mean_sleep
## <fctr> <dbl>
## 1 Yes 7.032822
## 2 No 7.027613
The difference in the sleep time between watching sugar intake and not watching doesnt seem to be much but we have to perform hypothesis testing to make a decision whether this difference is statistically significant
Lets look into the variable data types that are going to be explored in this section
str(data2013$marital)## Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
str(data2013$children)## int [1:486088] 0 2 0 0 0 0 1 0 1 0 ...
str(data2013$educa)## Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
str(data2013$employ1)## Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
str(data2013$income2)## Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
str(data2013$sex)## Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
str(data2013$pregnant)## Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA 2 NA NA NA ...
Lets convert children integer variable to factor variable. Also since the average alcohol drinks per day are in teh range of 1 to 76, letc make them factor or categorical variables to investigate further.
data2013$children <- factor(data2013$children)
data2013$avedrnk2 <- factor(data2013$avedrnk2)
str(data2013$children)## Factor w/ 20 levels "0","1","2","3",..: 1 3 1 1 1 1 2 1 2 1 ...
str(data2013$avedrnk2)## Factor w/ 48 levels "1","2","3","4",..: 2 NA 4 2 2 NA 1 1 1 NA ...
Lets make a function to investigate all the demographic variables. The function does the following 1. Removing all the NA rows in the dataframe. 2. groups by demographics and average drinks per day and calculated mean on the sleep time 3. we then plot the graph with demographic variables representing the color.
investigate_sleep <- function(df, col_name) {
rs3 <- df %>%
filter_(!is.na(col_name)) %>%
group_by_(col_name, "avedrnk2") %>%
summarise(avg_sleep_time=mean(sleptim1))
ggplot(data = rs3, aes_string(x = "avedrnk2", y = "avg_sleep_time", color = col_name)) +
geom_point() +
geom_hline(yintercept = 7, linetype = "dashed")
}All the graphs below show that on average with increase in alcohol intake per day, more the sleep time varies away from the average sleep time.
investigate_sleep(data2013_sleep, "marital")## Warning: Removed 7 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "children")## Warning: Removed 18 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "educa")## Warning: Removed 7 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "employ1")## Warning: Removed 9 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "income2")## Warning: Removed 9 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "sex")## Warning: Removed 2 rows containing missing values (geom_point).
investigate_sleep(data2013_sleep, "pregnant")## Warning: Removed 3 rows containing missing values (geom_point).