In this study, we will use data extracted from The Behavioral Risk Factor Surveillance System - BRFSS2013.
As stated on their website (https://www.cdc.gov/brfss/), the BRFSS is the premier “system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services”.
The data are collected by landline telephone and cellular telephone within each of the states and territories. According to the 2013 Summary Data Quality Report, released on August 15, 2014, landline and cellular phone response rates achieved both a satisfactory level, when compared to other similar surveys. The same report states that BRFSS2013 sample was random and weighted by demographic characteristics of respondents to ensure accurate estimates for most measures.
Base on that information, we can say that this survey’s sample tends to be a good representation of the population and the results have a good degree of generabizability. On the other hand, we cannot talk about causality here. This is an observational study. There’s no intervention and, therefore, no random allocation between intervention and control groups. So, despite all the controls and the quality this study shows, there’s no way we can talk about causal relationships between the variables observerd.
As a preliminary approach, we are only going to focus on five variables in this study: 1) Reported sleep time, 2) Reported High levels of cholesterol, 3) Reported weights in pounds and, 4) Reported Height in pounds and inches and 5) sex
It’s also important to note that the studies that follow are only a preliminary exploratory data analysis. All statements should be treated with caution, and should only be read as an introductory analysis. Further conclusion, specially the inferential ones, should be postponed to when we are able to run more advanced analysis.
Research quesion 1: Is there any relationship between how much time someone sleeps and cholesterol levels?
Research quesion 2: Is there any relationship between sex and cholesterol levels?
Research quesion 3: What’s the relationship between cholesterol levels and obesity?
Before addressing the research questions, we will create a new data frame containing only the variables we are interested at the moment. We can do this by using the select function.
final_project <- brfss2013 %>%
select(sleptim1, toldhi2, sex, weight2, height3)
final_project %>%
head()## sleptim1 toldhi2 sex weight2 height3
## 1 NA Yes Female 250 507
## 2 6 No Female 127 510
## 3 9 No Female 160 504
## 4 8 Yes Female 128 504
## 5 6 No Male 265 600
## 6 8 Yes Female 225 503
Research question 1:
To begin addressing the first question, we will run some summaries, to discover basic information about our data. We will start by summarizing data about sleeping time. First, from all the sample, and then grouping by sex. We use the filter function to exclude NA values from sleptim1. If we don’t do this, the numerical functions won’t be able to display the desirable outcomes.
final_project %>%
filter(!(is.na(sleptim1))) %>%
summarise(slep_mean = mean(sleptim1), slep_median = median(sleptim1), slep_sd = sd(sleptim1),
slep_min = min(sleptim1), slep_max = max(sleptim1))## slep_mean slep_median slep_sd slep_min slep_max
## 1 7.052099 7 1.60411 0 450
final_project %>%
filter(!(is.na(sleptim1))) %>%
group_by(sex) %>%
summarise(slep_mean = mean(sleptim1), slep_median = median(sleptim1), slep_sd = sd(sleptim1),
slep_min = min(sleptim1), slep_max = max(sleptim1), count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 7
## sex slep_mean slep_median slep_sd slep_min slep_max count
## <fct> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 Male 7.03 7 1.45 1 24 198910
## 2 Female 7.06 7 1.48 1 24 285474
## 3 <NA> 140. 55.5 212. 0 450 4
The first data summarization shows that mean and median values are pretty close to each other. It also shows that we have some possible outilers recorded at max and min values. When we group the sample by sex, we can see those extreme values are associated to NA entries, maybe due to some filling mistake, and that data from men and women seem to be quite uniform on every parameter described by summarization.
Now we can use those values as baseline to investigate if the amount of sleep has any relationship with reported hi cholesterol. To do this, we can start by creating (mutate) a new variable indicating people who reported sleep routines bigger or smaller than our baseline. We will use the same pipe to exclude NA values from sex and baseline variables.
final_project <- final_project %>%
mutate(baseline = ifelse(sleptim1 < 7, "below 7", "at least 7"))
final_project %>%
filter(!(is.na(sex))) %>%
filter(!(is.na(baseline))) %>%
group_by(baseline)%>%
summarise( toldhi2 = length(toldhi2)) ## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## baseline toldhi2
## <chr> <int>
## 1 at least 7 325690
## 2 below 7 158694
According to data, people who slept 7 or more hours a day reported more hi cholesterol than the people who slept less than 7 hours a day.
We can do a similar analysis, but this time adding the variable sex to the dataframe.
final_project %>%
filter(!(is.na(sex))) %>%
filter(!(is.na(baseline))) %>%
group_by(sex, baseline)%>%
summarise( toldhi2 = length(toldhi2)) ## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex baseline toldhi2
## <fct> <chr> <int>
## 1 Male at least 7 133280
## 2 Male below 7 65630
## 3 Female at least 7 192410
## 4 Female below 7 93064
As we can see, sex dos not alter the trend of reporting hi cholesterol if the respondent has a sleep routine bigger than 7h a day.
Research question 2: To address research question 2, we are going to create a contingency table that can show us how sex and reported hi cholesterol are distributed on this sample. To do it, we group the data by sex and reported cholesterol and then we summarize the results asking the function to count the times men and women reported or not a high cholesterol. We end the pipe using the spread() function, from tidyverse package, in order to organize Men and Women data in separated columns.
final_project %>%
filter(!(is.na(sex))) %>%
group_by(toldhi2, sex ) %>%
summarise(n = n()) %>%
spread(sex, n)## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)
## # A tibble: 3 x 3
## # Groups: toldhi2 [3]
## toldhi2 Male Female
## <fct> <int> <int>
## 1 Yes 74917 108584
## 2 No 92402 144210
## 3 <NA> 33994 37661
The first outcome we get is the great amount of NA entries. On question 1, we saw that NA entries could be excluded from the sample without many concerns. This time, however, NA occurrence can’t be ignored on further analysis. Another thing we can see on this tibble is that both men and women tended more to report a non-hi cholesterol level when asked about it.
We can see proportions for that numbers. That’s the outcome of the following pipe. According to it, the proportion of men and women that did not report hi cholesterol was 46% and 50% respectively. We can also see that the missing proportion is almost 17% for men and 13% for woman respondents.
Although we need further analysis, data seems to show no relationship between sex and hi cholesterol reporting, with both men and women showing similar outcomes. On the other hand, could this result be different if this data had less NA entries? We don’t have enough information to answer this.
final_project %>%
filter(!(is.na(sex))) %>%
count(toldhi2, sex) %>%
group_by(sex) %>%
mutate(prop = prop.table(n)) %>%
gather(measure, value, n, prop) %>%
unite(measure, measure, toldhi2) %>%
spread(measure, value)## # A tibble: 2 x 7
## # Groups: sex [2]
## sex n_NA n_No n_Yes prop_NA prop_No prop_Yes
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male 33994 92402 74917 0.169 0.459 0.372
## 2 Female 37661 144210 108584 0.130 0.496 0.374
Research quesion 3:
To address question 3, we will create a new variable, Body Mass Index (BMI). This is the most commonly used index to measure obesity levels. According to WHO (shorturl.at/fpCHQ), the BMI is defined as a person’s weight in kilograms divided by the square of his height in meters (kg/m2). Note that our data is in pounds and feet. As we are going to use a BMI based on metric system inputs, we will first convert them before plugging in the values on the BMI formula.
## Warning: Problem with `mutate()` input `numweight2`.
## ℹ NAs introduced by coercion
## ℹ Input `numweight2` is `as.numeric(as.character((weight2)))`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
final_project <- final_project %>%
mutate(numheight = as.numeric(as.character((height3))))
final_project <- final_project %>%
mutate(weight_KG = numweight2 /2.20462262185)
final_project <- final_project %>%
mutate(height_Mt = (height3/100)/3.2808)
final_project %>%
select(sex, weight_KG, height_Mt)%>%
head()## sex weight_KG height_Mt
## 1 Female 113.39809 1.545355
## 2 Female 57.60623 1.554499
## 3 Female 72.57478 1.536211
## 4 Female 58.05982 1.536211
## 5 Male 120.20198 1.828822
## 6 Female 102.05828 1.533163
Note that originally, weight and height are recorded as factors. We first asked R to read them as numeric, and only then we used this values to create new columns for kg and meter metrics. Now, we can go further and create some summaries on weight and height.
final_project %>%
filter(!(is.na(weight_KG))) %>%
summarise(wgt_mean = mean(weight_KG), wgt_median = median(weight_KG), wgt_sd = sd(weight_KG),
wgt_min = min(weight_KG), wgt_max = max(weight_KG))## wgt_mean wgt_median wgt_sd wgt_min wgt_max
## 1 87.93668 77.1107 179.1129 1.360777 4229.749
final_project %>%
filter(!(is.na(height_Mt))) %>%
summarise(hgt_mean = mean(height_Mt), hgt_median = median(height_Mt), hgt_sd = sd(height_Mt),
hgt_min = min(height_Mt), hgt_max = max(height_Mt))## hgt_mean hgt_median hgt_sd hgt_min hgt_max
## 1 1.679829 1.542307 1.642893 0.003048037 28.98378
At the first glance, we can see we have some strange data on this data frame. Min and max values are way out of the curve we would expect from a distribution of weights and heights for human beings. On the other hand, median and mean for both weight and height are not to apart from each other, and both fall under what we would expect to be a normal standard for human adults.
Making another step on our analysis, we can now go ahead and create our BMI index , using the formula BMI = kg/m^2.
final_project <- final_project %>%
mutate(bmi = weight_KG/(height_Mt^2))
final_project %>%
select(sex, weight_KG, height_Mt, bmi)%>%
head()## sex weight_KG height_Mt bmi
## 1 Female 113.39809 1.545355 47.48422
## 2 Female 57.60623 1.554499 23.83903
## 3 Female 72.57478 1.536211 30.75276
## 4 Female 58.05982 1.536211 24.60221
## 5 Male 120.20198 1.828822 35.93922
## 6 Female 102.05828 1.533163 43.41820
According to World Health Organization site (shorturl.at/bfpzG), we can use BMI to classify people in 3 categories: healthy (BMI<25), overweight (BMI >= 25) and obesity (BMI >=30). We will create another column on our working dataframe in order to classify our bmi values into those 3 categories.
final_project <- final_project %>%
mutate(condition = ifelse(bmi < 25, "healthy", ifelse(bmi >= 30, "obsesity", "overweight")))
final_project %>%
select(sex, toldhi2, weight_KG, height_Mt, bmi,condition) %>%
head()## sex toldhi2 weight_KG height_Mt bmi condition
## 1 Female Yes 113.39809 1.545355 47.48422 obsesity
## 2 Female No 57.60623 1.554499 23.83903 healthy
## 3 Female No 72.57478 1.536211 30.75276 obsesity
## 4 Female Yes 58.05982 1.536211 24.60221 healthy
## 5 Male No 120.20198 1.828822 35.93922 obsesity
## 6 Female Yes 102.05828 1.533163 43.41820 obsesity
Now we can cross obesity entries and reported hi cholesterol entries to finally adress our third question. We do it again using mutate and spread function.
## `summarise()` regrouping output by 'condition' (override with `.groups` argument)
## # A tibble: 3 x 5
## toldhi2 healthy obsesity overweight `<NA>`
## <fct> <int> <int> <int> <int>
## 1 Yes 19129 111863 44571 7938
## 2 No 39600 119580 66020 11412
## 3 <NA> 15608 31900 19752 4402
Again, the great number of NA entries on both toldhi2 and condition variables does not let us comfortable to look for patterns on this table. Nevertheless, it looks like that people with good BMI scores tend to report less high cholesterol. Unless, of course, all the healthy that haven’t reported their cholesterol (NA entries) would report a high cholesterol, than we would have to consider the opposite situation (good BMI associating with high cholesterol reports)
When we consider obesity and overweight, the situation is even less conclusive. Actually, just looking at the table does not allow us to think of any kind of association between hi reported levels of cholesterol and obesity on the sample at hand. Again, the great numbers of NA entries weakens the analysis here, but if can trust these numbers and generalize them to the population, it could even mean that people with obese and overweight BMIs scores would tend to report low cholesterol levels more often than they would tend to report hi level ones. Or, in a more conservative conclusion, that obesity and overweight conditions does not affect cholesterol levels.
Now that we cleaned up the data in a way that addresses the needs of the research questions we want to explore, we are ready to continue with our analysis.