First of all, I have to load the data, and the specific packages that we have been using in this course (as instructed).
library(ggplot2)
library(dplyr)load("brfss2013.RData")The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year.States are required to conduct at least 2,500 interviews for each of the versions of the questionnaire in order to have enough responses for weighting purposes.
I assume that states are adopting a random sampling approach, which means that data is generalizable. However, since that data are observational, only correlations can be found among the data. There are no random assignment of tests between treatment groups and control groups, which means causal relations cannot be established.
The dimensions of the data are that there are 491775 observations (interviews) in the data on 330 variables. Most of these variables are answers to survey questions, but there are also computated variables.
dim(brfss2013)## [1] 491775 330
Research quesion 1: Is metal health related to Body Mass Index?
I chose metal health, as overweight is obviously related to physical health. However, I think it is interesting to also find out what the mental relation is.
The variables that I will investigate are:
-menthlth: number of days mental health not good in the past 30 days. The exact survey question was: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
-X_bmi5cat: this is computated variable based on height and weight of a person
Research quesion 2: What is the effect on the amount of sleep per day on general health? Are there differences among gender?
I was interested in this question, as I think many people just don’t get enough sleep as they are too busy with work, household etc.
The variables that I will investigate are:
sleptime1: How much do you sleep?
genhlth: Would you say that in general your health is….(categories)
sex: male/female
Research quesion 3: What is the relation between education level and people eating fruit and vegetables?
I chose this as I think that people with a higher education pay more attention to healthy eating.
The variables that I will investigate are:
educa. Education level. This is a factoral variable with 6 levels.
x_frtlt1.This is a computated variable, and calculates if people eat fruit at least once a day.
x-veglt1. This is also a computed variable, and calcultes if people eat vegetables at least once a day.
Research quesion 1:The relation between BMI and Mental Health
First of all, I want to look at data types and the responses of the Body Mass Index variable.
str(brfss2013$X_bmi5cat)## Factor w/ 4 levels "Underweight",..: 4 1 3 2 4 4 2 NA 4 3 ...
brfss2013 %>%
group_by(X_bmi5cat) %>%
summarise(count=n())## # A tibble: 5 × 2
## X_bmi5cat count
## <fctr> <int>
## 1 Underweight 8267
## 2 Normal weight 154898
## 3 Overweight 167084
## 4 Obese 134799
## 5 NA 26727
So the X_bmi5cat is a factor variable, and it is also ordinal. It has actually 4 categories, and I will need to filter out the NA’s.
Now, I also want to investigate the Mental Health variable.
str(brfss2013$menthlth)## int [1:491775] 29 0 2 0 2 0 15 0 0 0 ...
Mental Health is an integer variable. Since Mental Health comprises many more different responses, I thought it is visually a much better idea to use a histogram to investigate the responses. However, before I do so, I find need to check for bad data. Since the survey question inquires about the mental health status over the past 30 days, any answers outside the 0-30 range should not be considered. As I also want to see if there are NA’s, I added selecting NA’s to the filter.
brfss2013 %>%
group_by(menthlth) %>%
summarise(count=n()) %>%
filter(menthlth>30|menthlth<0|is.na(menthlth))## # A tibble: 3 × 2
## menthlth count
## <int> <int>
## 1 247 1
## 2 5000 1
## 3 NA 8627
As you can see, there are 2 values recorded that make no sense, and need to be excluded from the analysis, together with the NA’s. First of all, I am going to calculated basic statistics with the values mentioned excluded.
brfss2013 %>%
filter(!(is.na(menthlth)|menthlth>30)) %>%
summarise(median_mh = median(menthlth), mean_mh = mean(menthlth), sd_mh = sd(menthlth), n = n())## median_mh mean_mh sd_mh n
## 1 0 3.371888 7.700926 483146
If I check the (n), I can see that indeed the 8629 values that I wanted to excluded are excluded indeed. Also, since the median is 0, at least half of the people report no days of mental illness (as expected). However, I actually only want to see the people who reported some mental illness. Therefore, I am also going to filter out the 0-days values, and look at the results in a histogram.As this did not work properly by filtering and chaining data to the histogram, I made a copy of the dataframe with clean data first.
clean_mh <- brfss2013 %>%
filter(!(is.na(menthlth)|menthlth>30|menthlth==0))
ggplot(data = clean_mh, aes(x=menthlth)) +
geom_histogram(binwidth=1)The result is clearly not a normal distribution, with many people reporting mental illness a few days, and a lot of people reporting feeling mentally ill all 30 days (that’s actually the mode now). Below, I am calculating the statistics with the 0 values excluded. As you can see, the median is now 5 days, and the people reporting any days of mental illness is about 30% of the sample (148685/491775)
clean_mh %>%
summarise(median_mh = median(menthlth), mean_mh = mean(menthlth), sd_mh = sd(menthlth), n = n())## median_mh mean_mh sd_mh n
## 1 5 10.95681 10.46903 148685
The last thing that I want to do is to investigate the correlation (if there is one) with the 4 BMI categories in a box plot. However, in order to do so, I first have to also clean the NA category in the BMI variable. What I am doing below is to overwrite clean_mh again, and this time only select the cleaned variables mental health and BMI.
clean_mh <- select(brfss2013,menthlth,X_bmi5cat) %>%
filter(!(is.na(menthlth)|menthlth>30|menthlth==0|is.na(X_bmi5cat))) ggplot(clean_mh, aes(x = X_bmi5cat, y = menthlth)) +
geom_boxplot()As you can see, the categoriess reporting more mental illness days are actually Obese and Underweight. People with a normal weight and even overweight seem to have more severe issues if they have any issues (>0 days mental illness). The IQR and Median of Obese and Underweight are very similar (same for normal and overweight). I cannot say anything about the statistical significance of these results yet, but I am sure that we are going to learn about this in the next course.
Update: I later realised that the analysis above only shows the severeness of issues. However, the answer to the research question also asks for a view of the proportions reporting issues versus no issues at all. Therefore, I also wanted to compose a bar chart that shows this. First, I create a dataframe that includes the 0-days responses again. Then, I created a variable “health”, which can take the values healthy (0-days) or issues (>0 days). The graph shows that Obese and Underweight people also report more issues (smaller % of people with 0-days of mental illness).
clean_mhw0 <- select(brfss2013,menthlth,X_bmi5cat) %>%
filter(!(is.na(menthlth)|menthlth>30|is.na(X_bmi5cat))) clean_mhw0 <- clean_mhw0 %>%
mutate(health=ifelse(menthlth==0,"healthy","issues"))ggplot(data = clean_mhw0, aes(x = X_bmi5cat, fill = health)) +
geom_bar(position="fill")Research quesion 2: Sleep and general health
What is the effect on the amount of sleep per day on general health? Are there differences among gender?
Again, I first want to find out what data types and responses I am dealing with. General health is a factoral and ordinal variable with 5 categories.
str(brfss2013$genhlth)## Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
Below, you can see that there are no starnge responses, and that I will only need to exclude the NAs in my analysis.
brfss2013 %>%
group_by(genhlth) %>%
summarise(count=n())## # A tibble: 6 × 2
## genhlth count
## <fctr> <int>
## 1 Excellent 85482
## 2 Very good 159076
## 3 Good 150555
## 4 Fair 66726
## 5 Poor 27951
## 6 NA 1985
Secondly, I also want to investigate sleptim1. As can be expected, Slept time is an integer, and should not exceed 24 hours a day.
str(brfss2013$sleptim1)## int [1:491775] NA 6 9 8 6 8 7 6 8 8 ...
Similar to what I did in the 1st research question, I first want to find out if there are any clearly impossible answers recorded (in this case >24 hours/day) and also see if there are any NAs.
brfss2013 %>%
group_by(sleptim1) %>%
summarise(count=n()) %>%
filter(sleptim1<0|sleptim1>24|is.na(sleptim1))## # A tibble: 3 × 2
## sleptim1 count
## <int> <int>
## 1 103 1
## 2 450 1
## 3 NA 7387
As you can see, the 103 and 450 hours a day are clearly mistakes/bad data. I now want to plot the feasible data in a histogram. To realize this, I am first making a new dataframe.
cleansleep <- brfss2013 %>%
filter(!(is.na(sleptim1)|sleptim1>24))
ggplot(data = cleansleep, aes(x=sleptim1)) +
geom_histogram(binwidth=1)As you can see, the time slept per night is symmetrical and seems normally distributed at first sight.Below, I have calculated the basic statistics.
cleansleep %>%
summarise(median_s = median(sleptim1), mean_s = mean(sleptim1), sd_s = sd(sleptim1), n = n())## median_s mean_s sd_s n
## 1 7 7.050986 1.465987 484386
I also computed the statistics using the general health categories as groups. As you can see, there seems to be a correlation with the mean by category (more sleep means better general health).
cleansleep %>%
group_by(genhlth) %>%
filter(!(is.na(genhlth))) %>%
summarise(median_s = median(sleptim1), mean_s = mean(sleptim1), sd_s = sd(sleptim1), n = n())## # A tibble: 5 × 5
## genhlth median_s mean_s sd_s n
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 Excellent 7 7.189680 1.214332 84822
## 2 Very good 7 7.103533 1.206879 157833
## 3 Good 7 7.038402 1.442974 148299
## 4 Fair 7 6.896496 1.809592 65012
## 5 Poor 6 6.737152 2.391026 26639
In order to visualize this, I tried the boxplot again. However, The boxplot was not very helpful, as it displays mediums instead of means. Therfore, I tried a different graph. However, since the diferences in the means are tiny, this visualization didn’t add much either.
The additional question that I asked myself was if the is a difference among gender. First, I created a dataframe in which I only keep the general health categories (and exclude the NAs), and the sleeping time data (which is already cleaned in cleansleep), and sex (male/female, excluding the NAs).
cleanstats <-select(cleansleep,genhlth,sleptim1,sex) %>%
filter(!(is.na(genhlth)|is.na(sex)))The first thing that I did was to check was if there is a difference in hours of sleep between males and females. As you can see below, the are hardly any differences in the statistics.
cleanstats %>%
group_by(sex) %>%
summarise(median_s = median(sleptim1), mean_s = mean(sleptim1), sd_s = sd(sleptim1), n = n())## # A tibble: 2 × 5
## sex median_s mean_s sd_s n
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 Male 7 7.030745 1.443894 198148
## 2 Female 7 7.064362 1.478621 284455
Lastly, I also checked if mapping sex and general health would show any significant differences. However, as you can see below the patterns are the same. For both males and females there seems to be some correlation between general health and amount of sleep. The only difference is that the correlation seems to be a little stronger among women (excellent: a little more sleep than man, poor: a little less sleep than men).
cleanstats %>%
group_by(genhlth,sex) %>%
summarise(median_s = median(sleptim1), mean_s = mean(sleptim1), sd_s = sd(sleptim1), n = n())## Source: local data frame [10 x 6]
## Groups: genhlth [?]
##
## genhlth sex median_s mean_s sd_s n
## <fctr> <fctr> <dbl> <dbl> <dbl> <int>
## 1 Excellent Male 7 7.132367 1.234058 35477
## 2 Excellent Female 7 7.231031 1.197839 49344
## 3 Very good Male 7 7.054906 1.191493 64711
## 4 Very good Female 7 7.137316 1.216321 93121
## 5 Good Male 7 7.030881 1.416651 62271
## 6 Good Female 7 7.043846 1.461717 86028
## 7 Fair Male 7 6.931316 1.779180 25377
## 8 Fair Female 7 6.874202 1.828472 39635
## 9 Poor Male 7 6.773371 2.423805 10312
## 10 Poor Female 6 6.714277 2.369878 16327
Research quesion 3: Relation between education and fruit and vegetables consumption
First of all, I am checking the structure and values of the variables that I have chosen to investigate. These are: - educa. Education level. This is a factoral variable with 6 levels. - x_frtlt1.This is a computated variable, and calculates if people eat fruit at least once a day. - x-veglt1. This is also a computed variable, and calcultes if people eat vegetables at least once a day.
str(brfss2013$educa)## Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
brfss2013 %>%
group_by(educa) %>%
summarise(count=n())## # A tibble: 7 × 2
## educa count
## <fctr> <int>
## 1 Never attended school or only kindergarten 677
## 2 Grades 1 through 8 (Elementary) 13395
## 3 Grades 9 though 11 (Some high school) 28141
## 4 Grade 12 or GED (High school graduate) 142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6 College 4 years or more (College graduate) 170120
## 7 NA 2274
str(brfss2013$X_frtlt1)## Factor w/ 2 levels "Consumed fruit one or more times per day",..: 1 2 2 2 2 1 1 2 1 2 ...
brfss2013 %>%
group_by(X_frtlt1) %>%
summarise(count=n())## # A tibble: 3 × 2
## X_frtlt1 count
## <fctr> <int>
## 1 Consumed fruit one or more times per day 291729
## 2 Consumed fruit less than one time per day 171343
## 3 NA 28703
str(brfss2013$X_veglt1)## Factor w/ 2 levels "Consumed vegetables one or more times per day",..: 2 1 1 1 1 1 1 1 1 1 ...
brfss2013 %>%
group_by(X_veglt1) %>%
summarise(count=n())## # A tibble: 3 × 2
## X_veglt1 count
## <fctr> <int>
## 1 Consumed vegetables one or more times per day 359834
## 2 Consumed vegetables less than one time per day 101777
## 3 NA 30164
As you can see, there are no strange values/bad data, and I only need to filter out the NAs per category. I order to make start my analysis with only clean data, and only the variables that I need, I am copying those 3 variables into a new dataframe, and filter out the NAs. As you can see, I lost about 32.500 interview with NAs (I think there is a lot of overlap between missing fruits and missing vegetables data).
q3clean <- select(brfss2013, educa, X_frtlt1, X_veglt1) %>%
filter(!(is.na(educa)|is.na(X_frtlt1)|is.na(X_veglt1)))
dim(q3clean)## [1] 459264 3
Unfortunately, I could not find out how to use the factorial values ([1] ect) in the ifelse statements, so for the time being, I am going to create new variables using the descriptions.
q3clean <- q3clean %>%
mutate(both=ifelse((X_veglt1=="Consumed vegetables one or more times per day" & X_frtlt1=="Consumed fruit one or more times per day"),"both","no"))q3clean <- q3clean %>%
mutate(none=ifelse((X_veglt1!="Consumed vegetables one or more times per day" & X_frtlt1!="Consumed fruit one or more times per day"),"none","no"))q3clean <- q3clean %>%
mutate(fruit=ifelse((X_veglt1!="Consumed vegetables one or more times per day" & X_frtlt1=="Consumed fruit one or more times per day"),"fruit_only","no"))q3clean <- q3clean %>%
mutate(vegt=ifelse((X_veglt1=="Consumed vegetables one or more times per day" & X_frtlt1!="Consumed fruit one or more times per day"),"vegt_only","no"))So, now I have created 4 additional variables, and the values should be mutually exclusive (either eating both, neither, fruit only, or vegetables only). As you can see below, most people eat both or vegetables only.
q3clean %>%
group_by(both,none,fruit,vegt) %>%
summarise(count=n())## Source: local data frame [4 x 5]
## Groups: both, none, fruit [?]
##
## both none fruit vegt count
## <chr> <chr> <chr> <chr> <int>
## 1 both no no no 250394
## 2 no no fruit_only no 39478
## 3 no no no vegt_only 108066
## 4 no none no no 61326
Below, you can see that especially college graduates eat both more often (I had to search for a way to make the x-axis readable by the way…).
ggplot(data = q3clean, aes(x = educa, fill = both)) +
geom_bar(position="fill")+ theme(axis.text.x = element_text(angle = 25, hjust = 1))The last thing that I wanted to do, is the same thing for people who don’t eat either (I have made my life hard enough already ;-)). This picture is pretty similar to the previous one, with the “worst performance” in the “some highschool” category, and college graduates performing best (almost all eat some fruit or vegetables.)
ggplot(data = q3clean, aes(x = educa, fill = none)) +
geom_bar(position="fill")+ theme(axis.text.x = element_text(angle = 25, hjust = 1))