This “Data Analysis Project” is part of the requirements for the conclusion of the course “Introduction to Probability and Data with R” by the Duke Univeristy (Coursera). Its main objectives include:
One of the first steps in the programming assignment is to load some useful packages as well as the dataset we will be working with.
library(ggplot2)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)
load("brfss2013.RData")
In this first part of the assignment, I will be describing a little bit about the sample collecting and the implications of this data collection method on the scope of inference (generalizability / causality).
The data from this project come from the “Behavioral Risk Factor Surveillance System”. You can learn more about this survey on the following website: https://www.cdc.gov/brfss/.
The Behavioral Risk Factor Surveillance System (BRFSS) is a national survey that collects health-related data by telephone about U.S. residents (adults +18 years old) regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. It was established in 1984 with 15 states, but not BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. The survey completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
This is a population-based study since the data have been collected considering the population over the 50 U.S states plus the three U.S. territories. Furthermore, the data are randomly collected, obtained by telephone interviews. It means that people are randomly selected to participate, and those who answer the phone and agree to participate in the survey are included in the study.
Generalizability: In summary, since this is a large representative random sample, we can conclude that the data for the sample is generalizable for the adult population of the United States.
The BRFSS is a cross-sectional telephone survey, which means that each year the data are collected independently from other previous years. In other words, people participating in the survey are not followed over time about their health-related aspects of interest in the survey.
Causality: Since this is an observational cross-sectional survey, we cannot establish causal inference through the data. We may draw conclusions about prevalences, correlation, and even association. However, we are not able to distinguish the direction of this association, in other words, the causality. We cannot assume that one outcome causes the other, instead of the other outcome causing the first one
In this project I will be working to answer three research questions:
Background: Some studies in the area of chrononutrition have raised a possible relationship between individuals’ sleep time and their weight status. In this question, I want to explore this possible relation, under the assumption (hypothesis) that people who sleep less have a higher BMI.
Background: It is believed that women usually are more concerned about their health than men. Besides, men are related to an overall health status worse than women. In this question, I want to check if women have a reported health status higher than men.
Background: Are people in the highest levels of income more satisfied with their lives? Also, I am interested to see if women have a lower income than men.
To answer the first question, about the relationship between sleep time and weight status, I will be using three variables:
How Much Time Do You Sleep
Discrete variable: Range from 1-24. Presence of NAs and refuse (removed for the analysis)
Computed BMI categories
Qualitative variable: 1 - Underweight; 2 - Normal Weight; 3 - Overweight; 4 - Obese
Respondents Sex
Binary outcome: Assume 1 - Male; 2 - Female
# Removing NAs and cleaning the data
question1 <- brfss2013 %>%
filter(!is.na(sleptim1)&!is.na(X_bmi5cat)&!is.na(sex))%>%
select(sleptim1, X_bmi5cat, sex)
#Exploring the variables
str(question1)
## 'data.frame': 458915 obs. of 3 variables:
## $ sleptim1 : int 6 9 8 6 8 7 8 8 6 3 ...
## $ X_bmi5cat: Factor w/ 4 levels "Underweight",..: 1 3 2 4 4 2 4 3 3 3 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 1 2 2 1 2 1 1 ...
summary(question1)
## sleptim1 X_bmi5cat sex
## Min. : 1.000 Underweight : 8054 Male :195085
## 1st Qu.: 6.000 Normal weight:152911 Female:263830
## Median : 7.000 Overweight :165107
## Mean : 7.049 Obese :132843
## 3rd Qu.: 8.000
## Max. :24.000
The str function provides information about the structure of the object (question 1). The dataset contains 458,915 observations after NAs and refusal removal among the three categories.
From the summary output we can see the absolute frequency for each level on the categorical variables (sex and BMI categories). The summary statistics for sleep time - a discrete variable - show a range of 1-24 hours of sleep with a mean (7.049 hours) close to the median (7 hours).
#Descriptive Statistics
table1 <- table(question1$sex, question1$X_bmi5cat)
prop.table(table1, 1)
##
## Underweight Normal weight Overweight Obese
## Male 0.009565061 0.268857165 0.430273983 0.291303791
## Female 0.023454497 0.380779290 0.307648865 0.288117348
aggregate(x = question1$sleptim1,
by = list(question1$sex),
FUN = mean)
## Group.1 x
## 1 Male 7.030228
## 2 Female 7.062203
aggregate(x = question1$sleptim1,
by = list(question1$X_bmi5cat),
FUN = mean)
## Group.1 x
## 1 Underweight 7.083437
## 2 Normal weight 7.118638
## 3 Overweight 7.058889
## 4 Obese 6.953118
The descriptive statistics show that men are mostly overweight and women mostly normal-weight. The obesity rate is close to 30% for both sexes. Very few individuals were underweight.
The mean sleep time between sexes was very similar: 7.03h for men and 7.06h for women.
When considering weight status, the highest mean sleep (7.12h) time was among normal-weight individuals, while the lowest mean sleep time was among obese participants (6.95h).
#Plotting graphs
ggplot(question1, aes(x=X_bmi5cat, y=sleptim1, fill=sex)) +
geom_boxplot() + labs(y = "Hours of Sleep", x = "BMI categories") +
theme_bw() + scale_fill_brewer(palette="Set2")
Finally, when investigating sleep time by sex among nutritional status categories the plots showed no relevant or different pattern.
To answer the second question, about the relationship between reported general health and sex, besides sex, I will be using another variable:
# Removing NAs and cleaning the data
question2 <- brfss2013 %>%
filter(!is.na(genhlth)&!is.na(sex))%>%
select(genhlth, sex)
#Exploring the variables
str(question2)
## 'data.frame': 489788 obs. of 2 variables:
## $ genhlth: Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
summary(question2)
## genhlth sex
## Excellent: 85481 Male :200469
## Very good:159075 Female:289319
## Good :150555
## Fair : 66726
## Poor : 27951
After removing missing values and refusals, we obtained 489788 observations. The structure of the data is organized in two factors: the factor sex with two levels, and the factor general health with five levels.
table2 <- table(question2$sex, question2$genhlth)
prop.table(table2, 1)
##
## Excellent Very good Good Fair Poor
## Male 0.17828692 0.32491308 0.31425308 0.12910724 0.05343968
## Female 0.17192096 0.32469350 0.30263135 0.14117289 0.05958129
When we print a proportional table crossing the data from both factors, we can see that there are no such differences in the proportion of each category of reported general health between sexes.
ggplot(question2, aes(x= genhlth, group=sex)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop.., round(digits = 1)),
y= ..prop.. ), stat= "count", vjust = -.3) +
labs(y = "Percentage", x = "General Health", fill="Sex") +theme_bw()+
facet_grid(~sex) + scale_fill_brewer(palette="Set2")+
scale_y_continuous(labels=percent, limits=c(0,0.4, 0.2))
The plot shows the proportion of each category of reported general health between sexes. As we can see, the data between sexes are very similar, which may suggest that men and women did not report differently their general health status.
Most of the participants reported a general health status good or very good (more than 60% for both sexes). Less than 10% of the participants in both sexes reported a poor health status, and the prevalence of individuals with reported general health considered excellent did not reach 20% in either men or women.
To answer the third question, about the relationship between income level and overall life satisfaction, in general and between sexes, I used two variables besides sex of the respondent:
Satisfaction with life
Categorical variable: 1 - Very Satisfied; 2 - Satisfied; 3 - Dissatisfied; 4 - Very Dissatisfied. Presence of NAs and refuse (removed for the analysis)
Computed income categories
Qualitative variable: 1 - Less than 15,000; 2 - 15,000 to less than 25,000; 3 - 25,000 to less than 35,000; 4 - 35,000 to less than 50,000; 5 - 50,000 or more.
# Removing NAs and cleaning the data
question3 <- brfss2013 %>%
filter(!is.na(X_incomg)&!is.na(sex)&!is.na(lsatisfy))%>%
select(X_incomg, sex, lsatisfy)
#Exploring the variables
str(question3)
## 'data.frame': 9332 obs. of 3 variables:
## $ X_incomg: Factor w/ 5 levels "Less than $15,000",..: 5 5 5 5 5 2 3 2 2 5 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 1 2 2 2 ...
## $ lsatisfy: Factor w/ 4 levels "Very satisfied",..: 1 2 1 1 2 1 1 2 2 1 ...
summary(question3)
## X_incomg sex lsatisfy
## Less than $15,000 :1849 Male :3418 Very satisfied :4290
## $15,000 to less than $25,000:2083 Female:5914 Satisfied :4418
## $25,000 to less than $35,000:1179 Dissatisfied : 490
## $35,000 to less than $50,000:1274 Very dissatisfied: 134
## $50,000 or more :2947
After cleaning the data and removing NAs and refusal from the three variables of interest, we result in only 9,332 observations. It is important to consider than this data may not be representative of the population anymore due to the loss of statistical power. Then, this data cannot be generalized to the entire population of the study.
ggplot(question3, aes(x=sex, fill=X_incomg)) + geom_bar(position = "fill")+
facet_grid(.~lsatisfy) + ylab("Proportion") +
ggtitle("Reported health vs. Income categories by sex") +
scale_fill_brewer("Income level", palette="Set2") +
theme_bw()
The plot shows a clear relationship between income level and life satisfaction. In both sexes, most of the people dissatisfied with their lives are in the lowest category of income level. On the other hand, for those who reported being very satisfied with their lives, most of them presented an annual income level above $50,000, for both sexes.
The plot illustrates that as the proportion of people in the lowest categories of income tends to increase, life satisfaction tends to decrease.
ggplot(question3, aes(x=sex, fill=X_incomg)) + geom_bar(position = "fill")+
ylab("Proportion") +
ggtitle("Income level vs Sex") +
scale_fill_brewer("Income level", palette="Set2") +
theme_bw()
Finally, the plot shows that income level differs between sexes. Men tend to have a higher (50,000 or more) income level than women, while the lowest income level (Less than 15,000) have a higher proportion of women than men.
© Lais Duarte Batista
All Rights Reserved
August 06, 2020