Introduction

This “Data Analysis Project” is part of the requirements for the conclusion of the course “Introduction to Probability and Data with R” by the Duke Univeristy (Coursera). Its main objectives include:

Setup

One of the first steps in the programming assignment is to load some useful packages as well as the dataset we will be working with.

Load packages

library(ggplot2)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)

Load data

load("brfss2013.RData")

Part 1: DATA

In this first part of the assignment, I will be describing a little bit about the sample collecting and the implications of this data collection method on the scope of inference (generalizability / causality).

The Data

The data from this project come from the “Behavioral Risk Factor Surveillance System”. You can learn more about this survey on the following website: https://www.cdc.gov/brfss/.

Background

The Behavioral Risk Factor Surveillance System (BRFSS) is a national survey that collects health-related data by telephone about U.S. residents (adults +18 years old) regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. It was established in 1984 with 15 states, but not BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. The survey completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

Generalizability / Causality

This is a population-based study since the data have been collected considering the population over the 50 U.S states plus the three U.S. territories. Furthermore, the data are randomly collected, obtained by telephone interviews. It means that people are randomly selected to participate, and those who answer the phone and agree to participate in the survey are included in the study.

Generalizability: In summary, since this is a large representative random sample, we can conclude that the data for the sample is generalizable for the adult population of the United States.

The BRFSS is a cross-sectional telephone survey, which means that each year the data are collected independently from other previous years. In other words, people participating in the survey are not followed over time about their health-related aspects of interest in the survey.

Causality: Since this is an observational cross-sectional survey, we cannot establish causal inference through the data. We may draw conclusions about prevalences, correlation, and even association. However, we are not able to distinguish the direction of this association, in other words, the causality. We cannot assume that one outcome causes the other, instead of the other outcome causing the first one

Part 2: RESEARCH QUESTIONS

In this project I will be working to answer three research questions:

1. Is there any relation between the number of hours slept and the participant’s body mass index (BMI), overall, and between sexes?

Background: Some studies in the area of chrononutrition have raised a possible relationship between individuals’ sleep time and their weight status. In this question, I want to explore this possible relation, under the assumption (hypothesis) that people who sleep less have a higher BMI.

2. Are there any differences in the general health status between sexes?

Background: It is believed that women usually are more concerned about their health than men. Besides, men are related to an overall health status worse than women. In this question, I want to check if women have a reported health status higher than men.

3. Are there any differences between the reported income levels and overall life satisfaction? Are any differences in these levels of income between sexes?

Background: Are people in the highest levels of income more satisfied with their lives? Also, I am interested to see if women have a lower income than men.

Part 3: EXPLORATORY DATA ANALYSIS (EDA)

Research Question 1

To answer the first question, about the relationship between sleep time and weight status, I will be using three variables:

  • How Much Time Do You Sleep

    • var: sleptime1 - “On average, how many hours of sleep do you get in a 24-hour period?”

    Discrete variable: Range from 1-24. Presence of NAs and refuse (removed for the analysis)

  • Computed BMI categories

    • var: weight2: “Four-categories of Body Mass Index”

    Qualitative variable: 1 - Underweight; 2 - Normal Weight; 3 - Overweight; 4 - Obese

  • Respondents Sex

    • var: sex: “Indicate sex of respondent”

    Binary outcome: Assume 1 - Male; 2 - Female

# Removing NAs and cleaning the data

question1 <- brfss2013 %>%
   filter(!is.na(sleptim1)&!is.na(X_bmi5cat)&!is.na(sex))%>%
   select(sleptim1, X_bmi5cat, sex)

#Exploring the variables 

str(question1)
## 'data.frame':    458915 obs. of  3 variables:
##  $ sleptim1 : int  6 9 8 6 8 7 8 8 6 3 ...
##  $ X_bmi5cat: Factor w/ 4 levels "Underweight",..: 1 3 2 4 4 2 4 3 3 3 ...
##  $ sex      : Factor w/ 2 levels "Male","Female": 2 2 2 1 2 2 1 2 1 1 ...
summary(question1)
##     sleptim1              X_bmi5cat          sex        
##  Min.   : 1.000   Underweight  :  8054   Male  :195085  
##  1st Qu.: 6.000   Normal weight:152911   Female:263830  
##  Median : 7.000   Overweight   :165107                  
##  Mean   : 7.049   Obese        :132843                  
##  3rd Qu.: 8.000                                         
##  Max.   :24.000

The str function provides information about the structure of the object (question 1). The dataset contains 458,915 observations after NAs and refusal removal among the three categories.

From the summary output we can see the absolute frequency for each level on the categorical variables (sex and BMI categories). The summary statistics for sleep time - a discrete variable - show a range of 1-24 hours of sleep with a mean (7.049 hours) close to the median (7 hours).

#Descriptive Statistics

table1 <- table(question1$sex, question1$X_bmi5cat)
prop.table(table1, 1)
##         
##          Underweight Normal weight  Overweight       Obese
##   Male   0.009565061   0.268857165 0.430273983 0.291303791
##   Female 0.023454497   0.380779290 0.307648865 0.288117348
aggregate(x = question1$sleptim1,
          by = list(question1$sex),
          FUN = mean) 
##   Group.1        x
## 1    Male 7.030228
## 2  Female 7.062203
aggregate(x = question1$sleptim1,
          by = list(question1$X_bmi5cat),
          FUN = mean)  
##         Group.1        x
## 1   Underweight 7.083437
## 2 Normal weight 7.118638
## 3    Overweight 7.058889
## 4         Obese 6.953118

The descriptive statistics show that men are mostly overweight and women mostly normal-weight. The obesity rate is close to 30% for both sexes. Very few individuals were underweight.

The mean sleep time between sexes was very similar: 7.03h for men and 7.06h for women.

When considering weight status, the highest mean sleep (7.12h) time was among normal-weight individuals, while the lowest mean sleep time was among obese participants (6.95h).

#Plotting graphs

ggplot(question1, aes(x=X_bmi5cat, y=sleptim1, fill=sex)) + 
    geom_boxplot() + labs(y = "Hours of Sleep", x = "BMI categories") +
   theme_bw() + scale_fill_brewer(palette="Set2")

Finally, when investigating sleep time by sex among nutritional status categories the plots showed no relevant or different pattern.

  • In summary, it seems that there is not an association between sleep time and nutritional status.

Research Question 2

To answer the second question, about the relationship between reported general health and sex, besides sex, I will be using another variable:

  • Reported General Health categories
    • var: genhlt: “Five-categories of Reported General Health”
# Removing NAs and cleaning the data

question2 <- brfss2013 %>%
   filter(!is.na(genhlth)&!is.na(sex))%>%
   select(genhlth, sex)

#Exploring the variables 

str(question2)
## 'data.frame':    489788 obs. of  2 variables:
##  $ genhlth: Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ sex    : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
summary(question2)
##       genhlth           sex        
##  Excellent: 85481   Male  :200469  
##  Very good:159075   Female:289319  
##  Good     :150555                  
##  Fair     : 66726                  
##  Poor     : 27951

After removing missing values and refusals, we obtained 489788 observations. The structure of the data is organized in two factors: the factor sex with two levels, and the factor general health with five levels.

table2 <- table(question2$sex, question2$genhlth)
prop.table(table2, 1)
##         
##           Excellent  Very good       Good       Fair       Poor
##   Male   0.17828692 0.32491308 0.31425308 0.12910724 0.05343968
##   Female 0.17192096 0.32469350 0.30263135 0.14117289 0.05958129

When we print a proportional table crossing the data from both factors, we can see that there are no such differences in the proportion of each category of reported general health between sexes.

ggplot(question2, aes(x= genhlth,  group=sex)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
    geom_text(aes( label = scales::percent(..prop.., round(digits = 1)),
                   y= ..prop.. ), stat= "count", vjust = -.3) +
    labs(y = "Percentage", x = "General Health", fill="Sex") +theme_bw()+
   facet_grid(~sex) + scale_fill_brewer(palette="Set2")+
   scale_y_continuous(labels=percent, limits=c(0,0.4, 0.2))

The plot shows the proportion of each category of reported general health between sexes. As we can see, the data between sexes are very similar, which may suggest that men and women did not report differently their general health status.

Most of the participants reported a general health status good or very good (more than 60% for both sexes). Less than 10% of the participants in both sexes reported a poor health status, and the prevalence of individuals with reported general health considered excellent did not reach 20% in either men or women.

  • In summary, it seems that there is no association between sex and reported general health status.

Research Question 3

To answer the third question, about the relationship between income level and overall life satisfaction, in general and between sexes, I used two variables besides sex of the respondent:

  • Satisfaction with life

    • var: lsatisfy - “In general, how satisfied are you with your life?”

    Categorical variable: 1 - Very Satisfied; 2 - Satisfied; 3 - Dissatisfied; 4 - Very Dissatisfied. Presence of NAs and refuse (removed for the analysis)

  • Computed income categories

    • var: X_incomg: “Five-categories of Income Level”

    Qualitative variable: 1 - Less than 15,000; 2 - 15,000 to less than 25,000; 3 - 25,000 to less than 35,000; 4 - 35,000 to less than 50,000; 5 - 50,000 or more.

# Removing NAs and cleaning the data

question3 <- brfss2013 %>%
   filter(!is.na(X_incomg)&!is.na(sex)&!is.na(lsatisfy))%>%
   select(X_incomg, sex, lsatisfy)

#Exploring the variables 

str(question3)
## 'data.frame':    9332 obs. of  3 variables:
##  $ X_incomg: Factor w/ 5 levels "Less than $15,000",..: 5 5 5 5 5 2 3 2 2 5 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 1 2 2 2 ...
##  $ lsatisfy: Factor w/ 4 levels "Very satisfied",..: 1 2 1 1 2 1 1 2 2 1 ...
summary(question3)
##                          X_incomg        sex                    lsatisfy   
##  Less than $15,000           :1849   Male  :3418   Very satisfied   :4290  
##  $15,000 to less than $25,000:2083   Female:5914   Satisfied        :4418  
##  $25,000 to less than $35,000:1179                 Dissatisfied     : 490  
##  $35,000 to less than $50,000:1274                 Very dissatisfied: 134  
##  $50,000 or more             :2947

After cleaning the data and removing NAs and refusal from the three variables of interest, we result in only 9,332 observations. It is important to consider than this data may not be representative of the population anymore due to the loss of statistical power. Then, this data cannot be generalized to the entire population of the study.

ggplot(question3, aes(x=sex, fill=X_incomg)) + geom_bar(position = "fill")+ 
  facet_grid(.~lsatisfy) + ylab("Proportion") + 
  ggtitle("Reported health vs. Income categories by sex") +
  scale_fill_brewer("Income level", palette="Set2") +
   theme_bw()

The plot shows a clear relationship between income level and life satisfaction. In both sexes, most of the people dissatisfied with their lives are in the lowest category of income level. On the other hand, for those who reported being very satisfied with their lives, most of them presented an annual income level above $50,000, for both sexes.

The plot illustrates that as the proportion of people in the lowest categories of income tends to increase, life satisfaction tends to decrease.

ggplot(question3, aes(x=sex, fill=X_incomg)) + geom_bar(position = "fill")+ 
  ylab("Proportion") + 
  ggtitle("Income level vs Sex") +
  scale_fill_brewer("Income level", palette="Set2") +
   theme_bw()

Finally, the plot shows that income level differs between sexes. Men tend to have a higher (50,000 or more) income level than women, while the lowest income level (Less than 15,000) have a higher proportion of women than men.

  • In summary, it seems that there is an association between income level and overall life satisfaction. Besides, the proportion of individuals in each income level shows that men tend to be in the highest income level more than women, while the opposite is also true. Women tend to present a higher proportion of individuals in the lowest income level.

© Lais Duarte Batista

All Rights Reserved

August 06, 2020