Overview

This is a project for Introduction to Probability and Data with R course, which is a part of Coursera’s Statistics with R Specialization.

The report aims to perform some exploratory data analysis of BRFSS (Behavioral Risk Factor Surveillance System) \(2013\) year data set.

BRFSS is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS data are used for targeting and building health promotion activities.


  • Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(gridExtra); library(dplyr); library(plotly)
if(!file.exists("./data/1024_SR-IP-w5_Happiness/brfss2013.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/4tiY2fqCQa-YmNn6gnGvzQ_1e7320c30a6f4b27894a54e2de50a805_brfss2013.RData?Expires=1609372800&Signature=il7z9AaJdFfLp1bthmDUPsyGE7CBaSYUNygP2NkI6t~Cwqq2mSDAYKTgEJ1uhDYe65QxIRZj-qAxBw84DnVt0XDL69e52OxTL95YyB~UT7K0e0ZXrCpfmFZ-yF3EFWTla0MHCEx5SsWx4I0AYLNBj3CyxirSGT0L98FUzE6NzgE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1024_SR-IP-w5_Happiness/brfss2013.RData",
              method = "curl")
  }
load("./data/1024_SR-IP-w5_Happiness/brfss2013.RData")
data <- tibble(brfss2013)
dim<- dim(data)

1. Data

As the \(BRFSS-2013\) Codebook says, the data were collected in all \(50\) states as well as the District of Columbia and three U.S. territories. A quick look at the data reveals that there are 491775 observations of 330 variables.

Scope of inference

It’s known from the Codebook that the survey has been conducted both landline telephone- and cellular telephone-based. In conducting landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Generalizability

  • population of interest: BRFSS’s protocols have ensured that data are representative of the population on a number of demographic characteristics including sex, age, race, education, marital status, home ownership, phone ownership (landline telephone, cellular telephone or both) and sub-state region (The BRFSS Data User Guide).
    So, health characteristics pertain to non-institutionalized adult population, aged 18 years or older, who reside in the US;

  • potential sources of bias: the BRFSS uses a weighting process includes two steps: design weighting and iterative proportional fitting (also known as “raking” weighting). It’s an effective tool in attempting to remove bias in the sample (The BRFSS Data User Guide).
    Nevertheless, as the survey was conducted over phone, it could lead to missing poverty-stricken people who are unable to afford any kind of phone, neither landline nor cell one. Obviously, these people are more at risk for health problems, which theoretically could lead to some skewness and/or bias in the data.

  • all-in-all (The BRFSS Data User Guide):

    • for the landline sample has been used disproportionate stratified sampling (DSS),
    • the cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations;
    • a large representative random sampling was drawn in both data collection.

Overall, it can be concluded that the data sample is generalizable to non-institutionalized adult US citizens, excluding maybe poverty-stricken people without any kind of phone.

Causality

No random assignment was used when gathering data as participants were never assigned into groups after being sampled. Therefore, we can further infer that the results of the study are non-causal and only correlation statements can be made using the results.

  • the data were collected without interfering,
  • there was no random assignment for any treatment or exercise,
    • participants were never assigned into groups after being sampled.

Hence, there is an observational nature of the study. So its results are non-causal and only correlation statements can be inferred.


2. Research questions

Research question 1: Does money buy happiness?, or

Is an adult US citizen income level correlated with their life satisfaction?

  • Reason for interest: Until recently, scientists were inclined to believe that there is a certain threshold, after hitting which more income does not make a difference in feeling happy (see, for example: 2002, 2008_1, 2008_2, 2010). However, in August \(2020\), a new study appeared in the American Psychological Association journal “Emotion” (“The expanding class divide in happiness in the United States, 1972–2016”), the results of which show that such a threshold does not exist - at least for some categories of people.
    • So, it’s interesting to know, whose point of view would the BRFSS data support.
  • Variables:
  1. income2: Income Level
  2. lsatisfy: Satisfaction With Life

Research question 2: Does health bring happiness?, or

Is an adult US citizen life satisfaction correlated with their general health/ gender?

  • Reasons for interest:
    • I am disabled, and would like to know: if I were healthy, could I be happier?;
    • and also, are life satisfaction and gender related?.
  • Variables:
  1. lsatisfy: Satisfaction With Life
  2. genhlth: General Health
  3. sex: Respondents Sex

Research question 3: Does depression have a woman face?, or

Is an adult US citizen depressive disorder correlated with their sex/ income level?

  • Reasons for interest:
    • Since I am prone to depressive episodes, I am interested in whether depression and gender may correlate.
    • And also, are depression and income level related (for example, if a person has a higher income level, then they have more options for treating depression).
  • Variables:
  1. addepev2: Ever Told You Had A Depressive Disorder
  2. sex: Respondents Sex
  3. income2: Income Level

3. Exploratory data analysis

Research question 1: Does money buy happiness? (Is an adult US citizen income level correlated with their life satisfaction?)

  • Summary statistics
happy <- data %>% select(income=income2, satisfy=lsatisfy) %>%
        filter(!is.na(income), !is.na(satisfy))
levels(happy$income) <- gsub("Less than", "<", levels(happy$income))
levels(happy$income)<- gsub(".*more", "> $75,000", levels(happy$income))
summary(happy)
       income                  satisfy    
 > $75,000:1706   Very satisfied   :4290  
 < $50,000:1274   Satisfied        :4418  
 < $75,000:1241   Dissatisfied     : 490  
 < $35,000:1179   Very dissatisfied: 134  
 < $25,000:1044                           
 < $20,000:1039                           
 (Other)  :1849                           
table(happy)
           satisfy
income      Very satisfied Satisfied Dissatisfied Very dissatisfied
  < $10,000            238       461          113                52
  < $15,000            309       548          101                27
  < $20,000            374       594           52                19
  < $25,000            398       572           58                16
  < $35,000            517       587           67                 8
  < $50,000            634       589           43                 8
  < $75,000            708       504           28                 1
  > $75,000           1112       563           28                 3
  • Statistics Interpretation
    • it is clear that there are significantly more satisfied with life US citizens than unsatisfied;
    • also, it appears to be a very strong relation between life satisfaction and income levels.
  • Visualization
gg<-ggplot(happy, aes(x = income, fill = satisfy)) +
        geom_bar(position = "fill") +
        scale_fill_brewer(name="Satisfaction w/life", palette = "RdGy") +
        ylab("share by satisfaction") +
        theme(axis.text.x = element_text(angle = -20, vjust = 1, hjust = 0))+
        ggtitle("Money Buys Happiness (interactive)")
ggplotly(gg)
  • Plot Interpretation
    • the plot also illustrates a very strong relation between life satisfaction and income levels: the higher the income level, the more satisfied the participant.

EDA answer to the question 1:

  • YES, an adult US citizen income level is strongly correlated with their life satisfaction
    • or, figuratively speaking, money buys happiness,

and BRFSS data seems to support the latest study results on this subject.

Research question 2: Does health bring happiness? (Is an adult US citizen life satisfaction correlated with their general health/ gender?)

  • Summary statistics
health <- data %>% select(health=genhlth, satisfy=lsatisfy, gender=sex) %>%
        filter(!is.na(satisfy), !is.na(health), !is.na(gender))
summary(health)
       health                  satisfy        gender    
 Excellent:1480   Very satisfied   :5373   Male  :4066  
 Very good:3268   Satisfied        :5495   Female:7555  
 Good     :3629   Dissatisfied     : 593                
 Fair     :2031   Very dissatisfied: 160                
 Poor     :1213                                         
table(health$satisfy, health$health)
                   
                    Excellent Very good Good Fair Poor
  Very satisfied          997      1914 1550  647  265
  Satisfied               456      1278 1916 1175  670
  Dissatisfied             22        70  132  167  202
  Very dissatisfied         5         6   31   42   76
table(health$satisfy, health$gender)
                   
                    Male Female
  Very satisfied    1954   3419
  Satisfied         1878   3617
  Dissatisfied       187    406
  Very dissatisfied   47    113
  • Statistics Interpretation
    • it is again clear that there are significantly more satisfied with life US citizens than unsatisfied;
    • as for health, such a striking difference is not observed;
    • also, it appears to be a strong relation between health and life satisfaction levels;
    • regarding the relation between gender and life satisfaction, the picture is not very clear.
  • Visualization
gg<-ggplot(health, aes(x = health, fill = satisfy)) +
        geom_bar(position = "fill") +
        facet_grid(.~gender) +
        scale_fill_brewer(name="Satisfaction w/life", palette = "PuOr") +
        ylab("share by satisfaction") +
        theme(axis.text.x = element_text(angle = -20, vjust = 1, hjust = 0))+
        ggtitle("Health Brings Happiness (interactive)")
ggplotly(gg)
  • Plot Interpretation
    • the plot also illustrates a strong relation between health and life satisfaction levels: the better the health, the more satisfied the participant;
    • as for gender, it can be, to a certain extent said that US male and female citizens health is approximately equally correlated with life satisfaction;
    • regarding the gender and life satisfaction, it doesn’t seem to be any relation between them.

EDA answer to the question 2:

  • YES, an adult US citizen life satisfaction significantly is correlated with their general health
    • or, figuratively speaking, health brings happiness
  • and NO, life satisfaction is not correlated with person’s gender.

Research question 3: Does depression have a woman face? (Is an adult US citizen depressive disorder correlated with their sex/ income level?)

  • Summary statistics
depr <- data %>% select(depression=addepev2, income=income2, gender=sex) %>%
        filter(!is.na(depression), !is.na(gender), !is.na(income))
levels(depr$income) <- gsub("Less than", "<", levels(depr$income))
levels(depr$income)<- gsub(".*more", "> $75,000", levels(depr$income))
summary(depr)
 depression         income          gender      
 Yes: 83847   > $75,000:115686   Male  :177419  
 No :334995   < $75,000: 65081   Female:241423  
              < $50,000: 61315                  
              < $35,000: 48670                  
              < $25,000: 41528                  
              < $20,000: 34716                  
              (Other)  : 51846                  
table(depr$depression, depr$gender)
     
        Male Female
  Yes  25528  58319
  No  151891 183104
table(depr$depression, depr$income)
     
      < $10,000 < $15,000 < $20,000 < $25,000 < $35,000 < $50,000 < $75,000
  Yes      9240      8731      9152      9717      9609     11056     10894
  No      15983     17892     25564     31811     39061     50259     54187
     
      > $75,000
  Yes     15448
  No     100238
  • Statistics Interpretation
    • it seems to be a noticeable correlation between US citizen depressive disorder and their gender;
    • as for relation between depression and income level, the picture is not very clear.
  • Visualization
genplot<-ggplot(depr, aes(x = depression, fill = gender)) +
        geom_bar(position = "fill") +
        scale_fill_manual(name="Gender",
                          values = c("cornflowerblue","hotpink")) +
        ylab("share by gender") +
        ggtitle("Depression Has a Woman Face")

incplot<-ggplot(depr, aes(x = depression, fill = income)) +
        geom_bar(position = "fill") +
        scale_fill_brewer(name="Income Level", palette = "OrRd") +
        ylab("share by income")
grid.arrange(genplot, incplot, ncol=2)

  • Plots Interpretation
    • the left plot also illustrates a noticeable relation between depression and gender: among those who suffer from depression, the proportion of women is almost three quarters;
    • regarding the depressive disorder and income level, the right plot shows it doesn’t seem to be any noticeable relation between them.

EDA answer to the question 3:

  • YES, an adult US citizen depressive disorder is correlated with their sex
    • or, figuratively speaking, depression has a woman face
  • and NO, depression is not correlated with person’s income