This is a project for Introduction to Probability and Data with R course, which is a part of Coursera’s Statistics with R Specialization.
The report aims to perform some exploratory data analysis of BRFSS (Behavioral Risk Factor Surveillance System) \(2013\) year data set.
BRFSS is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS data are used for targeting and building health promotion activities.
Code buttonLoad packages & data
library(ggplot2); library(gridExtra); library(dplyr); library(plotly)
if(!file.exists("./data/1024_SR-IP-w5_Happiness/brfss2013.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/4tiY2fqCQa-YmNn6gnGvzQ_1e7320c30a6f4b27894a54e2de50a805_brfss2013.RData?Expires=1609372800&Signature=il7z9AaJdFfLp1bthmDUPsyGE7CBaSYUNygP2NkI6t~Cwqq2mSDAYKTgEJ1uhDYe65QxIRZj-qAxBw84DnVt0XDL69e52OxTL95YyB~UT7K0e0ZXrCpfmFZ-yF3EFWTla0MHCEx5SsWx4I0AYLNBj3CyxirSGT0L98FUzE6NzgE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
destfile = "./data/1024_SR-IP-w5_Happiness/brfss2013.RData",
method = "curl")
}
load("./data/1024_SR-IP-w5_Happiness/brfss2013.RData")
data <- tibble(brfss2013)
dim<- dim(data)As the \(BRFSS-2013\) Codebook says, the data were collected in all \(50\) states as well as the District of Columbia and three U.S. territories. A quick look at the data reveals that there are 491775 observations of 330 variables.
It’s known from the Codebook that the survey has been conducted both landline telephone- and cellular telephone-based. In conducting landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
population of interest: BRFSS’s protocols have ensured that data are representative of the population on a number of demographic characteristics including sex, age, race, education, marital status, home ownership, phone ownership (landline telephone, cellular telephone or both) and sub-state region (The BRFSS Data User Guide).
So, health characteristics pertain to non-institutionalized adult population, aged 18 years or older, who reside in the US;
potential sources of bias: the BRFSS uses a weighting process includes two steps: design weighting and iterative proportional fitting (also known as “raking” weighting). It’s an effective tool in attempting to remove bias in the sample (The BRFSS Data User Guide).
Nevertheless, as the survey was conducted over phone, it could lead to missing poverty-stricken people who are unable to afford any kind of phone, neither landline nor cell one. Obviously, these people are more at risk for health problems, which theoretically could lead to some skewness and/or bias in the data.
all-in-all (The BRFSS Data User Guide):
Overall, it can be concluded that the data sample is generalizable to non-institutionalized adult US citizens, excluding maybe poverty-stricken people without any kind of phone.
No random assignment was used when gathering data as participants were never assigned into groups after being sampled. Therefore, we can further infer that the results of the study are non-causal and only correlation statements can be made using the results.
Hence, there is an observational nature of the study. So its results are non-causal and only correlation statements can be inferred.
Research question 1: Does money buy happiness?, or
Is an adult US citizen income level correlated with their life satisfaction?
income2: Income Levellsatisfy: Satisfaction With LifeResearch question 2: Does health bring happiness?, or
Is an adult US citizen life satisfaction correlated with their general health/ gender?
lsatisfy: Satisfaction With Lifegenhlth: General Healthsex: Respondents SexResearch question 3: Does depression have a woman face?, or
Is an adult US citizen depressive disorder correlated with their sex/ income level?
addepev2: Ever Told You Had A Depressive Disordersex: Respondents Sexincome2: Income LevelResearch question 1: Does money buy happiness? (Is an adult US citizen income level correlated with their life satisfaction?)
happy <- data %>% select(income=income2, satisfy=lsatisfy) %>%
filter(!is.na(income), !is.na(satisfy))
levels(happy$income) <- gsub("Less than", "<", levels(happy$income))
levels(happy$income)<- gsub(".*more", "> $75,000", levels(happy$income))
summary(happy) income satisfy
> $75,000:1706 Very satisfied :4290
< $50,000:1274 Satisfied :4418
< $75,000:1241 Dissatisfied : 490
< $35,000:1179 Very dissatisfied: 134
< $25,000:1044
< $20,000:1039
(Other) :1849
table(happy) satisfy
income Very satisfied Satisfied Dissatisfied Very dissatisfied
< $10,000 238 461 113 52
< $15,000 309 548 101 27
< $20,000 374 594 52 19
< $25,000 398 572 58 16
< $35,000 517 587 67 8
< $50,000 634 589 43 8
< $75,000 708 504 28 1
> $75,000 1112 563 28 3
gg<-ggplot(happy, aes(x = income, fill = satisfy)) +
geom_bar(position = "fill") +
scale_fill_brewer(name="Satisfaction w/life", palette = "RdGy") +
ylab("share by satisfaction") +
theme(axis.text.x = element_text(angle = -20, vjust = 1, hjust = 0))+
ggtitle("Money Buys Happiness (interactive)")
ggplotly(gg)EDA answer to the question 1:
and BRFSS data seems to support the latest study results on this subject.
Research question 2: Does health bring happiness? (Is an adult US citizen life satisfaction correlated with their general health/ gender?)
health <- data %>% select(health=genhlth, satisfy=lsatisfy, gender=sex) %>%
filter(!is.na(satisfy), !is.na(health), !is.na(gender))
summary(health) health satisfy gender
Excellent:1480 Very satisfied :5373 Male :4066
Very good:3268 Satisfied :5495 Female:7555
Good :3629 Dissatisfied : 593
Fair :2031 Very dissatisfied: 160
Poor :1213
table(health$satisfy, health$health)
Excellent Very good Good Fair Poor
Very satisfied 997 1914 1550 647 265
Satisfied 456 1278 1916 1175 670
Dissatisfied 22 70 132 167 202
Very dissatisfied 5 6 31 42 76
table(health$satisfy, health$gender)
Male Female
Very satisfied 1954 3419
Satisfied 1878 3617
Dissatisfied 187 406
Very dissatisfied 47 113
gg<-ggplot(health, aes(x = health, fill = satisfy)) +
geom_bar(position = "fill") +
facet_grid(.~gender) +
scale_fill_brewer(name="Satisfaction w/life", palette = "PuOr") +
ylab("share by satisfaction") +
theme(axis.text.x = element_text(angle = -20, vjust = 1, hjust = 0))+
ggtitle("Health Brings Happiness (interactive)")
ggplotly(gg)EDA answer to the question 2:
Research question 3: Does depression have a woman face? (Is an adult US citizen depressive disorder correlated with their sex/ income level?)
depr <- data %>% select(depression=addepev2, income=income2, gender=sex) %>%
filter(!is.na(depression), !is.na(gender), !is.na(income))
levels(depr$income) <- gsub("Less than", "<", levels(depr$income))
levels(depr$income)<- gsub(".*more", "> $75,000", levels(depr$income))
summary(depr) depression income gender
Yes: 83847 > $75,000:115686 Male :177419
No :334995 < $75,000: 65081 Female:241423
< $50,000: 61315
< $35,000: 48670
< $25,000: 41528
< $20,000: 34716
(Other) : 51846
table(depr$depression, depr$gender)
Male Female
Yes 25528 58319
No 151891 183104
table(depr$depression, depr$income)
< $10,000 < $15,000 < $20,000 < $25,000 < $35,000 < $50,000 < $75,000
Yes 9240 8731 9152 9717 9609 11056 10894
No 15983 17892 25564 31811 39061 50259 54187
> $75,000
Yes 15448
No 100238
genplot<-ggplot(depr, aes(x = depression, fill = gender)) +
geom_bar(position = "fill") +
scale_fill_manual(name="Gender",
values = c("cornflowerblue","hotpink")) +
ylab("share by gender") +
ggtitle("Depression Has a Woman Face")
incplot<-ggplot(depr, aes(x = depression, fill = income)) +
geom_bar(position = "fill") +
scale_fill_brewer(name="Income Level", palette = "OrRd") +
ylab("share by income")
grid.arrange(genplot, incplot, ncol=2)EDA answer to the question 3: