knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load("~/brfss2013.Rdata")
Introduction This document has been prepared and presented for fulfilling the requirement of Data Analysis project submission in the final week of the course:“Introduction to Probability and Data with R” by Duke University.
Data Behavioral Risk Factor Surveillance System (BRFSS) uses stratified sampling for data collection via telephone surveys that collects data from the residents of various states in the USA regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. However, it can be understood that the BRFSS does not use random assignment, this leads to non-equivalent groups. The lack of random assignment in the treatment or control group would make it impossible to make causal inferences from the available data. Since this is an observational study, the data may be considered as generalizable but causality cannot be inferred.
**Research question 1: For the general population in the US, is there a relationship between the frequency of smoking and High Blood Pressure?
There has been significant research about the link between smoking and high blood pressure. American Academy of Family Physicians states, “The nicotine in cigarettes and other tobacco products makes your blood vessels get narrow and your heart beat faster, which makes your blood pressure get higher. If you quit smoking and using tobacco products, you can lower your blood pressure and your risk for heart disease and heart attack.” , I am interested to see if this claim is supported by the data.**
**Research question 2: For the general population in the US,is there a correlation between the frequency of smoking and Diabetes?
U.S Food and Drug Administration, in it’s article “Cigarette Smoking: A Risk Factor for Type 2 Diabetes” states, “Smokers are 30 to 40 percent more likely to develop type 2 diabetes than nonsmokers”. I am interested to see if this claim can be supported by the BRFSS data.**
** Research question 3: For the general population in the US, is there a relationship between Income Levels and whether they own or rent their home? General perception is that the people with higher income tend to invest in house property for own use and the people with lower levels of income might opt for rental properties or make other arrangements. The BRFSS data will be examined to see if this is the case.
All questions are analyised to see if there is variance between males and females.**
Research quesion 1:Correlation between the frequency of Smoking and High Blood Pressure
q1 <- brfss2013 %>% select(sex , smokday2, bphigh4) %>%
filter(!is.na(sex), !is.na(smokday2),!is.na(bphigh4)) %>%
group_by(sex,smokday2,bphigh4) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
## `summarise()` has grouped output by 'sex', 'smokday2'. You can override using the `.groups` argument.
ggplot(q1, aes(x = factor(smokday2), y = perc*100, fill = factor(bphigh4))) +
geom_bar(stat="identity", width = 0.7) +
labs(x = "Smoking Frequency", y = "Percentage", fill = "Blood Pressure") +
theme_minimal(base_size = 10) +
facet_grid(. ~ sex)
**Narrative: About 60% of the All participants who reported to smoke every day and on some days are not diagnosed with high blood pressure whereas about 40% of them are diagnosed with high blood pressure.
About 45% of men, who reported that they do not smoke at all are not diagnosed with high blood pressure, whereas about 55% of them are diagnosed with high blood pressure.
About 50% of women, who reported that they do not smoke at all are not diagnosed with high blood pressure, whereas about 50% of them are diagnosed with high blood pressure.
The results depend on various other factors which are not considered in this study. Since this is an observational study, the causality could not be inferred.**
Research quesion 2:Correlation between the frequency of Smoking and Diabetes
q2 <- brfss2013 %>% select(sex , smokday2, diabete3) %>%
filter(!is.na(sex), !is.na(smokday2),!is.na(diabete3)) %>%
group_by(sex,smokday2,diabete3) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
## `summarise()` has grouped output by 'sex', 'smokday2'. You can override using the `.groups` argument.
ggplot(q2, aes(x = factor(smokday2), y = perc*100, fill = factor(diabete3))) +
geom_bar(stat="identity", width = 0.7) +
labs(x = "Smoking Frequency", y = "Percentage", fill = "Diagnosed with Diabetes") +
theme_minimal(base_size = 10) +
facet_grid(. ~ sex)
** Narrative: It is observed that about 12.5% of all partipants who smoked every day or on some days are diagnosed with Diabetes and about 87.5% of them are not diagnosed with diabetes.
About 20% of those who reported that they do not smoke at all were diagnosed with Diabetes whereas about 80% of them are not diagnosed with diabetes.
The results depend on various other factors which are not considered in this study Since this is an observational study, the causality could not be inferred. **
Research quesion 3: Correlation between Income Levels of people and whether they own or rent their home
q3 <- brfss2013 %>% select(sex , income2 , renthom1) %>%
filter(!is.na(sex), !is.na(income2),!is.na(renthom1)) %>%
group_by(sex,income2,renthom1) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
## `summarise()` has grouped output by 'sex', 'income2'. You can override using the `.groups` argument.
ggplot(q3, aes(x = factor(renthom1), y = perc*100, fill = factor(income2))) +
geom_bar(stat="identity", width = 0.7) +
labs(x = "Type of Home", y = "Percentage", fill = "Income Range") +
theme_minimal(base_size = 10) +
facet_grid(. ~ sex)
** Narrative:
There appears to be a relationship between the income range of people and the type of home that they opt for. It can be observed that individuals with income greater than $20,000 have a tendency of owning their homes whereas people with income less than $10,000 predominantly go for rental accommodations. A mix range can be observed in individuals who opt for other arrangements.
The results depend on various other factors which are not considered in this study Since this is an observational study, the causality could not be inferred. **