Aadarsh Srivastava
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Data collection procedure is explained in brfss_codebook. The data were collected from United States’ all 50 states, the District of Columbia, Puerto Rico, Guam and American Samoa, Federated States of Micronesia, and Palau, by conducting both landline telephone- and cellular telephone-based surveys. Disproportionate stratified sampling (DSS) has been used for the landline sample and the cellular telephone respondents are randomly selected with each having equal probability of selection. The dataset we are working on contains 330 variables for a total of 491, 775 observations in 2013. The missing values denoted by “NA”.
The sample data should allow us to generalize to the population of interest. It is a survey of 491,775 U.S. adults aged 18 years or older. It is based on a large stratified random sample. Potential biases are associated with non-response, incomplete interviews, missing values and convenience bias (some potential respondents may not have been included because they do not have a landline and cell phone).
There is no causation can be established as BRFSS is an observation study that can only establish correlation/association between variables.
Research quesion 1: For the general population in the US,is there a relation between person’s frequency of tobacoo consumption and number of days with poor mental health condition? Usually people say that they find escape to their mental issues thorugh tobacco usage. Ifs this just an excuse or does the Data reflect this?
Research quesion 2: For the general population in the US, is there a relation between physical activity and income level. We might suspect that higher income or employment statua might lead to more activity as there might be more freedom to give proper attention to their health.
Research quesion 3: For the general population of the US who are over 18 years of age,what is the distribution of hours of sleep based on the gender difference. Does there exist difference in the distribution based on the whether the individual is male or female?
Research quesion 1:Relationship between Frequency of days now Smoking and Number of days of Poor Mental Health reported.
#Remove the NAs from the data is in focus.
cleaned_data1 <- brfss2013 %>% filter(!is.na(smokday2),!is.na(menthlth))
#Plotting the selected feature
ggplot(data = cleaned_data1,aes(x=smokday2,fill=smokday2))+geom_bar(alpha = 0.5)+ labs(title="Distribution of Smoking habits",x="Frequency of Smoking",y="Population")+scale_y_continuous(breaks = scales::pretty_breaks(n=10))#Plotting the Bar plot to display data
pl = ggplot(data = cleaned_data1,aes(x=smokday2,y= menthlth,fill = smokday2)) + geom_bar(stat = "Identity",alpha=0.5)
pl + labs(title="Smoking and Mental Health related?",x="Frequency of Smoking",y="Days with Poor Mental Health",fill = "Frequency of Smoking")+scale_y_continuous(breaks = scales::pretty_breaks(n=10))The bar plot between Mental Health and Frequency of smoking suggests that people who smokes not at all has the highest number of days with poor mental health.But this is not true,as the Feature corresponds to the numeric value of number of days of poor mental health and should be considered a categorical variable.
pl = ggplot(data = cleaned_data1,aes(y=smokday2,x= factor(menthlth),fill = smokday2)) + geom_count()
pl + labs(title="Smoking and Mental Health related?",x="Frequency of Smoking",y="Days with Poor Mental Health")## cleaned_data1$smokday2: Every day
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 6.386 10.000 30.000
## --------------------------------------------------------
## cleaned_data1$smokday2: Some days
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 5.77 7.00 30.00
## --------------------------------------------------------
## cleaned_data1$smokday2: Not at all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.159 2.000 30.000
The statistics show that the mean number of days with poor mental health is the least for people who smokes not at all. This means that people with more days with poor mental health tend to smoke more.
Research quesion 2: More Income = More Exercise?
# Remove the NAs and find porportion that does any exercise grouped by income.
cleaned_data2 <- brfss2013 %>% filter(!is.na(exerany2), !is.na(income2), !is.na(employ1)) %>% group_by(income2) %>% summarise(prop_exer = sum(exerany2 == "Yes") / n())
# Plot - first replace spaces with line breaks in income2 names.
levels(cleaned_data2$income2) <- gsub(" ", "\n", levels(cleaned_data2$income2))
ggplot(cleaned_data2, aes(income2, prop_exer)) +
geom_point(aes(income2, prop_exer)) +
labs(title="Proportion who exercised in last 30 days vs. Income", x="Income", y="Proportion Exercise")The plot clearly indicates that there is an increasing proportion of the people who exercised in the last 30 days as we jump to a higher income scale. There does come a dip in the proportion who has a income between 10,000 and 15,000 dollars which does not help us to infer anything.
Research quesion 3: Nap time also has a Battle of the Sexes?**
#Cleaning the data and then selection of required columns.
cleaned_data3 <- brfss2013 %>% filter(!is.na(ladult),!is.na(sleptim1))
cleaned_data3 <- cleaned_data3 %>% select(ladult,sleptim1)
#Getting rid of the outliers
cleaned_data3 <- cleaned_data3 %>% filter(sleptim1 < 20)
#Mean sleep of male and female respondents
levels(cleaned_data3$ladult) <- c("Male","Female")
names(cleaned_data3) <- c("Gender","Sleep_Time")
#Examining the ditribution of Male and Female respondents' hours of sleep
ggplot(data = cleaned_data3,aes(Gender,factor(Sleep_Time)))+geom_count(fill="green")## cleaned_data3$Gender: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 6.000 8.000 7.412 8.000 10.000
## --------------------------------------------------------
## cleaned_data3$Gender: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 7.500 6.611 8.000 9.000
The sleep time of male adults were observed to distributed between 5 to 10 hours where as that of the female adults were between 1 to 9 hours. For both the genders the median is greater than the mean, hence a left skewness in observed. Also the mean sleep time for the male adults were observed to be greater than that of the females.
Firstly,the process of Feature selection. The features selected are: smoke100: Smoked At Least 100 Cigarettes
avedrnk2: Avg Alcoholic Drinks Per Day In Past 30
bphigh4: Ever Told Blood Pressure High
toldhi2: Ever Told Blood Cholesterol High
weight2: Reported Weight In Pounds
cvdstrk3: Ever Diagnosed With A Stroke
Converting the above variables to numeric, and see any correlation between these numerical variables.
vars <- names(brfss2013) %in% c('smoke100', 'avedrnk2', 'bphigh4', 'toldhi2', 'weight2')
selected_brfss <- brfss2013[vars]
selected_brfss$toldhi2 <- ifelse(selected_brfss$toldhi2=="Yes", 1, 0)
selected_brfss$smoke100 <- ifelse(selected_brfss$smoke100=="Yes", 1, 0)
selected_brfss$weight2 <- as.numeric(selected_brfss$weight2)
selected_brfss$bphigh4 <- as.factor(ifelse(selected_brfss$bphigh4=="Yes", "Yes", (ifelse(selected_brfss$bphigh4=="Yes, but female told only during pregnancy", "Yes", (ifelse(selected_brfss$bphigh4=="Told borderline or pre-hypertensive", "Yes", "No"))))))
selected_brfss$bphigh4 <- ifelse(selected_brfss$bphigh4=="Yes", 1, 0)
names(selected_brfss) <- c("High_BP","High_Cholestrol","Weight","Smoke","Alcohol")
selected_corr <- na.delete(selected_brfss)
corr.matrix <- cor(selected_corr)
corrplot(corr.matrix, method="number")No any two numeric variables seem to have strong correlations.
Next,I will combine three features to make our target variable,which are: cvdinfr4: Ever Diagnosed With Heart Attack
cvdcrhd4: Ever Diagnosed With Angina Or Coronary Heart Disease
cvdstrk3: Ever Diagnosed With A Stroke
chronic_diseases <- data.frame("Stroke"=brfss2013$cvdstrk3,"Coronary_disease"=brfss2013$cvdcrhd4,"Heart_attack"=brfss2013$cvdinfr4)
chronic_diseases$Stroke <- ifelse(chronic_diseases$Stroke=="Yes",1,0)
chronic_diseases$Coronary_disease <- ifelse(chronic_diseases$Coronary_disease=="Yes",1,0)
chronic_diseases$Heart_attack <- ifelse(chronic_diseases$Heart_attack=="Yes",1,0)
chronic_diseases$Stroke <- replace(chronic_diseases$Stroke,which(is.na(chronic_diseases$Stroke)),0)
chronic_diseases$Coronary_disease <- replace(chronic_diseases$Coronary_disease,which(is.na(chronic_diseases$Coronary_disease)),0)
chronic_diseases$Heart_attack <- replace(chronic_diseases$Heart_attack,which(is.na(chronic_diseases$Heart_attack)),0)
chronic_diseases$Chronic_disease<-ifelse((chronic_diseases$Stroke+chronic_diseases$Coronary_disease+chronic_diseases$Heart_attack)>0,1,0)
model_data <- as.data.frame(cbind(selected_brfss,chronic_diseases$Chronic_disease))
names(model_data)[6] <- "chronic_disease"
model_data$High_BP <- replace(model_data$High_BP, which(is.na(model_data$High_BP)),0)
model_data$High_Cholestrol <- replace(model_data$High_Cholestrol, which(is.na(model_data$High_Cholestrol)),0)
model_data$chronic_disease <- replace(model_data$chronic_disease, which(is.na(model_data$chronic_disease)),0)
model_data$Smoke <- replace(model_data$Smoke, which(is.na(model_data$Smoke)),0)
#Replace NA with average
model_data$Alcohol <- replace(model_data$Alcohol, which(is.na(model_data$Alcohol)),round(mean(model_data$Alcohol,na.rm = T)))
summary(model_data)## High_BP High_Cholestrol Weight Smoke
## Min. :0.0000 Min. :0.0000 Min. : 1.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 43.00 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median : 73.00 Median :0.0000
## Mean :0.4223 Mean :0.3731 Mean : 80.22 Mean :0.4376
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:103.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :570.00 Max. :1.0000
## Alcohol chronic_disease
## Min. : 1.000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:0.0000
## Median : 2.000 Median :0.0000
## Mean : 2.099 Mean :0.1164
## 3rd Qu.: 2.000 3rd Qu.:0.0000
## Max. :76.000 Max. :1.0000
## High_BP High_Cholestrol Weight Smoke Alcohol chronic_disease
## 1 1 1 154 1 2 0
## 2 0 0 30 0 2 0
## 3 0 0 63 1 4 0
## 4 0 1 31 0 2 0
## 5 1 0 169 1 2 0
## 6 1 1 128 0 2 0
After wrangling and cleaning the data, now we can fit the model.
Model Fitting
train_index <- createDataPartition(model_data$chronic_disease,p=0.8,list = FALSE,times = 1)
train <- model_data[train_index,]
test <- model_data[-train_index,]
model <- glm(chronic_disease~.,family=binomial(link = 'logit'),data=train)
summary(model)##
## Call:
## glm(formula = chronic_disease ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0409 -0.5104 -0.3305 -0.2475 3.4258
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.403e+00 1.607e-02 -211.744 <2e-16 ***
## High_BP 1.276e+00 1.185e-02 107.704 <2e-16 ***
## High_Cholestrol 8.780e-01 1.101e-02 79.783 <2e-16 ***
## Weight 7.391e-04 8.549e-05 8.646 <2e-16 ***
## Smoke 5.596e-01 1.053e-02 53.136 <2e-16 ***
## Alcohol -6.077e-02 4.570e-03 -13.298 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 283561 on 393419 degrees of freedom
## Residual deviance: 250669 on 393414 degrees of freedom
## AIC: 250681
##
## Number of Fisher Scoring iterations: 5
Interpreting the results of my logistic regression model:
All the variables are statistically significant.
All other variables being equal, being told high cholestrol is more likely to have a stroke.
The negative coefficient for the predictor Alcohol suggests that all other variables being equal, higher the alcohol drinksper day in the past 30 days is less likely to have a stroke.
For every one unit change in weight, the log odds of having a stroke (versus no-stroke) increases by 0.0007391.
Not Smoked At Least 100 Cigarettes, less likely to have a stroke.
For a one unit increase in Avg alcoholic drinks per day in past 30 days, the log odds of having a stroke decreases by 0.06077.
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: chronic_disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 393419 283561
## High_BP 1 22644.2 393418 260917 < 2.2e-16 ***
## High_Cholestrol 1 7192.1 393417 253725 < 2.2e-16 ***
## Weight 1 99.5 393416 253625 < 2.2e-16 ***
## Smoke 1 2745.6 393415 250880 < 2.2e-16 ***
## Alcohol 1 211.1 393414 250669 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time.
fitted.results <- predict(model,newdata=test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$chronic_disease)
print(paste('Accuracy',(1-misClasificError)*100,"%"))## [1] "Accuracy 88.4998220731025 %"
One last note, when we analyze health survery data, we must be aware that self-reported prevalence may be biased because respondents may not be aware of their risk status. Therefore, to achieve more precise estimates, researchers are using laboratory tests as well as self-reported data.