Setup

Load packages

## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all states in the United States (US), US territories, and the Centre for Disease Control (CDC). It is designed to measure behavioral risk factors in the adult population in the US. The objective of this survey is to collect data on health practices and behaviors linked to injuries and diseases. Data was collected by telephone surveys since 1984. Later on, in 2011, cellular telephone surveys were also conducted.

Part 2: Research questions

Research quesion 1: My first research question will explore the relationship between depressive disorders and smoking habits and alcohol consumption. My initial assumption is that individuals with depressive disorders are more likely to have such habits.

Research quesion 2: For my second question, I will look into the relationship beteen veterans, depression, and having sleep problems. The general consensus about the amount of appropriate sleep hours for adults (18-64) is at least 6 hours. I will be using this value to determine if veterens are getting the recommended amount of sleep.

Research quesion 3: For the final question, I will look into the relationship between cancer and smoking.


Part 3: Exploratory data analysis

Research quesion 1:

Determine drinkers and non drinkers

## # A tibble: 3 x 2
##   alcohol_drinker  count
##   <chr>            <int>
## 1 Drinker         235412
## 2 Non drinker     236719
## 3 <NA>             19644

Find number of individuals with depression who have drinking and smoking habits

Summaries

## # A tibble: 4 x 3
## # Groups:   addepev2 [2]
##   addepev2 smoke100  count
##   <fct>    <fct>     <int>
## 1 Yes      Yes       52850
## 2 Yes      No        40848
## 3 No       Yes      161297
## 4 No       No       219742
## # A tibble: 4 x 3
## # Groups:   addepev2 [2]
##   addepev2 alcohol_drinker  count
##   <fct>    <chr>            <int>
## 1 Yes      Drinker          41343
## 2 Yes      Non drinker      51630
## 3 No       Drinker         193242
## 4 No       Non drinker     183847

Visualization

According to the visualizations above, we see that there is actually a slightly higher number of non drinkers who have a depressive disorder. This does not align with my assumption that more individuals with depressive disorders would have a drinking habit. For smokers, we see that the number of smokers with depressive disorder is actually higher, but not high enough to suggest a relationship between the two variables.

Notable, there is a significant difference between the number of smokers and non smokers without depressive disorders.

Research quesion 2:

Select veterens, depression, and sleep time variables

##   veteran3 addepev2 sleptim1
## 1      Yes      Yes        3
## 2      Yes      Yes        5
## 3      Yes       No        8
## 4      Yes       No        5
## 5      Yes       No        8
## 6      Yes       No        6

Summary

## # A tibble: 2 x 4
##   addepev2 count  mean median
##   <fct>    <int> <dbl>  <dbl>
## 1 Yes       9542  6.61      7
## 2 No       50019  7.11      7

Visualization

This bar plot shows the difference of veterans with and without depression.

The mean sleeping times of veterans with depression is 6.6, which is less than the recommended amount but still above the appropriate number of hours of sleep. It is less than the mean sleeping time of veterans without depression.

Research quesion 3:

Only selecting smokers to see their rate of cancer.

##   chcocncr smoke100
## 1       No      Yes
## 2       No      Yes
## 3      Yes      Yes
## 4       No      Yes
## 5       No      Yes
## 6       No      Yes

Summary

## # A tibble: 2 x 2
##   chcocncr  count
##   <fct>     <int>
## 1 Yes       24586
## 2 No       190007

Visualization

The plot above shows that the number of smokers with cancer are far less than the number of smokers without cancer, which is a bit surprising. This graph may be misleading, since the data only records all types of cancer. If there was data for specifically lung cancer, the difference may not be very large.