Setup

Load packages

library(dplyr)
library(statsr)
library(janitor)

Load data

load("/Users/kuan/Statistics/DAprojectData.RData")

The following is the background of this data.

This is a code book to find out what each column in the data represents.


Part 1: Data

Generalizability: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone, resides in a private residence orcollege housing, and received 90 percent or more of their calls on cellular telephones. (Centers for Disease Control and Prevention, 2014, Background section)

Cellular telephone sampling frames are commercially available and the system can call random samples of cellular telephone numbers, but doing so requires specific protocols. The basis of the 2013 BRFSS sampling frame is the Telecordia database of telephone exchanges (e.g., 617-492-0000 to 617-492-9999) and 1,000 banks (e.g., 617-492-0000 to 617-492-0999). (Centers for Disease Control and Prevention, 2014, Sample Description section)

To meet the BRFSS standard for the participating states’ sample designs, one must be able to justify sample records as a probability sample of all households with telephones in the state. All participating areas met this criterion in 2013. Fifty-one projects used a disproportionate stratified sample (DSS) design for their landline samples. Guam and Puerto Rico used a simple random-sample design. (Centers for Disease Control and Prevention, 2014, Sample Description section)

In the type of DSS design that states most commonly used in the BRFSS landline telephone sampling, BRFSS divides telephone numbers into two groups, or strata, which are sampled separately. The high-density and medium-density strata contain telephone numbers that are expected to belong mostly to households. (Centers for Disease Control and Prevention, 2014, Sample Description section)

Causality: Describe how the observations in the sample are collected, and the implications of this data collection method on causality.

Subjects are adults who were randomly selected through landline and/or mobile phone. BRFSS project cannot show causality due to it being a retrospective observational study where adults are randomly selected and their responses are recorded.

References:

Centers for Disease Control and Prevention. (2014, August 15). Behavioral Risk Factor Surveillance System: Overview: BRFSS 2013 [PDF]. https://www.cdc.gov/brfss/annual_data/2013/pdf/overview_2013.pdf.

Centers for Disease Control and Prevention. (2014, October 24). Behavioral Risk Factor Surveillance System: 2013 codebook report: Land-line and cell-phone data [PDF]. https://www.cdc.gov/brfss/annual_data/2013/pdf/CODEBOOK13_LLCP.pdf


Part 2: Research questions

Research question 1:

How to demonstrate the correlation between smokday2: Frequency Of Days Now Smoking and chccopd1: (Ever Told) You Have (Copd) Chronic Obstructive Pulmonary Disease, Emphysema?

Research question 2:

What is the relation between X_rfdrhv4: Heavy Alcohol Consumption Calculated Variable and those who smokday2: Frequency Of Days Now Smoking?

Research question 3:

How significant is the relationship between pregnant: Pregnancy Status and X_rfdrwm4: Adult Women Heavy Alcohol Consumption Calculated Variable?


Part 3: Exploratory data analysis

Research question 1:

lung <- table(brfss2013$smokday2, brfss2013$chccopd1)
lung
##             
##                 Yes     No
##   Every day    9999  44747
##   Some days    3186  18163
##   Not at all  16746 120536
result1 <- chisq.test(lung)
result1
## 
##  Pearson's Chi-squared test
## 
## data:  lung
## X-squared = 1210, df = 2, p-value < 2.2e-16
result1$p.value        # small p-value => evidence of association
## [1] 1.7664e-263
alpha <- 0.05
ifelse(result1$p.value < alpha, "p-value is significant", "p-value is not significant")  
## [1] "p-value is significant"
plot(brfss2013$smokday2, 
     main = "Daily Cigarettes Status",  
     xlab = "Figure 1 shows the responses of 491,775 participants regarding their daily smoking habits.")

plot(brfss2013$chccopd1, 
     main = "Having COPD, Emphysema or Chronic Bronchitis Status", 
     xlab = "Figure 2 portrays the responses of 491,775 participants when asked if they have ever
     been told that they have COPD, emphysema or chronic bronchitis.")

At 5% significance level, from the sample data, I reject the null hypothesis of independence and conclude a statistical significant association between having COPD, emphysema or chronic bronchitis status and those who smoke cigarettes daily status.

Research question 2:

liv_lung <- table(brfss2013$X_rfdrhv4, brfss2013$smokday2)
liv_lung
##      
##       Every day Some days Not at all
##   No      47167     18834     125916
##   Yes      6212      1932       8972
result2 <- chisq.test(liv_lung)
result2
## 
##  Pearson's Chi-squared test
## 
## data:  liv_lung
## X-squared = 1302.9, df = 2, p-value < 2.2e-16
result2$p.value        # small p-value => evidence of association
## [1] 1.192845e-283
alpha <- 0.05
ifelse(result2$p.value < alpha, "p-value is significant", "p-value is not significant")  
## [1] "p-value is significant"
plot(brfss2013$X_rfdrhv4, 
     main = "Heavy Alcohol Consumption Status", 
     xlab = "Figure 3 demonstrates the responses of 491,775 participants about their heavy alcohol  
     consumption habits.")

plot(brfss2013$smokday2, 
     main = "Daily Cigarettes Status", 
     xlab = "Figure 4 shows the responses of 491,775 participants regarding their daily smoking habits.")

At 5% significance level, from the sample data, I reject the null hypothesis of independence and conclude that there is a statistical significant association between those with heavy alcohol consumption status and those who smoke cigarettes daily status.

Research question 3:

preg_liv <- table(brfss2013$pregnant, brfss2013$X_rfdrwm4)
preg_liv
##      
##          No   Yes
##   Yes  2928    40
##   No  66828  4066
result3 <- chisq.test(preg_liv)
result3
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  preg_liv
## X-squared = 103.63, df = 1, p-value < 2.2e-16
result3$p.value
## [1] 2.443343e-24
alpha <- 0.05
ifelse(result3$p.value < alpha, "p-value is significant", "p-value is not significant")  
## [1] "p-value is significant"
plot(brfss2013$pregnant, 
     main = "Pregnancy Status", 
     xlab = "Figure 5 demonstrates subjects' responses to the question about pregnancy status.")

plot(brfss2013$X_rfdrwm4, 
     main = "Heavy Drinker Status", 
     xlab = "Figure 6 presents participants' heavy alcohol consumption habits.")

At 5% significance level, from the sample data, I reject the null hypothesis of independence and conclude that pregnancy status and heavy drinker status are statistically associated. This result does not imply a causal relationship.