R Markdown

Load Packages

Description of the e-cigarette dataset from Center for Disease Control and Prevention, 2016.

The data set was collected via health-related telephone surveys on health-related risk behaviors, chronic health conditions, and use of preventive services from the noninstitutionalized adult population 18 years of age and older in the US during 2016 calendar year. From the study “The relationship between e-cigarette use and chronic health conditions” can be examined.

Below is a description of variables and their possible values. * SEQNO [unique identifier] * ECIGARETTE [E-cigarette user status.] 1 = Currently using e-cigarettes, 2 = Not current e-cigarette user

Question 1: Call the data set

Mydata <- read_sav("ECigSPSS.sav")
#View(Mydata)
Mydata$chronic<-NA # initializing chronic
Mydata$health<-NA # initializing chronic
head(Mydata)
## # A tibble: 6 x 15
##   Stroke Asthma Cancer  Copd Arthritis Depression Kidney Diabetes   SEX Heart
##    <dbl>  <dbl>  <dbl> <dbl>     <dbl>      <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1      0      0      0     0         0          0      0        0     1     0
## 2      0      0      0     0         0          1      0        0     2     0
## 3      0      0      0     0         1          0      0        0     2     1
## 4      0      0      0     0         1          0      0        1     1     0
## 5      0      1      0     0         0          0      0        0     1     0
## 6      0      0      0     0         0          0      0        0     2     0
## # … with 5 more variables: Age <dbl>, Ecigarette <dbl>, SEQNO <chr>,
## #   chronic <lgl>, health <lgl>
str(Mydata)
## tibble [15,000 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Stroke    : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "EVER DIAGNOSED WITH A STROKE"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ Asthma    : num [1:15000] 0 0 0 0 1 0 0 1 0 0 ...
##   ..- attr(*, "label")= chr "EVER TOLD HAD ASTHMA"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ Cancer    : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "(EVER TOLD) YOU HAD ANY OTHER TYPES OF C"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ Copd      : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE (COPD) CHRONIC OBST"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 6
##  $ Arthritis : num [1:15000] 0 0 1 1 0 0 1 1 0 0 ...
##   ..- attr(*, "label")= chr "TOLD HAVE ARTHRITIS"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 11
##  $ Depression: num [1:15000] 0 1 0 0 0 0 0 1 0 0 ...
##   ..- attr(*, "label")= chr "EVER TOLD YOU HAD A DEPRESSIVE DISORDER"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 12
##  $ Kidney    : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE KIDNEY DISEASE?"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ Diabetes  : num [1:15000] 0 0 0 1 0 0 0 1 0 0 ...
##   ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE DIABETES"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 10
##  $ SEX       : num [1:15000] 1 2 2 1 1 2 2 2 2 1 ...
##   ..- attr(*, "label")= chr "RESPONDENTS SEX"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 5
##  $ Heart     : num [1:15000] 0 0 1 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "RESPONDENTS THAT HAVE EVER REPORTED HAVI"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 7
##  $ Age       : num [1:15000] 43 59 80 70 18 65 74 76 43 56 ...
##   ..- attr(*, "label")= chr "IMPUTED AGE VALUE COLLAPSED ABOVE 80"
##   ..- attr(*, "format.spss")= chr "F2.0"
##   ..- attr(*, "display_width")= int 5
##  $ Ecigarette: num [1:15000] 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "CURRENT E-CIGARETTE USER CALCULATED VARI"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 12
##  $ SEQNO     : chr [1:15000] "2016000001" "2016000002" "2016000003" "2016000004" ...
##   ..- attr(*, "label")= chr "ANNUAL SEQUENCE NUMBER"
##   ..- attr(*, "format.spss")= chr "A30"
##   ..- attr(*, "display_width")= int 26
##  $ chronic   : logi [1:15000] NA NA NA NA NA NA ...
##  $ health    : logi [1:15000] NA NA NA NA NA NA ...

Question 2: Generate frequency tables for each of the following variables ECIGARETTE, DEPRESSION, and AGE answer questions #1 through #4 on the blackboard quiz.

Note that I am using different table styles for printing output which is tabular, using kable and kableextra packages. You can read up more on it https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

Dep.count<-tally(Mydata$Depression)
ECig.Count<-tally(Mydata$Ecigarette)
ECig.Smoker<-tally(Mydata$Depression~Mydata$Ecigarette)

Dep.count %>%
  kbl(caption = "Frequency Table for Depression") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Frequency Table for Depression
X Freq
0 12154
1 2785
NA 61
ECig.Count %>%
  kbl(caption = "Frequency Table for ECigarette") %>%
  kable_paper("hover", full_width = F)
Frequency Table for ECigarette
X Freq
1 497
2 13974
NA 529
ECig.Smoker %>%
  kbl(caption="ECigarette vs Depression") %>%
  kable_material(c("striped", "hover"))
ECigarette vs Depression
1 2 NA
0 305 11391 458
1 188 2532 65
NA 4 51 6

Question 3: Create a new quantitative variable (CHRONIC) from the 9 no/yes chronic health categorical variables. This new variable will represent a total number of chronic health conditions ranging from 0 to 9.

Mydata$chronic<-NA # initializing chronic
Mydata$chronic <- Mydata$Stroke + Mydata$Asthma + Mydata$Cancer + Mydata$Copd + Mydata$Arthritis +  Mydata$Kidney + Mydata$Diabetes + Mydata$Depression

Problem 4: Run a Pearson correlation to evaluate the association between AGE (Explanatory) and CHRONIC (Response). Answer questions #5 and #6.

Bar Graph, Quantitative vs. Categorical Data

One limitation of such plots is that they do not display the distribution of the data - only the summary statistic for each group. The plots below correct this limitation to some extent. If you want to alleviate the issue, you can read up more and fit kernel plots https://rkabacoff.github.io/datavis/Bivariate.html#categorical-vs.quantitative

Method choice

Choice of method in the argument depends on which correlation coefficient is to be used for the test.

  • Pearson: Two quantitative variables ** Assumptions: For the Pearson correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Other assumptions include linearity and homoscedasticity. Linearity assumes a straight line relationship between each of the two variables and homoscedasticity assumes that data is equally distributed about the regression line.
  • Kendall: Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables
  • Spearman: The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal. For example Is there a statistically significant relationship between participants’ level of education (high school, bachelor’s, or graduate degree) and their starting salary?

**Assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable must be monotonically related to the other variable.

cor.age.chronic <- cor.test(Mydata$chronic, Mydata$Age, method="pearson")
cor.age.chronic
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 35.421, df = 14279, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2690512 0.2992049
## sample estimates:
##       cor 
## 0.2841983

Problem 5: Run an Anova to compare the means for CHRONIC (Response) for the two groups of ECIGARETTE (Explanatory). Answer questions #7 and #8.

EC.aov <- aov(Mydata$chronic~ Mydata$Ecigarette, data=Mydata)
summary(EC.aov)
##                      Df Sum Sq Mean Sq F value   Pr(>F)    
## Mydata$Ecigarette     1     30  30.014   19.07 1.27e-05 ***
## Residuals         13777  21689   1.574                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1221 observations deleted due to missingness
#by(data$CHRONIC, data$Ecigarette, favstats, na.rm=T)

Problem 6: Run a Chi-Square to evaluation the association between ECIGARETTE (Explanatory) and DEPRESSION (Response). Answer questions #9 & #10.

ED.chisq <- chisq.test(Mydata$Depression, Mydata$Ecigarette)
ED.chisq
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
prop.table(ED.chisq$observed, 2)
##                  Mydata$Ecigarette
## Mydata$Depression         1         2
##                 0 0.6186613 0.8181426
##                 1 0.3813387 0.1818574

Problem 7: Test the relationship examined in Step #6 for moderation using SEX as the third variable. Answer question #11.

by(Mydata, Mydata$SEX, function(x)
  list(
    chisq.test(Mydata$Depression, Mydata$Ecigarette),
    prop.table(chisq.test(Mydata$Depression, Mydata$Ecigarette)$observed, 2)
  ))
## Mydata$SEX: 1
## [[1]]
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
## 
## 
## [[2]]
##                  Mydata$Ecigarette
## Mydata$Depression         1         2
##                 0 0.6186613 0.8181426
##                 1 0.3813387 0.1818574
## 
## ------------------------------------------------------------ 
## Mydata$SEX: 2
## [[1]]
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
## 
## 
## [[2]]
##                  Mydata$Ecigarette
## Mydata$Depression         1         2
##                 0 0.6186613 0.8181426
##                 1 0.3813387 0.1818574

Problem 8: Create a new variable (HEALTH) by collapsing the variable you created in step 3 (CHRONIC) into two levels, no chronic health conditions and one or more chronic health conditions. Use a dummy code of 2 for “no chronic conditions” and a dummy code of 1 for “1 or more chronic health conditions.”

 Mydata$health <- NA   # Intitialising / Creating Storage Space

Mydata$health[Mydata$chronic >= 1] <- 1

Mydata$health[Mydata$chronic == 0] <- 0

summary(Mydata)
##      Stroke            Asthma           Cancer            Copd        
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.05343   Mean   :0.1441   Mean   :0.1047   Mean   :0.09997  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  NA's   :47        NA's   :43       NA's   :38       NA's   :56       
##    Arthritis        Depression         Kidney           Diabetes     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.4049   Mean   :0.1864   Mean   :0.04404   Mean   :0.1614  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##  NA's   :76       NA's   :61       NA's   :59        NA's   :410     
##       SEX            Heart             Age          Ecigarette   
##  Min.   :1.000   Min.   :0.0000   Min.   :18.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:46.00   1st Qu.:2.000  
##  Median :2.000   Median :0.0000   Median :60.00   Median :2.000  
##  Mean   :1.591   Mean   :0.1058   Mean   :57.01   Mean   :1.966  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:70.00   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :1.0000   Max.   :80.00   Max.   :2.000  
##                  NA's   :156                      NA's   :529    
##     SEQNO              chronic          health      
##  Length:15000       Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :1.0000  
##                     Mean   :1.181   Mean   :0.6285  
##                     3rd Qu.:2.000   3rd Qu.:1.0000  
##                     Max.   :8.000   Max.   :1.0000  
##                     NA's   :719     NA's   :719

Problem 9: Run two logistic regression analyses, first examining the association between ECIGARETTE (Explanatory) and HEALTH (Response) (MODEL1) and then examining the association between ECIGARETTE (Explanatory) and HEALTH (Response) controlling for SEX as a potential confounder. Answer questions #12 and #13.

m1 <- lm(Mydata$health ~ Mydata$Ecigarette)
summary(m1)
## 
## Call:
## lm(formula = Mydata$health ~ Mydata$Ecigarette)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6859 -0.6308  0.3692  0.3692  0.3692 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.74096    0.04476  16.555   <2e-16 ***
## Mydata$Ecigarette -0.05507    0.02267  -2.429   0.0151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.482 on 13777 degrees of freedom
##   (1221 observations deleted due to missingness)
## Multiple R-squared:  0.0004281,  Adjusted R-squared:  0.0003556 
## F-statistic: 5.901 on 1 and 13777 DF,  p-value: 0.01515

Problem 10: Generate a frequency distribution and descriptive statistics for the variable CHRONIC. Answer questions #14 and 15.

# Logistic Regression for ECIGARETTE + SEX -> HEALTH
m2 <- lm(Mydata$health ~ Mydata$Ecigarette + Mydata$SEX)
summary(m2)
## 
## Call:
## lm(formula = Mydata$health ~ Mydata$Ecigarette + Mydata$SEX)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7266 -0.5772  0.3322  0.3322  0.4229 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.604149   0.046303  13.048  < 2e-16 ***
## Mydata$Ecigarette -0.058802   0.022575  -2.605  0.00921 ** 
## Mydata$SEX         0.090603   0.008318  10.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.48 on 13776 degrees of freedom
##   (1221 observations deleted due to missingness)
## Multiple R-squared:  0.008964,   Adjusted R-squared:  0.00882 
## F-statistic:  62.3 on 2 and 13776 DF,  p-value: < 2.2e-16

Note: Each question is worth 2 points. An error free program is worth another 8 points