The data set was collected via health-related telephone surveys on health-related risk behaviors, chronic health conditions, and use of preventive services from the noninstitutionalized adult population 18 years of age and older in the US during 2016 calendar year. From the study “The relationship between e-cigarette use and chronic health conditions” can be examined.
Below is a description of variables and their possible values. * SEQNO [unique identifier] * ECIGARETTE [E-cigarette user status.] 1 = Currently using e-cigarettes, 2 = Not current e-cigarette user
KIDNEY [Health professional ever told you that you have kidney disease?] 0 = No, 1 = Yes
STROKE [Health professional ever told you that you had a stroke?] 0 = No, 1 = Yes
CANCER [Health professional ever told you that you had any other types of cancer besides skin cancer?] 0 = No, 1 = Yes
COPD [Health professional ever told you that you had COPD, emphysema, or chronic bronchitis?] 0 = No, 1 = Yes
HEART [Health professional ever told you that you had heart attack/myocardial infraction?] 0 = No, 1 = Yes
DIABETES [Health professional ever told you that you had diabetes?] 0 = No, 1 = Yes
ASTHMA [Health professional ever told you that you had asthma?] 0 = No, 1 = Yes
ARTHRITIS [Health professional ever told you that you had arthritis?] 0 = No, 1 = Yes
DEPRESSION [Health professional ever told you that you had a depressive disorder?] 0 = No, 1 = Yes
SEX [Indicate sex of participant] 1 = Male, 2 = Female
AGE [Age of participant.] 18-80 (A persons age reported in years)
Mydata <- read_sav("ECigSPSS.sav")
#View(Mydata)
Mydata$chronic<-NA # initializing chronic
Mydata$health<-NA # initializing chronic
head(Mydata)
## # A tibble: 6 x 15
## Stroke Asthma Cancer Copd Arthritis Depression Kidney Diabetes SEX Heart
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0 0 0 1 0
## 2 0 0 0 0 0 1 0 0 2 0
## 3 0 0 0 0 1 0 0 0 2 1
## 4 0 0 0 0 1 0 0 1 1 0
## 5 0 1 0 0 0 0 0 0 1 0
## 6 0 0 0 0 0 0 0 0 2 0
## # … with 5 more variables: Age <dbl>, Ecigarette <dbl>, SEQNO <chr>,
## # chronic <lgl>, health <lgl>
str(Mydata)
## tibble [15,000 × 15] (S3: tbl_df/tbl/data.frame)
## $ Stroke : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "EVER DIAGNOSED WITH A STROKE"
## ..- attr(*, "format.spss")= chr "F1.0"
## $ Asthma : num [1:15000] 0 0 0 0 1 0 0 1 0 0 ...
## ..- attr(*, "label")= chr "EVER TOLD HAD ASTHMA"
## ..- attr(*, "format.spss")= chr "F1.0"
## $ Cancer : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "(EVER TOLD) YOU HAD ANY OTHER TYPES OF C"
## ..- attr(*, "format.spss")= chr "F1.0"
## $ Copd : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE (COPD) CHRONIC OBST"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 6
## $ Arthritis : num [1:15000] 0 0 1 1 0 0 1 1 0 0 ...
## ..- attr(*, "label")= chr "TOLD HAVE ARTHRITIS"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 11
## $ Depression: num [1:15000] 0 1 0 0 0 0 0 1 0 0 ...
## ..- attr(*, "label")= chr "EVER TOLD YOU HAD A DEPRESSIVE DISORDER"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 12
## $ Kidney : num [1:15000] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE KIDNEY DISEASE?"
## ..- attr(*, "format.spss")= chr "F1.0"
## $ Diabetes : num [1:15000] 0 0 0 1 0 0 0 1 0 0 ...
## ..- attr(*, "label")= chr "(EVER TOLD) YOU HAVE DIABETES"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 10
## $ SEX : num [1:15000] 1 2 2 1 1 2 2 2 2 1 ...
## ..- attr(*, "label")= chr "RESPONDENTS SEX"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 5
## $ Heart : num [1:15000] 0 0 1 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "RESPONDENTS THAT HAVE EVER REPORTED HAVI"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 7
## $ Age : num [1:15000] 43 59 80 70 18 65 74 76 43 56 ...
## ..- attr(*, "label")= chr "IMPUTED AGE VALUE COLLAPSED ABOVE 80"
## ..- attr(*, "format.spss")= chr "F2.0"
## ..- attr(*, "display_width")= int 5
## $ Ecigarette: num [1:15000] 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "label")= chr "CURRENT E-CIGARETTE USER CALCULATED VARI"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 12
## $ SEQNO : chr [1:15000] "2016000001" "2016000002" "2016000003" "2016000004" ...
## ..- attr(*, "label")= chr "ANNUAL SEQUENCE NUMBER"
## ..- attr(*, "format.spss")= chr "A30"
## ..- attr(*, "display_width")= int 26
## $ chronic : logi [1:15000] NA NA NA NA NA NA ...
## $ health : logi [1:15000] NA NA NA NA NA NA ...
Note that I am using different table styles for printing output which is tabular, using kable and kableextra packages. You can read up more on it https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Dep.count<-tally(Mydata$Depression)
ECig.Count<-tally(Mydata$Ecigarette)
ECig.Smoker<-tally(Mydata$Depression~Mydata$Ecigarette)
Dep.count %>%
kbl(caption = "Frequency Table for Depression") %>%
kable_classic(full_width = F, html_font = "Cambria")
X | Freq |
---|---|
0 | 12154 |
1 | 2785 |
NA | 61 |
ECig.Count %>%
kbl(caption = "Frequency Table for ECigarette") %>%
kable_paper("hover", full_width = F)
X | Freq |
---|---|
1 | 497 |
2 | 13974 |
NA | 529 |
ECig.Smoker %>%
kbl(caption="ECigarette vs Depression") %>%
kable_material(c("striped", "hover"))
1 | 2 | NA | |
---|---|---|---|
0 | 305 | 11391 | 458 |
1 | 188 | 2532 | 65 |
NA | 4 | 51 | 6 |
Mydata$chronic<-NA # initializing chronic
Mydata$chronic <- Mydata$Stroke + Mydata$Asthma + Mydata$Cancer + Mydata$Copd + Mydata$Arthritis + Mydata$Kidney + Mydata$Diabetes + Mydata$Depression
One limitation of such plots is that they do not display the distribution of the data - only the summary statistic for each group. The plots below correct this limitation to some extent. If you want to alleviate the issue, you can read up more and fit kernel plots https://rkabacoff.github.io/datavis/Bivariate.html#categorical-vs.quantitative
Choice of method in the argument depends on which correlation coefficient is to be used for the test.
**Assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable must be monotonically related to the other variable.
cor.age.chronic <- cor.test(Mydata$chronic, Mydata$Age, method="pearson")
cor.age.chronic
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 35.421, df = 14279, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2690512 0.2992049
## sample estimates:
## cor
## 0.2841983
EC.aov <- aov(Mydata$chronic~ Mydata$Ecigarette, data=Mydata)
summary(EC.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Mydata$Ecigarette 1 30 30.014 19.07 1.27e-05 ***
## Residuals 13777 21689 1.574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1221 observations deleted due to missingness
#by(data$CHRONIC, data$Ecigarette, favstats, na.rm=T)
ED.chisq <- chisq.test(Mydata$Depression, Mydata$Ecigarette)
ED.chisq
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
prop.table(ED.chisq$observed, 2)
## Mydata$Ecigarette
## Mydata$Depression 1 2
## 0 0.6186613 0.8181426
## 1 0.3813387 0.1818574
by(Mydata, Mydata$SEX, function(x)
list(
chisq.test(Mydata$Depression, Mydata$Ecigarette),
prop.table(chisq.test(Mydata$Depression, Mydata$Ecigarette)$observed, 2)
))
## Mydata$SEX: 1
## [[1]]
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
##
##
## [[2]]
## Mydata$Ecigarette
## Mydata$Depression 1 2
## 0 0.6186613 0.8181426
## 1 0.3813387 0.1818574
##
## ------------------------------------------------------------
## Mydata$SEX: 2
## [[1]]
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Mydata$Depression and Mydata$Ecigarette
## X-squared = 122.47, df = 1, p-value < 2.2e-16
##
##
## [[2]]
## Mydata$Ecigarette
## Mydata$Depression 1 2
## 0 0.6186613 0.8181426
## 1 0.3813387 0.1818574
Mydata$health <- NA # Intitialising / Creating Storage Space
Mydata$health[Mydata$chronic >= 1] <- 1
Mydata$health[Mydata$chronic == 0] <- 0
summary(Mydata)
## Stroke Asthma Cancer Copd
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.05343 Mean :0.1441 Mean :0.1047 Mean :0.09997
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## NA's :47 NA's :43 NA's :38 NA's :56
## Arthritis Depression Kidney Diabetes
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.4049 Mean :0.1864 Mean :0.04404 Mean :0.1614
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## NA's :76 NA's :61 NA's :59 NA's :410
## SEX Heart Age Ecigarette
## Min. :1.000 Min. :0.0000 Min. :18.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:46.00 1st Qu.:2.000
## Median :2.000 Median :0.0000 Median :60.00 Median :2.000
## Mean :1.591 Mean :0.1058 Mean :57.01 Mean :1.966
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:70.00 3rd Qu.:2.000
## Max. :2.000 Max. :1.0000 Max. :80.00 Max. :2.000
## NA's :156 NA's :529
## SEQNO chronic health
## Length:15000 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :1.0000
## Mean :1.181 Mean :0.6285
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :8.000 Max. :1.0000
## NA's :719 NA's :719
m1 <- lm(Mydata$health ~ Mydata$Ecigarette)
summary(m1)
##
## Call:
## lm(formula = Mydata$health ~ Mydata$Ecigarette)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6859 -0.6308 0.3692 0.3692 0.3692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.74096 0.04476 16.555 <2e-16 ***
## Mydata$Ecigarette -0.05507 0.02267 -2.429 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.482 on 13777 degrees of freedom
## (1221 observations deleted due to missingness)
## Multiple R-squared: 0.0004281, Adjusted R-squared: 0.0003556
## F-statistic: 5.901 on 1 and 13777 DF, p-value: 0.01515
# Logistic Regression for ECIGARETTE + SEX -> HEALTH
m2 <- lm(Mydata$health ~ Mydata$Ecigarette + Mydata$SEX)
summary(m2)
##
## Call:
## lm(formula = Mydata$health ~ Mydata$Ecigarette + Mydata$SEX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7266 -0.5772 0.3322 0.3322 0.4229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.604149 0.046303 13.048 < 2e-16 ***
## Mydata$Ecigarette -0.058802 0.022575 -2.605 0.00921 **
## Mydata$SEX 0.090603 0.008318 10.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.48 on 13776 degrees of freedom
## (1221 observations deleted due to missingness)
## Multiple R-squared: 0.008964, Adjusted R-squared: 0.00882
## F-statistic: 62.3 on 2 and 13776 DF, p-value: < 2.2e-16