SD2:Chi-Squared Test for Independence

Research Question / Hypothesis

I hypothesize that there is a feeling about mental_health will not differ between smoking_history.

Package Loading, Data Import, Data Prep

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

data <- read_csv("C:/Users/jammi/Downloads/SD2 Data(1).csv")

## Parsed with column specification:
## cols(
##   sex = col_character(),
##   race = col_character(),
##   marital_status = col_character(),
##   poverty_status = col_character(),
##   age_range = col_character(),
##   health = col_character(),
##   bmi_category = col_character(),
##   mental_health = col_character(),
##   heart_attack_history = col_character(),
##   heart_condition_history = col_character(),
##   cancer_history = col_character(),
##   prediabetes_history = col_character(),
##   asthma_history = col_character(),
##   hypertension_history = col_character(),
##   smoking_history = col_character(),
##   birthcontrol_status = col_logical()
## )

head(data)

## # A tibble: 6 x 16
##   sex   race  marital_status poverty_status age_range health bmi_category
##   <chr> <chr> <chr>          <chr>          <chr>     <chr>  <chr>       
## 1 fema~ White Never Married  above poverty  18-29     Excel~ Normal      
## 2 fema~ White Married        above poverty  50-59     Very ~ Normal      
## 3 male  White DivorcedOrSep~ above poverty  60-69     Good   Normal      
## 4 male  White DivorcedOrSep~ above poverty  50-59     Fair   Obese       
## 5 fema~ White Married        above poverty  30-39     Very ~ Normal      
## 6 male  White Never Married  above poverty  18-29     Excel~ Normal      
## # ... with 9 more variables: mental_health <chr>, heart_attack_history <chr>,
## #   heart_condition_history <chr>, cancer_history <chr>,
## #   prediabetes_history <chr>, asthma_history <chr>,
## #   hypertension_history <chr>, smoking_history <chr>,
## #   birthcontrol_status <lgl>

Data Summary

[Independent Variable Response Distribution]

  table(data$smoking_history)%>%
  prop.table()%>%
  round(2)

## 
##  No Yes 
## 0.6 0.4

[Dependent Variable Response Distribution]

  table(data$mental_health)%>%
  prop.table()%>%
  round(2)

## 
##                 Low Risk Moderate Mental Distress   Serious Mental Illness 
##                     0.80                     0.16                     0.03

[Expected Crosstab Distribution]

table(data$smoking_history,data$mental_health)%>%
  prop.table()

##      
##         Low Risk Moderate Mental Distress Serious Mental Illness
##   No  0.50160014               0.08226560             0.01301213
##   Yes 0.30301077               0.07962800             0.02048336

[Observed Crosstab Distribution]

table(data$smoking_history,data$mental_health)%>%
  prop.table()%>%
  round(2)

##      
##       Low Risk Moderate Mental Distress Serious Mental Illness
##   No      0.50                     0.08                   0.01
##   Yes     0.30                     0.08                   0.02

About 30% says that mental health is affect by smoking and about 50% says that mental health is affected by smoking. about 10% said yes and no to moderate mental Distress and about 0% says no serious mental illness.The table below shows the actual % of responses for each category combination. A crosstab showing table %. These values are not very different from the expected observations from the null hypothesis.

Data Analysis

Relationship of Interest: Crosstab showing [Column%] [Row%]

data%>%
  group_by(smoking_history,mental_health)%>%
  summarize(n=n())%>%
  mutate(perecent=n/sum(n))

## `summarise()` regrouping output by 'smoking_history' (override with `.groups` argument)

## # A tibble: 6 x 4
## # Groups:   smoking_history [2]
##   smoking_history mental_health                 n perecent
##   <chr>           <chr>                     <int>    <dbl>
## 1 No              Low Risk                 128367   0.840 
## 2 No              Moderate Mental Distress  21053   0.138 
## 3 No              Serious Mental Illness     3330   0.0218
## 4 Yes             Low Risk                  77545   0.752 
## 5 Yes             Moderate Mental Distress  20378   0.198 
## 6 Yes             Serious Mental Illness     5242   0.0508

Relationship of Interest: [Visualization]

data%>%
  group_by(smoking_history,mental_health)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))%>%
  ggplot()+
  geom_col(aes(x=smoking_history,y=percent, fill=mental_health))

## `summarise()` regrouping output by 'smoking_history' (override with `.groups` argument)

[Interpretive Writing]

Chi-Squared Test

chisq.test(data$smoking_history,data$mental_health)

## 
##  Pearson's Chi-squared test
## 
## data:  data$smoking_history and data$mental_health
## X-squared = 3505.3, df = 2, p-value < 2.2e-16

There is a statistically significant relationship between smoking_history and mental_health