The chi-squared test of independence is a method appropriate for testing independence between two categorical variables.

Let’s first import and attach the required dataset for analysis

LungCapData = read.table(file = "../Dataset/LungCapData.txt", header = TRUE, sep = "\t")
attach(LungCapData)

Let’s test the independece between “Gender” variable and “Smoke” variable.

Before jumping into the test, lets analyze the dataset first.

The class of the comparing variables are :

class(Gender)
## [1] "factor"
class(Smoke)
## [1] "factor"

The elements present under the variables are :

levels(Gender)
## [1] "female" "male"
levels(Smoke)
## [1] "no"  "yes"

In case we want to know a brief information about chi-squared test, we can simple pass the following command to see help

help("chisq.test")
#or,
?chisq.test

Now, Let’s first create a contengency table to perform the test :

Tab = table(Gender,Smoke)
Tab
##         Smoke
## Gender    no yes
##   female 314  44
##   male   334  33

Let’s visualize the data :

barplot(Tab, beside = TRUE, legend=TRUE)

From the above chart, it seems that the smoking group has more females than males and the non-smoking group has more males than females.

So, there might be some relation between the variables.

Let’s perform the chi-squared test to validate our hypothesis.

\(H_0\) : The variables are independent, i.e., no association between the variables \(H_A\) : The variables are dependent, i.e., there is an association between the variables

Now, lets perform the chi squared test :

Chi = chisq.test(Tab, correct = T)
Chi
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Tab
## X-squared = 1.7443, df = 1, p-value = 0.1866

As the p-Value of this test is high enough (\(18.6\%\)), so, we failed to reject the null hypothesis.

So, from the chi-squared test, it is evident that the variables are independent of each other.

Let’s see what attributes are stored :

attributes(Chi)
## $names
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"   
## 
## $class
## [1] "htest"

Now, let’s see the expected table of our chi-squared test :

Chi$expected
##         Smoke
## Gender         no      yes
##   female 319.9779 38.02207
##   male   328.0221 38.97793

Measures of Association

The various measures of association are :

  1. Relative Risk (RR)
  2. Odds Ration (OR)
  3. Attribute Risk/Risk Difference (RD)

In the chi-squared test, we found that the variables are independent but, it doesn’t gave us any idea about the strength/association between the variables.

All of the above are measures of direction and the strength of the association between two categorical variables.

To find our RR, OR & RD, we need an additional package.

Let’s load the package

library(epiR)
## Warning: package 'epiR' was built under R version 3.6.3
## Loading required package: survival
## Package epiR 1.0-14 is loaded
## Type help(epi.about) for summary information
## Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses
## 

To get any help on this package :

help(package=epiR)

To calculate the various measures , we have to pass the following code into our contigency table :

epi.2by2(Tab, method = "cohort.count", conf.level = 0.95)
##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +          314           44        358              87.7        7.14
## Exposed -          334           33        367              91.0       10.12
## Total              648           77        725              89.4        8.42
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                               0.96 (0.92, 1.01)
## Odds ratio                                   0.71 (0.44, 1.14)
## Attrib risk *                                -3.30 (-7.79, 1.19)
## Attrib risk in population *                  -1.63 (-5.32, 2.06)
## Attrib fraction in exposed (%)               -3.76 (-9.12, 1.34)
## Attrib fraction in population (%)            -1.82 (-4.34, 0.64)
## -------------------------------------------------------------------
##  Test that OR = 1: chi2(1) = 2.077 Pr>chi2 = 0.15
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Looking at the odds ration, we can say that the odds of a female not smoking is \(0.71\) times the odds of a male not smoking.

So, if we take the inverse of this

1/0.71
## [1] 1.408451

So, we can say that the odds of a male not smoking is \(1.4\) times of the odds of a female not smoking.

Its better to arrange our contigency table in the default format, i.e.,

elements = c(44, 314, 33, 334)

ContTab = matrix(elements, nrow = 2, ncol = 2, byrow = T)
rownames(ContTab) = c("Female", "Male")
colnames(ContTab) = c("Yes", "No")

ContTab
##        Yes  No
## Female  44 314
## Male    33 334

Now, lets find out the different measures of association :

epi.2by2(ContTab, method = "cohort.count")
##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +           44          314        358             12.29      0.1401
## Exposed -           33          334        367              8.99      0.0988
## Total               77          648        725             10.62      0.1188
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                               1.37 (0.89, 2.10)
## Odds ratio                                   1.42 (0.88, 2.28)
## Attrib risk *                                3.30 (-1.19, 7.79)
## Attrib risk in population *                  1.63 (-2.06, 5.32)
## Attrib fraction in exposed (%)               26.84 (-12.15, 52.28)
## Attrib fraction in population (%)            15.34 (-8.10, 33.69)
## -------------------------------------------------------------------
##  Test that OR = 1: chi2(1) = 2.077 Pr>chi2 = 0.15
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

From the above analysis we can see the odds ratio contains \(1\) . This indicate that the odds ratio is not significant.

But, from here we can directly interpret that the odds of a male not smoking is \(1.42\) time to that of a female not smoking.