The chi-squared test of independence is a method appropriate for testing independence between two categorical variables.
Let’s first import and attach the required dataset for analysis
LungCapData = read.table(file = "../Dataset/LungCapData.txt", header = TRUE, sep = "\t")
attach(LungCapData)
Let’s test the independece between “Gender” variable and “Smoke” variable.
Before jumping into the test, lets analyze the dataset first.
The class of the comparing variables are :
class(Gender)
## [1] "factor"
class(Smoke)
## [1] "factor"
The elements present under the variables are :
levels(Gender)
## [1] "female" "male"
levels(Smoke)
## [1] "no" "yes"
In case we want to know a brief information about chi-squared test, we can simple pass the following command to see help
help("chisq.test")
#or,
?chisq.test
Now, Let’s first create a contengency table to perform the test :
Tab = table(Gender,Smoke)
Tab
## Smoke
## Gender no yes
## female 314 44
## male 334 33
Let’s visualize the data :
barplot(Tab, beside = TRUE, legend=TRUE)
From the above chart, it seems that the smoking group has more females than males and the non-smoking group has more males than females.
So, there might be some relation between the variables.
Let’s perform the chi-squared test to validate our hypothesis.
\(H_0\) : The variables are independent, i.e., no association between the variables \(H_A\) : The variables are dependent, i.e., there is an association between the variables
Now, lets perform the chi squared test :
Chi = chisq.test(Tab, correct = T)
Chi
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Tab
## X-squared = 1.7443, df = 1, p-value = 0.1866
As the p-Value of this test is high enough (\(18.6\%\)), so, we failed to reject the null hypothesis.
So, from the chi-squared test, it is evident that the variables are independent of each other.
Let’s see what attributes are stored :
attributes(Chi)
## $names
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##
## $class
## [1] "htest"
Now, let’s see the expected table of our chi-squared test :
Chi$expected
## Smoke
## Gender no yes
## female 319.9779 38.02207
## male 328.0221 38.97793
The various measures of association are :
In the chi-squared test, we found that the variables are independent but, it doesn’t gave us any idea about the strength/association between the variables.
All of the above are measures of direction and the strength of the association between two categorical variables.
To find our RR, OR & RD, we need an additional package.
Let’s load the package
library(epiR)
## Warning: package 'epiR' was built under R version 3.6.3
## Loading required package: survival
## Package epiR 1.0-14 is loaded
## Type help(epi.about) for summary information
## Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses
##
To get any help on this package :
help(package=epiR)
To calculate the various measures , we have to pass the following code into our contigency table :
epi.2by2(Tab, method = "cohort.count", conf.level = 0.95)
## Outcome + Outcome - Total Inc risk * Odds
## Exposed + 314 44 358 87.7 7.14
## Exposed - 334 33 367 91.0 10.12
## Total 648 77 725 89.4 8.42
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 0.96 (0.92, 1.01)
## Odds ratio 0.71 (0.44, 1.14)
## Attrib risk * -3.30 (-7.79, 1.19)
## Attrib risk in population * -1.63 (-5.32, 2.06)
## Attrib fraction in exposed (%) -3.76 (-9.12, 1.34)
## Attrib fraction in population (%) -1.82 (-4.34, 0.64)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 2.077 Pr>chi2 = 0.15
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
For case control studies, we have to pass method="case.control" as the argument and the default conf.level is \(95\%\).
RR will not be returned in case of Case Control Studies.
Looking at the odds ration, we can say that the odds of a female not smoking is \(0.71\) times the odds of a male not smoking.
So, if we take the inverse of this
1/0.71
## [1] 1.408451
So, we can say that the odds of a male not smoking is \(1.4\) times of the odds of a female not smoking.
Its better to arrange our contigency table in the default format, i.e.,
elements = c(44, 314, 33, 334)
ContTab = matrix(elements, nrow = 2, ncol = 2, byrow = T)
rownames(ContTab) = c("Female", "Male")
colnames(ContTab) = c("Yes", "No")
ContTab
## Yes No
## Female 44 314
## Male 33 334
Now, lets find out the different measures of association :
epi.2by2(ContTab, method = "cohort.count")
## Outcome + Outcome - Total Inc risk * Odds
## Exposed + 44 314 358 12.29 0.1401
## Exposed - 33 334 367 8.99 0.0988
## Total 77 648 725 10.62 0.1188
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.37 (0.89, 2.10)
## Odds ratio 1.42 (0.88, 2.28)
## Attrib risk * 3.30 (-1.19, 7.79)
## Attrib risk in population * 1.63 (-2.06, 5.32)
## Attrib fraction in exposed (%) 26.84 (-12.15, 52.28)
## Attrib fraction in population (%) 15.34 (-8.10, 33.69)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 2.077 Pr>chi2 = 0.15
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
From the above analysis we can see the odds ratio contains \(1\) . This indicate that the odds ratio is not significant.
But, from here we can directly interpret that the odds of a male not smoking is \(1.42\) time to that of a female not smoking.