For suggestions and queries: sulovekoirala@gmail.com
We are going demonstrate t-test and chi sq. test on the publicly available insurance dataset. Hypothesis testing using R is a breeze.
library(ggplot2)
library(data.table)
library(psych, warn.conflicts = FALSE)
library(knitr)
options(scipen = 999) # This removes the scientific notation
data <- fread('https://github.com/stedy/Machine-Learning-with-R-datasets/raw/master/insurance.csv') # fread command from data.table package loads the dataset very fast (faster than pandas)
head(data)
str(data)
## Classes 'data.table' and 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
## - attr(*, ".internal.selfref")=<externalptr>
describe(data)
Now, there are several questions that we have to answer.
Before moving stright to the analysis, we have to formulate the null and alternative hypothesis.
Ho: Smokers does not pay more insurance charges than non-smoker H1: Smokers pay more insurance charges
We will use an alpha(α) of 0.05.
ggplot(data)+
aes(smoker, charges, fill = smoker)+
geom_boxplot(outlier.colour = "red")+
theme_minimal()+
labs(title = "Insurance charges comparision of Smokers and Non-Smokers")
The boxplot gives a hint that the charges of non-smoker is quite below the smokers. Most of the smokers data are concentrated above non-smokers.
describeBy(x = data$charges, group = data$smoker)
##
## Descriptive statistics by group
## group: no
## vars n mean sd median trimmed mad min max range
## X1 1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
## skew kurtosis se
## X1 1.53 3.12 183.75
## ------------------------------------------------------------
## group: yes
## vars n mean sd median trimmed mad min max
## X1 1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
## range skew kurtosis se
## X1 50940.97 0.13 -1.05 697.25
We also used the describe command from psych package to show the various descriptives of two groups. It shows that average insurance charge of smoker group is very high. But is it statistically significant? This could also occur by a chance. This is where we are going to conduct hypothesis testing. We are going to perform t-test using the R inbuilt package.
t.test(charges~smoker, data)
##
## Welch Two Sample t-test
##
## data: charges by smoker
## t = -32.752, df = 311.85, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25034.71 -22197.21
## sample estimates:
## mean in group no mean in group yes
## 8434.268 32050.232
We can observe the p-value above. It is very low compared to our defined alpha value. Hence, we can reject the null-hypothesis and accept our alternative hypothesis, which is, smokers pay more insurance charges.
countplot <- as.data.frame(table(data$sex, data$smoker))
countplot$Var2 <- ifelse(countplot$Var2 =="no", "Non-Smoker", "Smoker")
countplot
ggplot(countplot)+
aes(Var1, Freq, fill = Var2)+
geom_bar(stat = "identity", position = "dodge")+
labs(x = "Gender", fill = "Categories", title = "Proportion of Smokers and Non-Smokers classified by genders", y = "Frequency")+
theme_minimal()
Since we are dealing with two independent variables here, we are going to perform Chi-Square Test.
Ho: There is no difference in proportion of smokers between genders H1: There is difference in proportion of smokers between genders
We will use an alpha(α) of 0.05.
table(data$sex, data$smoker)
##
## no yes
## female 547 115
## male 517 159
chisq.test(table(data$sex, data$smoker))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(data$sex, data$smoker)
## X-squared = 7.3929, df = 1, p-value = 0.006548
The test returned the p-value which is lesser than the alpha value that we defined earlier. Therefore, we again reject the null hypothesis and accept the alternative hypothesis which states that there is difference in proportion of smokers according to gender.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Matt Dowle and Arun Srinivasan (2021). data.table: Extension of data.frame. R package version 1.14.0. https://CRAN.R-project.org/package=data.table
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Revelle, W. (2020) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 2.1.3,.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963