Hypothesis Testing on Insurance Dataset

For suggestions and queries: sulovekoirala@gmail.com

Introduction

We are going demonstrate t-test and chi sq. test on the publicly available insurance dataset. Hypothesis testing using R is a breeze.

Loading Packages

library(ggplot2)
library(data.table)
library(psych, warn.conflicts = FALSE)
library(knitr)
options(scipen = 999) # This removes the scientific notation

Importing the dataset

data <- fread('https://github.com/stedy/Machine-Learning-with-R-datasets/raw/master/insurance.csv') # fread command from data.table package loads the dataset very fast (faster than pandas)

head(data)

str(data)

## Classes 'data.table' and 'data.frame':   1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
##  - attr(*, ".internal.selfref")=<externalptr>

describe(data)

Now, there are several questions that we have to answer.

Does smokers pay more insurance charges?

Before moving stright to the analysis, we have to formulate the null and alternative hypothesis.

Ho: Smokers does not pay more insurance charges than non-smoker H1: Smokers pay more insurance charges

We will use an alpha(α) of 0.05.

ggplot(data)+
    aes(smoker, charges, fill = smoker)+
    geom_boxplot(outlier.colour = "red")+
    theme_minimal()+
    labs(title = "Insurance charges comparision of Smokers and Non-Smokers")

The boxplot gives a hint that the charges of non-smoker is quite below the smokers. Most of the smokers data are concentrated above non-smokers.

describeBy(x = data$charges, group = data$smoker)

## 
##  Descriptive statistics by group 
## group: no
##    vars    n    mean      sd  median trimmed     mad     min      max    range
## X1    1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
##    skew kurtosis     se
## X1 1.53     3.12 183.75
## ------------------------------------------------------------ 
## group: yes
##    vars   n     mean       sd   median  trimmed      mad      min      max
## X1    1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
##       range skew kurtosis     se
## X1 50940.97 0.13    -1.05 697.25

We also used the describe command from psych package to show the various descriptives of two groups. It shows that average insurance charge of smoker group is very high. But is it statistically significant? This could also occur by a chance. This is where we are going to conduct hypothesis testing. We are going to perform t-test using the R inbuilt package.

t.test(charges~smoker, data)

## 
##  Welch Two Sample t-test
## 
## data:  charges by smoker
## t = -32.752, df = 311.85, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -25034.71 -22197.21
## sample estimates:
##  mean in group no mean in group yes 
##          8434.268         32050.232

We can observe the p-value above. It is very low compared to our defined alpha value. Hence, we can reject the null-hypothesis and accept our alternative hypothesis, which is, smokers pay more insurance charges.

Does the proportion of smokers varies (significantly) according to gender?

countplot <- as.data.frame(table(data$sex, data$smoker)) 
countplot$Var2 <- ifelse(countplot$Var2 =="no", "Non-Smoker", "Smoker")
countplot

ggplot(countplot)+
    aes(Var1, Freq, fill = Var2)+
    geom_bar(stat = "identity", position = "dodge")+
    labs(x = "Gender", fill = "Categories", title = "Proportion of Smokers and Non-Smokers classified by genders", y = "Frequency")+
    theme_minimal()

Since we are dealing with two independent variables here, we are going to perform Chi-Square Test.

Ho: There is no difference in proportion of smokers between genders H1: There is difference in proportion of smokers between genders

We will use an alpha(α) of 0.05.

table(data$sex, data$smoker)

##         
##           no yes
##   female 547 115
##   male   517 159

chisq.test(table(data$sex, data$smoker))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data$sex, data$smoker)
## X-squared = 7.3929, df = 1, p-value = 0.006548

The test returned the p-value which is lesser than the alpha value that we defined earlier. Therefore, we again reject the null hypothesis and accept the alternative hypothesis which states that there is difference in proportion of smokers according to gender.

Bibliography

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Matt Dowle and Arun Srinivasan (2021). data.table: Extension of data.frame. R package version 1.14.0. https://CRAN.R-project.org/package=data.table
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Revelle, W. (2020) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 2.1.3,.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Hypothesis Testing on Insurance Dataset

Dr. Sulove Koirala

Introduction

Loading Packages

Importing the dataset

Does smokers pay more insurance charges?

Does the proportion of smokers varies (significantly) according to gender?

Bibliography