Homework 1 : 19566100

Data set for Homework 1, was found on Kaggle. Link to data set is https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout.

Research question: Does average expenditure for customer differ between smokers and nonsmokers ?

mydata <- read.table("./expenses.csv", header=TRUE, sep=",", dec=".")
head(mydata)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Explenation of data set

Unit of observation: customer of the insurance company

Sample size: 1338 units

Description of data:
  • age: Age of the customer
  • sex: Gender
  • bmi: Body Mass Index (kg/m2)
  • children: Number of children
  • smoker: Whether the customer smokes or not
  • region: Which region of the USA the customer belongs to
  • charges: The expenditure for the customer in $
mydata$sexF <- factor(mydata$sex,
                 levels = c("male", "female"),
                 labels = c("male", "female"))
mydata$smokerF <- factor(mydata$smoker,
                 levels = c("yes", "no"),
                 labels = c("yes", "no"))
head(mydata, 3)
##   age    sex   bmi children smoker    region   charges   sexF smokerF
## 1  19 female 27.90        0    yes southwest 16884.924 female     yes
## 2  18   male 33.77        1     no southeast  1725.552   male      no
## 3  28   male 33.00        3     no southeast  4449.462   male      no
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(10)

mydata1 <- mydata %>% # for the purpose of exercise, I will take a random sample of 100 units
sample_n(100)

Data manipulations

mydata2 <- mydata1[,c(-6)]# excluding variable, that I will not use
head(mydata2)
##   age    sex    bmi children smoker   charges   sexF smokerF
## 1  19 female 32.900        0     no  1748.774 female      no
## 2  42 female 24.985        2     no  8017.061 female      no
## 3  52 female 46.750        5     no 12592.534 female      no
## 4  63   male 36.765        0     no 13981.850   male      no
## 5  58   male 25.175        0     no 11931.125   male      no
## 6  34   male 25.300        2    yes 18972.495   male     yes
library(psych)
describe(mydata2 [,c(-2,-5,-7,-8)])
##          vars   n     mean       sd   median  trimmed     mad     min     max
## age         1 100    42.04    14.56    42.50    42.40   17.05   18.00    64.0
## bmi         2 100    31.27     6.26    31.40    31.21    6.94   17.86    47.6
## children    3 100     1.22     1.39     1.00     1.00    1.48    0.00     5.0
## charges     4 100 14067.91 12524.07 10673.46 11842.21 7085.90 1141.45 60021.4
##             range  skew kurtosis      se
## age         46.00 -0.17    -1.26    1.46
## bmi         29.74  0.12    -0.35    0.63
## children     5.00  1.08     0.55    0.14
## charges  58879.95  1.56     1.80 1252.41
summary(mydata2)
##       age            sex                 bmi           children   
##  Min.   :18.00   Length:100         Min.   :17.86   Min.   :0.00  
##  1st Qu.:31.00   Class :character   1st Qu.:26.65   1st Qu.:0.00  
##  Median :42.50   Mode  :character   Median :31.40   Median :1.00  
##  Mean   :42.04                      Mean   :31.27   Mean   :1.22  
##  3rd Qu.:54.25                      3rd Qu.:36.08   3rd Qu.:2.00  
##  Max.   :64.00                      Max.   :47.60   Max.   :5.00  
##     smoker             charges          sexF    smokerF 
##  Length:100         Min.   : 1141   male  :49   yes:19  
##  Class :character   1st Qu.: 5576   female:51   no :81  
##  Mode  :character   Median :10673                       
##                     Mean   :14068                       
##                     3rd Qu.:14222                       
##                     Max.   :60021
Explanation of the parameters:
  • Age-min: The youngest customer observed is 18 years old.
  • Charges: Skewness value of 1.56 for variable charges, which means that distribution of charges is positively or right skewed.
  • Bmi-median: Half of the customers in the sample had less or equal to 31.4 bmi(body-mass index) and half of the customers had more than 31.4 bmi(body-mass index).

Statistical hypothesis

Does average expenditure for customer differ between smokers and nonsmokers ?
  • H0: There is no difference between average expenditures for smokers and nonsmokers.
  • H1: There is difference between average expenditures for smokers and nonsmokers.
Assumptions:
  • Numerical variable (charges is numeric variable)
  • The distribution of the variable is normal in both populations (Shapiro Wilk’s test)
  • Variance of expenditures must be the same for smokers and nonsmokers, but this is usually violated and Welch correction needs to be applied
H0: Variable charges is normally distributed in both groups
H1: Variable charges is not normally distributed in both groups
# Install and load necessary packages
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("ggpubr", quietly = TRUE)) {
  install.packages("ggpubr")
}
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
library(ggplot2)

smoker <- ggplot(mydata2[mydata2$smokerF=="yes",  ], aes(x = charges)) +
  theme_linedraw() + 
  geom_histogram(fill = "green", colour = "black") +
  ylab("Frequency") +
  ggtitle("Charges of smokers")

nonsmoker <- ggplot(mydata2[mydata2$smokerF=="no",  ], aes(x = charges)) +
  theme_linedraw() + 
  geom_histogram(fill = "red", colour = "black") +
  ylab("Frequency") +
  ggtitle("Charges of nonsmokers")

install.packages("ggpubr")
## Warning: package 'ggpubr' is in use and will not be installed
library(ggpubr)
ggarrange(smoker, nonsmoker,
          ncol= 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydata2 %>%
  group_by(smokerF) %>%
  shapiro_test(charges)
## # A tibble: 2 × 4
##   smokerF variable statistic            p
##   <fct>   <chr>        <dbl>        <dbl>
## 1 yes     charges      0.923 0.128       
## 2 no      charges      0.834 0.0000000420

After plotting the histograms and preforming the Shapiro test, we can conclude that in group nonsmokers variable is not normally distributed. We can reject the H0 at p-value < 0.001.

Assumption of normality is violated but I will however show parametric test for the purpose of the practice.

t.test(mydata2$charges ~ mydata2$smokerF, 
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  mydata2$charges by mydata2$smokerF
## t = 8.2836, df = 20.461, p-value = 5.708e-08
## alternative hypothesis: true difference in means between group yes and group no is not equal to 0
## 95 percent confidence interval:
##  18304.42 30602.34
## sample estimates:
## mean in group yes  mean in group no 
##         33875.151          9421.768
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata2$charges ~ mydata2$smokerF,
                     pooled_sd = FALSE)
## Cohen's d |       95% CI
## ------------------------
## 2.45      | [1.49, 3.38]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(2.45, rules = "sawilowsky2009")
## [1] "huge"
## (Rules: sawilowsky2009)

Because normality assumption is violated I need to do non-parametric Wilcoxson Rank Sum Test.

wilcox.test(mydata2$charges ~ mydata2$smokerF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$charges by mydata2$smokerF
## W = 1504, p-value = 1.092e-10
## alternative hypothesis: true location shift is not equal to 0
H0: The distribution locations of smokers and nonsmokers charges are the same
H1: The distribution locations of smokers and nonsmokers charges are not the same

We can reject the null hypothesis at p < 0.001.

library(effectsize)
effectsize(wilcox.test(mydata2$charges ~ mydata2$smokerF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.95              | [0.92, 0.97]
interpret_rank_biserial(0.95)
## [1] "very large"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that there is difference in average expenditures for smokers and nonsmokers (p< 0.001), the difference in distribution is very large (r= 0.95).