Ricson Lee - HW2

Instruction 1:

I found this dataset on Kaggle titled “insurance” to study the risk present in Health Insurance as well as the interplay of various attributes of the insured and see how they affect the insurance premium.

Instruction 2:

mydata <- read.table("./insurance.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

I imported the data into R Studio using the read.table() function. The data was sourced from the US, which is why the decimal point is used as the decimal separator instead of the decimal comma.

Instruction 3:

head(mydata)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Instruction 4:

This dataset has 1338 observations with 7 variables.

Instruction 5:

The data is sourced from the third edition of “Machine Learning with R”, by Brett Lanz.

Instruction 6:

#I created a copy of mydata to work on for the sake of convenience
mydata2 <- mydata 
#Made adjustments to the categorical variables
mydata2$sex <- factor(mydata2$sex,
                              levels = c("male", "female"),
                              labels = c("male", "female"))

mydata2$smoker <- factor(mydata2$smoker,
                              levels = c("yes", "no"),
                              labels = c("yes", "no"))

mydata2$region <- factor(mydata2$region,
                              levels = c("northeast", "southeast", "southwest", "northwest"),
                              labels = c("northeast", "southeast", "southwest", "northwest"))
#Renamed the column headers to be more visually appealing
library(dplyr)
## 
## 载入程序包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata2 <- mydata2 %>%
  rename(Age = age)

mydata2 <- mydata2 %>%
  rename(Sex = sex)

mydata2 <- mydata2 %>%
  rename(BMI = bmi)

mydata2 <- mydata2 %>%
  rename(Children = children)

mydata2 <- mydata2 %>%
  rename(Smoker = smoker)

mydata2 <- mydata2 %>%
  rename(Region = region)

mydata2 <- mydata2 %>%
  rename(Charges = charges)
#Below is a dataframe that only contains data on male beneficiaries
mydata2_male <- mydata2 %>%
  subset(Sex == "male")

summary(mydata2_male)
##       Age            Sex           BMI           Children     Smoker   
##  Min.   :18.00   male  :676   Min.   :15.96   Min.   :0.000   yes:159  
##  1st Qu.:26.00   female:  0   1st Qu.:26.41   1st Qu.:0.000   no :517  
##  Median :39.00                Median :30.69   Median :1.000            
##  Mean   :38.92                Mean   :30.94   Mean   :1.115            
##  3rd Qu.:51.00                3rd Qu.:34.99   3rd Qu.:2.000            
##  Max.   :64.00                Max.   :53.13   Max.   :5.000            
##        Region       Charges     
##  northeast:163   Min.   : 1122  
##  southeast:189   1st Qu.: 4619  
##  southwest:163   Median : 9370  
##  northwest:161   Mean   :13957  
##                  3rd Qu.:18990  
##                  Max.   :62593
#Below is a dataframe that only contains data on female beneficiaries
mydata2_female <- mydata2 %>%
  subset(Sex == "female")

summary(mydata2_female)
##       Age            Sex           BMI           Children     Smoker   
##  Min.   :18.00   male  :  0   Min.   :16.82   Min.   :0.000   yes:115  
##  1st Qu.:27.00   female:662   1st Qu.:26.12   1st Qu.:0.000   no :547  
##  Median :40.00                Median :30.11   Median :1.000            
##  Mean   :39.50                Mean   :30.38   Mean   :1.074            
##  3rd Qu.:51.75                3rd Qu.:34.31   3rd Qu.:2.000            
##  Max.   :64.00                Max.   :48.07   Max.   :5.000            
##        Region       Charges     
##  northeast:161   Min.   : 1608  
##  southeast:175   1st Qu.: 4885  
##  southwest:162   Median : 9413  
##  northwest:164   Mean   :12570  
##                  3rd Qu.:14455  
##                  Max.   :63770
summary(mydata2)
##       Age            Sex           BMI           Children     Smoker    
##  Min.   :18.00   male  :676   Min.   :15.96   Min.   :0.000   yes: 274  
##  1st Qu.:27.00   female:662   1st Qu.:26.30   1st Qu.:0.000   no :1064  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        Region       Charges     
##  northeast:324   Min.   : 1122  
##  southeast:364   1st Qu.: 4740  
##  southwest:325   Median : 9382  
##  northwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Instruction 7:

The variable “Charges” is a numerical ratio.

#First of all, I will implement random sampling on mydata2 to ensure the integrity of the following hypothesis testing
library(dplyr)
set.seed(123)
sample_size <- 250
sampled_data <- mydata2 %>% sample_n(sample_size)
#Initial overview of the descriptive statistics by group
library(psych)
describeBy(sampled_data$Charges, g = sampled_data$Sex)
## 
##  Descriptive statistics by group 
## group: male
##    vars   n     mean       sd  median  trimmed     mad     min      max
## X1    1 135 12832.62 12160.82 8798.59 10643.54 6201.35 1135.94 48970.25
##       range skew kurtosis      se
## X1 47834.31  1.5     1.24 1046.64
## ------------------------------------------------------------ 
## group: female
##    vars   n  mean       sd   median trimmed     mad     min      max    range
## X1    1 115 13317 11746.63 10156.78   11063 6663.63 1744.46 63770.43 62025.96
##    skew kurtosis      se
## X1    2     4.18 1095.38
#Visualisation of the distribution of "Charges" for males and females
library(ggplot2)
## 
## 载入程序包:'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(sampled_data, aes(x = Charges)) +
  geom_histogram(binwidth = 30, colour = "gray") +
  facet_wrap(~Sex, ncol = 1) +
  ylab("Frequency")

#Implementing the Shapiro-Wilk test on "Charges" for both males and females
library(rstatix)
## 
## 载入程序包:'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
sampled_data %>%
  group_by(Sex) %>%
  shapiro_test(Charges)
## # A tibble: 2 × 4
##   Sex    variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 male   Charges      0.790 1.29e-12
## 2 female Charges      0.772 4.36e-12

For both males and females,

#Here I shall use Levene's Test to test for the homogeneity of variance for the groups "male" and "female"
library(car)
## 载入需要的程序包:carData
## 
## 载入程序包:'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(sampled_data$Charges, group = sampled_data$Sex)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.2857 0.5935
##       248

The p-value we obtained from Levene’s Test is 0.5935, we therefore do not reject the null hypothesis.

#Implementation of independent samples t-test
t.test(sampled_data$Charges ~ sampled_data$Sex,
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  sampled_data$Charges by sampled_data$Sex
## t = -0.31883, df = 248, p-value = 0.7501
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  -3476.655  2507.889
## sample estimates:
##   mean in group male mean in group female 
##             12832.62             13317.00
#Effect size
library(effectsize)
## 
## 载入程序包:'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
cohens_d(sampled_data$Charges ~ sampled_data$Sex,
         pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.04     | [-0.29, 0.21]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.04, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)

Based on the sample data, we get a p-value of 0.7501, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is tiny (d = 0.04).

#Implementation of the Wilcoxon rank-sum test
wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  sampled_data$Charges by sampled_data$Sex
## W = 7048, p-value = 0.2099
## alternative hypothesis: true location shift is not equal to 0
#Effect size
library(effectsize)
effectsize(wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.09             | [-0.23, 0.05]
interpret_rank_biserial(0.09)
## [1] "very small"
## (Rules: funder2019)

Based on the sample data, we get a p-value of 0.2099, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is very small (r = 0.09).

Verdict: