I found this dataset on Kaggle titled “insurance” to study the risk present in Health Insurance as well as the interplay of various attributes of the insured and see how they affect the insurance premium.
mydata <- read.table("./insurance.csv",
header = TRUE,
sep = ",",
dec = ".")
I imported the data into R Studio using the read.table() function. The data was sourced from the US, which is why the decimal point is used as the decimal separator instead of the decimal comma.
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
This dataset has 1338 observations with 7 variables.
The data is sourced from the third edition of “Machine Learning with R”, by Brett Lanz.
#I created a copy of mydata to work on for the sake of convenience
mydata2 <- mydata
#Made adjustments to the categorical variables
mydata2$sex <- factor(mydata2$sex,
levels = c("male", "female"),
labels = c("male", "female"))
mydata2$smoker <- factor(mydata2$smoker,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata2$region <- factor(mydata2$region,
levels = c("northeast", "southeast", "southwest", "northwest"),
labels = c("northeast", "southeast", "southwest", "northwest"))
#Renamed the column headers to be more visually appealing
library(dplyr)
##
## 载入程序包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata2 <- mydata2 %>%
rename(Age = age)
mydata2 <- mydata2 %>%
rename(Sex = sex)
mydata2 <- mydata2 %>%
rename(BMI = bmi)
mydata2 <- mydata2 %>%
rename(Children = children)
mydata2 <- mydata2 %>%
rename(Smoker = smoker)
mydata2 <- mydata2 %>%
rename(Region = region)
mydata2 <- mydata2 %>%
rename(Charges = charges)
#Below is a dataframe that only contains data on male beneficiaries
mydata2_male <- mydata2 %>%
subset(Sex == "male")
summary(mydata2_male)
## Age Sex BMI Children Smoker
## Min. :18.00 male :676 Min. :15.96 Min. :0.000 yes:159
## 1st Qu.:26.00 female: 0 1st Qu.:26.41 1st Qu.:0.000 no :517
## Median :39.00 Median :30.69 Median :1.000
## Mean :38.92 Mean :30.94 Mean :1.115
## 3rd Qu.:51.00 3rd Qu.:34.99 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## Region Charges
## northeast:163 Min. : 1122
## southeast:189 1st Qu.: 4619
## southwest:163 Median : 9370
## northwest:161 Mean :13957
## 3rd Qu.:18990
## Max. :62593
#Below is a dataframe that only contains data on female beneficiaries
mydata2_female <- mydata2 %>%
subset(Sex == "female")
summary(mydata2_female)
## Age Sex BMI Children Smoker
## Min. :18.00 male : 0 Min. :16.82 Min. :0.000 yes:115
## 1st Qu.:27.00 female:662 1st Qu.:26.12 1st Qu.:0.000 no :547
## Median :40.00 Median :30.11 Median :1.000
## Mean :39.50 Mean :30.38 Mean :1.074
## 3rd Qu.:51.75 3rd Qu.:34.31 3rd Qu.:2.000
## Max. :64.00 Max. :48.07 Max. :5.000
## Region Charges
## northeast:161 Min. : 1608
## southeast:175 1st Qu.: 4885
## southwest:162 Median : 9413
## northwest:164 Mean :12570
## 3rd Qu.:14455
## Max. :63770
summary(mydata2)
## Age Sex BMI Children Smoker
## Min. :18.00 male :676 Min. :15.96 Min. :0.000 yes: 274
## 1st Qu.:27.00 female:662 1st Qu.:26.30 1st Qu.:0.000 no :1064
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## Region Charges
## northeast:324 Min. : 1122
## southeast:364 1st Qu.: 4740
## southwest:325 Median : 9382
## northwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
The variable “Charges” is a numerical ratio.
#First of all, I will implement random sampling on mydata2 to ensure the integrity of the following hypothesis testing
library(dplyr)
set.seed(123)
sample_size <- 250
sampled_data <- mydata2 %>% sample_n(sample_size)
#Initial overview of the descriptive statistics by group
library(psych)
describeBy(sampled_data$Charges, g = sampled_data$Sex)
##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max
## X1 1 135 12832.62 12160.82 8798.59 10643.54 6201.35 1135.94 48970.25
## range skew kurtosis se
## X1 47834.31 1.5 1.24 1046.64
## ------------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range
## X1 1 115 13317 11746.63 10156.78 11063 6663.63 1744.46 63770.43 62025.96
## skew kurtosis se
## X1 2 4.18 1095.38
#Visualisation of the distribution of "Charges" for males and females
library(ggplot2)
##
## 载入程序包:'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(sampled_data, aes(x = Charges)) +
geom_histogram(binwidth = 30, colour = "gray") +
facet_wrap(~Sex, ncol = 1) +
ylab("Frequency")
#Implementing the Shapiro-Wilk test on "Charges" for both males and females
library(rstatix)
##
## 载入程序包:'rstatix'
## The following object is masked from 'package:stats':
##
## filter
sampled_data %>%
group_by(Sex) %>%
shapiro_test(Charges)
## # A tibble: 2 × 4
## Sex variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 male Charges 0.790 1.29e-12
## 2 female Charges 0.772 4.36e-12
For both males and females,
Null Hypothesis: The variable “Charges” is normally distributed
Alternative Hypothesis: The variable “Charges” is not normally distributed
For males, the p-value we obtained from the Shapiro-Wilk test is 1.292164e-12, we therefore reject the null hypothesis at p < 0.001
For males, the p-value we obtained from the Shapiro-Wilk test is 4.358733e-12, we therefore reject the null hypothesis at p < 0.001
#Here I shall use Levene's Test to test for the homogeneity of variance for the groups "male" and "female"
library(car)
## 载入需要的程序包:carData
##
## 载入程序包:'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(sampled_data$Charges, group = sampled_data$Sex)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.2857 0.5935
## 248
The p-value we obtained from Levene’s Test is 0.5935, we therefore do not reject the null hypothesis.
#Implementation of independent samples t-test
t.test(sampled_data$Charges ~ sampled_data$Sex,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: sampled_data$Charges by sampled_data$Sex
## t = -0.31883, df = 248, p-value = 0.7501
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
## -3476.655 2507.889
## sample estimates:
## mean in group male mean in group female
## 12832.62 13317.00
#Effect size
library(effectsize)
##
## 载入程序包:'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
cohens_d(sampled_data$Charges ~ sampled_data$Sex,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.04 | [-0.29, 0.21]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.04, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Based on the sample data, we get a p-value of 0.7501, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is tiny (d = 0.04).
#Implementation of the Wilcoxon rank-sum test
wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: sampled_data$Charges by sampled_data$Sex
## W = 7048, p-value = 0.2099
## alternative hypothesis: true location shift is not equal to 0
#Effect size
library(effectsize)
effectsize(wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.09 | [-0.23, 0.05]
interpret_rank_biserial(0.09)
## [1] "very small"
## (Rules: funder2019)
Based on the sample data, we get a p-value of 0.2099, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is very small (r = 0.09).
Verdict: