mydata <- read.table("~/insurance.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
A unit of observation is the primary beneficiary (person) of health insurance
The sample size in this data set is equal to 1338 units of observation
Definition of variables:
The data was taken from the website Kaggle.com: https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download
mydata_new <- mydata[c(-4)] # Removing the columns "children" & "region"
mydata_new$sexF <- factor(mydata_new$sex,
labels = c("male", "female"),
levels = c("male", "female")) # Adding a factor variable for "sex"
mydata_new$smokerF <- factor(mydata_new$smoker,
labels = c("yes", "no"),
levels = c("yes", "no")) # Adding a factor variable for "smoker"
mydata_new$regionF <- factor(mydata_new$region,
labels = c("southwest", "southeast", "northwest", "northeast"),
levels = c("southwest", "southeast", "northwest", "northeast")) # Adding a factor variable for "region"
mydata_new$ID <- seq.int(nrow(mydata_new)) # Adding a column with ID of each observation
mydata_new <- mydata_new[c(10, 1, 2, 7, 3, 4, 8, 5, 9, 6)] # Rearranging the column order
summary(mydata_new[c(-1, -3, -6, -8)]) # Showing summary of variables without ID, sex, smoker and region columns
## age sexF bmi smokerF regionF charges
## Min. :18.00 male :676 Min. :15.96 yes: 274 southwest:325 Min. : 1122
## 1st Qu.:27.00 female:662 1st Qu.:26.30 no :1064 southeast:364 1st Qu.: 4740
## Median :39.00 Median :30.40 northwest:325 Median : 9382
## Mean :39.21 Mean :30.66 northeast:324 Mean :13270
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:16640
## Max. :64.00 Max. :53.13 Max. :63770
get_summary_stats(mydata_new[-1]) # Summary statistics for numerical variables excluding "ID"
## # A tibble: 3 × 13
## variable n min max median q1 q3 iqr mad mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 1338 18 64 39 27 51 24 17.8 39.2 14.0 0.384 0.754
## 2 bmi 1338 16.0 53.1 30.4 26.3 34.7 8.40 6.20 30.7 6.10 0.167 0.327
## 3 charges 1338 1122. 63770. 9382. 4740. 16640. 11900. 7441. 13270. 12110. 331. 649.
Parameter estimate interpretation for each numerical variable:
From the previous table we can see the summary of some descriptive statistics of the numerical variables in my dataset. Before that we have the summary of all the variables (without ID, and the two categorical ones) in the dataset.
Hypothesis: - H0: Males have equal or lower medical charges than females. - H1: Males have higher medical charges than females.
describeBy(mydata_new$charges, g = mydata_new$sexF)
##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471 1.33 0.79 498.89
## ------------------------------------------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92 1.72 2.71 432.53
ggplot(mydata_new, aes(x = charges)) +
geom_histogram(binwidth = 10000, colour="deeppink", fill="lightpink") +
facet_wrap(~sexF, ncol = 1) +
ylab("Frequency")
shapiro.test(mydata_new$charges[mydata_new$sexF == "female"])
##
## Shapiro-Wilk normality test
##
## data: mydata_new$charges[mydata_new$sexF == "female"]
## W = 0.80539, p-value < 2.2e-16
shapiro.test(mydata_new$charges[mydata_new$sexF == "male"])
##
## Shapiro-Wilk normality test
##
## data: mydata_new$charges[mydata_new$sexF == "male"]
## W = 0.82281, p-value < 2.2e-16
Hypothesis: - H0: Distribution of charges is normal. - H1: Distribution of charges is not normal.
We can reject the null hypothesis at p<0.001 for both gender groups.
var.test(mydata_new$charges[mydata_new$sexF == "male"], mydata_new$charges[mydata_new$sexF == "female"])
##
## F test to compare two variances
##
## data: mydata_new$charges[mydata_new$sexF == "male"] and mydata_new$charges[mydata_new$sexF == "female"]
## F = 1.3585, num df = 675, denom df = 661, p-value = 7.896e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.167092 1.581105
## sample estimates:
## ratio of variances
## 1.3585
Hypothesis: - H0: The variances of medical charges for males and females are equal. - H1: The variances of medical charges for males and females are not equal.
We can reject the null hypothesis at p<0.001, meaning that we cannot say that variances of medical charges for males and females are equal.
t.test(mydata_new$charges ~ mydata_new$sexF,
alternative = "greater",
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: mydata_new$charges by mydata_new$sexF
## t = 2.1009, df = 1313.4, p-value = 0.01792
## alternative hypothesis: true difference in means between group male and group female is greater than 0
## 95 percent confidence interval:
## 300.3432 Inf
## sample estimates:
## mean in group male mean in group female
## 13956.75 12569.58
H0: There is no difference between mean charges for the female group and male group. H1: The mean charges for the male group are greater than for the female group.
We can reject H0 at p<0.02.
wilcox.test(mydata_new$charges ~ mydata_new$sexF,
correct = FALSE,
exact = FALSE,
alternative = "greater")
##
## Wilcoxon rank sum test
##
## data: mydata_new$charges by mydata_new$sexF
## W = 226208, p-value = 0.3643
## alternative hypothesis: true location shift is greater than 0
H0: The distribution of charges for females is equal to the distribution of charges for males. H1:The distribution of charges for females is greater that the distribution of charges for males.
We cannot reject HO (p>0.05).
effectsize(wilcox.test(mydata_new$charges ~ mydata_new$sexF,
correct = FALSE,
exact = FALSE,
alternative = "greater"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.01 | [-0.04, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_rank_biserial(0.01)
## [1] "tiny"
## (Rules: funder2019)
The distribution of medical charges for both genders was found to be non-normal (based on the Shapiro-Wilk test), making a non-parametric test more appropriate. Using the Wilcoxon Rank-Sum test, we found no statistically significant difference in the amount of medical charges between females and males (p = 0.3643). Furthermore, the effect size was tiny (r = 0.01), indicating that gender has no meaningful impact on medical charges in this sample.
Hypothesis: - H0: There is no association between being a smoker and the region where a person lives. (Smoking status is independent of region) - H1: There is an association between being a smoker and the region where a person lives. (Smoking status is not independent of region)
results <- chisq.test(mydata_new$regionF, mydata_new$smokerF,
correct = TRUE)
results
##
## Pearson's Chi-squared test
##
## data: mydata_new$regionF and mydata_new$smokerF
## X-squared = 7.3435, df = 3, p-value = 0.06172
The p-value is higher that 0.05 and therefore, we fail to reject the null hypothesis, meaning there is no statistically significant association between smoking status and region.
addmargins(results$observed)
## mydata_new$smokerF
## mydata_new$regionF yes no Sum
## southwest 58 267 325
## southeast 91 273 364
## northwest 58 267 325
## northeast 67 257 324
## Sum 274 1064 1338
round(results$expected, 2)
## mydata_new$smokerF
## mydata_new$regionF yes no
## southwest 66.55 258.45
## southeast 74.54 289.46
## northwest 66.55 258.45
## northeast 66.35 257.65
If there is no association, we would expect 66.55 people living in southwest region to be smokers. But in reality 58 people are smokers.
round(results$res, 2)
## mydata_new$smokerF
## mydata_new$regionF yes no
## southwest -1.05 0.53
## southeast 1.91 -0.97
## northwest -1.05 0.53
## northeast 0.08 -0.04
None of the residuals exceed ±1.96, meaning there is no significant deviation from the expected frequencies. The largest residual (1.91 for Southeast smokers) is close to significance but not quite there.
effectsize::cramers_v(mydata_new$region, mydata_new$smokerF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.06 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.06)
## [1] "very small"
## (Rules: funder2019)
The effect size, measured by Cramér’s V (V = 0.06), confirms that any observed association is extremely weak and not practically meaningful. This suggests that smoking status is distributed similarly across regions in this dataset.