Data Manipulation
colSums(is.na(mydata))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
mydata$ID <- seq.int(nrow(mydata))
mydata$sexF <- as.factor(mydata$sex)
mydata$regionF <- as.factor(mydata$region)
mydata$smokerF <- as.factor(mydata$smoker)
Mydata <- mydata[,c(10,1,8,3,4,11,9,7)]
sapply(Mydata[,c(2,4,5,8)], FUN=max)
## age bmi children charges
## 64.00 53.13 5.00 63770.43
sapply(Mydata[,c(2,4,5,8)], FUN=min)
## age bmi children charges
## 18.000 15.960 0.000 1121.874
All minimums and maximums are not such that it would at first glance
indicate them being a mistake (as in having negative age).
Since there are 1338 observations we will randomly select 150 of them
and do the hypothesis testing only on the selected 150.
MydataO <- Mydata
set.seed(10)
Mydata <- Mydata[sample(nrow(mydata), 150), ]
Hypotheses testing
Regions
MydataRegions <- Mydata[c(3,1,8)]
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
psych::describe(MydataRegions$charges)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 150 13795.53 11987.12 10652.71 11644.81 7123.54 1141.45 60021.4 58879.95 1.58 1.92 978.74
psych::describeBy(x = MydataRegions$charges, group = MydataRegions$regionF)
##
## Descriptive statistics by group
## group: northeast
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 42 12916.33 10022.65 11722.7 11341.31 7113.46 1694.8 40904.2 39209.4 1.27 1.07 1546.53
## ------------------------------------------------------------------------------------------
## group: northwest
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 37 14953.75 13094.34 9910.36 13058.04 6443.13 2352.97 60021.4 57668.43 1.56 1.96 2152.7
## ------------------------------------------------------------------------------------------
## group: southeast
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 44 14094.74 11968.32 10737.55 12322.08 6833.64 1141.45 43896.38 42754.93 1.36 0.74 1804.29
## ------------------------------------------------------------------------------------------
## group: southwest
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 27 13088.39 13671.79 9625.92 10998.15 7321.26 1682.6 48824.45 47141.85 1.69 1.73 2631.14
We can see that arithmetic means as well as medians are different
between regions. Can we generalize that the basic health insurance
premium is different between regions?
Checking the assumptions: First, variable charges is numeric.
Secondly, we need to check if there are any outliers. Thirdly, we will
check whether the variable charges is normally distributed within each
group, defined by region. Lastly, we will check if the variance of the
variable charges is the same between all groups.
OUTLIERS
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
MydataRegions %>%
group_by(regionF) %>%
identify_outliers(charges)
## # A tibble: 16 × 5
## regionF ID charges is.outlier is.extreme
## <fct> <int> <dbl> <lgl> <lgl>
## 1 northeast 1125 40904. TRUE FALSE
## 2 northeast 912 33733. TRUE FALSE
## 3 northeast 39 39774. TRUE FALSE
## 4 northeast 477 35148. TRUE FALSE
## 5 northwest 1231 60021. TRUE TRUE
## 6 northwest 299 38746. TRUE FALSE
## 7 northwest 668 40003. TRUE FALSE
## 8 southeast 666 42560. TRUE TRUE
## 9 southeast 1324 43896. TRUE TRUE
## 10 southeast 857 40974. TRUE FALSE
## 11 southeast 1013 36580. TRUE FALSE
## 12 southeast 1023 42211. TRUE TRUE
## 13 southeast 1119 38283. TRUE FALSE
## 14 southwest 861 46114. TRUE TRUE
## 15 southwest 176 48824. TRUE TRUE
## 16 southwest 40 48173. TRUE TRUE
suppressPackageStartupMessages(library(ggpubr))
## Warning: package 'ggpubr' was built under R version 4.2.2
## Warning: package 'ggplot2' was built under R version 4.2.2
ggboxplot(MydataRegions,
x = "regionF",
y = "charges",
add = "jitter")

There are outliers and some of them are extreme, meaning we must use
non-parametric test. Let us, nonetheless, check the other assumptions as
well.
NORMAL DISTRIBUTION
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
MydataRegions %>%
group_by(regionF) %>%
shapiro_test(charges)
## # A tibble: 4 × 4
## regionF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 northeast charges 0.860 0.000110
## 2 northwest charges 0.793 0.00000963
## 3 southeast charges 0.801 0.00000303
## 4 southwest charges 0.720 0.00000742
(H0: Charges in the specific region are normally distributed. H1:
Charges in the specific region are not normally distributed.)
We can see that p-values for all four regions are very small (p-value
< 0.001), which indicates the distribution is not normal (we can
reject the H0 hypotheses), therefore, we must use the non-parametric
test also because of normal distribution assumption not being met.
suppressPackageStartupMessages(library(car))
## Warning: package 'car' was built under R version 4.2.2
leveneTest(MydataRegions$charges, group = MydataRegions$regionF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.239 0.869
## 146
Tests homogenity (H0: Variences of charges are the same between all
groups. H1: Variance of at least one region is not the same as the
others.)
We cannot reject the H0 at p-value = 0.869, meaning had we used
parametric test we would not have to use heteroskedastic ANOVA.
Since all the assumptions are not met, we have to use the
non-parametric test (Kruskal-Wallis Rank Sum test):
KRUSKAL-WALLIS RANK SUM TEST
kruskal.test(charges ~ regionF,
data = MydataRegions)
##
## Kruskal-Wallis rank sum test
##
## data: charges by regionF
## Kruskal-Wallis chi-squared = 0.89303, df = 3, p-value = 0.8271
H0: Distribution locations of charges are the same for all four
regions. H1: Distribution location of charges for at least one region is
different than the others.
Based on this sample data we cannot reject the H0 since the p-value
is 0.8271, thus, we cannot claim the distribution locations of the
charges grouped by regions are different.
Therefore, we discovered we cannot claim that the basic health
insurance premiums between regions are different.
library(rstatix)
kruskal_effsize( charges ~ regionF,
data = MydataRegions)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 charges 150 -0.0144 eta2[H] small
By looking at effect size we get small differences between locations,
which is not surprising, since the difference between distribution
locations is not statistically significant.
BMI
In homework 1 we have seen that for smokers the limit BMI = 30 is
very important, whereas for non-smokers the difference was not as
obvious.
MydataSN <- subset(Mydata, smokerF != 'yes')
BMI30SN <- cut(MydataSN$bmi, c(-Inf, 29.9999, Inf), labels=c("Below30", "Above30"))
MydataSNBMI <- MydataSN[c(3,8)]
MydataSNBMI$BMI30SN <- BMI30SN
library(psych)
psych::describe(MydataSNBMI$charges)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 121 9227.74 6050.21 8965.8 8616.51 5694.04 1141.45 36580.28 35438.84 1.69 4.79 550.02
psych::describeBy(x = MydataSNBMI$charges, group = MydataSNBMI$BMI30SN)
##
## Descriptive statistics by group
## group: Below30
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 55 8073.23 5178.59 6500.24 7536.63 4060.76 1694.8 30166.62 28471.82 1.68 4.36 698.28
## ------------------------------------------------------------------------------------------
## group: Above30
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 66 10189.84 6575.06 9998.1 9517.84 5176.84 1141.45 36580.28 35438.84 1.56 4.23 809.33
Here, we can see that arithmetic means as well as medians are
different for the two groups (non-smokers with BMI below 30kg/m² and
non-smokers with BMI 30kg/m² or higher). Interestingly, the difference
is higher with median. Can we generalize the claim that the basic health
insurance premium for non-smokers is different for those with BMI
30kg/m² or higher than for those with BMI lower than 30kg/m²?
Checking the assumptions: Variable charges is numeric. We have two
independent populations. We will check for outliers. We will check
whether the variable charges is normally distributed within both groups.
Lastly, the variance of the variable charges should be the same in both
groups, but if it is not, which often happens, we just apply Welch
correction.
OUTLIERS
MydataSNBMI %>%
group_by(BMI30SN) %>%
identify_outliers(charges)
## # A tibble: 4 × 5
## BMI30SN ID charges is.outlier is.extreme
## <fct> <int> <dbl> <lgl> <lgl>
## 1 Below30 63 30167. TRUE FALSE
## 2 Above30 600 33472. TRUE TRUE
## 3 Above30 527 24060. TRUE FALSE
## 4 Above30 1013 36580. TRUE TRUE
suppressPackageStartupMessages(library(ggpubr))
ggboxplot(MydataSNBMI,
x = "BMI30SN",
y = "charges",
add = "jitter")

We get outliers, of which some are also extreme, therefore, we must
use non-parametric test, but let us still look at the normal
distribution assumption.
library(dplyr)
library(rstatix)
MydataSNBMI %>%
group_by(BMI30SN) %>%
shapiro_test(charges)
## # A tibble: 2 × 4
## BMI30SN variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Below30 charges 0.855 0.00000918
## 2 Above30 charges 0.864 0.00000329
(H0: Charges in the specific group are normally distributed. H1:
Charges in the specific group are not normally distributed.)
We can see that p-values for both groups are very small (p-value <
0.001), which indicates the distribution is not normal (we can reject
the H0 hypotheses), therefore, also based on this assumption we must use
the non-parametric test, presented below.
WILCOXON RANK SUM TEST
wilcox.test(MydataSNBMI$charges ~ MydataSNBMI$BMI30SN,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: MydataSNBMI$charges by MydataSNBMI$BMI30SN
## W = 1416, p-value = 0.0378
## alternative hypothesis: true location shift is not equal to 0
H0: Distribution locations of charges are the same for both groups.
H1: Distribution location of charges are not the same.
Based on this sample data we can reject the H0 at p-value = 0.0378,
meaning distribution locations are not the same. Therefore, the basic
health insurance premiums are different for non-smokers with BMI below
30kg/m² compared to those with BMI of 30kg/m² or more. Moreover, from
the description of the charges of the two groups we have done at the
beginning of this section we can see that the median is higher in case
of non-smokers having the BMI 30 or more. We can thus conclude that the
basic health premiums of non-smokers are higher if their BMI is 30 or
higher.
suppressPackageStartupMessages(library(effectsize))
## Warning: package 'effectsize' was built under R version 4.2.2
effectsize(wilcox.test(MydataSNBMI$charges ~ MydataSNBMI$BMI30SN,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.22 | [-0.41, -0.02]
interpret_rank_biserial(0.69)
## [1] "very large"
## (Rules: funder2019)
Here, we can see that the difference between the two groups is not
only statistically significant but also very large when looking at the
effect sizes. The non-smoking policyholders should, therefore, try to
avoid having BMI higher than 30kg/m², if they want lower basic health
insurance premium.
Share of women smokers compared to the share of men smokers
(For this test we use all 1338 observations.)
(FMNS <- table(MydataO$sexF, MydataO$smokerF))
##
## no yes
## female 547 115
## male 517 159
FMNS[,2]/(FMNS[,1]+FMNS[,2])
## female male
## 0.1737160 0.2352071
The share of smokers among women is 17.45%, lower than the share of
smokers among men, which is 23.33%. Can we generalize the claim that
there is more men smokers than women smokers, and can it importantly
affect the higher arithmetic mean of men’ basic health insurance
premiums compared to women? (In homework 1 we discovered that men’
premiums had higher arithmetic mean than women’. We also observed a big
difference in the premiums’ height non-smokers have compared to those
that smokers have. Thus we are interested whether the higher share of
smokers among men can be at least one of the reasons for on average
higher premiums for men.)
ASSUMPTIONS
Observations are independent, we just need to check whether all
expected frequencies are bigger than 5.
MydataOSY <- subset(MydataO, smokerF == 'yes')
addmargins(FMNS)
##
## no yes Sum
## female 547 115 662
## male 517 159 676
## Sum 1064 274 1338
From the table we see the lowest expected frequency will be f(1,2),
so let us look at that one.
274*662/1338
## [1] 135.5665
The lowest expected frequency is above 5, meaning the assumption is
met and we can use the following test.
prop.test(x = c(sum(MydataOSY$sexF=='female', na.rm=TRUE), sum(MydataOSY$sexF=='male', na.rm=TRUE)),
n = c(sum(MydataO$sexF=='female', na.rm=TRUE), sum(MydataO$sexF=='male', na.rm=TRUE)),
correct = TRUE,
alternative = "two.sided")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(MydataOSY$sexF == "female", na.rm = TRUE), sum(MydataOSY$sexF == "male", na.rm = TRUE)) out of c(sum(MydataO$sexF == "female", na.rm = TRUE), sum(MydataO$sexF == "male", na.rm = TRUE))
## X-squared = 7.3929, df = 1, p-value = 0.006548
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.10605743 -0.01692475
## sample estimates:
## prop 1 prop 2
## 0.1737160 0.2352071
H0: The proportion of smokers among women is equal to the proportion
of smokers among men. H1: The proportion of smokers among women is not
equal to the proportion of smokers among men.
Based on the data the difference in proportion of smokers between men
and women is statistically significantly different - we can reject the
H0 at p-value = 0.006548. We can also see the sample estimates for the
two proportions that indicate a higher proportion of smokers among
men.
library(psych)
phi(FMNS)
## Phi (adj.) | 95% CI
## -------------------------
## 0.07 | [0.02, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_phi(phi(FMNS))
## Phi (adj.) | 95% CI | Interpretation
## ------------------------------------------
## 0.07 | [0.02, 1.00] | very small
##
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019
However, based on the effect size the difference between the two
proportions is very small.
From this we can conclude that the share of smokers among women is
smaller than the one among men but the difference is very small.
Consequently, the difference of these two proportions is very possibly
not the reason (at least not the main one) for higher arithmetic mean of
basic health insurance premiums for men compared to women.