Tina Pivk, 17. 1. 2023

Insurance Premium Data

mydata <- read.table("C:/Users/TinaP/Desktop/EF/IMB/multivariate analysis/MVA_2022_2023/DN-R/insurance-data/insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Description:

unit: one USA policyholder of basic health insurance (year not given)
sample size: 1338

Variables:

age: The Age of the Policyholder (years)
sex: The Gender of the Policyholder (male/female)
bmi: The Body Mass Index of the Policyholder (kg/m²)
children: Number of Children of the Policyholder
smoker: The Policyholder’s Statues of Being a Smoker (yes/no)
region: The Region Where the Policyholder Belongs to (southwest/southeast/northwest/northeast)
charges: The Premium Charged to the Policyholder (USD)

Data source:

Jain, S. (2021). Insurance Premium Data. Kaggle. Retrieved from: https://www.kaggle.com/datasets/simranjain17/insurance (original source is Census Bureau)

Main Goal of the Analysis

The main goal of the analysis carried out in homework 2 is to look at some of the things we were looking at in homework 1 and better analyse it. Here, we will concentrate on analyzing the existence of differences the visualization of the data in homework 1 indicated as possible but quite uncertain due to the differences not being as big as when looking at some other ones. The research questions would therefore be:

are the differences in basic health insurance premiums between the four regions statistically significant
are there significant differences between premiums charged to non-smokers below BMI of 30 and premiums charged to non-smokers with BMI of 30kg/m² and more
is the difference in proportions of smokers among men and smokers among women statistically significant and could therefore be a reason for the higher arithmetic mean of premiums for men compared to women

Data Manipulation

colSums(is.na(mydata))

##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

mydata$ID <- seq.int(nrow(mydata))
mydata$sexF <- as.factor(mydata$sex)
mydata$regionF <- as.factor(mydata$region)
mydata$smokerF <- as.factor(mydata$smoker)

Mydata <- mydata[,c(10,1,8,3,4,11,9,7)]

sapply(Mydata[,c(2,4,5,8)], FUN=max)

##      age      bmi children  charges 
##    64.00    53.13     5.00 63770.43

sapply(Mydata[,c(2,4,5,8)], FUN=min)

##      age      bmi children  charges 
##   18.000   15.960    0.000 1121.874

All minimums and maximums are not such that it would at first glance indicate them being a mistake (as in having negative age).

Since there are 1338 observations we will randomly select 150 of them and do the hypothesis testing only on the selected 150.

MydataO <- Mydata

set.seed(10)
Mydata <- Mydata[sample(nrow(mydata), 150), ]

Hypotheses testing

Regions

MydataRegions <- Mydata[c(3,1,8)]

library(psych)

## Warning: package 'psych' was built under R version 4.2.2

psych::describe(MydataRegions$charges)

##    vars   n     mean       sd   median  trimmed     mad     min     max    range skew kurtosis     se
## X1    1 150 13795.53 11987.12 10652.71 11644.81 7123.54 1141.45 60021.4 58879.95 1.58     1.92 978.74

psych::describeBy(x = MydataRegions$charges, group = MydataRegions$regionF)

## 
##  Descriptive statistics by group 
## group: northeast
##    vars  n     mean       sd  median  trimmed     mad    min     max   range skew kurtosis      se
## X1    1 42 12916.33 10022.65 11722.7 11341.31 7113.46 1694.8 40904.2 39209.4 1.27     1.07 1546.53
## ------------------------------------------------------------------------------------------ 
## group: northwest
##    vars  n     mean       sd  median  trimmed     mad     min     max    range skew kurtosis     se
## X1    1 37 14953.75 13094.34 9910.36 13058.04 6443.13 2352.97 60021.4 57668.43 1.56     1.96 2152.7
## ------------------------------------------------------------------------------------------ 
## group: southeast
##    vars  n     mean       sd   median  trimmed     mad     min      max    range skew kurtosis      se
## X1    1 44 14094.74 11968.32 10737.55 12322.08 6833.64 1141.45 43896.38 42754.93 1.36     0.74 1804.29
## ------------------------------------------------------------------------------------------ 
## group: southwest
##    vars  n     mean       sd  median  trimmed     mad    min      max    range skew kurtosis      se
## X1    1 27 13088.39 13671.79 9625.92 10998.15 7321.26 1682.6 48824.45 47141.85 1.69     1.73 2631.14

We can see that arithmetic means as well as medians are different between regions. Can we generalize that the basic health insurance premium is different between regions?

Checking the assumptions: First, variable charges is numeric. Secondly, we need to check if there are any outliers. Thirdly, we will check whether the variable charges is normally distributed within each group, defined by region. Lastly, we will check if the variance of the variable charges is the same between all groups.

OUTLIERS

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.2.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

MydataRegions %>%
  group_by(regionF) %>%
  identify_outliers(charges)

## # A tibble: 16 × 5
##    regionF      ID charges is.outlier is.extreme
##    <fct>     <int>   <dbl> <lgl>      <lgl>     
##  1 northeast  1125  40904. TRUE       FALSE     
##  2 northeast   912  33733. TRUE       FALSE     
##  3 northeast    39  39774. TRUE       FALSE     
##  4 northeast   477  35148. TRUE       FALSE     
##  5 northwest  1231  60021. TRUE       TRUE      
##  6 northwest   299  38746. TRUE       FALSE     
##  7 northwest   668  40003. TRUE       FALSE     
##  8 southeast   666  42560. TRUE       TRUE      
##  9 southeast  1324  43896. TRUE       TRUE      
## 10 southeast   857  40974. TRUE       FALSE     
## 11 southeast  1013  36580. TRUE       FALSE     
## 12 southeast  1023  42211. TRUE       TRUE      
## 13 southeast  1119  38283. TRUE       FALSE     
## 14 southwest   861  46114. TRUE       TRUE      
## 15 southwest   176  48824. TRUE       TRUE      
## 16 southwest    40  48173. TRUE       TRUE

suppressPackageStartupMessages(library(ggpubr))

## Warning: package 'ggpubr' was built under R version 4.2.2

## Warning: package 'ggplot2' was built under R version 4.2.2

ggboxplot(MydataRegions, 
          x = "regionF", 
          y = "charges", 
          add = "jitter")

There are outliers and some of them are extreme, meaning we must use non-parametric test. Let us, nonetheless, check the other assumptions as well.

NORMAL DISTRIBUTION

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)
MydataRegions %>%
  group_by(regionF) %>%
  shapiro_test(charges)

## # A tibble: 4 × 4
##   regionF   variable statistic          p
##   <fct>     <chr>        <dbl>      <dbl>
## 1 northeast charges      0.860 0.000110  
## 2 northwest charges      0.793 0.00000963
## 3 southeast charges      0.801 0.00000303
## 4 southwest charges      0.720 0.00000742

(H0: Charges in the specific region are normally distributed. H1: Charges in the specific region are not normally distributed.)

We can see that p-values for all four regions are very small (p-value < 0.001), which indicates the distribution is not normal (we can reject the H0 hypotheses), therefore, we must use the non-parametric test also because of normal distribution assumption not being met.

suppressPackageStartupMessages(library(car))

## Warning: package 'car' was built under R version 4.2.2

leveneTest(MydataRegions$charges, group = MydataRegions$regionF)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3   0.239  0.869
##       146

Tests homogenity (H0: Variences of charges are the same between all groups. H1: Variance of at least one region is not the same as the others.)

We cannot reject the H0 at p-value = 0.869, meaning had we used parametric test we would not have to use heteroskedastic ANOVA.

Since all the assumptions are not met, we have to use the non-parametric test (Kruskal-Wallis Rank Sum test):

KRUSKAL-WALLIS RANK SUM TEST

kruskal.test(charges ~ regionF, 
             data = MydataRegions)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  charges by regionF
## Kruskal-Wallis chi-squared = 0.89303, df = 3, p-value = 0.8271

H0: Distribution locations of charges are the same for all four regions. H1: Distribution location of charges for at least one region is different than the others.

Based on this sample data we cannot reject the H0 since the p-value is 0.8271, thus, we cannot claim the distribution locations of the charges grouped by regions are different.

Therefore, we discovered we cannot claim that the basic health insurance premiums between regions are different.

library(rstatix)
kruskal_effsize( charges ~ regionF, 
                data = MydataRegions)

## # A tibble: 1 × 5
##   .y.         n effsize method  magnitude
## * <chr>   <int>   <dbl> <chr>   <ord>    
## 1 charges   150 -0.0144 eta2[H] small

By looking at effect size we get small differences between locations, which is not surprising, since the difference between distribution locations is not statistically significant.

BMI

In homework 1 we have seen that for smokers the limit BMI = 30 is very important, whereas for non-smokers the difference was not as obvious.

MydataSN <- subset(Mydata, smokerF != 'yes')
BMI30SN <- cut(MydataSN$bmi, c(-Inf, 29.9999, Inf), labels=c("Below30", "Above30"))
MydataSNBMI <- MydataSN[c(3,8)]
MydataSNBMI$BMI30SN <- BMI30SN

library(psych)
psych::describe(MydataSNBMI$charges)

##    vars   n    mean      sd median trimmed     mad     min      max    range skew kurtosis     se
## X1    1 121 9227.74 6050.21 8965.8 8616.51 5694.04 1141.45 36580.28 35438.84 1.69     4.79 550.02

psych::describeBy(x = MydataSNBMI$charges, group = MydataSNBMI$BMI30SN)

## 
##  Descriptive statistics by group 
## group: Below30
##    vars  n    mean      sd  median trimmed     mad    min      max    range skew kurtosis     se
## X1    1 55 8073.23 5178.59 6500.24 7536.63 4060.76 1694.8 30166.62 28471.82 1.68     4.36 698.28
## ------------------------------------------------------------------------------------------ 
## group: Above30
##    vars  n     mean      sd median trimmed     mad     min      max    range skew kurtosis     se
## X1    1 66 10189.84 6575.06 9998.1 9517.84 5176.84 1141.45 36580.28 35438.84 1.56     4.23 809.33

Here, we can see that arithmetic means as well as medians are different for the two groups (non-smokers with BMI below 30kg/m² and non-smokers with BMI 30kg/m² or higher). Interestingly, the difference is higher with median. Can we generalize the claim that the basic health insurance premium for non-smokers is different for those with BMI 30kg/m² or higher than for those with BMI lower than 30kg/m²?

Checking the assumptions: Variable charges is numeric. We have two independent populations. We will check for outliers. We will check whether the variable charges is normally distributed within both groups. Lastly, the variance of the variable charges should be the same in both groups, but if it is not, which often happens, we just apply Welch correction.

OUTLIERS

MydataSNBMI %>%
  group_by(BMI30SN) %>%
  identify_outliers(charges)

## # A tibble: 4 × 5
##   BMI30SN    ID charges is.outlier is.extreme
##   <fct>   <int>   <dbl> <lgl>      <lgl>     
## 1 Below30    63  30167. TRUE       FALSE     
## 2 Above30   600  33472. TRUE       TRUE      
## 3 Above30   527  24060. TRUE       FALSE     
## 4 Above30  1013  36580. TRUE       TRUE

suppressPackageStartupMessages(library(ggpubr))

ggboxplot(MydataSNBMI, 
          x = "BMI30SN", 
          y = "charges", 
          add = "jitter")

We get outliers, of which some are also extreme, therefore, we must use non-parametric test, but let us still look at the normal distribution assumption.

library(dplyr)
library(rstatix)

MydataSNBMI %>%
group_by(BMI30SN) %>%
shapiro_test(charges)

## # A tibble: 2 × 4
##   BMI30SN variable statistic          p
##   <fct>   <chr>        <dbl>      <dbl>
## 1 Below30 charges      0.855 0.00000918
## 2 Above30 charges      0.864 0.00000329

(H0: Charges in the specific group are normally distributed. H1: Charges in the specific group are not normally distributed.)

We can see that p-values for both groups are very small (p-value < 0.001), which indicates the distribution is not normal (we can reject the H0 hypotheses), therefore, also based on this assumption we must use the non-parametric test, presented below.

WILCOXON RANK SUM TEST

wilcox.test(MydataSNBMI$charges ~ MydataSNBMI$BMI30SN,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE, 
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  MydataSNBMI$charges by MydataSNBMI$BMI30SN
## W = 1416, p-value = 0.0378
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution locations of charges are the same for both groups. H1: Distribution location of charges are not the same.

Based on this sample data we can reject the H0 at p-value = 0.0378, meaning distribution locations are not the same. Therefore, the basic health insurance premiums are different for non-smokers with BMI below 30kg/m² compared to those with BMI of 30kg/m² or more. Moreover, from the description of the charges of the two groups we have done at the beginning of this section we can see that the median is higher in case of non-smokers having the BMI 30 or more. We can thus conclude that the basic health premiums of non-smokers are higher if their BMI is 30 or higher.

suppressPackageStartupMessages(library(effectsize))

## Warning: package 'effectsize' was built under R version 4.2.2

effectsize(wilcox.test(MydataSNBMI$charges ~ MydataSNBMI$BMI30SN,
                      paired = FALSE,
                      correct = FALSE,
                      exact = FALSE, 
                      alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.22             | [-0.41, -0.02]

interpret_rank_biserial(0.69)

## [1] "very large"
## (Rules: funder2019)

Here, we can see that the difference between the two groups is not only statistically significant but also very large when looking at the effect sizes. The non-smoking policyholders should, therefore, try to avoid having BMI higher than 30kg/m², if they want lower basic health insurance premium.

Share of women smokers compared to the share of men smokers

(For this test we use all 1338 observations.)

(FMNS <- table(MydataO$sexF, MydataO$smokerF))

##         
##           no yes
##   female 547 115
##   male   517 159

FMNS[,2]/(FMNS[,1]+FMNS[,2])

##    female      male 
## 0.1737160 0.2352071

The share of smokers among women is 17.45%, lower than the share of smokers among men, which is 23.33%. Can we generalize the claim that there is more men smokers than women smokers, and can it importantly affect the higher arithmetic mean of men’ basic health insurance premiums compared to women? (In homework 1 we discovered that men’ premiums had higher arithmetic mean than women’. We also observed a big difference in the premiums’ height non-smokers have compared to those that smokers have. Thus we are interested whether the higher share of smokers among men can be at least one of the reasons for on average higher premiums for men.)

ASSUMPTIONS

Observations are independent, we just need to check whether all expected frequencies are bigger than 5.

MydataOSY <- subset(MydataO, smokerF == 'yes')

addmargins(FMNS)

##         
##            no  yes  Sum
##   female  547  115  662
##   male    517  159  676
##   Sum    1064  274 1338

From the table we see the lowest expected frequency will be f(1,2), so let us look at that one.

274*662/1338

## [1] 135.5665

The lowest expected frequency is above 5, meaning the assumption is met and we can use the following test.

prop.test(x = c(sum(MydataOSY$sexF=='female', na.rm=TRUE), sum(MydataOSY$sexF=='male', na.rm=TRUE)), 
          n = c(sum(MydataO$sexF=='female', na.rm=TRUE), sum(MydataO$sexF=='male', na.rm=TRUE)),
          correct = TRUE,
          alternative = "two.sided")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(MydataOSY$sexF == "female", na.rm = TRUE), sum(MydataOSY$sexF == "male", na.rm = TRUE)) out of c(sum(MydataO$sexF == "female", na.rm = TRUE), sum(MydataO$sexF == "male", na.rm = TRUE))
## X-squared = 7.3929, df = 1, p-value = 0.006548
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.10605743 -0.01692475
## sample estimates:
##    prop 1    prop 2 
## 0.1737160 0.2352071

H0: The proportion of smokers among women is equal to the proportion of smokers among men. H1: The proportion of smokers among women is not equal to the proportion of smokers among men.

Based on the data the difference in proportion of smokers between men and women is statistically significantly different - we can reject the H0 at p-value = 0.006548. We can also see the sample estimates for the two proportions that indicate a higher proportion of smokers among men.

library(psych)
phi(FMNS)

## Phi (adj.) |       95% CI
## -------------------------
## 0.07       | [0.02, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_phi(phi(FMNS))

## Phi (adj.) |       95% CI | Interpretation
## ------------------------------------------
## 0.07       | [0.02, 1.00] |     very small
## 
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019

However, based on the effect size the difference between the two proportions is very small.

From this we can conclude that the share of smokers among women is smaller than the one among men but the difference is very small. Consequently, the difference of these two proportions is very possibly not the reason (at least not the main one) for higher arithmetic mean of basic health insurance premiums for men compared to women.

Conclusion

To conclude, we have discovered that we cannot claim that the four regions in the US have different basic health insurance premiums charged. Furthermore, the limit BMI = 30kg/m² is important for non-smokers as well, meaning they should avoid crossing it if they want to be charged lower premiums. Lastly, the difference between the share of smokers among men is higher than the share of smokers among women, but the difference is very small, therefore, most probably not the main reason for men’ premiums being on average higher than women.

HW2