Homework 2

Ian Gošnak

Importing the data and using the function “head”

mydata <- read.table("./insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

A unit of observation in this dataset is the primary beneficiary (person) of health insurance

The sample size in this data set is equal to 1338 units of observation

Definition of variables:

age: Age of primary beneficiary (in years)
sex: Insurance contractor gender, can be female or male
bmi: Body mass index, (kg / m ^ 2) using the ratio of height to weight
children: Number of children covered by health insurance / Number of dependents
smoker: Is the beneficiary a smoker or not
region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance (in USD)

The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download

The main goal of this data analysis is to conduct some hypothesis tests within the data set, for example to test if there is a difference between the charges of males and females.

Cleaning of data

mydata$sexF <- factor(mydata$sex,
                          labels = c("male", "female"),
                          levels = c("male", "female")) # Adding a factor variable for "sex"

mydata$smokerF <- factor(mydata$smoker,
                             labels = c("smoker", "non-smoker"),
                             levels = c("yes", "no")) # Adding a factor variable for "smoker"

mydata <- mydata[c( 1, 2, 8, 3, 4, 5, 9, 6, 7)] # Rearranging the column order

set.seed(5)

mydata1 <- mydata %>% 
  sample_n(100)

Independent samples t-test for charges by gender

Male <- ggplot(mydata1[mydata1$sexF=="male",  ], aes(x = charges)) +
  theme_linedraw() + 
  geom_histogram(fill = "lightblue", colour = "black") +
  ylab("Frequency") +
  ggtitle("Charges of Males")

Female <- ggplot(mydata1[mydata1$sexF=="female",  ], aes(x = charges)) +
  theme_linedraw() + 
  geom_histogram(fill = "indianred1", colour = "black") +
  ylab("Frequency") +
  ggtitle("Charges of Females")

ggarrange(Male, Female,
          ncol= 2, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can already see from the graphs of the distribution of medical charges by gender that neither is normally distributed, however we need to formally test it as well.

mydata1 %>% 
  group_by(sexF) %>% 
  shapiro_test(charges)

## # A tibble: 2 × 4
##   sexF   variable statistic           p
##   <fct>  <chr>        <dbl>       <dbl>
## 1 male   charges      0.754 0.000000211
## 2 female charges      0.809 0.000000678

Testing the normality with the formal test (Shapiro test), we can see that neither distribution of charges by gender is normally distributed as the p value is lower than 0.05 and therefore non significant. We reject the null hypothesis which states that the distribution is normally distributed.

wilcox.test(mydata1$charges ~ mydata1$sexF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata1$charges by mydata1$sexF
## W = 947, p-value = 0.04133
## alternative hypothesis: true location shift is not equal to 0

Because normality is not met we use the-non parametric test, which is the Wilcox Rank Sum Test.

H0: The distribution locations of male and female charges are the same Ha: The distribution locations of male and female charges are different

The test shows that we can reject the null hypothesis at 5% level of significance, however we also need to check the effect size.

effectsize(wilcox.test(mydata1$charges ~ mydata1$sexF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.24             | [-0.44, -0.01]

interpret_rank_biserial(0.24)

## [1] "medium"
## (Rules: funder2019)

Based on the sample data we found that the men and women differ in the amount of medical charges received (p < 0.05) - the difference in distribution is “medium” (r = 0.24).

One-Way ANOVA

mydata1$regionF <- factor(mydata1$region,
                             labels = c("northeast", "southeast", "southwest", "northwest"),
                             levels = c("northeast", "southeast", "southwest", "northwest"))

leveneTest(mydata1$bmi, group = mydata1$regionF)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  0.5058 0.6792
##       96

One of the assumptions for one-way ANOVA is homoskedasticity which means that variances of the analyzed variable is the same in all of the groups. We formally test it with the Levene Test. Because the p value is not significant (p > 0.05) we cannot reject the H0 which says that there is homoskedasticity. The assumption is met.

ggplot(mydata1, aes(x = bmi)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, colour="black", fill = "lightgreen") +
  facet_wrap(~regionF, ncol = 1) + 
  ylab("Frequency") +
  xlab("Body Mass Index")

mydata1 %>% 
  group_by(regionF) %>% 
  shapiro_test(bmi)

## # A tibble: 4 × 4
##   regionF   variable statistic      p
##   <fct>     <chr>        <dbl>  <dbl>
## 1 northeast bmi          0.963 0.519 
## 2 southeast bmi          0.935 0.0909
## 3 southwest bmi          0.975 0.658 
## 4 northwest bmi          0.927 0.172

Another one of the assumptions is normality of the distribution in all groups. The formal test is the Shapiro Test. All of the groups have a statistically insignificant p value (p > 0.05) therefore we can not reject the H0: distribution is normally distributed. Consequently the assumption is met.

ANOVA_results <- aov(bmi ~ regionF, 
                     data = mydata1)

summary(ANOVA_results)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## regionF      3  541.5  180.48   5.913 0.000957 ***
## Residuals   96 2930.5   30.53                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because both of the assumptions are met we can use the parametric test without the Welch correction. H0: All distribution locations of bmi among regions are the same Ha: All distribution locations of bmi among regions are different

We can see that the test result was statistically significant at 0.001 p-value. This means that we can reject the null hypothesis. Additionally we have to look at the effect size and the pairwise t test to learn more about the result.

effectsize::eta_squared(ANOVA_results)

## For one-way between subjects designs, partial eta squared is equivalent to eta squared. Returning eta
##   squared.

## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       95% CI
## -------------------------------
## regionF   | 0.16 | [0.05, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_eta_squared(0.16, rules = "cohen1992")

## [1] "medium"
## (Rules: cohen1992)

pairwise.t.test(x = mydata1$bmi, g = mydata1$regionF, 
                p.adj = "bonf")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mydata1$bmi and mydata1$regionF 
## 
##           northeast southeast southwest
## southeast 0.0104    -         -        
## southwest 1.0000    0.0201    -        
## northwest 1.0000    0.0021    1.0000   
## 
## P value adjustment method: bonferroni

We found a statistically significant relationship between regions and bmi (F = 5.91, p < 0.001), the effect size is medium (𝜂2 = 0.16). Post-hoc tests revealed differences only for the se-ne, sw-se pairs at 0.05 level of significance and for the nw-se pair at 0.01 level of significance. The other pairs did not reveal statistically significant differences.

Chi-Square Test

(result <- chisq.test(mydata1$sexF, mydata1$smokerF,
           correct = TRUE)
)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$sexF and mydata1$smokerF
## X-squared = 0.72715, df = 1, p-value = 0.3938

addmargins(result$observed)

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker Sum
##       male        7         39  46
##       female     13         41  54
##       Sum        20         80 100

Conducting a Chi-Square test to see if there is a relationship between sex and smokers.

H0: There is no association between the category Sex and category Smoker, Ha: There is association between the category Sex and category Smoker

The test shows that we can not reject the null hypothesis at 0.05 level of significance because the p value is higher than 0.05. This means that the test tells us that there is no association between the two categories. For this to be valid we need to check that the assumptions are met.

round(result$expected, 2)

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker
##       male      9.2       36.8
##       female   10.8       43.2

We can see that all of the expected frequencies are larger than 5, therefore we can confirm our first assumption. The other assumption is that the observations are independent which is also true.

round(result$res, 2)

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker
##       male    -0.73       0.36
##       female   0.67      -0.33

We can also see that all of the z-statistics of standardized residuals are below +- 1.96 (a = 0.05) showing that the deviations are not statistically significant.

addmargins(round(prop.table(result$observed), 3)) # Share of total

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker  Sum
##       male     0.07       0.39 0.46
##       female   0.13       0.41 0.54
##       Sum      0.20       0.80 1.00

addmargins(round(prop.table(result$observed, 1), 3), 2) # Share of total by row

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker   Sum
##       male    0.152      0.848 1.000
##       female  0.241      0.759 1.000

addmargins(round(prop.table(result$observed, 2), 3), 1) # share of total by column

##             mydata1$smokerF
## mydata1$sexF smoker non-smoker
##       male    0.350      0.488
##       female  0.650      0.512
##       Sum     1.000      1.000

effectsize::cramers_v(mydata1$sexF, mydata1$smokerF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.05              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.05)

## [1] "very small"
## (Rules: funder2019)

The effect size is very small (r = 0.05)