HW1 - AJDA ROBIDA

Importing the data and using the function “head”

mydata <- read.table("~/insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

A unit of observation is the primary beneficiary (person) of health insurance

The sample size in this data set is equal to 1338 units of observation

Definition of variables:

age: Age of primary beneficiary (in years)
sex: Insurance contractor gender, can be female or male
bmi: Body mass index, (kg / m ^ 2) using the ratio of height to weight
children: Number of children covered by health insurance / Number of dependents
smoker: Is the beneficiary a smoker or not
region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance (in USD)

The data was taken from the website Kaggle.com: https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download

Cleaning the data

mydata_new <- mydata[c(-4)] # Removing the columns "children" & "region"

mydata_new$sexF <- factor(mydata_new$sex,
                          labels = c("male", "female"),
                          levels = c("male", "female")) # Adding a factor variable for "sex"

mydata_new$smokerF <- factor(mydata_new$smoker,
                             labels = c("yes", "no"),
                             levels = c("yes", "no")) # Adding a factor variable for "smoker"

mydata_new$regionF <- factor(mydata_new$region,
                             labels = c("southwest", "southeast", "northwest", "northeast"),
                             levels = c("southwest", "southeast", "northwest", "northeast")) # Adding a factor variable for "region"

mydata_new$ID <- seq.int(nrow(mydata_new)) # Adding a column with ID of each observation

mydata_new <- mydata_new[c(10, 1, 2, 7, 3, 4, 8, 5, 9, 6)] # Rearranging the column order

Descriptive statistics

summary(mydata_new[c(-1, -3, -6, -8)]) # Showing summary of variables without ID, sex, smoker and region columns

##       age            sexF          bmi        smokerF         regionF       charges     
##  Min.   :18.00   male  :676   Min.   :15.96   yes: 274   southwest:325   Min.   : 1122  
##  1st Qu.:27.00   female:662   1st Qu.:26.30   no :1064   southeast:364   1st Qu.: 4740  
##  Median :39.00                Median :30.40              northwest:325   Median : 9382  
##  Mean   :39.21                Mean   :30.66              northeast:324   Mean   :13270  
##  3rd Qu.:51.00                3rd Qu.:34.69                              3rd Qu.:16640  
##  Max.   :64.00                Max.   :53.13                              Max.   :63770

get_summary_stats(mydata_new[-1]) # Summary statistics for numerical variables excluding "ID"

## # A tibble: 3 × 13
##   variable     n    min     max median     q1      q3      iqr     mad    mean       sd      se      ci
##   <fct>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
## 1 age       1338   18      64     39     27      51      24      17.8     39.2    14.0    0.384   0.754
## 2 bmi       1338   16.0    53.1   30.4   26.3    34.7     8.40    6.20    30.7     6.10   0.167   0.327
## 3 charges   1338 1122.  63770.  9382.  4740.  16640.  11900.   7441.   13270.  12110.   331.    649.

Parameter estimate interpretation for each numerical variable:

age: The average age in the sample of 1338 observations was 39.207 years (Mean), the Q1 for age was 27 which means that 75% of people were older than 27 years and 25% younger
bmi: The median value of BMI was 30.40 this means that 50% of people had a higher BMI and 50% had a lower, the min was 15.96 this means that the lowest value of BMI recorded was 15.96
charges: The mean for charges was 13270.422 this means that on average people had that much medical charges to pay, the Q3 was 16639.913 which means that 25% of people had higher charges and 75% had lower ones

From the previous table we can see the summary of some descriptive statistics of the numerical variables in my dataset. Before that we have the summary of all the variables (without ID, and the two categorical ones) in the dataset.

RQ1 - Do males have higher medical charges than females?

Hypothesis: - H0: Males have equal or lower medical charges than females. - H1: Males have higher medical charges than females.

describeBy(mydata_new$charges, g = mydata_new$sexF)

## 
##  Descriptive statistics by group 
## group: male
##    vars   n     mean       sd  median trimmed     mad     min      max range skew kurtosis     se
## X1    1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471 1.33     0.79 498.89
## ------------------------------------------------------------------------------------------ 
## group: female
##    vars   n     mean      sd  median  trimmed     mad     min      max    range skew kurtosis     se
## X1    1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92 1.72     2.71 432.53

ggplot(mydata_new, aes(x = charges)) +
  geom_histogram(binwidth = 10000, colour="deeppink", fill="lightpink") +
  facet_wrap(~sexF, ncol = 1) + 
  ylab("Frequency")

shapiro.test(mydata_new$charges[mydata_new$sexF == "female"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_new$charges[mydata_new$sexF == "female"]
## W = 0.80539, p-value < 2.2e-16

shapiro.test(mydata_new$charges[mydata_new$sexF == "male"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_new$charges[mydata_new$sexF == "male"]
## W = 0.82281, p-value < 2.2e-16

Hypothesis: - H0: Distribution of charges is normal. - H1: Distribution of charges is not normal.

We can reject the null hypothesis at p<0.001 for both gender groups.

var.test(mydata_new$charges[mydata_new$sexF == "male"], mydata_new$charges[mydata_new$sexF == "female"])

## 
##  F test to compare two variances
## 
## data:  mydata_new$charges[mydata_new$sexF == "male"] and mydata_new$charges[mydata_new$sexF == "female"]
## F = 1.3585, num df = 675, denom df = 661, p-value = 7.896e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.167092 1.581105
## sample estimates:
## ratio of variances 
##             1.3585

Hypothesis: - H0: The variances of medical charges for males and females are equal. - H1: The variances of medical charges for males and females are not equal.

We can reject the null hypothesis at p<0.001, meaning that we cannot say that variances of medical charges for males and females are equal.

Parametric test

t.test(mydata_new$charges ~ mydata_new$sexF,
       alternative = "greater",
       var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mydata_new$charges by mydata_new$sexF
## t = 2.1009, df = 1313.4, p-value = 0.01792
## alternative hypothesis: true difference in means between group male and group female is greater than 0
## 95 percent confidence interval:
##  300.3432      Inf
## sample estimates:
##   mean in group male mean in group female 
##             13956.75             12569.58

H0: There is no difference between mean charges for the female group and male group. H1: The mean charges for the male group are greater than for the female group.

We can reject H0 at p<0.02.

Non-parametric test

wilcox.test(mydata_new$charges ~ mydata_new$sexF,
            correct = FALSE,
            exact = FALSE,
            alternative = "greater")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata_new$charges by mydata_new$sexF
## W = 226208, p-value = 0.3643
## alternative hypothesis: true location shift is greater than 0

H0: The distribution of charges for females is equal to the distribution of charges for males. H1:The distribution of charges for females is greater that the distribution of charges for males.

We cannot reject HO (p>0.05).

effectsize(wilcox.test(mydata_new$charges ~ mydata_new$sexF,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "greater"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.01              | [-0.04, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_rank_biserial(0.01)

## [1] "tiny"
## (Rules: funder2019)

The distribution of medical charges for both genders was found to be non-normal (based on the Shapiro-Wilk test), making a non-parametric test more appropriate. Using the Wilcoxon Rank-Sum test, we found no statistically significant difference in the amount of medical charges between females and males (p = 0.3643). Furthermore, the effect size was tiny (r = 0.01), indicating that gender has no meaningful impact on medical charges in this sample.

RQ2 - Is the BMI of a person correlated to their amount of medical charges?

Hypothesis: - H0: The BMI of a person and their amount of medical charges are not correlated. - H1: The BMI of a person and their amount of medical charges are correlated.

ggpairs(mydata_new[ ,c(5,10)])

shapiro.test(mydata_new$charges)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_new$charges
## W = 0.81469, p-value < 2.2e-16

shapiro.test(mydata_new$bmi)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_new$bmi
## W = 0.99389, p-value = 2.605e-05

Hypothesis: - H0: Distribution is normal. - H1: Distribution is not normal.

We can reject the null hypothesis at p<0.001 for both variables, meaning that they are not normally distributed.

rcorr(as.matrix(mydata_new[ ,c(5,10)]), 
      type = "spearman")

##          bmi charges
## bmi     1.00    0.12
## charges 0.12    1.00
## 
## n= 1338 
## 
## 
## P
##         bmi charges
## bmi          0     
## charges  0

We investigated whether there is a correlation between a person’s BMI and their amount of medical charges. The distributions of both BMI and medical charges were found to be non-normal (Shapiro-Wilk test, p < 0.001), making Spearman’s rank correlation the appropriate choice for analysis.

Using Spearman correlation, we found a very weak positive correlation (ρ = 0.12) between BMI and medical charges. The strength of the correlation is negligible, suggesting that BMI has minimal practical impact on medical charges in this dataset.

If the variables were normally distributed, Pearson’s correlation would have been used instead.

RQ3 - Is there asociation between the persons region and their smoking habbits?

Hypothesis: - H0: There is no association between being a smoker and the region where a person lives. (Smoking status is independent of region) - H1: There is an association between being a smoker and the region where a person lives. (Smoking status is not independent of region)

results <- chisq.test(mydata_new$regionF, mydata_new$smokerF,
           correct = TRUE)

results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata_new$regionF and mydata_new$smokerF
## X-squared = 7.3435, df = 3, p-value = 0.06172

The p-value is higher that 0.05 and therefore, we fail to reject the null hypothesis, meaning there is no statistically significant association between smoking status and region.

addmargins(results$observed)

##                   mydata_new$smokerF
## mydata_new$regionF  yes   no  Sum
##          southwest   58  267  325
##          southeast   91  273  364
##          northwest   58  267  325
##          northeast   67  257  324
##          Sum        274 1064 1338

round(results$expected, 2)

##                   mydata_new$smokerF
## mydata_new$regionF   yes     no
##          southwest 66.55 258.45
##          southeast 74.54 289.46
##          northwest 66.55 258.45
##          northeast 66.35 257.65

If there is no association, we would expect 66.55 people living in southwest region to be smokers. But in reality 58 people are smokers.

round(results$res, 2)

##                   mydata_new$smokerF
## mydata_new$regionF   yes    no
##          southwest -1.05  0.53
##          southeast  1.91 -0.97
##          northwest -1.05  0.53
##          northeast  0.08 -0.04

None of the residuals exceed ±1.96, meaning there is no significant deviation from the expected frequencies. The largest residual (1.91 for Southeast smokers) is close to significance but not quite there.

effectsize::cramers_v(mydata_new$region, mydata_new$smokerF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.06              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.06)

## [1] "very small"
## (Rules: funder2019)

The effect size, measured by Cramér’s V (V = 0.06), confirms that any observed association is extremely weak and not practically meaningful. This suggests that smoking status is distributed similarly across regions in this dataset.