Ricson Lee

Ricson Lee - HW2

Instruction 1:

I found this dataset on Kaggle titled “insurance” to study the risk present in Health Insurance as well as the interplay of various attributes of the insured and see how they affect the insurance premium.

Instruction 2:

mydata <- read.table("./insurance.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

I imported the data into R Studio using the read.table() function. The data was sourced from the US, which is why the decimal point is used as the decimal separator instead of the decimal comma.

Instruction 3:

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Instruction 4:

This dataset has 1338 observations with 7 variables.

Unit of observation: EACH insured beneficiary
Sample Size: 1338 insured beneficiaries
age: Age of the beneficiaries
sex: Sex of the beneficiaries
bmi: Body mass index of the beneficiaries
children: Number of children covered by health insurance / Number of dependents
smoker: Whether the beneficiaries smoke
region: The beneficiaries’ residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance in US Dollars.

Instruction 5:

The data is sourced from the third edition of “Machine Learning with R”, by Brett Lanz.

Instruction 6:

#I created a copy of mydata to work on for the sake of convenience
mydata2 <- mydata

#Made adjustments to the categorical variables
mydata2$sex <- factor(mydata2$sex,
                              levels = c("male", "female"),
                              labels = c("male", "female"))

mydata2$smoker <- factor(mydata2$smoker,
                              levels = c("yes", "no"),
                              labels = c("yes", "no"))

mydata2$region <- factor(mydata2$region,
                              levels = c("northeast", "southeast", "southwest", "northwest"),
                              labels = c("northeast", "southeast", "southwest", "northwest"))

#Renamed the column headers to be more visually appealing
library(dplyr)

## 
## 载入程序包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata2 <- mydata2 %>%
  rename(Age = age)

mydata2 <- mydata2 %>%
  rename(Sex = sex)

mydata2 <- mydata2 %>%
  rename(BMI = bmi)

mydata2 <- mydata2 %>%
  rename(Children = children)

mydata2 <- mydata2 %>%
  rename(Smoker = smoker)

mydata2 <- mydata2 %>%
  rename(Region = region)

mydata2 <- mydata2 %>%
  rename(Charges = charges)

#Below is a dataframe that only contains data on male beneficiaries
mydata2_male <- mydata2 %>%
  subset(Sex == "male")

summary(mydata2_male)

##       Age            Sex           BMI           Children     Smoker   
##  Min.   :18.00   male  :676   Min.   :15.96   Min.   :0.000   yes:159  
##  1st Qu.:26.00   female:  0   1st Qu.:26.41   1st Qu.:0.000   no :517  
##  Median :39.00                Median :30.69   Median :1.000            
##  Mean   :38.92                Mean   :30.94   Mean   :1.115            
##  3rd Qu.:51.00                3rd Qu.:34.99   3rd Qu.:2.000            
##  Max.   :64.00                Max.   :53.13   Max.   :5.000            
##        Region       Charges     
##  northeast:163   Min.   : 1122  
##  southeast:189   1st Qu.: 4619  
##  southwest:163   Median : 9370  
##  northwest:161   Mean   :13957  
##                  3rd Qu.:18990  
##                  Max.   :62593

#Below is a dataframe that only contains data on female beneficiaries
mydata2_female <- mydata2 %>%
  subset(Sex == "female")

summary(mydata2_female)

##       Age            Sex           BMI           Children     Smoker   
##  Min.   :18.00   male  :  0   Min.   :16.82   Min.   :0.000   yes:115  
##  1st Qu.:27.00   female:662   1st Qu.:26.12   1st Qu.:0.000   no :547  
##  Median :40.00                Median :30.11   Median :1.000            
##  Mean   :39.50                Mean   :30.38   Mean   :1.074            
##  3rd Qu.:51.75                3rd Qu.:34.31   3rd Qu.:2.000            
##  Max.   :64.00                Max.   :48.07   Max.   :5.000            
##        Region       Charges     
##  northeast:161   Min.   : 1608  
##  southeast:175   1st Qu.: 4885  
##  southwest:162   Median : 9413  
##  northwest:164   Mean   :12570  
##                  3rd Qu.:14455  
##                  Max.   :63770

summary(mydata2)

##       Age            Sex           BMI           Children     Smoker    
##  Min.   :18.00   male  :676   Min.   :15.96   Min.   :0.000   yes: 274  
##  1st Qu.:27.00   female:662   1st Qu.:26.30   1st Qu.:0.000   no :1064  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        Region       Charges     
##  northeast:324   Min.   : 1122  
##  southeast:364   1st Qu.: 4740  
##  southwest:325   Median : 9382  
##  northwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

The mean for “Age” is 39.21, this means that the average age of the beneficiaries in the dataset is 39.21 years old.
The median for “BMI” is 30.40, this means that half of the beneficiaries in the dataset has a BMI of less than or equal to 30.40, and the other half of the beneficiaries in the dataset has a BMI greater than 30.40.
The maximum value of “Charges” is 63770 USD, this means that the highest individual medical costs billed by health insurance is 63770 USD.

Instruction 7:

Research Question: Is there a difference in the individual medical costs billed by health insurance for males and females in the US?
Research Hypothesis: The average individual medical cost billed by health insurance is different for males and females.

The variable “Charges” is a numerical ratio.

#First of all, I will implement random sampling on mydata2 to ensure the integrity of the following hypothesis testing
library(dplyr)
set.seed(123)
sample_size <- 250
sampled_data <- mydata2 %>% sample_n(sample_size)

#Initial overview of the descriptive statistics by group
library(psych)
describeBy(sampled_data$Charges, g = sampled_data$Sex)

## 
##  Descriptive statistics by group 
## group: male
##    vars   n     mean       sd  median  trimmed     mad     min      max
## X1    1 135 12832.62 12160.82 8798.59 10643.54 6201.35 1135.94 48970.25
##       range skew kurtosis      se
## X1 47834.31  1.5     1.24 1046.64
## ------------------------------------------------------------ 
## group: female
##    vars   n  mean       sd   median trimmed     mad     min      max    range
## X1    1 115 13317 11746.63 10156.78   11063 6663.63 1744.46 63770.43 62025.96
##    skew kurtosis      se
## X1    2     4.18 1095.38

#Visualisation of the distribution of "Charges" for males and females
library(ggplot2)

## 
## 载入程序包：'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(sampled_data, aes(x = Charges)) +
  geom_histogram(binwidth = 30, colour = "gray") +
  facet_wrap(~Sex, ncol = 1) +
  ylab("Frequency")

From the histograms above, we can observe that the distribution of “Charges” for both males and females are positively skewed, indicating that the distribution of “Charges” for both males and females are not normal. However, this shall be tested further using the Shapiro–Wilk test.

#Implementing the Shapiro-Wilk test on "Charges" for both males and females
library(rstatix)

## 
## 载入程序包：'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

sampled_data %>%
  group_by(Sex) %>%
  shapiro_test(Charges)

## # A tibble: 2 × 4
##   Sex    variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 male   Charges      0.790 1.29e-12
## 2 female Charges      0.772 4.36e-12

For both males and females,

Null Hypothesis: The variable “Charges” is normally distributed
Alternative Hypothesis: The variable “Charges” is not normally distributed
For males, the p-value we obtained from the Shapiro-Wilk test is 1.292164e-12, we therefore reject the null hypothesis at p < 0.001
For males, the p-value we obtained from the Shapiro-Wilk test is 4.358733e-12, we therefore reject the null hypothesis at p < 0.001

#Here I shall use Levene's Test to test for the homogeneity of variance for the groups "male" and "female"
library(car)

## 载入需要的程序包：carData

## 
## 载入程序包：'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(sampled_data$Charges, group = sampled_data$Sex)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.2857 0.5935
##       248

Null Hypothesis: The variances for the groups “male” and “female” are homogeneous
Alternative Hypothesis: The variances for the groups “male” and “female” are not homogeneous

The p-value we obtained from Levene’s Test is 0.5935, we therefore do not reject the null hypothesis.

#Implementation of independent samples t-test
t.test(sampled_data$Charges ~ sampled_data$Sex,
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  sampled_data$Charges by sampled_data$Sex
## t = -0.31883, df = 248, p-value = 0.7501
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  -3476.655  2507.889
## sample estimates:
##   mean in group male mean in group female 
##             12832.62             13317.00

#Effect size
library(effectsize)

## 
## 载入程序包：'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(sampled_data$Charges ~ sampled_data$Sex,
         pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## -0.04     | [-0.29, 0.21]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.04, rules = "sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

Null Hypothesis: The difference in the means of “Charges” between group “male” and group “female” is equal to 0
Alternative Hypothesis: The difference in the means of “Charges” between group “male” and group “female” is not equal to 0

Based on the sample data, we get a p-value of 0.7501, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is tiny (d = 0.04).

#Implementation of the Wilcoxon rank-sum test
wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  sampled_data$Charges by sampled_data$Sex
## W = 7048, p-value = 0.2099
## alternative hypothesis: true location shift is not equal to 0

#Effect size
library(effectsize)
effectsize(wilcox.test(sampled_data$Charges ~ sampled_data$Sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.09             | [-0.23, 0.05]

interpret_rank_biserial(0.09)

## [1] "very small"
## (Rules: funder2019)

Null Hypothesis: The distribution of location of “Charges” is the same for males and females
Alternative Hypothesis: The distribution of location “Charges” is not the same for males and females

Based on the sample data, we get a p-value of 0.2099, we therefore do not reject the null hypothesis and find that the average individual medical cost billed by health insurance is not different for males and females. The effect size is very small (r = 0.09).

Verdict:

The non-parametric approach is more suitable for my case because the distribution of “Charges” for both males and females are not normal, the underlying assumptions for the parametric independent samples t-test therefore do not hold.
The answer to the research question is that there is no difference in the individual medical costs billed by health insurance for males and females in the US.

Ricson Lee - HW2

2025-03-28

Ricson Lee - HW2

Instruction 1:

Instruction 2:

Instruction 3:

Instruction 4:

Instruction 5:

Instruction 6:

Instruction 7: