Data set for Homework 2, was found on Kaggle. I used the same data set as I used in Homework 1. Link to data set is https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout.
mydata <- read.table("./expenses.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Explanation of data set
Unit of observation: customer of the insurance company
Sample size: 1338 units
mydata$sexF <- factor(mydata$sex, # creating a factor for variable sex
levels = c("male", "female"),
labels = c("male", "female"))
mydata$smokerF <- factor(mydata$smoker, # creating a factor for variable smoker
levels = c("yes", "no"),
labels = c("yes", "no"))
head(mydata, 3)
## age sex bmi children smoker region charges sexF smokerF
## 1 19 female 27.90 0 yes southwest 16884.924 female yes
## 2 18 male 33.77 1 no southeast 1725.552 male no
## 3 28 male 33.00 3 no southeast 4449.462 male no
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(10)
mydata1 <- mydata %>% # for the purpose of exercise, I will take a random sample of 100 units
sample_n(100)
mydata2 <- mydata1[,c(-6)]# excluding variable, that I will not use
head(mydata2)
## age sex bmi children smoker charges sexF smokerF
## 1 19 female 32.900 0 no 1748.774 female no
## 2 42 female 24.985 2 no 8017.061 female no
## 3 52 female 46.750 5 no 12592.534 female no
## 4 63 male 36.765 0 no 13981.850 male no
## 5 58 male 25.175 0 no 11931.125 male no
## 6 34 male 25.300 2 yes 18972.495 male yes
library(psych)
describe(mydata2 [,c(-2,-5,-7,-8)])
## vars n mean sd median trimmed mad min max
## age 1 100 42.04 14.56 42.50 42.40 17.05 18.00 64.0
## bmi 2 100 31.27 6.26 31.40 31.21 6.94 17.86 47.6
## children 3 100 1.22 1.39 1.00 1.00 1.48 0.00 5.0
## charges 4 100 14067.91 12524.07 10673.46 11842.21 7085.90 1141.45 60021.4
## range skew kurtosis se
## age 46.00 -0.17 -1.26 1.46
## bmi 29.74 0.12 -0.35 0.63
## children 5.00 1.08 0.55 0.14
## charges 58879.95 1.56 1.80 1252.41
summary(mydata2)
## age sex bmi children
## Min. :18.00 Length:100 Min. :17.86 Min. :0.00
## 1st Qu.:31.00 Class :character 1st Qu.:26.65 1st Qu.:0.00
## Median :42.50 Mode :character Median :31.40 Median :1.00
## Mean :42.04 Mean :31.27 Mean :1.22
## 3rd Qu.:54.25 3rd Qu.:36.08 3rd Qu.:2.00
## Max. :64.00 Max. :47.60 Max. :5.00
## smoker charges sexF smokerF
## Length:100 Min. : 1141 male :49 yes:19
## Class :character 1st Qu.: 5576 female:51 no :81
## Mode :character Median :10673
## Mean :14068
## 3rd Qu.:14222
## Max. :60021
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata2[ , c(-1,-2,-4,-5,-7,-8)], smooth=FALSE)
scatterplot(mydata2$bmi ~ mydata2$charges,
smooth = FALSE,
boxplots = FALSE,
ylab = "BMI",
xlab = "Charges")
shapiro.test(mydata2$bmi)
##
## Shapiro-Wilk normality test
##
## data: mydata2$bmi
## W = 0.9917, p-value = 0.7994
H0: Variable bmi is normally distributed.
H1: Variable bmi is not normally distributed.
We can not reject H0.
shapiro.test(mydata2$charges)
##
## Shapiro-Wilk normality test
##
## data: mydata2$charges
## W = 0.8089, p-value = 4.691e-10
H0: Variable charge is normally distributed.
H1: Variable charge is not normally distributed.
We reject H0 at p<0.001. Variable charge is not normally distributed.
Because variable charges is not normally distributed, assumption is violated and I will use Spearman correlation coefficient.
cor(mydata2$bmi, mydata2$charges,
method= "spearman")
## [1] 0.02573165
cor.test(mydata2$bmi, mydata2$charges,
method = "spearman")
## Warning in cor.test.default(mydata2$bmi, mydata2$charges, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata2$bmi and mydata2$charges
## S = 162362, p-value = 0.7994
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02573165
Hypothesis:
H0: There is no correlation.
H1: There is a correlation.
We can not reject H0 (p>0.05). Based on sample data, we can conclude that there is not enough evidence to prove that there is correlation between bmi (body max index) and charges.
results <- chisq.test(mydata2$sexF, mydata2$smokerF,
correct = TRUE) #Correction because of 2x2 table
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata2$sexF and mydata2$smokerF
## X-squared = 0.3682, df = 1, p-value = 0.544
Based on the sample data, we can not reject H0, we can not claim that there is an association between gender and smoking.
addmargins(results$observed) #Checking empirical frequencies
## mydata2$smokerF
## mydata2$sexF yes no Sum
## male 11 38 49
## female 8 43 51
## Sum 19 81 100
addmargins(round(results$expected, 2)) #Checking expected frequencies
## mydata2$smokerF
## mydata2$sexF yes no Sum
## male 9.31 39.69 49
## female 9.69 41.31 51
## Sum 19.00 81.00 100
round(results$res, 2) #Checking residuals
## mydata2$smokerF
## mydata2$sexF yes no
## male 0.55 -0.27
## female -0.54 0.26
There is no significant discrepancies, because all residuals are less than 1.96.
For the purpose of practice I will interpret one as it would be significant: There is less than expected females that smokes.
addmargins(round(prop.table(results$observed), 3))
## mydata2$smokerF
## mydata2$sexF yes no Sum
## male 0.11 0.38 0.49
## female 0.08 0.43 0.51
## Sum 0.19 0.81 1.00
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata2$smokerF
## mydata2$sexF yes no Sum
## male 0.224 0.776 1.000
## female 0.157 0.843 1.000
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata2$smokerF
## mydata2$sexF yes no
## male 0.579 0.469
## female 0.421 0.531
## Sum 1.000 1.000
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata2$sexF, mydata2$smokerF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)