Homework 2 : 19566100

Data set for Homework 2, was found on Kaggle. I used the same data set as I used in Homework 1. Link to data set is https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout.

mydata <- read.table("./expenses.csv", header=TRUE, sep=",", dec=".")
head(mydata)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Explanation of data set

Unit of observation: customer of the insurance company

Sample size: 1338 units

Description of data:
  • age: Age of the customer
  • sex: Gender
  • bmi: Body Mass Index (kg/m2)
  • children: Number of children
  • smoker: Whether the customer smokes or not
  • region: Which region of the USA the customer belongs to
  • charges: The expenditure for the customer in $
mydata$sexF <- factor(mydata$sex, # creating a factor for variable sex
                 levels = c("male", "female"),
                 labels = c("male", "female"))
mydata$smokerF <- factor(mydata$smoker, # creating a factor for variable smoker
                 levels = c("yes", "no"),
                 labels = c("yes", "no"))
head(mydata, 3)
##   age    sex   bmi children smoker    region   charges   sexF smokerF
## 1  19 female 27.90        0    yes southwest 16884.924 female     yes
## 2  18   male 33.77        1     no southeast  1725.552   male      no
## 3  28   male 33.00        3     no southeast  4449.462   male      no
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(10)

mydata1 <- mydata %>% # for the purpose of exercise, I will take a random sample of 100 units
sample_n(100)
mydata2 <- mydata1[,c(-6)]# excluding variable, that I will not use
head(mydata2)
##   age    sex    bmi children smoker   charges   sexF smokerF
## 1  19 female 32.900        0     no  1748.774 female      no
## 2  42 female 24.985        2     no  8017.061 female      no
## 3  52 female 46.750        5     no 12592.534 female      no
## 4  63   male 36.765        0     no 13981.850   male      no
## 5  58   male 25.175        0     no 11931.125   male      no
## 6  34   male 25.300        2    yes 18972.495   male     yes
library(psych)
describe(mydata2 [,c(-2,-5,-7,-8)])
##          vars   n     mean       sd   median  trimmed     mad     min     max
## age         1 100    42.04    14.56    42.50    42.40   17.05   18.00    64.0
## bmi         2 100    31.27     6.26    31.40    31.21    6.94   17.86    47.6
## children    3 100     1.22     1.39     1.00     1.00    1.48    0.00     5.0
## charges     4 100 14067.91 12524.07 10673.46 11842.21 7085.90 1141.45 60021.4
##             range  skew kurtosis      se
## age         46.00 -0.17    -1.26    1.46
## bmi         29.74  0.12    -0.35    0.63
## children     5.00  1.08     0.55    0.14
## charges  58879.95  1.56     1.80 1252.41
summary(mydata2)
##       age            sex                 bmi           children   
##  Min.   :18.00   Length:100         Min.   :17.86   Min.   :0.00  
##  1st Qu.:31.00   Class :character   1st Qu.:26.65   1st Qu.:0.00  
##  Median :42.50   Mode  :character   Median :31.40   Median :1.00  
##  Mean   :42.04                      Mean   :31.27   Mean   :1.22  
##  3rd Qu.:54.25                      3rd Qu.:36.08   3rd Qu.:2.00  
##  Max.   :64.00                      Max.   :47.60   Max.   :5.00  
##     smoker             charges          sexF    smokerF 
##  Length:100         Min.   : 1141   male  :49   yes:19  
##  Class :character   1st Qu.: 5576   female:51   no :81  
##  Mode  :character   Median :10673                       
##                     Mean   :14068                       
##                     3rd Qu.:14222                       
##                     Max.   :60021
Explanation of the parameters:
  • Age-max: The oldes customer observed is 64 years old.
  • Bmi: Skewness value of 0.12 for variable bmi means that distribution of bmi is positively or right skewed.
  • Charges-mean: The average value of expenditures is 14068 $.

1. Research question: Is there a correlation between bmi(body max index) and expenditures ?

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata2[ , c(-1,-2,-4,-5,-7,-8)], smooth=FALSE)

scatterplot(mydata2$bmi ~ mydata2$charges,
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "BMI",
            xlab = "Charges")

shapiro.test(mydata2$bmi)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$bmi
## W = 0.9917, p-value = 0.7994
Hypothesis for bmi:
  • H0: Variable bmi is normally distributed.

  • H1: Variable bmi is not normally distributed.

We can not reject H0.

shapiro.test(mydata2$charges)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$charges
## W = 0.8089, p-value = 4.691e-10
Hypothesis for charge:
  • H0: Variable charge is normally distributed.

  • H1: Variable charge is not normally distributed.

We reject H0 at p<0.001. Variable charge is not normally distributed.

Because variable charges is not normally distributed, assumption is violated and I will use Spearman correlation coefficient.

cor(mydata2$bmi, mydata2$charges, 
      method= "spearman")
## [1] 0.02573165
cor.test(mydata2$bmi, mydata2$charges,
         method = "spearman")
## Warning in cor.test.default(mydata2$bmi, mydata2$charges, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  mydata2$bmi and mydata2$charges
## S = 162362, p-value = 0.7994
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02573165

Hypothesis:

H0: There is no correlation.

H1: There is a correlation.

We can not reject H0 (p>0.05). Based on sample data, we can conclude that there is not enough evidence to prove that there is correlation between bmi (body max index) and charges.

2. Research question: Is there an association between smoking and gender?

Assumptions:
  • independence of observation (met)
  • expected frequencies need to be more than 1 (met)
  • max 20% of frequencies can be between 1 and 5 (met)
results <- chisq.test(mydata2$sexF, mydata2$smokerF,
                      correct = TRUE) #Correction because of 2x2 table

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata2$sexF and mydata2$smokerF
## X-squared = 0.3682, df = 1, p-value = 0.544
The hypotheses for Pearon’s Chi2 :
  • H0: There are no associations between the two categorical variables.
  • H1: There is association between the two categorical variables.

Based on the sample data, we can not reject H0, we can not claim that there is an association between gender and smoking.

addmargins(results$observed) #Checking empirical frequencies
##             mydata2$smokerF
## mydata2$sexF yes  no Sum
##       male    11  38  49
##       female   8  43  51
##       Sum     19  81 100
addmargins(round(results$expected, 2)) #Checking expected frequencies
##             mydata2$smokerF
## mydata2$sexF   yes    no Sum
##       male    9.31 39.69  49
##       female  9.69 41.31  51
##       Sum    19.00 81.00 100
round(results$res, 2) #Checking residuals
##             mydata2$smokerF
## mydata2$sexF   yes    no
##       male    0.55 -0.27
##       female -0.54  0.26

There is no significant discrepancies, because all residuals are less than 1.96.

For the purpose of practice I will interpret one as it would be significant: There is less than expected females that smokes.

Proportion tables

addmargins(round(prop.table(results$observed), 3))
##             mydata2$smokerF
## mydata2$sexF  yes   no  Sum
##       male   0.11 0.38 0.49
##       female 0.08 0.43 0.51
##       Sum    0.19 0.81 1.00
Explanation of 0.38(Male/No):
  • Out of 100 customers, there is 38% males that do not smoke.
addmargins(round(prop.table(results$observed, 1), 3), 2)
##             mydata2$smokerF
## mydata2$sexF   yes    no   Sum
##       male   0.224 0.776 1.000
##       female 0.157 0.843 1.000
Explanation of 0.776(Male/No):
  • Out of all males, 77.6% of them do not smoke.
addmargins(round(prop.table(results$observed, 2), 3), 1)
##             mydata2$smokerF
## mydata2$sexF   yes    no
##       male   0.579 0.469
##       female 0.421 0.531
##       Sum    1.000 1.000
Explanation of 0.469(Male/No):
  • Out of all customers that do no smoke, 46.9% of them are males.

Effect size

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(mydata2$sexF, mydata2$smokerF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)

Conclusion:

  • From the analysis that I did, I found out that there is no association between gender and smoking.