Homework 2 : 19566100

Data set for Homework 3, was found on Kaggle. I used the same data set as I used in Homework 1 and 2. Link to data set is https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout .

RESEARCH QUESTION:

  • Do Body Max Index, age and smoking status influence the expenditure for a customer, paid by an insurance company ?
mydata <- read.table("./expenses.csv", header=TRUE, sep=",", dec=".")
head(mydata)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Explanation of data set

Unit of observation: customer of the insurance company

Sample size: 1338 units

Description of data:
  • age: Age of the customer
  • sex: Gender
  • bmi: Body Mass Index (kg/m2)
  • children: Number of children
  • smoker: Whether the customer smokes or not
  • region: Which region of the USA the customer belongs to
  • charges: The expenditure for the customer in $
mydata$sexF <- factor(mydata$sex, # creating a factor for variable sex
                 levels = c("male", "female"),
                 labels = c("male", "female"))
mydata$smokerF <- factor(mydata$smoker, # creating a factor for variable smoker
                 levels = c("yes", "no"),
                 labels = c("yes", "no"))
mydata$regionF <- factor(mydata$region, # creating a factor for variable region
                             labels = c("northeast", "southeast", "southwest", "northwest"),
                             levels = c("northeast", "southeast", "southwest", "northwest"))

head(mydata, 3)
##   age    sex   bmi children smoker    region   charges   sexF smokerF   regionF
## 1  19 female 27.90        0    yes southwest 16884.924 female     yes southwest
## 2  18   male 33.77        1     no southeast  1725.552   male      no southeast
## 3  28   male 33.00        3     no southeast  4449.462   male      no southeast
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(4)

mydata1 <- mydata %>% # for the purpose of exercise, I will take a random sample of 400 units
sample_n(400)
head(mydata1)
##   age    sex    bmi children smoker    region   charges   sexF smokerF
## 1  21   male 20.235        3     no northeast  3861.210   male      no
## 2  47 female 26.125        1    yes northeast 23401.306 female     yes
## 3  27 female 24.750        0    yes southeast 16577.780 female     yes
## 4  53   male 24.320        0     no northwest  9863.472   male      no
## 5  61 female 21.090        0     no northwest 13415.038 female      no
## 6  39 female 22.800        3     no northeast  7985.815 female      no
##     regionF
## 1 northeast
## 2 northeast
## 3 southeast
## 4 northwest
## 5 northwest
## 6 northeast
summary(mydata1 [,c(-2,-5,-6)])
##       age             bmi           children        charges          sexF    
##  Min.   :18.00   Min.   :17.29   Min.   :0.000   Min.   : 1122   male  :190  
##  1st Qu.:27.00   1st Qu.:26.10   1st Qu.:0.000   1st Qu.: 4687   female:210  
##  Median :37.50   Median :30.25   Median :1.000   Median : 9018               
##  Mean   :38.54   Mean   :30.49   Mean   :1.097   Mean   :12402               
##  3rd Qu.:50.00   3rd Qu.:34.41   3rd Qu.:2.000   3rd Qu.:14352               
##  Max.   :64.00   Max.   :47.74   Max.   :5.000   Max.   :60021               
##  smokerF        regionF   
##  yes: 76   northeast: 98  
##  no :324   southeast:119  
##            southwest: 98  
##            northwest: 85  
##                           
## 
library(psych)
describe(mydata1 [,c(-2,-5,-6,-8,-9,-10)])
##          vars   n     mean       sd  median  trimmed     mad     min      max
## age         1 400    38.54    13.91   37.50    38.19   17.79   18.00    64.00
## bmi         2 400    30.49     6.05   30.26    30.32    6.17   17.29    47.74
## children    3 400     1.10     1.22    1.00     0.93    1.48    0.00     5.00
## charges     4 400 12402.02 11278.53 9018.46 10173.07 6928.94 1121.87 60021.40
##             range skew kurtosis     se
## age         46.00 0.15    -1.19   0.70
## bmi         30.45 0.27    -0.22   0.30
## children     5.00 1.02     0.59   0.06
## charges  58899.53 1.71     2.47 563.93
Explanation of the parameters:
  • Sex: There are 210 females in our sample.
  • Bmi: The average body max index is 30.49.
  • Charges: The maximum expenditure in sample for one customer is 60021$.

Multiple regression model

In my regression model charges is dependent variable and age, BMI and smoking status are independent variables. We can expect that with higher age, expenditures of insurance company increases, because the older the person is, more likely it is that the person have heath issues and consequently expenditures of insurance company are higher. If we observe BMI variable, we can expect that BMI(body max index) will increase expenditures, because people with higher BMI usually have more health issues and that means that insurance company expenditures are higher. And lastly, we can expect that people who smoke, have more health issues than others and consequently expenditures are higher.

Checking the assumptions

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata1[ , c(7, 3, 1)],
                  smooth = FALSE)

From the scatterplot above we can not be sure that there is linearity between charges and bmi. I did logarithms of charges and bmi in another file but after I did a scatterplot, it was even worse. I decided to keep variables without logarithms.

Multicolinearity

fit1 <- lm(charges ~ age + bmi + smokerF, #Dependent and explanatory variable
           data = mydata1) #Name of the data frame
vif(fit1)
##      age      bmi  smokerF 
## 1.024468 1.017070 1.011959
mean(vif(fit1))
## [1] 1.017833

All tree VIF are close to 1 and also mean is very close to 1, so we can say that we do not have problems with multicolinearity.

Checking for outliers

mydata1$StdResid <- round(rstandard(fit1), 3) #Standardized residuals
mydata1$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances

hist(mydata1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

From the graph we can observe that there are some outliers (values that are below -3 or above than 3) that we need to remove.

I will now do the Saphiro Wilk test for normality. In our case this is not needed because we have enough large sample, but I will just do it as a purpose of practice.

shapiro.test(mydata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.88057, p-value < 2.2e-16
  • H0: Errors are normally distributed.
  • H1: Errors are not normally distributed.

We reject the null hypothesis (p=0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size of 400 is sufficiently big.

Checking for units with high impact

hist(mydata1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

All values are less than 1 which is good, but it should not be any gap between values.

head(mydata1[order(-mydata1$StdResid),], 15) 
##     age    sex    bmi children smoker    region  charges   sexF smokerF
## 275  52   male 34.485        3    yes northwest 60021.40   male     yes
## 89   34   male 22.420        2     no northeast 27375.90   male      no
## 178  21 female 32.680        2     no northwest 26018.95 female      no
## 137  23   male 18.715        0     no northwest 21595.38   male      no
## 90   28   male 36.400        1    yes southwest 51194.56   male     yes
## 291  23 female 24.225        2     no northeast 22395.74 female      no
## 199  26 female 29.640        4     no northeast 24671.66 female      no
## 326  28 female 24.320        1     no northeast 23288.93 female      no
## 395  60   male 28.595        0     no northeast 30260.00   male      no
## 138  60 female 25.840        0     no northwest 28923.14 female      no
## 111  53 female 32.300        2     no northeast 29186.48 female      no
## 257  19 female 27.930        3     no northwest 18838.70 female      no
## 380  52   male 26.400        3     no southeast 25992.82   male      no
## 8    28 female 27.500        2     no southwest 20177.67 female      no
## 115  35 female 38.095        2     no northeast 24915.05 female      no
##       regionF StdResid CooksD
## 275 northwest    4.053  0.074
## 89  northeast    3.818  0.029
## 178 northwest    3.560  0.026
## 137 northwest    3.520  0.048
## 90  southwest    3.453  0.053
## 291 northeast    3.341  0.025
## 199 northeast    3.297  0.015
## 326 northeast    3.280  0.019
## 395 northeast    2.915  0.020
## 138 northwest    2.846  0.023
## 111 northeast    2.813  0.011
## 257 northwest    2.713  0.016
## 380 southeast    2.648  0.012
## 8   southwest    2.591  0.009
## 115 northeast    2.519  0.012
head(mydata1[order(mydata1$StdResid),], 15) 
##     age    sex    bmi children smoker    region  charges   sexF smokerF
## 340  38   male 19.300        0    yes southwest 15820.70   male     yes
## 66   42   male 24.640        0    yes southeast 19515.54   male     yes
## 106  53   male 20.900        0    yes southeast 21195.82   male     yes
## 62   43   male 20.130        2    yes southeast 18767.74   male     yes
## 71   33   male 24.795        0    yes northeast 17904.53   male     yes
## 127  25   male 24.130        0    yes northwest 15817.99   male     yes
## 195  29 female 21.850        0    yes northeast 16115.30 female     yes
## 3    27 female 24.750        0    yes southeast 16577.78 female     yes
## 375  39   male 26.410        0    yes northeast 20149.32   male     yes
## 235  42   male 30.000        0    yes southwest 22144.03   male     yes
## 366  27   male 28.500        0    yes northwest 18310.74   male     yes
## 185  20   male 27.300        0    yes southwest 16232.85   male     yes
## 255  40 female 22.220        2    yes southeast 19444.27 female     yes
## 44   36 female 22.600        2    yes southwest 18608.26 female     yes
## 290  29 female 21.755        1    yes northeast 16657.72 female     yes
##       regionF StdResid CooksD
## 340 southwest   -1.897  0.019
## 66  southeast   -1.732  0.012
## 106 southeast   -1.700  0.018
## 62  southeast   -1.655  0.015
## 71  northeast   -1.643  0.010
## 127 northwest   -1.631  0.011
## 195 northeast   -1.620  0.012
## 3   southeast   -1.619  0.010
## 375 northeast   -1.600  0.009
## 235 southwest   -1.586  0.009
## 366 northwest   -1.534  0.008
## 185 southwest   -1.533  0.010
## 255 southeast   -1.532  0.011
## 44  southwest   -1.529  0.010
## 290 northeast   -1.525  0.011
head(mydata1[order(-mydata1$CooksD),], 10)
##     age    sex    bmi children smoker    region  charges   sexF smokerF
## 275  52   male 34.485        3    yes northwest 60021.40   male     yes
## 90   28   male 36.400        1    yes southwest 51194.56   male     yes
## 137  23   male 18.715        0     no northwest 21595.38   male      no
## 89   34   male 22.420        2     no northeast 27375.90   male      no
## 178  21 female 32.680        2     no northwest 26018.95 female      no
## 38   44 female 38.060        0    yes southeast 48885.14 female     yes
## 291  23 female 24.225        2     no northeast 22395.74 female      no
## 138  60 female 25.840        0     no northwest 28923.14 female      no
## 368  25   male 45.540        2    yes southeast 42112.24   male     yes
## 395  60   male 28.595        0     no northeast 30260.00   male      no
##       regionF StdResid CooksD
## 275 northwest    4.053  0.074
## 90  southwest    3.453  0.053
## 137 northwest    3.520  0.048
## 89  northeast    3.818  0.029
## 178 northwest    3.560  0.026
## 38  southeast    2.333  0.025
## 291 northeast    3.341  0.025
## 138 northwest    2.846  0.023
## 368 southeast    1.580  0.021
## 395 northeast    2.915  0.020

We need to eliminate all units with standardized residuals below -3 and above than +3. This units are: ID 275 89 178 137 90 291 199 326. From calculating Cooks distances, I identified that units with high impact are units with ID 275, 90 and 137. This three units were also identified as an outliers.

Removing units

mydata2 <- mydata1[abs(mydata1$StdResid) <= 3, ] 

I removed outliers. Because units with high impact were also outliers, I did not need to use another code to eliminate them.

Checking homoskedasticity

fit2 <- lm(charges ~ age + bmi + smokerF, 
           data = mydata2)

mydata2$StdFitted <- scale(fit2$fitted.values)
mydata2$StdResid <- round(rstandard(fit2), 3)

library(car)
scatterplot(y = mydata2$StdResid, x = mydata2$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : charges 
##  Variables: fitted values of charges 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    116.9762 
##  Prob > Chi2   =    2.905388e-27

From the scatterplot we can observe that variances are probably not constant. To prove it I used Breusch Pagan test.

  • H0: The variance is constant
  • H1: The variance is not constant

Based on the sample data we can reject H0(p<0.001) and conclude that heteroscedasticity is present. We will need to use robust standard errors.

Explanation

library(estimatr)
fit_robust <- lm_robust(charges ~ age + bmi + smokerF,  
                        se_type = "HC1",
                        data = mydata2)
summary(fit_robust)
## 
## Call:
## lm_robust(formula = charges ~ age + bmi + smokerF, data = mydata2, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)   9600.0    1815.79   5.287 2.081e-07   6030.0  13170.0 388
## age            261.8      19.29  13.571 1.332e-34    223.9    299.7 388
## bmi            354.4      50.87   6.966 1.402e-11    254.3    454.4 388
## smokerFno   -22880.9    1038.74 -22.027 2.458e-70 -24923.1 -20838.6 388
## 
## Multiple R-squared:  0.7651 ,    Adjusted R-squared:  0.7633 
## F-statistic: 225.3 on 3 and 388 DF,  p-value: < 2.2e-16

Charges = 9600 + 261.8age + 354.4bmi - 22880.9smokerFno

  • If age increases by 1 year, expenditures on average increases by 261.8 $ (p<0.001), assuming all other variables remain the same.
  • If Body Max Index increases by 1 (kg/m2), expenditures on average increases by 354.4 $ (p<0.001), assuming all other variables remain the same.
  • Given the values of age and bmi, the expenditures for non smokers are on average 22880.9 $ lower than theexpenditures for smokers (p<0.001).

Multiple R-squared: 76.51%. - 76.51 % of variability of the charges is explained by variability of age, bmi and smoking status.

F-statistic:

  • H0:ρ2 = 0
  • H1:ρ2 > 0

Based on the sample data we reject H0(p<0.001) and conclude that at least one independent variable has impact on dependent variable. This model is better model than model without any independent variable.

Conclusion

Based on the sample data, we found that age, Body Max Index and smoking status influence the expenditures of insurance company.