Import and show data

mydata <- read.table("~/IMB/2 semester/1) MVA/HW R/HW 3/insurance.csv", 
                     header=TRUE, 
                     sep=",", 
                     dec=".")

head(mydata, 10)
##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

Explanations of the data

Source: https://www.kaggle.com/datasets/mirichoi0218/insurance?select=insurance.csv

Main goal

To determine how factors such as age, sex, BMI, region, number of children, whether a person is a smoker or not impact the charges for heath insurance.

Insurance companies have to evaluate many factors when setting the appropriate price for the insurance premium. Since all of the above available variables can be found as being taken into consideration in different sources, we will be including them all in our model.

Sources:

Data manipulations

#factors

mydata$sex <- factor(mydata$sex) 
mydata$smoker <- factor(mydata$smoker)
mydata$region <- factor (mydata$region)
round(stat.desc(mydata[ , c(-2, -5, -6)]), 2)
##                   age      bmi children      charges
## nbr.val       1338.00  1338.00  1338.00      1338.00
## nbr.null         0.00     0.00   574.00         0.00
## nbr.na           0.00     0.00     0.00         0.00
## min             18.00    15.96     0.00      1121.87
## max             64.00    53.13     5.00     63770.43
## range           46.00    37.17     5.00     62648.55
## sum          52459.00 41027.62  1465.00  17755824.99
## median          39.00    30.40     1.00      9382.03
## mean            39.21    30.66     1.09     13270.42
## SE.mean          0.38     0.17     0.03       331.07
## CI.mean.0.95     0.75     0.33     0.06       649.47
## var            197.40    37.19     1.45 146652372.15
## std.dev         14.05     6.10     1.21     12110.01
## coef.var         0.36     0.20     1.10         0.91

We can observe that the customers’ age ranges from 18 to 64, whith the average at 39 years. Interestingly, we can observe that on average, people have BMIs higher than recommended (average is 30.66). Out of 1338 obsrvations, 574 have no children. The charges they are currently paying range from 1,121.87 to 63,770.43, with the average at 13,270.42 (unfortunatelly, the data set does not elaborete on which currency the costs are defined in).

Scatterplots

library(car)
scatterplotMatrix(mydata[ , c(7, 1, 3, 4)], 
                  smooth = FALSE) 

We can observe positive correlation between our (numeric) explenatory variables and the dependant variable.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(7, 1, 3, 4)]))
##          charges  age  bmi children
## charges     1.00 0.30 0.20     0.07
## age         0.30 1.00 0.11     0.04
## bmi         0.20 0.11 1.00     0.01
## children    0.07 0.04 0.01     1.00
## 
## n= 1338 
## 
## 
## P
##          charges age    bmi    children
## charges          0.0000 0.0000 0.0129  
## age      0.0000         0.0000 0.1205  
## bmi      0.0000  0.0000        0.6410  
## children 0.0129  0.1205 0.6410

Multiple regression model

fit1 <- lm(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)

Assumptions

Outliers: Standardised residuals

mydata$StdResid <- round(rstandard(fit1), 3)

Is the variable normally distributed?

Histogram:

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Ho: variable (std. residuals) is normally distributed

H1: variable (std. residuals) is NOT normally distributed

shapiro.test(mydata$StdResid) 
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.89909, p-value < 2.2e-16

We reject Ho at p < 0.001, meaning the standardised residuals are not normallly distributed.

We have to remove the outlier(s) (that is all units with standardised residuals more than 3 or less than -3 (this is done a bit later in the code).

Outliers:

head(mydata[order(-mydata$StdResid),], 30) 
##      age    sex    bmi children smoker    region  charges StdResid
## 1301  45   male 30.360        0    yes southeast 62592.87    4.965
## 578   31 female 38.095        1    yes northeast 58571.07    4.193
## 243   55 female 26.800        1     no southwest 35160.13    4.030
## 220   24 female 23.210        0     no southeast 25081.77    3.976
## 517   20   male 35.310        1     no southeast 27724.29    3.844
## 820   33 female 35.530        0    yes northwest 55135.40    3.820
## 544   54 female 47.410        0    yes southeast 63770.43    3.808
## 1231  52   male 34.485        3    yes northwest 60021.40    3.666
## 141   34   male 22.420        2     no northeast 27375.90    3.665
## 1207  59 female 34.800        2     no southwest 36910.61    3.623
## 937   44   male 29.735        2     no northeast 32108.66    3.611
## 1028  23   male 18.715        0     no northwest 21595.38    3.604
## 1020  21 female 32.680        2     no northwest 26018.95    3.455
## 1013  61 female 33.330        4     no southeast 36580.28    3.428
## 35    28   male 36.400        1    yes southwest 51194.56    3.372
## 388   50   male 25.365        2     no northwest 30284.64    3.359
## 527   19 female 30.590        2     no northwest 24059.68    3.333
## 431   19   male 33.100        0     no southwest 23082.96    3.311
## 1040  19   male 27.265        2     no northwest 22493.66    3.282
## 1009  25   male 24.985        2     no northeast 23241.47    3.220
## 469   28 female 24.320        1     no northeast 23288.93    3.193
## 1329  23 female 24.225        2     no northeast 22395.74    3.186
## 988   45 female 27.645        1     no northwest 28340.19    3.177
## 600   52 female 37.525        2     no northwest 33471.97    3.099
## 103   18 female 30.115        0     no northeast 21344.85    3.052
## 689   47 female 24.100        1     no southwest 26236.58    3.045
## 1143  52 female 24.860        0     no southeast 27117.99    3.029
## 4     33   male 22.705        0     no northwest 21984.47    3.016
## 322   26 female 29.640        4     no northeast 24671.66    2.979
## 63    64   male 24.700        1     no northwest 30166.62    2.864

Units with high impact: Cooks distance

mydata$CooksD <- round(cooks.distance(fit1), 3) 

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Since we have many observations, the lower frequencies are not well seen in the graph, we will look at the data by the order to get a better sense of Cook’s distances.

head(mydata[order(-mydata$CooksD),], 20)
##      age    sex    bmi children smoker    region  charges StdResid CooksD
## 544   54 female 47.410        0    yes southeast 63770.43    3.808  0.020
## 1301  45   male 30.360        0    yes southeast 62592.87    4.965  0.019
## 578   31 female 38.095        1    yes northeast 58571.07    4.193  0.018
## 820   33 female 35.530        0    yes northwest 55135.40    3.820  0.015
## 1231  52   male 34.485        3    yes northwest 60021.40    3.666  0.014
## 1013  61 female 33.330        4     no southeast 36580.28    3.428  0.013
## 220   24 female 23.210        0     no southeast 25081.77    3.976  0.012
## 1028  23   male 18.715        0     no northwest 21595.38    3.604  0.011
## 35    28   male 36.400        1    yes southwest 51194.56    3.372  0.010
## 141   34   male 22.420        2     no northeast 27375.90    3.665  0.009
## 243   55 female 26.800        1     no southwest 35160.13    4.030  0.009
## 322   26 female 29.640        4     no northeast 24671.66    2.979  0.009
## 517   20   male 35.310        1     no southeast 27724.29    3.844  0.009
## 1207  59 female 34.800        2     no southwest 36910.61    3.623  0.009
## 431   19   male 33.100        0     no southwest 23082.96    3.311  0.008
## 527   19 female 30.590        2     no northwest 24059.68    3.333  0.008
## 1020  21 female 32.680        2     no northwest 26018.95    3.455  0.008
## 63    64   male 24.700        1     no northwest 30166.62    2.864  0.007
## 103   18 female 30.115        0     no northeast 21344.85    3.052  0.007
## 388   50   male 25.365        2     no northwest 30284.64    3.359  0.007
head(mydata[order(mydata$CooksD),], 20)
##    age    sex    bmi children smoker    region   charges StdResid CooksD
## 2   18   male 33.770        1     no southeast  1725.552   -0.285      0
## 3   28   male 33.000        3     no southeast  4449.462   -0.374      0
## 5   32   male 28.880        0     no northwest  3866.855   -0.285      0
## 6   31 female 25.740        0     no southeast  3756.622    0.006      0
## 7   46 female 33.440        1     no southeast  8240.590   -0.400      0
## 8   37 female 27.740        3     no northwest  7281.506   -0.127      0
## 9   37   male 29.830        2     no northeast  6406.411   -0.347      0
## 11  25   male 26.220        0     no northeast  2721.321   -0.087      0
## 13  23   male 34.400        0     no southwest  1826.843   -0.450      0
## 14  56 female 39.820        0     no southeast 11090.718   -0.633      0
## 16  19   male 24.600        1     no southwest  1837.237    0.193      0
## 17  52 female 30.780        1     no northeast 10797.336   -0.254      0
## 18  23   male 23.845        0     no northeast  2395.172    0.078      0
## 19  56   male 40.300        0     no southwest 10602.385   -0.732      0
## 21  60 female 36.005        0     no northeast 13228.847   -0.407      0
## 22  30 female 32.400        1     no southwest  4149.736   -0.351      0
## 23  18   male 34.100        0     no southeast  1137.011   -0.322      0
## 25  37   male 28.025        2     no northwest  6203.902   -0.221      0
## 26  59 female 27.720        3     no southeast 14001.134    0.164      0
## 27  63 female 23.085        0     no northeast 14451.835    0.394      0

We see no very big jumps in the values of Cook’s distance, however we will remove all units with CD more than 0.0155 (the higher jump is observed there).

All rows with |std residual| above 3 and/or CD > 0.0155 will be removed.

mydata <- mydata[c(-1301, -578, -243, -220, - 517, - 820, -544, -1231, -141, -1207, -937, -1028, -1020, -1013, -35, -388, -527, -431, -1040, -1009, -469, -1329, -988, -600, -103, -689, -1143, -4), ]

Multicolinearity: VIF

vif(fit1)
##              GVIF Df GVIF^(1/(2*Df))
## age      1.016822  1        1.008376
## sex      1.008900  1        1.004440
## bmi      1.106630  1        1.051965
## children 1.004011  1        1.002003
## smoker   1.012074  1        1.006019
## region   1.098893  3        1.015841
mean(vif(fit1))
## [1] 1.129776
sqrt(5)
## [1] 2.236068

To test for potential problem of multicolinearity variance inflation factor is used. Since our variable “region” has more than 2 categories (it has 4), we have to compare it’s GVIF to square root of 5. It is smaller, which is good. Generally speaking, all values are very close to 1, furthermore the average is also close to 1 which is very good. Therefore, we do not have to worry about multicolinearity of the variables in our model.

Re-estimation of regression model

Because we removed the outliers.

fit2 <- lm(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)

Checking for homoskedasticity

mydata$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The scaterplot indicates that the variance might not be constant, so it will be tested formally (Breusch Pagan test).

Ho: The variance is constant.

H1: The variance is NOT constant.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : charges 
##  Variables: fitted values of charges 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    386.2995 
##  Prob > Chi2   =    5.290442e-86

We can reject the Ho at p < 0.001, meaning the variance is NOT constant. This affects standardized errors and p-values in our model, so the function lm_robust will be used to correct it.

Final regression

library(estimatr)
fit3 <- lm_robust(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)

summary(fit3)
## 
## Call:
## lm_robust(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = mydata)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                   Estimate Std. Error   t value   Pr(>|t|) CI Lower  CI Upper
## (Intercept)     -12735.702     896.61 -14.20425  1.120e-42 -14494.7 -10976.74
## age                263.387      10.52  25.03834 2.987e-113    242.8    284.02
## sexmale              4.146     290.05   0.01429  9.886e-01   -564.9    573.16
## bmi                341.595      28.07  12.16914  2.412e-32    286.5    396.66
## children           441.808     116.58   3.78976  1.577e-04    213.1    670.51
## smokeryes        23732.417     531.35  44.66439 7.049e-265  22690.0  24774.81
## regionnorthwest   -539.383     414.10  -1.30254  1.930e-01  -1351.8    272.99
## regionsoutheast   -967.064     445.37  -2.17137  3.008e-02  -1840.8    -93.34
## regionsouthwest   -831.003     408.92  -2.03221  4.233e-02  -1633.2    -28.80
##                   DF
## (Intercept)     1301
## age             1301
## sexmale         1301
## bmi             1301
## children        1301
## smokeryes       1301
## regionnorthwest 1301
## regionsoutheast 1301
## regionsouthwest 1301
## 
## Multiple R-squared:    0.8 , Adjusted R-squared:  0.7988 
## F-statistic: 387.8 on 8 and 1301 DF,  p-value: < 2.2e-16

Explanations

The result:

charges = -12,735.702 + 263.387 * age + 4.146 * male + 341.595 * bmi + 441.808 * children + 23,732.417 * smokers(yes) - 539.383 * NW_region - 967.064 SE_region - 831.003 * SW_region

At p = 0.05 the following coefficients are statistically NOT significant: sexM, NW_region and will not be explained.

Multiple R squared = 0.8, which means that 80 % of the variability of the dependant variable is explained by the independant variables.

Ho: ρ^2 = 0

H1: ρ^2 > 0

We can reject Ho (that the model with no independent variables fits the data as well as our estimated model) at p < 0.001, which means that ro is higher than 0, indicating that our model is a good fit (there is relationship between at least one independent variable and the dependent variable).

Descriptive statistics

round(stat.desc(mydata[ , c(-2, -5, -6)]), 2)
##                   age      bmi children      charges StdResid  CooksD
## nbr.val       1310.00  1310.00  1310.00      1310.00  1310.00 1310.00
## nbr.null         0.00     0.00   565.00         0.00     1.00  918.00
## nbr.na           0.00     0.00     0.00         0.00     0.00    0.00
## min             18.00    15.96     0.00      1121.87    -1.87    0.00
## max             64.00    53.13     5.00     52590.83     2.98    0.01
## range           46.00    37.17     5.00     51468.96     4.85    0.01
## sum          51445.00 40191.54  1431.00  16808640.14   -98.37    0.79
## median          39.00    30.40     1.00      9178.15    -0.17    0.00
## mean            39.27    30.68     1.09     12831.02    -0.08    0.00
## SE.mean          0.39     0.17     0.03       322.81     0.02    0.00
## CI.mean.0.95     0.76     0.33     0.07       633.27     0.05    0.00
## var            197.15    37.15     1.46 136506984.21     0.75    0.00
## std.dev         14.04     6.09     1.21     11683.62     0.87    0.00
## coef.var         0.36     0.20     1.11         0.91   -11.53    1.87
##              StdFittedValues
## nbr.val         1.310000e+03
## nbr.null        0.000000e+00
## nbr.na          0.000000e+00
## min            -1.470000e+00
## max             2.640000e+00
## range           4.110000e+00
## sum             0.000000e+00
## median         -2.800000e-01
## mean            0.000000e+00
## SE.mean         3.000000e-02
## CI.mean.0.95    5.000000e-02
## var             1.000000e+00
## std.dev         1.000000e+00
## coef.var        1.991459e+16

We can observe that the customers’ age ranges from 18 to 64, with the average at 39 years. Interestingly, we can observe that on average, people have BMIs higher than recommended (average is 30.68). Out of 1310 obsrvations, 565 have no children. The charges they are currently paying range from 1,121.87 to 52,590.83, with the average at 12,831.02 (unfortunatelly, the data set does not elaborete on which currency the costs are defined in).