Import and show data
mydata <- read.table("~/IMB/2 semester/1) MVA/HW R/HW 3/insurance.csv",
header=TRUE,
sep=",",
dec=".")
head(mydata, 10)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
## 7 46 female 33.440 1 no southeast 8240.590
## 8 37 female 27.740 3 no northwest 7281.506
## 9 37 male 29.830 2 no northeast 6406.411
## 10 60 female 25.840 0 no northwest 28923.137
Explanations of the data
unit of observation: 1 customer/beneficiary
sample size: 1338
variables:
– age: age of primary beneficiary
– sex: gender (female, male)
– bmi: Body mass index, providing an understanding of body, weights
that are relatively high or low relative to height, objective index of
body weight (kg / m ^ 2) using the ratio of height to weight, ideally
18.5 to 24.9
– children: Number of children covered by health insurance / Number
of dependents
– smoker: Smoking (yes, no)
– region: the beneficiary’s residential area in the US (northeast,
southeast, southwest, northwest)
– charges: Individual medical costs billed by health
insurance
Source: https://www.kaggle.com/datasets/mirichoi0218/insurance?select=insurance.csv
Main goal
To determine how factors such as age, sex, BMI, region, number of
children, whether a person is a smoker or not impact the charges for
heath insurance.
Insurance companies have to evaluate many factors when setting the
appropriate price for the insurance premium. Since all of the above
available variables can be found as being taken into consideration in
different sources, we will be including them all in our model.
Sources:
Data manipulations
#factors
mydata$sex <- factor(mydata$sex)
mydata$smoker <- factor(mydata$smoker)
mydata$region <- factor (mydata$region)
round(stat.desc(mydata[ , c(-2, -5, -6)]), 2)
## age bmi children charges
## nbr.val 1338.00 1338.00 1338.00 1338.00
## nbr.null 0.00 0.00 574.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 18.00 15.96 0.00 1121.87
## max 64.00 53.13 5.00 63770.43
## range 46.00 37.17 5.00 62648.55
## sum 52459.00 41027.62 1465.00 17755824.99
## median 39.00 30.40 1.00 9382.03
## mean 39.21 30.66 1.09 13270.42
## SE.mean 0.38 0.17 0.03 331.07
## CI.mean.0.95 0.75 0.33 0.06 649.47
## var 197.40 37.19 1.45 146652372.15
## std.dev 14.05 6.10 1.21 12110.01
## coef.var 0.36 0.20 1.10 0.91
We can observe that the customers’ age ranges from 18 to 64, whith
the average at 39 years. Interestingly, we can observe that on average,
people have BMIs higher than recommended (average is 30.66). Out of 1338
obsrvations, 574 have no children. The charges they are currently paying
range from 1,121.87 to 63,770.43, with the average at 13,270.42
(unfortunatelly, the data set does not elaborete on which currency the
costs are defined in).
Scatterplots
library(car)
scatterplotMatrix(mydata[ , c(7, 1, 3, 4)],
smooth = FALSE)

We can observe positive correlation between our (numeric) explenatory
variables and the dependant variable.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(7, 1, 3, 4)]))
## charges age bmi children
## charges 1.00 0.30 0.20 0.07
## age 0.30 1.00 0.11 0.04
## bmi 0.20 0.11 1.00 0.01
## children 0.07 0.04 0.01 1.00
##
## n= 1338
##
##
## P
## charges age bmi children
## charges 0.0000 0.0000 0.0129
## age 0.0000 0.0000 0.1205
## bmi 0.0000 0.0000 0.6410
## children 0.0129 0.1205 0.6410
Multiple regression model
fit1 <- lm(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)
Assumptions
Outliers: Standardised residuals
mydata$StdResid <- round(rstandard(fit1), 3)
Is the variable normally distributed?
Histogram:
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")

Ho: variable (std. residuals) is normally distributed
H1: variable (std. residuals) is NOT normally distributed
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.89909, p-value < 2.2e-16
We reject Ho at p < 0.001, meaning the standardised residuals are
not normallly distributed.
We have to remove the outlier(s) (that is all units with standardised
residuals more than 3 or less than -3 (this is done a bit later in the
code).
Outliers:
head(mydata[order(-mydata$StdResid),], 30)
## age sex bmi children smoker region charges StdResid
## 1301 45 male 30.360 0 yes southeast 62592.87 4.965
## 578 31 female 38.095 1 yes northeast 58571.07 4.193
## 243 55 female 26.800 1 no southwest 35160.13 4.030
## 220 24 female 23.210 0 no southeast 25081.77 3.976
## 517 20 male 35.310 1 no southeast 27724.29 3.844
## 820 33 female 35.530 0 yes northwest 55135.40 3.820
## 544 54 female 47.410 0 yes southeast 63770.43 3.808
## 1231 52 male 34.485 3 yes northwest 60021.40 3.666
## 141 34 male 22.420 2 no northeast 27375.90 3.665
## 1207 59 female 34.800 2 no southwest 36910.61 3.623
## 937 44 male 29.735 2 no northeast 32108.66 3.611
## 1028 23 male 18.715 0 no northwest 21595.38 3.604
## 1020 21 female 32.680 2 no northwest 26018.95 3.455
## 1013 61 female 33.330 4 no southeast 36580.28 3.428
## 35 28 male 36.400 1 yes southwest 51194.56 3.372
## 388 50 male 25.365 2 no northwest 30284.64 3.359
## 527 19 female 30.590 2 no northwest 24059.68 3.333
## 431 19 male 33.100 0 no southwest 23082.96 3.311
## 1040 19 male 27.265 2 no northwest 22493.66 3.282
## 1009 25 male 24.985 2 no northeast 23241.47 3.220
## 469 28 female 24.320 1 no northeast 23288.93 3.193
## 1329 23 female 24.225 2 no northeast 22395.74 3.186
## 988 45 female 27.645 1 no northwest 28340.19 3.177
## 600 52 female 37.525 2 no northwest 33471.97 3.099
## 103 18 female 30.115 0 no northeast 21344.85 3.052
## 689 47 female 24.100 1 no southwest 26236.58 3.045
## 1143 52 female 24.860 0 no southeast 27117.99 3.029
## 4 33 male 22.705 0 no northwest 21984.47 3.016
## 322 26 female 29.640 4 no northeast 24671.66 2.979
## 63 64 male 24.700 1 no northwest 30166.62 2.864
Units with high impact: Cooks distance
mydata$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")

Since we have many observations, the lower frequencies are not well
seen in the graph, we will look at the data by the order to get a better
sense of Cook’s distances.
head(mydata[order(-mydata$CooksD),], 20)
## age sex bmi children smoker region charges StdResid CooksD
## 544 54 female 47.410 0 yes southeast 63770.43 3.808 0.020
## 1301 45 male 30.360 0 yes southeast 62592.87 4.965 0.019
## 578 31 female 38.095 1 yes northeast 58571.07 4.193 0.018
## 820 33 female 35.530 0 yes northwest 55135.40 3.820 0.015
## 1231 52 male 34.485 3 yes northwest 60021.40 3.666 0.014
## 1013 61 female 33.330 4 no southeast 36580.28 3.428 0.013
## 220 24 female 23.210 0 no southeast 25081.77 3.976 0.012
## 1028 23 male 18.715 0 no northwest 21595.38 3.604 0.011
## 35 28 male 36.400 1 yes southwest 51194.56 3.372 0.010
## 141 34 male 22.420 2 no northeast 27375.90 3.665 0.009
## 243 55 female 26.800 1 no southwest 35160.13 4.030 0.009
## 322 26 female 29.640 4 no northeast 24671.66 2.979 0.009
## 517 20 male 35.310 1 no southeast 27724.29 3.844 0.009
## 1207 59 female 34.800 2 no southwest 36910.61 3.623 0.009
## 431 19 male 33.100 0 no southwest 23082.96 3.311 0.008
## 527 19 female 30.590 2 no northwest 24059.68 3.333 0.008
## 1020 21 female 32.680 2 no northwest 26018.95 3.455 0.008
## 63 64 male 24.700 1 no northwest 30166.62 2.864 0.007
## 103 18 female 30.115 0 no northeast 21344.85 3.052 0.007
## 388 50 male 25.365 2 no northwest 30284.64 3.359 0.007
head(mydata[order(mydata$CooksD),], 20)
## age sex bmi children smoker region charges StdResid CooksD
## 2 18 male 33.770 1 no southeast 1725.552 -0.285 0
## 3 28 male 33.000 3 no southeast 4449.462 -0.374 0
## 5 32 male 28.880 0 no northwest 3866.855 -0.285 0
## 6 31 female 25.740 0 no southeast 3756.622 0.006 0
## 7 46 female 33.440 1 no southeast 8240.590 -0.400 0
## 8 37 female 27.740 3 no northwest 7281.506 -0.127 0
## 9 37 male 29.830 2 no northeast 6406.411 -0.347 0
## 11 25 male 26.220 0 no northeast 2721.321 -0.087 0
## 13 23 male 34.400 0 no southwest 1826.843 -0.450 0
## 14 56 female 39.820 0 no southeast 11090.718 -0.633 0
## 16 19 male 24.600 1 no southwest 1837.237 0.193 0
## 17 52 female 30.780 1 no northeast 10797.336 -0.254 0
## 18 23 male 23.845 0 no northeast 2395.172 0.078 0
## 19 56 male 40.300 0 no southwest 10602.385 -0.732 0
## 21 60 female 36.005 0 no northeast 13228.847 -0.407 0
## 22 30 female 32.400 1 no southwest 4149.736 -0.351 0
## 23 18 male 34.100 0 no southeast 1137.011 -0.322 0
## 25 37 male 28.025 2 no northwest 6203.902 -0.221 0
## 26 59 female 27.720 3 no southeast 14001.134 0.164 0
## 27 63 female 23.085 0 no northeast 14451.835 0.394 0
We see no very big jumps in the values of Cook’s distance, however we
will remove all units with CD more than 0.0155 (the higher jump is
observed there).
All rows with |std residual| above 3 and/or CD > 0.0155 will be
removed.
mydata <- mydata[c(-1301, -578, -243, -220, - 517, - 820, -544, -1231, -141, -1207, -937, -1028, -1020, -1013, -35, -388, -527, -431, -1040, -1009, -469, -1329, -988, -600, -103, -689, -1143, -4), ]
Multicolinearity: VIF
vif(fit1)
## GVIF Df GVIF^(1/(2*Df))
## age 1.016822 1 1.008376
## sex 1.008900 1 1.004440
## bmi 1.106630 1 1.051965
## children 1.004011 1 1.002003
## smoker 1.012074 1 1.006019
## region 1.098893 3 1.015841
mean(vif(fit1))
## [1] 1.129776
sqrt(5)
## [1] 2.236068
To test for potential problem of multicolinearity variance inflation
factor is used. Since our variable “region” has more than 2 categories
(it has 4), we have to compare it’s GVIF to square root of 5. It is
smaller, which is good. Generally speaking, all values are very close to
1, furthermore the average is also close to 1 which is very good.
Therefore, we do not have to worry about multicolinearity of the
variables in our model.
Re-estimation of regression model
Because we removed the outliers.
fit2 <- lm(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)
Checking for homoskedasticity
mydata$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)

The scaterplot indicates that the variance might not be constant, so
it will be tested formally (Breusch Pagan test).
Ho: The variance is constant.
H1: The variance is NOT constant.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : charges
## Variables: fitted values of charges
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 386.2995
## Prob > Chi2 = 5.290442e-86
We can reject the Ho at p < 0.001, meaning the variance is NOT
constant. This affects standardized errors and p-values in our model, so
the function lm_robust will be used to correct it.
Final regression
library(estimatr)
fit3 <- lm_robust(formula = charges ~ age + sex + bmi + children + smoker + region, data = mydata)
summary(fit3)
##
## Call:
## lm_robust(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = mydata)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper
## (Intercept) -12735.702 896.61 -14.20425 1.120e-42 -14494.7 -10976.74
## age 263.387 10.52 25.03834 2.987e-113 242.8 284.02
## sexmale 4.146 290.05 0.01429 9.886e-01 -564.9 573.16
## bmi 341.595 28.07 12.16914 2.412e-32 286.5 396.66
## children 441.808 116.58 3.78976 1.577e-04 213.1 670.51
## smokeryes 23732.417 531.35 44.66439 7.049e-265 22690.0 24774.81
## regionnorthwest -539.383 414.10 -1.30254 1.930e-01 -1351.8 272.99
## regionsoutheast -967.064 445.37 -2.17137 3.008e-02 -1840.8 -93.34
## regionsouthwest -831.003 408.92 -2.03221 4.233e-02 -1633.2 -28.80
## DF
## (Intercept) 1301
## age 1301
## sexmale 1301
## bmi 1301
## children 1301
## smokeryes 1301
## regionnorthwest 1301
## regionsoutheast 1301
## regionsouthwest 1301
##
## Multiple R-squared: 0.8 , Adjusted R-squared: 0.7988
## F-statistic: 387.8 on 8 and 1301 DF, p-value: < 2.2e-16
Explanations
The result:
charges = -12,735.702 + 263.387 * age + 4.146 * male + 341.595 * bmi
+ 441.808 * children + 23,732.417 * smokers(yes) - 539.383 * NW_region -
967.064 SE_region - 831.003 * SW_region
At p = 0.05 the following coefficients are statistically NOT
significant: sexM, NW_region and will not be explained.
Multiple R squared = 0.8, which means that 80 % of the variability of
the dependant variable is explained by the independant variables.
- Analysis of variance (F-test):
Ho: ρ^2 = 0
H1: ρ^2 > 0
We can reject Ho (that the model with no independent variables fits
the data as well as our estimated model) at p < 0.001, which means
that ro is higher than 0, indicating that our model is a good fit (there
is relationship between at least one independent variable and the
dependent variable).
Descriptive statistics
round(stat.desc(mydata[ , c(-2, -5, -6)]), 2)
## age bmi children charges StdResid CooksD
## nbr.val 1310.00 1310.00 1310.00 1310.00 1310.00 1310.00
## nbr.null 0.00 0.00 565.00 0.00 1.00 918.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00
## min 18.00 15.96 0.00 1121.87 -1.87 0.00
## max 64.00 53.13 5.00 52590.83 2.98 0.01
## range 46.00 37.17 5.00 51468.96 4.85 0.01
## sum 51445.00 40191.54 1431.00 16808640.14 -98.37 0.79
## median 39.00 30.40 1.00 9178.15 -0.17 0.00
## mean 39.27 30.68 1.09 12831.02 -0.08 0.00
## SE.mean 0.39 0.17 0.03 322.81 0.02 0.00
## CI.mean.0.95 0.76 0.33 0.07 633.27 0.05 0.00
## var 197.15 37.15 1.46 136506984.21 0.75 0.00
## std.dev 14.04 6.09 1.21 11683.62 0.87 0.00
## coef.var 0.36 0.20 1.11 0.91 -11.53 1.87
## StdFittedValues
## nbr.val 1.310000e+03
## nbr.null 0.000000e+00
## nbr.na 0.000000e+00
## min -1.470000e+00
## max 2.640000e+00
## range 4.110000e+00
## sum 0.000000e+00
## median -2.800000e-01
## mean 0.000000e+00
## SE.mean 3.000000e-02
## CI.mean.0.95 5.000000e-02
## var 1.000000e+00
## std.dev 1.000000e+00
## coef.var 1.991459e+16
We can observe that the customers’ age ranges from 18 to 64, with the
average at 39 years. Interestingly, we can observe that on average,
people have BMIs higher than recommended (average is 30.68). Out of 1310
obsrvations, 565 have no children. The charges they are currently paying
range from 1,121.87 to 52,590.83, with the average at 12,831.02
(unfortunatelly, the data set does not elaborete on which currency the
costs are defined in).